0% found this document useful (0 votes)
4 views

Machine Learning

The document provides an overview of Machine Learning, covering its definition, various learning paradigms (supervised, unsupervised, and reinforcement learning), and key concepts such as datasets, features, and models. It discusses the process of machine learning, including data collection, preparation, training, evaluation, and tuning, as well as issues related to machine perception and concept learning. Additionally, it introduces version spaces and the significance of finite and infinite hypothesis spaces in the context of learning from examples.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Machine Learning

The document provides an overview of Machine Learning, covering its definition, various learning paradigms (supervised, unsupervised, and reinforcement learning), and key concepts such as datasets, features, and models. It discusses the process of machine learning, including data collection, preparation, training, evaluation, and tuning, as well as issues related to machine perception and concept learning. Additionally, it introduces version spaces and the significance of finite and infinite hypothesis spaces in the context of learning from examples.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 70

Downloaded from www.rgpvnotes.

in, whatsapp: 8989595022

Subject Notes
Unit I
VIII Semester
Subject Name: IT 802-Machine Learning

Unit-1 Content: Introduction, Examples of various Learning Paradigms, Perspectives and


Issues, Concept Learning, Version Spaces, Finite and Infinite Hypothesis Spaces, PAC
Learning, VC Dimension

INTRODUCTION

Machine learning (ML) is a category of an algorithm that allows software applications to


become more accurate in predicting outcomes without being explicitly programmed. The
basic premise of machine learning is to build algorithms that can receive input data and use
statistical analysis to predict an output while updating outputs as new data becomes
available.

Software engineering combined human created rules with data to create answers to a
problem. Instead, machine learning uses data and answers to discover the rules behind a
problem. To learn the rules governing a phenomenon, machines have to go through a
learning process, trying different rules and learning from how well they perform. Therefore,
it’s known as Machine Learning.

There are multiple forms of Machine Learning; supervised, unsupervised, semi-supervised


and reinforcement learning. Each form of Machine Learning has differing approaches, but
they all follow the same underlying process and theory.

The basic terminology used in Machine Learning is:

• Dataset: A set of data examples that contain features important to solving the
problem.

• Features: Important pieces of data that help us understand a problem. These are fed
into a Machine Learning algorithm to help it learn.

• Model: The representation (internal model) of a phenomenon that a Machine


Learning algorithm has learnt. It learns this from the data it is shown during training.
The model is the output you get after training an algorithm. For example, a decision
tree algorithm would be trained and produce a decision tree model.

The process of Machine Learning includes the following steps:

• Data Collection: Collect the data that the algorithm will learn from.

• Data Preparation: Format and engineer the data into the optimal format, extracting
important features and performing dimensionality reduction.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Training: This is where the Machine Learning algorithm learns by showing it the data
that has been collected and prepared.

• Evaluation: Test the model to see how well it performs.

• Tuning: Fine tune the model to maximize its performance.

There are many approaches that can be taken when conducting Machine Learning.
Supervised and Unsupervised are well established approaches and the most used. Semi-
supervised and Reinforcement Learning are newer and more complex but have shown
impressive results.

EXAMPLES OF VARIOUS LERANINFG PARADIGMS

Learning Paradigms basically states a particular pattern on which something or someone


learns. The Learning Paradigms related to machine learning, i.e., how a machine learns
when some data is given to it, its pattern of approach for some particular data.

There are three basic types of learning paradigms widely associated with machine learning,
namely

1. Supervised Learning

2. Unsupervised Learning

3. Reinforcement Learning

Figure 1.1: Basic types of Machine Learning

Supervised Learning

Supervised learning is a machine learning task in which a function maps the input to output
data using the provided input-output pairs.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure: 1.2: Supervised Learning

In this type of learning, you need to give both the input and output (usually in the form of
labels) to the computer for it to learn from it. The computer generates a function based on
this data, which can be anything like a simple line, to a complex convex function, depending
on the data provided.

This is the most basic type of learning paradigm, and most algorithms we learn today are
based on this type of learning pattern. For example:

Figure 1.3: Linear regression

Logistic Regression (0 or 1 logic, meaning yes or no)

Figure 1.4: Logistic regression Model

Classification: Machine is trained to classify something into some class.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Classifying whether a patient has disease or not

• Classifying whether an email is spam or not

Regression: Machine is trained to predict some value like price, weight, or height.

• Predicting house/property price

• Predicting stock market price

Unsupervised Learning

In this type of learning paradigm, the computer is provided with just the input to develop a
learning pattern. It is basically learning from no results.

Figure 1.5: Unsupervised Learning

This means that the computer has to recognize a pattern in the given input and develop a
learning algorithm accordingly. So, we conclude that “the machine learns through
observation & find structures in data”. This is still a much unexplored field of machine
learning, and big tech giants like Google and Microsoft are currently researching on
development in it.

Clustering: A clustering problem is where you want to discover the inherent groupings in the
data

• such as grouping customers by purchasing behavior

Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data

• such as people that buy X also tend to buy Y

Reinforcement Learning

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Reinforcement Learning is a type of Machine Learning, and thereby also a branch of Artificial
Intelligence. It allows machines and software agents to automatically determine the ideal
behavior within a specific context, in order to maximize its performance.

Figure 1.6: Reinforcement Learning

There is an excellent analogy to explain this type of learning paradigm, “training a dog”.

This learning paradigm is like a dog trainer, which teaches the dog how to respond to
specific signs, like a whistle, clap, or anything else. Whenever the dog responds correctly,
the trainer gives a reward to the dog, which can be a “Bone or a biscuit”.

A variety of different problems can be solved using Reinforcement Learning. Because RL


agents can learn without expert supervision, the type of problems that are best suited to RL
are complex problems where there appears to be no obvious or easily programmable
solution. Two of the main ones are:

Game playing — determining the best move to make in a game often depends on a number
of different factors; hence the number of possible states that can exist in a particular game
is usually very large.

Control problems — such as elevator scheduling. Again, it is not obvious what strategies
would provide the best, most timely elevator service. For control problems such as this, RL
agents can be left to learn in a simulated environment and eventually they will come up
with good controlling policies.

PERSPECTIVE & ISSUES

Perspective:

It involves searching a very large space of possible hypothesis to determine the one that
best fits the observed data. Machine perception is the capability of a computer system to
interpret data in a manner that is similar to the way humans use their senses to relate to the
world around them. The basic method that the computers take in and respond to their
environment is through the attached hardware. Until recently input was limited to a
keyboard, or a mouse, but advances in technology, both in hardware and software, have

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

allowed computers to take in sensory input in a way similar to humans. Machine perception
allows the computer to use this sensory input, as well as conventional computational means
of gathering information, to gather information with greater accuracy and to present it in a
way that is more comfortable for the user.

The end goal of machine perception is to give machines the ability to see, feel and perceive
the world as humans do and therefore for them to be able to explain in a human way why
they are making their decisions, to warn us when it is failing and more importantly, the
reason why it is failing. This purpose is very similar to the proposed purposes for artificial
intelligence generally, except that machine perception would only grant machines limited
sentience, rather than bestow upon machines full consciousness, self-awareness, and
intentionality.

Issues:

Some of the issues that the science of machine perception still has to overcome include:

• Embodied Cognition - The theory that cognition is a full body experience, and
therefore can only exist, and therefore be measure and analyzed, in fullness if all
required human abilities and processes are working together through a mutually
aware and supportive systems network.

• The Principle of Similarity - The ability young children develop to determine what
family a newly introduced stimulus falls under even when the said stimulus is
different from the members with which the child usually associates said family with.
(An example could be a child figuring that a Chihuahua is a dog and house pet rather
than vermin.)

• The Unconscious Inference: The natural human behavior of determining if a new


stimulus is dangerous or not, what it is, and then how to relate to it without ever
requiring any new conscious effort.

• The innate human ability to follow the Likelihood Principle in order to learn from
circumstances and others over time.

• The Recognition-by-components theory - being able to mentally analyze and break-


even complicated mechanisms into manageable parts with which to interact with.
For example: A person seeing both the cup and the handle parts that make up a mug
full of hot cocoa, in order to use the handle to hold the mug so as to avoid being
burned.

• The Free energy principle - determining long before hand how much energy one can
safely delegate to being aware of things outside one's self without the loss of the
needed energy one requires for sustaining their life and function satisfactorily. This
allows one to become both optimally aware of the world around them self without
depleting their energy so much that they experience damaging stress, decision
fatigue, and/or exhaustion.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

CONCEPT LEARNING

Concept learning, also known as category learning. It searches for and listing of attributes
that can be used to distinguish exemplars from non exemplars of various categories. More
simply put, concepts are the mental categories that help us classify objects, events, or ideas,
building on the understanding that each object, event, or idea has a set of common relevant
features. Thus, concept learning is a strategy which requires a learner to compare and
contrast groups or categories that contain concept-relevant features with groups or
categories that do not contain concept-relevant features.

Concepts in Machine Learning can be thought of as a Boolean-valued function defined over


a large set of training. For example, one possible target concept may be to find the day
when a person enjoys his favorite sport. We have some attributes/features of the day like,
Sky, Air Temperature, Humidity, Wind, Water, Forecast and based on this we have a target
Concept named EnjoySport.

We have the following training example available:

Table 1.1: Data table of finding the concept

Let’s Design the problem formally with TPE (Task, Performance, Experience):

Problem: Leaning the day when a person enjoys the sport.

Task T: Learn to predict the value of EnjoySport for an arbitrary day, based on the values of
the attributes of the day.

Performance measure P: Total percent of days (EnjoySport) correctly predicted.

Training experience E: A set of days with given labels (EnjoySport: Yes/No)

Let’s take a very simple hypothesis representation which consists of a conjunction of


constraints in the instance attributes. We get a hypothesis h_i with the help of example i for
our training set as below:

hi(x) := <x1, x2, x3, x4, x5, x6>

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Where x1, x2, x3, x4, x5 and x6 are the values of Sky, Air-Temp, Humidity, Wind, Water and
Forecast.

Hence h1 will look like (the first row of the table above):

h1(x=1): <Sunny, Warm, Normal, Strong, Warm, Same > Note: x=1 represents a positive
hypothesis / Positive example

We want to find the most suitable hypothesis which can represent the concept. For
example, the person enjoys his favorite sport only on cold days with high humidity.

h(x=1) = <?, Cold, High, ?, ?, ?>

Here ‘?’ indicates that any value of the attribute is acceptable. The most generic hypothesis
will be < ?, ?, ?, ?, ?, ?> where every day is a positive example and the most specific
hypothesis will be <?,?,?,?,?,? > where no day is a positive example. The two most popular
approaches to find a suitable hypothesis, they are:

1. Find-S Algorithm

2. List-Then-Eliminate Algorithm

Find-S Algorithm:

• Following are the steps for the Find-S algorithm:

• Initialize h to the most specific hypothesis in H

• For each positive training example,

o For each attribute, constraint ai in h

▪ If the constraints ai is satisfied by x

▪ Then do nothing

▪ Else replace ai in h by the next more general constraint that is


satisfied by x

• Output hypothesis h

The LIST-THEN-ELIMINATE Algorithm:

Following are the steps for the LIST-THE-ELIMINATE algorithm:

VersionSpace <- a list containing every hypothesis in H

For each training example, <x, c(x)>

• Remove from VersionSpace any hypothesis h for which h(x) != c(x)

Output the list of hypotheses in VersionSpace.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

VERSION SPACE

A version space is a hierarchical representation of knowledge that enables you to keep track
of all the useful information supplied by a sequence of learning examples without
remembering any of the examples.

The version space method is a concept learning process accomplished by managing multiple
models within a version space.

Version Space Characteristics

• Tentative heuristics are represented using version spaces.

• A version space represents all the alternative plausible descriptions of a heuristic.

• A plausible description is one that is applicable to all known positive examples and
no known negative example.

• A version space description consists of two complementary trees:

• One that contains nodes connected to overly general models.

• One that contains nodes connected to overly specific models.

FINITE AND INFINITE HYPOTHESIS SPACES

A hypothesis is a function on the sample space, giving a value for each point in the sample
space. If the possible values are {0, 1} then we can identify a hypothesis with the subset of
those points that are given value 1. The error of a hypothesis is the probability of that
subset where the hypothesis disagrees with the true hypothesis. Learning from examples is
the process of making independent random observations and eliminating those hypotheses
that disagree with observations.

The hypothesis space is the set of all possible hypotheses (i.e., functions from inputs to the
outputs) that can be returned by a model. The hypothesis space is important because it
specifies what types of functions you can model and what types you cannot. The absolute
best error you can achieve on a dataset is lower bounded by the error of the “best” function
in your hypothesis space.

Suppose we have a finite set of hypotheses, H, and that we make m observations. If h is a


hypothesis with error greater than E, then the probability that it will be consistent with a
given observation is less than 1 - E, and the probability that it will be consistent with all m
observations is less than (1 - E)m, which is less than exp( -Em ). Therefore the total
probability that some hypothesis with error greater than E remains after m observations is
less than |H|exp( -Em ). We can set this bound at some desired level, say d, and solve for m,
giving

m > (ln |H | + ln (1/d) )/ E

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

In machine learning are designed to make better use data using PAC (Probably
Approximately Correct). The PAC analyses assume that the true answer/concept is in the
given hypothesis space H. A machine learning algorithm L with hypothesis space H is one
that, given a training data set D, will always return a hypothesis H consistent with D if one
exists, otherwise it will indicate that no such hypothesis exists. In a finite machine learning
hypothesis H does not have polynomial sample complexity. If H has polynomial sample
complexity it is called infinite hypothesis.

PAC LEARNING

Probably Approximately Correct (PAC) learning is a framework for mathematical analysis of


machine learning. PAC Learning deals with the question of how to choose the size of the
training set.

In this framework, the learner receives samples and must select a generalization function
(called the hypothesis) from a certain class of possible functions. The goal is that, with high
probability (the "probably" part), the selected function will have low generalization error
(the "approximately correct" part). The learner must be able to learn the concept given any
arbitrary approximation ratio, probability of success, or distribution of the samples.

Probably approximately correct (PAC) learning theory helps analyze whether and under
what conditions a learner L will probably output an approximately correct classifier.

Approximate: A hypothesis h∈H is approximately correct if its error over the distribution of
inputs is bounded by some ϵ,0 ≤ ϵ ≤ (1/2). I.e., errorD(h)<ϵ, where D is the distribution over
inputs.

Probably: If L will output such a classifier with probability 1−δ, with 0 ≤ δ ≤ (1/2), we call
that classifier probably approximately correct.

Knowing that a target concept is PAC-learnable allows to bound the sample size necessary
to probably learn an approximately correct classifier, which is what's shown in the formula
reproduced:

To gain some intuition about this, note the effects on m when you alter variables in the
right-hand side. As allowable error decreases, the necessary sample size grows. Likewise, it
grows with the probability of an approximately correct learner, and with the size of the
hypothesis space H. (Loosely, a hypothesis space is the set of classifiers algorithm
considers.) More plainly, as we consider more possible classifiers, or desire a lower error or
higher probability of correctness, we need more data to distinguish between them.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

VC DIMENSION

The Vapnik–Chervonenkis (VC) dimension is a measure of the capacity (complexity,


expressive power, richness, or flexibility) of a set of functions that can be learned by a
statistical binary classification algorithm. It is defined as the cardinality of the largest set of
points that the algorithm can shatter.

The capacity of a classification model is related to how complicated it can be. For example,
consider the threshold of a high-degree polynomial: if the polynomial evaluates above zero,
that point is classified as positive, otherwise as negative. A high-degree polynomial can be
wiggly, so it can fit a given set of training points well. But one can expect that the classifier
will make errors on other points, because it is too wiggly. Such a polynomial has a high
capacity. A much simpler alternative is to threshold a linear function.

VC provides a measure of the complexity of a “Hypothesis space” or the “Power” of


“learning Machine”. The higher VC dimension implies the ability to represent more complex
functions.

Suppose we want a model (e.g., some classifier) that generalizes well on unseen data. And
we are limited to a specific amount of sample data.

The following figure shows some Models (S1 up to Sk) of differing complexity (VC
dimension), here shown on the x-axis and called h.

Figure 1.7: S1 up to Sk models of differing VC dimension

The diagram shows that a higher VC dimension allows for a lower empirical risk (the error a
model makes on the sample data), but also introduces a higher confidence interval. This

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

interval can be seen as the confidence in the model's ability to generalize.

Low VC dimension (high bias)

If we use a model of low complexity, we introduce assumption (bias) regarding the dataset
e.g., when using a linear classifier, we assume the data can be described with a linear
model. If this is not the case, our given problem cannot be solved by a linear model, for
example because the problem is of nonlinear nature. We will end up with a bad performing
model which will not be able to learn the data's structure. We should therefore try to avoid
introducing a strong bias.

High VC dimension (greater confidence interval)

On the other side of the x-axis, we see models of higher complexity which might be of such
a great capacity that it will rather memorize the data instead of learning its general
underlying structure i.e. the model over fits. After realizing this problem, it seems that we
should avoid complex models.

This may seem controversial as we shall not introduce a bias i.e., have low VC dimension but
should also do not have high VC dimension. This problem has deep roots in statistical
learning theory and is known as the bias-variance-tradeoff. What we should do in this
situation is to be as complex as necessary and as simplistic as possible, so when comparing
two models which end up with the same empirical error, we should use the less complex
one.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester

Unit II

Syllabus: Supervised Learning Algorithms: Learning a Class from Examples, Linear,


Non-linear, Multi-class and Multi-label classification, Decision Trees: ID3, Classification
and Regression Trees (CART), Regression: Linear Regression, Multiple Linear Regression,
Logistic Regression, Neural Networks: Introduction, Perceptron, Multilayer Perceptron,
Support vector machines: Linear and NonLinear, Kernel Functions, K-Nearest Neighbors.

Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO2): Apply various supervised learning methods to appropriate
problems.

SUPERVISED LEARNING ALGORITHMS

In Supervised learning, you train the machine using data which is well "labeled." It means
some data is already tagged with the correct answer. It can be compared to learning which
takes place in the presence of a supervisor or a teacher.

A supervised learning algorithm learns from labeled training data, helps you to predict
outcomes for unforeseen data. Successfully building, scaling, and deploying accurate
supervised machine learning Data science model takes time and technical expertise from a
team of highly skilled data scientists. Moreover, Data scientist must rebuild models to make
sure the insights given remains true until its data changes.

• Supervised learning allows to collect data or produce a data output from the
previous experience.

• Helps to optimize performance criteria using experience

• Supervised machine learning helps to solve various types of real-world computation


problems.

LEARNING A CLASS FROM EXAMPLES

Suppose we want to learn the class, C, of a “family car.” We have a set of examples of cars,
and we have a group of people that we survey to whom we show these cars. The people

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

look at the cars and label them; the cars that they believe are family cars are positive
examples, and the other are negative examples. Class learning is finding a description that is
shared by all the positive examples and none of the negative examples.

The features that separate a family car from other type of cars are the price and engine
power. These two attributes are the inputs to the class recognizer.

Class C of a “family car”:-

• Prediction: Is car xa family car?

• Knowledge extraction: What do people expect from a family car?

Output: Positive (+) and negative (–) examples„

Input representation:

x1: price, x2: engine power

Training Set x

Figure 2.1 Training set for the class of a “family car”

Class C has price as the first input attribute x1 and engine power as the second attribute x2

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.2 Example of a hypothesis class. The class of family car is a rectangle

Our training data can now be plotted in the two-dimensional (x1, x2) space where each
instance t is a data point at coordinates (xt1, xt2) and its type, namely, positive versus
negative, is given by r t (see figure 2.1) After further discussions with the expert and the
analysis of the data, we may have reason to believe that for a car to be a family car, its price
and engine power should be in a certain range (p1 ≤ price ≤ p2) AND (e1 ≤ engine power ≤
e2)

Figure 2.3: The error of hypothesis h given the training set X

The aim is to find h ∈ H that is as similar as possible to C. So, the hypothesis h makes a
prediction for an instance x. we have is the training set X, which is a small subset of the set
of all possible x. The empirical error is the proportion of training instances where predictions
of h do not match the required values given in X.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

LINEAR CLASSIFICATION

Linear classifiers classify data into labels based on a linear combination of input features.
Therefore, these classifiers separate data using a line or plane or a hyper-plane (a plane in
more than 2 dimensions). They can only be used to classify data that is linearly separable.
They can be modified to classify non-linearly separable data. Major algorithms in linear
classification are:

Perceptron: In Perceptron, we take weighted linear combination of input features and pass
it through a thresholding function which outputs 1 or 0. The sign of wTx tells us which side
of the plane wTx=0, the point x lies on. Thus, by taking threshold as 0, perceptron classifies
data based on which side of the plane the new point lies on.

The task during training is to arrive at the plane (defined by w) that accurately classifies the
training data. If the data is linearly separable, perceptron training always converges

Logistic Regression: In Logistic regression, we take weighted linear combination of input


features and pass it through a sigmoid function which outputs a number between 1 and 0.
Unlike perceptron, which just tells us which side of the plane the point lies on, logistic
regression gives a probability of a point lying on a particular side of the plane. The
probability of classification will be very close to 1 or 0 as the point goes far away from the
plane. The probability of classification of points very close to the plane is close to 0.5

SVM: There can be multiple hyperplanes that separate linearly separable data. SVM
calculates the optimal separating hyperplane using concepts of geometry. SVM only be used
to separate linearly separable data. But we can modify our data and project it into higher
dimensions to make it linearly separable.

NONLINEAR CLASSIFICATION

Nonlinear functions can be used to separate instances that are not linearly separable. We’ve
many nonlinear classifiers:

K-nearest-neighbors (kNN): The KNN algorithm assumes that similar things exist in close
proximity. In other words, similar things are near to each other. The KNN algorithm hinges
on this assumption being true enough for the algorithm to be useful. KNN captures the idea
of similarity.

Kernel SVM: SVM algorithms use a set of mathematical functions that are defined as the
kernel. The function of kernel is to take data as input and transform it into the required
form. Different SVM algorithms use different types of kernel functions

Decision Tree: Decision tree is the most powerful and popular tool for classification and
prediction. A Decision tree is a flowchart like tree structure, where each internal node

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

denotes a test on an attribute, each branch represents an outcome of the test, and each leaf
node (terminal node) holds a class label.

Multilayer Perceptron: Perceptron consists of an input layer and an output layer which are
fully connected. MLPs have the same input and output layers but may have multiple hidden
layers in between the aforementioned layers. The input layer is the first set of perceptrons
which output positive/negative based on the observed features in the data. A hidden layer is
a set of perceptrons that uses the outputs of the previous layer as inputs, instead of using
the original data. There can be multiple hidden layers and the final hidden layer is also
called the output layer.

MULTICLASS CLASSIFICATION

Multiclass classification means a classification task with more than two classes, e.g., classify
a set of images of fruits which may be oranges, apples, or pears. Multiclass classification
makes the assumption that each sample is assigned to one and only one label: a fruit can be
either an apple or a pear but not both at the same time.

Figure 2.4: Multi class where one column = one class

MULTILABEL CLASSIFICATION

Multilabel classification assigns to each sample a set of target labels. This can be thought of
as predicting properties of a data-point that are not mutually exclusive, such as topics that
are relevant for a document. A text might be about any of religion, politics, finance or
education at the same time or none of these.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.5: Multi label where one column = one class

DECISION TREES

A decision tree is a structure that contains nodes and edges and is built from a dataset
(table of columns representing features/attributes and rows corresponds to records). Each
node is either used to make a decision (known as decision node) or represent an outcome
(known as leaf node).

ID3

ID3 stands for Iterative Dichotomiser 3 and is named such because the algorithm iteratively
(repeatedly) dichotomizes(divides) features into two or more groups at each step. ID3 uses a
top-down greedy approach to build a decision tree. In simple words, the top-down approach
means that we start building the tree from the top and the greedy approach means that at
each iteration we select the best feature at the present moment to create a node.

ID3 uses Information Gain or just Gain to find the best feature. Information Gain calculates
the reduction in the entropy and measures how well a given feature separates or classifies
the target classes. The feature with the highest Information Gain is selected as the best one.

ID3 Steps

• Calculate the Information Gain of each feature.

• Considering that all rows don’t belong to the same class, split the dataset S into
subsets using the feature for which the Information Gain is maximum.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• Make a decision tree node using the feature with the maximum Information gain.

• If all rows belong to the same class, make the current node as a leaf node with the
class as its label.

• Repeat for the remaining features until we run out of all features, or the decision
tree has all leaf nodes.

CLASSIFICATION AND REGRESSION TREES (CART)

Classification and Regression Trees or CART for is a term introduced to refer to Decision Tree
algorithms that can be used for classification or regression predictive modeling problems.
The CART algorithm provides a foundation for important algorithms like bagged decision
trees, random forest and boosted decision trees.

CART Model Representation

The representation for the CART model is a binary tree.

This is a binary in this each root node represents a single input variable (x) and a split point
on that variable (assuming the variable is numeric).

The leaf nodes of the tree contain an output variable (y) which is used to make a prediction.

A dataset with two inputs (x) of height in centimeters and weight in kilograms the output of
gender as male or female, below is an example of a binary decision tree.

Figure 2.6: Decision Tree

The tree can be stored to file as a graph or a set of rules. For example, below is the above
decision tree as a set of rules.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• If Height > 180 cm Then Male

• If Height <= 180 cm AND Weight > 80 kg Then Male

• If Height <= 180 cm AND Weight <= 80 kg Then Female

• Make Predictions with CART Models

With the binary tree representation of the CART model, making predictions is relatively
straightforward. The tree is traversed by evaluating the specific input started at the root
node of the tree.

A learned binary tree is actually a partitioning of the input space. Each input variable as a
dimension on a p-dimensional space. The decision tree split this up into rectangles (when
p=2 input variables) or some kind of hyper-rectangles with more inputs. New data is filtered
through the tree and lands in one of the rectangles and the output value for that rectangle
is the prediction made by the model. This gives some feeling for the type of decisions that a
CART model is capable of making, e.g., boxy decision boundaries.

REGRESSION

Regression is a statistical method used in finance, investing, and other disciplines that
attempts to determine the strength and character of the relationship between one
dependent variable (usually denoted by Y) and a series of other variables (known as
independent variables).

Regression analysis is a predictive modeling technique that analyzes the relation between
the target or dependent variable and independent variable in a dataset. The different types
of regression analysis techniques get used when the target and independent variables show
a linear or non-linear relationship between each other, and the target variable contains
continuous values. The regression technique gets used mainly to determine the predictor
strength, forecast trend, time series, and in case of cause & effect relation.

LINEAR REGRESSION

Linear regression is one of the most basic types of regression in machine learning. The linear
regression model consists of a predictor variable and a dependent variable related linearly
to each other. In case the data involves more than one independent variable, then linear
regression is called multiple linear regression models.

The below-given equation is used to denote the linear regression model:

y=mx+c+e

Where m is the slope of the line, c is an intercept, and e represents the error in the model.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.7: Linear regression

The best fit line is determined by varying the values of m and c. The predictor error is the
difference between the observed values and the predicted value. The values of m and c get
selected in such a way that it gives the minimum predictor error. It is important to note that
a simple linear regression model is susceptible to outliers. Therefore, it should not be used
in case of big size data.

MULTIPLE LINEAR REGRESSION

Multiple linear regression (MLR), is a statistical technique that uses several explanatory
variables to predict the outcome of a response variable. The goal of multiple linear
regression (MLR) is to model the linear relationship between the explanatory (independent)
variables and response (dependent) variable.

Formula for Calculation of Multiple Linear Regression is:

LOGISTIC REGRESSION

Logistic regression is one of the types of regression analysis technique, which gets used
when the dependent variable is discrete. Example: 0 or 1, true or false, etc. This means the

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

target variable can have only two values, and a sigmoid curve denotes the relation between
the target variable and the independent variable.

Logit function is used in Logistic Regression to measure the relationship between the target
variable and independent variables. Below is the equation that denotes the logistic
regression.

logit(p) = ln(p/(1-p)) = b0+b1X1+b2X2+b3X3….+bkXk

Where, p is the probability of occurrence of the feature.

Figure 2.8: Logistic regression

For selecting logistic regression, as the regression analyst technique, it should be noted, the
size of data is large with the almost equal occurrence of values to come in target variables.
Also, there should be no multi-collinearity, which means that there should be no correlation
between independent variables in the dataset.

NEURAL NETWORKS: Introduction

Neural Network is a computing system made up of a number of simple, highly


interconnected processing elements, which process information by their dynamic state
response to external inputs.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.9: Neural network layers

Neural Network are processing devices (algorithms or actual hardware) that are loosely
modeled after the neuronal structure of the human cerebral cortex but on much smaller
scales. A large Neural Network might have hundreds or thousands of processor units,
whereas a human brain has billions of neurons with a corresponding increase in magnitude
of their overall interaction and emergent behavior. Neural networks are typically organized
in layers. Layers are made up of a number of interconnected 'nodes' which contain an
'activation function'. Patterns are presented to the network via the 'input layer', which
communicates to one or more 'hidden layers' where the actual processing is done via a
system of weighted 'connections'. The hidden layers then link to an 'output layer' where the
answer is output.

PERCEPTRON

A Perceptron is an algorithm used for supervised learning of binary classifiers. Binary


classifiers decide whether an input, usually represented by a series of vectors, belongs to a
specific class. In short, a perceptron is a single-layer neural network. They consist of four
main parts including input values, weights and bias, net sum, and an activation function.

The process of Perceptron begins by taking all the input values and multiplying them by
their weights. Then, all of these multiplied values are added together to create the weighted
sum. The weighted sum is then applied to the activation function, producing the
perceptron's output. The activation function plays the integral role of ensuring the output is
mapped between required values such as (0,1) or (-1,1). It is important to note that the
weight of an input is indicative of the strength of a node. Similarly, an input's bias value
gives the ability to shift the activation function curve up or down.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.10: The process of Perceptron

As a simplified form of a neural network, specifically a single-layer neural network,


perceptrons play an important role in binary classification. This means the perceptron is
used to classify data into two parts, hence binary. Sometimes, perceptrons are also referred
to as linear binary classifiers for this reason.

MULTILAYER PERCEPTRONS

The perceptron is very useful for classifying data sets that are linearly separable. They
encounter serious limitations with data sets that do not conform to this pattern as
discovered with the XOR problem. The XOR problem shows that for any classification of
four points that there exists a set that are not linearly separable. The MultiLayer Perceptron
(MLPs) breaks this restriction and classifies datasets which are not linearly separable. They
do this by using a more robust and complex architecture to learn regression and
classification models for difficult datasets.

The Perceptron consists of an input layer and an output layer which are fully connected.
MLPs have the same input and output layers but may have multiple hidden layers in
between the aforementioned layers

The algorithm for the MLP is as follows:

• Just as with the perceptron, the inputs are pushed forward through the MLP by
taking the dot product of the input with the weights that exist between the input
layer and the hidden layer (W---H). This dot product yields a value at the hidden
layer. We do not push this value forward as we would with a perceptron though.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

• MLPs utilize activation functions at each of their calculated layers. There are many
activation functions to discuss: rectified linear units (ReLU), sigmoid function, tanh.
Push the calculated output at the current layer through any of these activation
functions.

• Once the calculated output at the hidden layer has been pushed through the
activation function, push it to the next layer in the MLP by taking the dot product
with the corresponding weights.

• Repeat steps two and three until the output layer is reached.

• At the output layer, the calculations will either be used for a backpropagation
algorithm that corresponds to the activation function that was selected for the MLP
(in the case of training) or a decision will be made based on the output (in the case
of testing).

MLPs form the basis for all neural networks and have greatly improved the power of
computers when applied to classification and regression problems. Computers are no
longer limited by XOR cases and can learn rich and complex models thanks to the multilayer
perceptron.

SUPPORT VECTOR MACHINES: LINEAR AND NONLINEAR

Support Vector Machine is a linear model for classification and regression problems. It can
solve linear and non-linear problems and work well for many practical problems. The idea of
SVM is simple: The algorithm creates a line or a hyperplane which separates the data into
classes.

SVM is an algorithm that takes the data as an input and outputs a line that separates those
classes if possible. Suppose we have a dataset as shown below and we need to classify the
red rectangles from the blue ellipses (positives from the negatives). So, our task is to find an
ideal line that separates this dataset in two classes (red and blue).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.11: Data set to classify Blue and Red

We have infinite lines that can separate these two classes.

Figure 2.12: Classes separated by many lines

It’s visually quite intuitive in this case that the yellow line classifies better. The green line in
the image above is quite close to the red class. Though it classifies the current datasets it is
not a generalized line and in machine learning our goal is to get a more generalized
separator.

The SVM algorithm finds the points closest to the line from both the classes. These points
are called support vectors. Now, compute the distance between the line and the support
vectors. This distance is called the margin. The goal is to maximize the margin. The
hyperplane for which the margin is maximum, called the optimal hyperplane.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.13: Computing the Hyperplane

SVM tries to make a decision boundary in such a way that the separation between the two
classes (that street) is as wide as possible.

For a bit complex dataset, which is not linearly separable we use non linear SVM.

Figure 2.14: Data set for non linear SVM

This data is clearly not linearly separable. We cannot draw a straight line that can classify
this data. But this data can be converted to linearly separable data in higher dimension.
Let’s add one more dimension and call it z-axis. Let the co-ordinates on z-axis be governed
by the constraint,

z = x²+y²

So, basically z co-ordinate is the square of distance of the point from origin. Let’s plot the
data on z-axis.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.15: Dataset on higher dimension

Now the data is clearly linearly separable. Let the purple line separating the data in higher
dimension be z=k, where k is a constant. Since, z=x²+y² we get x² + y² = k; which is an
equation of a circle. So, we can project this linear separator in higher dimension back in
original dimensions using this transformation.

Figure 2.16: Decision boundary in original dimensions

Thus, we can classify data by adding an extra dimension to it so that it becomes linearly
separable and then projecting the decision boundary back to original dimensions using
mathematical transformation.

KERNEL FUNCTIONS

Kernel Function is a method used to take data as input and transform into the required form
of processing data. “Kernel” is used due to set of mathematical functions used in Support
Vector Machine provides the window to manipulate the data. So, Kernel Function generally
transforms the training set of data so that a non-linear decision surface is able to be

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

transformed to a linear equation in a higher number of dimension spaces. Basically, it


returns the inner product between two points in a standard feature dimension.

Standard Kernel Function Equation:

Major Kernel functions: -

For Implementing Kernel Functions, first of all we have to install “scikit-learn” library using
command prompt terminal: pip install scikit-learn

Gaussian Kernel: It is used to perform transformation when there is no prior knowledge


about data.

Gaussian Kernel Radial Basis Function (RBF): Same as above kernel function, adding radial
basis method to improve the transformation.

Figure 2.17: Gaussian Kernel Radial Basis Function

Sigmoid Kernel: this function is equivalent to a two-layer, perceptron model of neural


network, which is used as activation function for artificial neurons.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 2.18: Sigmoid Kernel

Polynomial Kernel: It represents the similarity of vectors in training set of data in a feature
space over polynomials of the original variables used in kernel.

Figure 2.19: Polynomial Kernel

Linear Kernel: Used when data is linearly separable.

K-NEAREST NEIGHBORS

The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised


machine learning algorithm that can be used to solve both classification and regression
problems. The KNN algorithm assumes that similar things exist in proximity. In other words,
similar things are near to each other. KNN works by finding the distances between a query
and all the examples in the data, selecting the specified number examples (K) closest to the
query, then votes for the most frequent label (in the case of classification) or averages the
labels (in the case of regression).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

In the case of classification and regression, choosing the right K for data is done by trying
several Ks and picking the one that works best.

The KNN Algorithm

1. Load the data

2. Initialize K to your chosen number of neighbors

3. For each example in the data

3.1 Calculate the distance between the query example and the current example from the
data.

3.2 Add the distance and the index of the example to an ordered collection

4. Sort the ordered collection of distances and indices from smallest to largest (in ascending
order) by the distances

5. Pick the first K entries from the sorted collection

6. Get the labels of the selected K entries

7. If regression, return the mean of the K labels

8. If classification, return the mode of the K labels

Advantages

• The algorithm is simple and easy to implement.

• There’s no need to build a model, tune several parameters, or make additional


assumptions.

• The algorithm is versatile. It can be used for classification, regression, and search

Disadvantages

• The algorithm gets significantly slower as the number of examples and/or


predictors/independent variables increase.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester

Unit III

Syllabus: Ensemble Learning: Ensemble Learning Model Combination Schemes,


Voting, Error-Correcting Output Codes, Bagging: Random Forest Trees, Boosting:
Adaboost, Stacking.

Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO3): Identify and integrate more than one technique to enhance the
performance of learning.

ENSEMBLE LEARNING
Ensemble learning is the process by which multiple models, such as classifiers or
experts, are strategically generated and combined to solve a particular computational
intelligence problem. Ensemble learning is primarily used to improve the (classification,
prediction, function approximation, etc.)

Ensemble methods are techniques that create multiple models and then combine them
to produce improved results. Ensemble methods usually produce more accurate
solutions than a single model would. This has been the case in a number of machine
learning competitions, where the winning solutions used ensemble methods.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 3.1: Ensemble learning

Model is the output of the algorithm that trained with data. This model is then used for
making predictions. This algorithm can be any machine learning algorithm such as
logistic regression, decision tree, etc. These models, when used as inputs of ensemble
methods, are called “base models”.

ENSEMBLE LEARNING MODEL COMBINATION SCHEMES


There are some methods that use different type of base learning algorithms: some
heterogeneous weak learners are then combined into a “heterogeneous ensembles
model”

There are different ways the multiple base-learners are combined to generate the final
output:

• Multiexpert methods have base-learners that work in parallel. These methods


can in turn be divided into two:

o In the global approach, also called learner fusion, given an input, all base-
learners generate an output, and all these outputs are used. Examples are
voting and stacking.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

o In the local approach, or learner selection, for example, in mixture of


experts, there is a gating model, which looks at the input and chooses one
(or very few) of the learners as responsible for generating the output

• Multistage combination methods use a serial approach where the next base-
learner is trained with or tested on only the instances where the previous base-
learners are not accurate enough. The idea is that the base-learners (or the
different representations they use) are sorted in increasing complexity so that a
complex base-learner is not used (or its complex representation is not extracted)
unless the preceding simpler base-learners are not confident. An example is
cascading.

Figure 3.2: Base-learners are dj and their outputs are combined using f ()

Let us say that we have L base-learners. We denote by dj (x) the prediction of base-
learner Mj given the arbitrary dimensional input x. In the case of multiple
representations, each Mj uses a different input representation xj . The final prediction is
calculated from the predictions of the base-learners:

y = f (d1, d2,...,dL|Φ)

where f () is the combining function with Φ denoting its parameters. When there are K
outputs, for each learner there are dj i(x), i = 1,...,K, j = 1,...,L, and, combining them, we
also generate K values, yi, i = 1,...,K and then for example in classification, we choose the
class with the maximum yi value:

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

VOTING
The simplest way to combine multiple classifiers is by voting, which corresponds to
taking a linear combination of the learners (see figure 3.2):

This is also known as ensembles and linear opinion pools. In the simplest case, all
learners are given equal weight and we have simple voting that corresponds to taking
an average. Still, taking a (weighted) sum is only one of the possibilities and there are
also other combination rules, as shown in table 3.1. If the outputs are not posterior
probabilities, these rules require that outputs be normalized to the same scale.

Table 3.1: Classifier combination rules

An example of the use of these rules is shown in table 3.2, which demonstrates the
effects of different rules. Sum rule is the most intuitive and is the most widely used in
practice. Median rule is more robust to outliers; minimum and maximum rules are
pessimistic and optimistic, respectively.

Table 3.2: Example of combination rules on three learners and three classes

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

With the product rule, each learner has veto power; regardless of the other ones, if one
learner has an output of 0, the overall output goes to 0. After the combination rules, yi
do not necessarily sum up to 1.

In weighted sum, dj i is the vote of learner j for class Ci and wj is the weight of its vote.
Simple voting is a special case where all voters have equal weight, namely, wj = 1/L. In
classification, this is called plurality voting where the class having the maximum
number of votes is the winner. When there are two classes, this is majority voting where
the winning class gets more than half of the votes. If the voters can also supply the
additional information of how much they vote for each class (e.g., by the posterior
probability), then after normalization, these can be used as weights in a weighted voting
scheme. Equivalently, if dj i are the class posterior probabilities, P (Ci|x,Mj ), then we can
just sum them up (wj = 1/L) and choose the class with maximum yi.

In the case of regression, simple or weighted averaging or median can be used to fuse
the outputs of base-regressors. Median is more robust to noise than the average.

Another possible way to find wj is to assess the accuracies of the learners (regressor or
classifier) on a separate validation set and use that information to compute the weights,
so that we give more weights to more accurate learners. These weights can also be
learned from data.

Voting schemes can be seen as approximations under a Bayesian framework with


weights approximating prior model probabilities, and model decisions approximating
model-conditional likelihoods. For example, in classification we have wj ≡ P (Mj ), dj i =
P (Ci|x,Mj ), and corresponds to

Simple voting corresponds to a uniform prior. If we have a prior distribution preferring


simpler models, this would give larger weights to them.

We cannot integrate over all models; we only choose a subset for which we believe P
(Mj ) is high, or we can have another Bayesian step and calculate P (Mj |X), the
probability of a model given the sample, and sample high probable models from this
density.

Let us assume that dj are iid with expected value E[dj ] and variance Var(dj ), then when
we take a simple average with wj = 1/L, the expected value and variance of the output
are

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

The expected value does not change, so the bias does not change. But variance, and
therefore mean square error, decreases as the number of independent voters, L,
increase. In the general case,

which implies that if learners are positively correlated, variance (and error) increase.
We can thus view using different algorithms and input features as efforts to decrease, if
not eliminate, the positive correlation.

Further decrease in variance is possible if the voters are not independent but negatively
correlated. The error then decreases if the accompanying increase in bias is not higher
because these aims are contradictory; we cannot have a number of classifiers that are
all accurate and negatively correlated. In mixture of experts for example, where learners
are localized, the experts are negatively correlated but biased.

If we view each base-learner as a random noise function added to the true


discriminant/regression function and if these noise functions are uncorrelated with 0
mean, then the averaging of the individual estimates is like averaging over the noise. In
this sense, voting has the effect of smoothing in the functional space and can be thought
of as a regularizer with a smoothness assumption on the true function.

ERROR-CORRECTING OUTPUT CODES


In error-correcting output codes (ECOC), the main classification task is defined in terms
of a number of subtasks that are implemented by the base-learners. The idea is that the
original task of separating one class from all other classes may be a difficult problem.
Instead, we want to define a set of simpler classification problems, each specializing in
one aspect of the task, and combining these simpler classifiers, we get the final
classifier.

Base-learners are binary classifiers having output −1/ + 1, and there is a code matrix W
of K × L whose K rows are the binary codes of classes in terms of the L base-learners dj .
For example, if the second row of W is [−1, +1, +1, −1], this means that for us to say an
instance belongs to C2, the instance should be on the negative side of d1 and d4, and on
the positive side of d2 and d3. Similarly, the columns of the code matrix define the task

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

of the base-learners. For example, if the third column is [−1, +1, +1]T , we understand
that the task of the third base-learner, d3, is to separate the instances of C1 from the
instances of C2 and C3 combined. This is how we form the training set of the base-
learners. For example in this case, all instances labeled with C2 and C3 form X+3 and
instances labeled with C1 form X−3 , and d3 is trained so that xt ∈ X+3 give output +1
and xt ∈ X−3 give output −1.

The code matrix thus allows us to define a polychotomy (K > 2 classification problem) in
terms of dichotomies (K = 2 classification problem), and it is a method that is applicable
using any learning algorithm to implement the base-learners—for example, linear or
multilayer perceptrons (with a single output), decision trees, or SVMs whose original
definition is for two-class problems.

The typical one discriminant per class setting corresponds to the diagonal code matrix
where L = K. For example, for K = 4, we have

The problem here is that if there is an error with one of the base learners, there may be
a misclassification because the class code words are so similar. So the approach in
error-correcting codes is to have L>K and increase the Hamming distance between the
code words. One possibility is pairwise separation of classes where there is a separate
base learner to separate Ci from Cj , for i<j (section 10.4). In this case, L = K(K − 1)/2
and with K = 4, the code matrix is

where a 0 entry denotes “don’t care.” That is, d1 is trained to separate C1 from C2 and
does not use the training instances belonging to the other classes. Similarly, we say that
an instance belongs to C2 if d1 = −1 and d4 = d5 = +1, and we do not consider the values
of d2, d3, and d6. The problem here is that L is O(K2), and for large K pairwise
separation may not be feasible.

If we can have L high, we can just randomly generate the code matrix with −1/ + 1 and
this will work fine, but if we want to keep L low, we need to optimize W. The approach is
to set L beforehand and then find W such that the distances between rows, and at the
same time the distances between columns, are as large as possible, in terms of Hamming
distance. With K classes, there are 2(K−1) − 1 possible columns, namely, two-class

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

problems. This is because K bits can be written in 2K different ways and complements
(e.g., “0101” and “1010,” from our point of view, define the same discriminant) dividing
the possible combinations by 2 and then subtracting 1 because a column of all 0s (or 1s)
is useless. For example, when K = 4, we have

When K is large, for a given value of L, we look for L columns out of the 2(K−1)−1. We
would like these columns of W to be as different as possible so that the tasks to be
learned by the base-learners are as different from each other as possible. At the same
time, we would like the rows of W to be as different as possible so that we can have
maximum error correction in case one or more base learners fail. ECOC can be written
as a voting scheme where the entries of W, wij, are considered as vote weights:

and then we choose the class with the highest yi. Taking a weighted sum and then
choosing the maximum instead of checking for an exact match allows dj to no longer
need to be binary but to take a value between −1 and +1, carrying soft certainties
instead of hard decisions. Note that a value pj between 0 and 1, for example, a posterior
probability, can be converted to a value dj between −1 and +1 simply as dj = 2pj – 1

BAGGING
Bagging is a voting method whereby base learners are made different by training them
over slightly different training sets. Bagging, that often considers homogeneous weak
learners, learns them independently from each other in parallel and combines them
following deterministic averaging process.

In parallel methods we fit the different considered learners independently from each
other’s and, so it is possible to train them concurrently. The most famous such approach
is “bagging” (standing for “bootstrap aggregating”) that aims at producing an ensemble
model that is more robust than the individual models composing it.

When training a model, no matter if we are dealing with a classification or a regression


problem, we obtain a function that takes an input, returns an output and that is defined
with respect to the training dataset. Due to the theoretical variance of the training
dataset (we remind that a dataset is an observed sample coming from a true unknown

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

underlying distribution), the fitted model is also subject to variability: if another dataset
had been observed, we would have obtained a different model.

The idea of bagging is then simple: we want to fit several independent models and
“average” their predictions in order to obtain a model with a lower variance. However,
we can’t, in practice, fit fully independent models because it would require too much
data. So, we rely on the good “approximate properties” of bootstrap samples
(representativity and independence) to fit models that are almost independent.

First, we create multiple bootstrap samples so that each new bootstrap sample will act
as another (almost) independent dataset drawn from true distribution. Then, we can fit
a weak learner for each of these samples and finally aggregate them such that we kind
of “average” their outputs and, so, obtain an ensemble model with less variance that its
components. Roughly speaking, as the bootstrap samples are approximatively
independent and identically distributed (i.i.d.), so are the learned base models. Then,
“averaging” weak learners outputs do not change the expected answer but reduce its
variance (just like averaging i.i.d. random variables preserve expected value but reduce
variance).

So, if we have L bootstrap samples (approximations of L independent datasets) of size B


denoted

We can fit L almost independent weak learners (one on each dataset)

and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that.

and then aggregate them into some kind of averaging process in order to get an
ensemble model with a lower variance. For example, we can define our strong model
such that

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

There are several possible ways to aggregate the multiple models fitted in parallel. For a
regression problem, the outputs of individual models can literally be averaged to obtain
the output of the ensemble model. For classification problem the class outputted by
each model can be seen as a vote and the class that receives the majority of the votes is
returned by the ensemble model (this is called hard voting). Still for a classification
problem, we can also consider the probabilities of each class returned by all the models,
average these probabilities and keep the class with the highest average probability (this
is called soft-voting). Averages or votes can either be simple or weighted if any relevant
weights can be used.

Finally, we can mention that one of the big advantages of bagging is that it can be
parallelized. As the different models are fitted independently from each other’s,
intensive parallelization techniques can be used if required.

Figure 3.3: Bagging consists in fitting several base models on different bootstrap
samples

RANDOM FOREST TREES


The random forest approach is a bagging method where deep trees, fitted on bootstrap
samples, are combined to produce an output with lower variance. However, random
forests also use another trick to make the multiple fitted trees a bit less correlated with
each other: when growing each tree, instead of only sampling over the observations in
the dataset to generate a bootstrap sample, we also sample over features and keep only
a random subset of them to build the tree.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 3.4: Random Forest Approach

Sampling over features has indeed the effect that all trees do not look at the exact same
information to make their decisions and, so, it reduces the correlation between the
different returned outputs. Another advantage of sampling over the features is that it
makes the decision-making process more robust to missing data: observations (from
the training dataset or not) with missing data can still be regressed or classified based
on the trees that take into account only features where data are not missing. Thus,
random forest algorithm combines the concepts of bagging and random feature
subspace selection to create more robust models.

BOOSTING
Boosting methods work in the same spirit as bagging methods: we build a family of
models that are aggregated to obtain a strong learner that performs better. However,
unlike bagging that mainly aims at reducing variance, boosting is a technique that
consists in fitting sequentially multiple weak learners in a very adaptative way: each
model in the sequence is fitted giving more importance to observations in the dataset
that were badly handled by the previous models in the sequence. Intuitively, each new
model focuses its efforts on the most difficult observations to fit up to now, so that we
obtain, at the end of the process, a strong learner with lower bias (even if we can notice
that boosting can also have the effect of reducing variance). Boosting, like bagging, can
be used for regression as well as for classification problems.

Being mainly focused at reducing bias, the base models that are often considered for
boosting are models with low variance but high bias. For example, if we want to use
trees as our base models, we will choose most of the time shallow decision trees with
only a few depths. Another important reason that motivates the use of low variance but
high bias models as weak learners for boosting is that these models are in general less
computationally expensive to fit (few degrees of freedom when parametrized). Indeed,
as computations to fit the different models can’t be done in parallel (unlike bagging), it
could become too expensive to fit sequentially several complex models.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Once the weak learners have been chosen, we still need to define how they will be
sequentially fitted (what information from previous models do we consider when fitting
current model?) and how they will be aggregated (how do we aggregate the current
model to the previous ones?). We will discuss these questions in the two following
subsections, describing more especially two important boosting algorithms: adaboost
and gradient boosting.

In a nutshell, these two meta-algorithms differ on how they create and aggregate the
weak learners during the sequential process. Adaptive boosting updates the weights
attached to each of the training dataset observations whereas gradient boosting updates
the value of these observations. This main difference comes from the way both methods
try to solve the optimization problem of finding the best model that can be written as a
weighted sum of weak learners.

Figure 3.5: Boosting method

ADABOOST
In adaptative boosting (often called “adaboost”), we try to define our ensemble model as
a weighted sum of L weak learners

Finding the best ensemble model with this form is a difficult optimization problem.
Then, instead of trying to solve it in one single shot (finding all the coefficients and weak
learners that give the best overall additive model), we make use of an iterative
optimization process that is much more tractable, even if it can lead to a sub-optimal
solution. More especially, we add the weak learners one by one, looking at each iteration
for the best possible pair (coefficient, weak learner) to add to the current ensemble
model. In other words, we define recurrently the (s_l)’s such that

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

where c_l and w_l are chosen such that s_l is the model that fit the best the training data
and, so, that is the best possible improvement over s_(l-1). We can then denote

where E(.) is the fitting error of the given model and e(.,.) is the loss/error function.
Thus, instead of optimizing “globally” over all the L models in the sum, we approximate
the optimum by optimizing “locally” building and adding the weak learners to the
strong model one by one.

More especially, when considering a binary classification, we can show that the
adaboost algorithm can be re-written into a process that proceeds as follow. First, it
updates the observations weights in the dataset and train a new weak learner with a
special focus given to the observations misclassified by the current ensemble model.
Second, it adds the weak learner to the weighted sum according to an update coefficient
that expresses the performances of this weak model: the better a weak learner
performs, the more it contributes to the strong learner.

So, assume that we are facing a binary classification problem, with N observations in
our dataset and we want to use adaboost algorithm with a given family of weak models.
At the very beginning of the algorithm (first model of the sequence), all the observations
have the same weights 1/N. Then, we repeat L times (for the L learners in the sequence)
the following steps:

• fit the best possible weak model with the current observations weights
• compute the value of the update coefficient that is some kind of scalar evaluation
metric of the weak learner that indicates how much this weak learner should be
taken into account into the ensemble model
• update the strong learner by adding the new weak learner multiplied by its
update coefficient
• compute new observations weights that expresses which observations we would
like to focus on at the next iteration (weights of observations wrongly predicted
by the aggregated model increase and weights of the correctly predicted
observations decrease)
Repeating these steps, we have then built sequentially our L models and aggregate them
into a simple linear combination weighted by coefficients expressing the performance of
each learner. Notice that there exist variants of the initial adaboost algorithm such that
LogitBoost (classification) or L2Boost (regression) that mainly differ by their choice of
loss function.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 3.6: Adaboost updates weights of the observations at each iteration.

STACKING
Stacking mainly differ from bagging and boosting on two points. First stacking often
considers heterogeneous weak learners (different learning algorithms are combined)
whereas bagging and boosting consider mainly homogeneous weak learners. Second,
stacking learns to combine the base models using a meta-model whereas bagging and
boosting combine weak learners following deterministic algorithms.

The idea of stacking is to learn several different weak learners and combine them by
training a meta-model to output predictions based on the multiple predictions returned
by these weak models. So, we need to define two things in order to build our stacking
model: the L learners we want to fit and the meta-model that combines them.

for example, for a classification problem, we can choose as weak learners a KNN
classifier, a logistic regression and a SVM, and decide to learn a neural network as meta-
model. Then, the neural network will take as inputs the outputs of our three weak
learners and will learn to return final predictions based on it.

So, assume that we want to fit a stacking ensemble composed of L weak learners. Then
we have to follow the steps thereafter:

• split the training data in two folds


• choose L weak learners and fit them to data of the first fold
• for each of the L weak learners, make predictions for observations in the second
fold
• fit the meta-model on the second fold, using predictions made by the weak
learners as inputs
In the previous steps, we split the dataset in two folds because predictions on data that
have been used for the training of the weak learners are not relevant for the training of

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

the meta-model. Thus, an obvious drawback of this split of our dataset in two parts is
that we only have half of the data to train the base models and half of the data to train
the meta-model. In order to overcome this limitation, we can however follow some kind
of “k-fold cross-training” approach (similar to what is done in k-fold cross-validation)
such that all the observations can be used to train the meta-model: for any observation,
the prediction of the weak learners are done with instances of these weak learners
trained on the k-1 folds that do not contain the considered observation. In other words,
it consists in training on k-1 fold in order to make predictions on the remaining fold and
that iteratively so that to obtain predictions for observations in any folds. Doing so, we
can produce relevant predictions for each observation of our dataset and then train our
meta-model on all these predictions.

Figure 3.7: Stacking consists in training a meta-model to produce

Multi-levels Stacking

A possible extension of stacking is multi-level stacking. It consists in doing stacking with


multiple layers. As an example, let’s consider a 3-levels stacking. In the first level (layer),
we fit the L weak learners that have been chosen. Then, in the second level, instead of
fitting a single meta-model on the weak models predictions (as it was described in the
previous subsection) we fit M such meta-models. Finally, in the third level we fit a last
meta-model that takes as inputs the predictions returned by the M meta-models of the
previous level.

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 3.8: Multi-level stacking considers several layers of stacking

For each meta-model of the different levels of a multi-levels stacking ensemble model,
we have to choose a learning algorithm that can be almost whatever we want (even
algorithms already used at lower levels). We can also mention that adding levels can
either be data expensive (if k-folds like technique is not used and then, more data are
needed) or time expensive (if k-folds like technique is used and, then, lot of models need
to be fitted).

follow us on instagram for frequent updates: www.instagram.com/rgpvnotes.in


lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester

Unit IV

Syllabus: Unsupervised Learning: Introduction to clustering, Hierarchical: AGNES,


DIANA, Partitional: K-means clustering, K-Mode Clustering, Self-Organizing Map,
Expectation Maximization, Gaussian Mixture Models, Principal Component Analysis (PCA),
Locally Linear Embedding (LLE), Factor Analysis.

Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO4): Design probabilistic and unsupervised learning models for handling
unknown pattern.

INTRODUCTION TO CLUSTERING
Clustering is the most important technique of unsupervised learning. Clustering is an
unsupervised learning technique in which there is predefined classes and prior
information which defines how the data should be grouped or labeled into separate
classes. Cluster is the collection of data objects which are similar to one another within
the same group (class or category) and are different from the objects in the other
clusters. It is Exploratory Data Analysis (EDA) process which helps to discover hidden
patterns of interest or structure in data. Clustering can also work as a standalone tool to
get the insights about the data distribution or as a preprocessing step in other
algorithms.

High quality clusters can be created by reducing the distance between the objects in the
same cluster known as intra-cluster minimization and increasing the distance with the
objects in the other cluster known as inter-cluster maximization.

Intra-cluster minimization: The closer the objects in a cluster, the more likely they
belong to the same cluster.

Inter-cluster Maximization: This makes the separation between two clusters. The
main goal is to maximize the distance between 2 clusters.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 4.1: Intra and Inter cluster distances

HIERARCHICAL
Hierarchical clustering does not partition the dataset into clusters in a single step.
Instead, it involves multiple steps which run from a single cluster containing all the data
points to n clusters containing single data point.

This algorithm is further classified into Divisive and Agglomerative Methods.

Hierarchical clustering can be shown using the below diagram:

Figure 4.2: Types of Hierarchical clustering

Divisive Method

This method is also known as top-down clustering method. It assigns all the data points
to a single cluster and then it partitions the cluster to two least similar clusters. Then

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

the same method is applied recursively on both the clusters until we get the cluster of
each data point.

Agglomerative method

It is also known as bottom-up clustering method. Here it assigns n data points to n


clusters and joins the most similar clusters by computing the similarity i.e., the distance
between each of the clusters. This process is continued until user gets a single cluster.

AGNES
AGNES is an inside-out or bottoms-up approach. Here every data point is assigned as a
cluster initially if there are n data points n clusters will be formed initially. In the next
iteration, similar clusters are merged (again based on the density and distances), this
continuous until similar points are clustered together and are distinct for other clusters.

Figure 4.3: AGNES Hierarchical Clustering

In steps one all the data points are assigned as clusters. In step two again depending
upon the density and distances the points are clubbed into a cluster. Lastly, in step 3 all
the similar points depending upon density and distances are clustered together which
are distinct to other clusters thus forming final clusters.

DIANA
DIANA is also known as Divisive Analysis clustering algorithm. It is the top-down
approach form of hierarchical clustering where all data points are initially assigned a
single cluster. Further, the clusters are split into two least similar clusters. This is done
recursively until clusters groups are formed which are distinct to each other.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 4.4: DIANA Hierarchical Clustering

In step 1 that is the blue outline circle can be thought of as all the points are assigned a
single cluster. Moving forward it is divided into 2 red-colored clusters based on the
distances/density of points. Now, we have two red-colored clusters in step 2. Lastly, in
step 3 the two red clusters are further divided into 2 blacks dotted each, again based on
density and distances to give us final four clusters. Since the points in the respective 4
clusters are very similar to each other and very different when compared to the other
cluster groups they are not further divided. Thus, this is how user gets DIANA clusters
or top-down approached Hierarchical clusters.

In the partitioning method when database that contains multiple objects then the
partitioning method constructs user-specified partitions of the data in which each
partition represents a cluster and a particular region. There are many algorithms that
come under partitioning method some of the popular ones are K-means clustering, K-
Mode Clustering etc.

K-MEANS CLUSTERING
The K means algorithm takes the input parameter K from the user and partitions the
dataset containing N objects into K clusters so that resulting similarity among the data
objects inside the group (intracluster) is high but the similarity of data objects with the
data objects from outside the cluster is low (intercluster). The similarity of the cluster is
determined with respect to the mean value of the cluster.

It is a type of square error algorithm. At the start randomly k objects from the dataset
are chosen in which each of the objects represents a cluster mean. For the rest of the
data objects, they are assigned to the nearest cluster based on their distance from the
cluster mean. The new mean of each of the cluster is then calculated with the added data
objects.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 4.5: K-mean Clustering

Method:

1. Randomly assign K objects from the dataset(D) as cluster centers(C)

2. Reassign each object to which object is most similar based upon mean values.

3. Update Cluster means, i.e., Recalculate the mean of each cluster with the updated
values.

4. Repeat Step 4 until no change occurs.

Figure 4.6: Flow chart for K-mean Clustering

K-means clustering algorithm works efficiently only for numerical dataset. It cannot
give proper results for the categorical data because of the improper spatial
representation. K-Means Clustering fails to find patterns in the categorical dataset.

K-MODES CLUSTERING

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

It is widely used algorithm for grouping the categorical data because it is easy to
implement and efficiently handles large amount of data. It defines clusters based on the
number of matching categories between data points. The k-modes clustering algorithm
is an extension of k-means clustering algorithm.

The modifications done in the k-means for k-modes are:

1. Using a simple matching dissimilarity measure for categorical objects.

2. Replacing means of clusters by modes.

3. Using a frequency-based method to update the modes.

Let X, x11, x12,…,xnm be the data set consists of n number of objects with m number of
attributes. The main objective of the k-modes clustering algorithm is to group the data
objects X into K clusters by minimize the cost function:

K-modes algorithm where inputs are Data objects (X) and Number of clusters (K).

Step 1: Randomly select the K initial modes from the data objects such that Cj, j = 1,2,…,K
Step 2: Find the matching dissimilarity between the each K initial cluster modes and
each data objects using the Euation:

Step 3: Evaluate the fitness using the following Equation:

Step 4: Find the minimum mode values in each data object i.e. finding the objects
nearest to the initial cluster modes.

Step 5: Assign the data objects to the nearest cluster centroid modes.

Step 6: Update the modes by apply the frequency-based method on newly formed
clusters.

Step 7: Recalculate the similarity between the data objects and the updated modes.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Step 8: Repeat the step 4 and step 5 until no changes in the cluster ship of data objects.

SELF-ORGANIZING MAP
A self-organizing map (SOM) is a clustering technique that helps to uncover categories
in large datasets, such as to find customer profiles based on a list of past purchases. Self-
organizing maps are unsupervised neural networks, where nodes (reference vectors)
are arranged in a single, 2-dimensional grid, which can take the shape of either
rectangles or hexagons.

One example of a data type with more than two dimensions is color. Colors have three
dimensions, typically represented by RGB (red, green, blue) values. SOM can distinguish
between two color clusters.

SOM comprises neurons in the grid, which gradually adapt to the intrinsic shape of data.
The result allows visualizing data points and identifying clusters in a lower dimension.
The iterative process to learn the shape of data by SOM is given below:

Step 0: Randomly position the grid’s neurons in the data space.

Step 1: Select one data point, either randomly or systematically cycling through the
dataset in order

Step 2: Find the neuron that is closest to the chosen data point. This neuron is called the
Best Matching Unit (BMU).

Step 3: Move the BMU closer to that data point. The distance moved by the BMU is
determined by a learning rate, which decreases after each iteration.

Step 4: Move the BMU’s neighbors closer to that data point as well, with farther away
neighbors moving less. Neighbors are identified using a radius around the BMU, and the
value for this radius decreases after each iteration.

Step 5: Update the learning rate and BMU radius, before repeating Steps 1 to 4. Iterate
these steps until positions of neurons have been stabilized.

EXPECTATION MAXIMIZATION
The Expectation-Maximization (EM) algorithm is a way to find maximum-likelihood
estimates for model parameters when your data is incomplete, has missing data points,
or has unobserved (hidden) latent variables.

The basic steps for the EM algorithm are:

1. An initial guess is made for the model’s parameters and a probability distribution
is created. This is sometimes called the “E-Step” for the “Expected” distribution.

2. Newly observed data is fed into the model.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

3. The probability distribution from the E-step is tweaked to include the new data.
This is sometimes called the “M-step.”

4. Steps 2 through 4 are repeated until stability (i.e. a distribution that doesn’t
change from the E-step to the M-step) is reached.

The EM Algorithm always improves a parameter’s estimation through this above given
process. However, it sometimes needs a few random starts to find the best model. The
EM algorithm can be very slow, even on the fastest computer. It works best when
dataset have a small percentage of missing data and the dimensionality of the data isn’t
too big.

GAUSSIAN MIXTURE MODELS

PRINCIPAL COMPONENT ANALYSIS (PCA)


Principal Component Analysis, or PCA, is a dimensionality-reduction method that is
often used to reduce the dimensionality of large data sets, by transforming a large set of
variables into a smaller one that still contains most of the information in the large set.
Because smaller data sets are easier to explore and visualize and make analyzing data
much easier and faster for machine learning algorithms without extraneous variables to
process.

Step 1: Standardization

First step is to standardize the range of the continuous initial variables so that each one
of them contributes equally to the analysis. This can be done by subtracting the mean
and dividing by the standard deviation for each value of each variable.

Step 2: Covariance Matrix Computation

Second is to understand how the variables of the input data set are varying from the
mean with respect to each other. Variables are highly correlated in such a way that they
contain redundant information. So, in order to identify these correlations, the
covariance matrix is computed.

The covariance matrix is a p × p symmetric matrix (where p is the number of


dimensions) that has as entries the covariance associated with all possible pairs of the
initial variables. If covariance positive, then the two variables increase or decrease
together (correlated) and if covariance negative then one increases when the other
decreases (Inversely correlated)

Step 3: Compute the Eigenvectors and Eigen-values of the Covariance Matrix

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Eigenvectors and Eigen-values are the linear algebra concepts that are computed from
the covariance matrix in order to determine the principal components of the data.
Principal components are new variables that are constructed as linear combinations or
mixtures of the initial variables. These combinations are done in such a way that the
new variables are uncorrelated and most of the information within the initial variables
is squeezed or compressed into the first components. So, the idea is 10-dimensional
data gives you 10 principal components, but PCA tries to put maximum possible
information in the first component, then maximum remaining information in the second
and so on, until having something like shown figure below.

Figure 4.7: Percentage of Variance for each Principal component

Computing the eigenvectors and ordering them by their Eigen-values in descending


order, helps to find the principal components in order of significance.

Step 4: Feature Vector

Feature vector is to choose whether to keep all the components or discard those of
lesser significance (of low Eigen-values), and form with the remaining ones. So, the
feature vector is simply a matrix that has as columns the eigenvectors of the
components that are kept.

Step 5: Recast the Data along the Principal Components Axes

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

In this step, the aim is to use the feature vector formed using the eigenvectors of the
covariance matrix, to reorient the data from the original axes to the ones represented by
the principal components. This can be done by multiplying the transpose of the original
data set by the transpose of the feature vector.

LOCALLY LINEAR EMBEDDING (LLE)


Locally Linear Embedding (LLE) is a method of Non-Linear Dimensionality reduction.
Dimensionality reduction helps reduce the complexity of the machine learning model
helping reduce over-fitting. This happens because the more features user takes, the
more complex a model gets, and this may cause the model to fit the data too well,
causing over-fitting. Features that do not help decide the output label also may be used,
which may not help in real life.

For example, in the house price prediction problem, it may have a feature like the age of
the seller, which may not affect the house price. Dimensionality reduction helps to keep
the more important features in the feature set, reducing the number of features
required to predict the output.

Figure 4.8: Dimensionality Reduction

The LLE algorithm is an unsupervised method for dimensionality reduction. It tries to


reduce these n-Dimensions while trying to preserve the geometric features of the
original non-linear feature structure. If we have D dimensions for data X1, we try to
reduce X1 to X2 in a feature space with d dimensions.

LLE first finds the k-nearest neighbors of the points. Then, it approximates each data
vector as a weighted linear combination of its k-nearest neighbors. Finally, it computes
the weights that best reconstruct the vectors from its neighbors, then produce the low-

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

dimensional vectors best reconstructed by these weights. If K is chosen to be too small


or too large, it will not be able to accommodate the geometry of the original data. Here,
for each data point, compute the K nearest neighbors. A weighted aggregation of the
neighbors of each point is computed to construct a new point.

Figure 4.9: Locally Linear Embedding method

The below given equation is used to minimize the cost function, where j’th nearest
neighbor for point Xi:

Now for defining the new vector space Y such that it minimizes the cost for Y as the new
points the below given equation is used.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

FACTOR ANALYSIS
Factor analysis is a way to take a mass of data and shrinking it to a smaller data set that
is more manageable and more understandable. It’s a way to find hidden patterns, show
how those patterns overlap and show what characteristics are seen in multiple patterns.
It is also used to create a set of variables for similar items in the set. It can be a very
useful tool for complex sets of data involving psychological studies, socioeconomic
status, and other involved concepts. A “factor” is a set of observed variables that have
similar response patterns; they are associated with a hidden variable (called a
confounding variable) that cannot be directly measured. Factors are listed according to
factor loadings, or how much variation in the data they can explain.

Types of Factor Analysis:

1. Exploratory factor analysis is used when user doesn’t have any idea about
what structure of data is or how many dimensions are in a set of variables.

2. Confirmatory Factor Analysis is used for verification as long as user have a


specific idea about what structure your data is or how many dimensions are in a
set of variables.

The key concept of factor analysis is that multiple observed variables have similar
patterns of responses because they are all associated with a latent (i.e., not directly
measured) variable. In every factor analysis, there are the same numbers of factors as
there are variables. Each factor captures a certain amount of the overall variance in the
observed variables, and the factors are always listed in order of how much variation
they explain.
The Eigen value is a measure of how much of the variance of the observed variables a
factor explains. Any factor with an Eigen value ≥1 explains more variance than a single
observed variable. So, if the factor for socioeconomic status had an Eigen value of 2.3 it
would explain as much variance as 2.3 of the three variables. This factor, which
captures most of the variance in those three variables, could then be used in other
analyses.

Factor Loading

The relationship of each variable to the underlying factor is expressed by the factor
loading. Here is an example of the output of a simple factor analysis looking at
indicators of wealth, with just six variables and two resulting factors.

Variables Factor 1 Factor 2

Income 0.65 0.11

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Education 0.59 0.25

Occupation 0.48 0.19

House value 0.38 0.60

Number of public parks in neighborhood 0.13 0.57

Number of violent crimes per year in neighborhood 0.23 0.55

Table 4.1: Output of a simple factor analysis for wealth indication

The variable with the strongest association to the underlying latent variable Factor 1 is
income, with a factor loading of 0.65.

Since factor loadings can be interpreted like standardized regression coefficients, so the
variable income has a correlation of 0.65 with Factor 1. This would be considered a
strong association for a factor analysis.

Two other variables, education, and occupation are also associated with Factor 1. Based
on the variables loading highly onto Factor 1, it is “Individual socioeconomic status.”

House value, number of public parks, and number of violent crimes per year, however,
have high factor loadings on the other factor, Factor 2. They seem to indicate the overall
wealth within the neighborhood, so it is Factor 2 “Neighborhood socioeconomic status.”

The variable house value also is marginally important in Factor 1 (loading = 0.38) since
the value of a person’s house should be associated with his or her income.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Subject Notes
IT 802 (A) - Machine Learning
B.Tech IT-8th Semester

Unit V

Syllabus: Probabilistic Learning: Bayesian Learning, Bayes Optimal Classifier, Naïve


Bayes Classifier, Bayesian Belief Networks, Mining Frequent Patterns.

Course Objective: To familiarize students with the knowledge of machine learning and enable
them to apply suitable machine learning techniques for data handling and to gain knowledge
from it. Evaluate the performance of algorithms and to provide solution for various real-world
applications.
_____________________________________________________________________________
Course Outcome (CO5): Analyze the co-occurrence of data to find interesting frequent
patterns and preprocess the data before applying to any real-world problem for evaluation.

PROBABILISTIC LEARNING
Probabilistic classification learning is one form of implicit learning in which cues are
probabilistically associated with outcomes and participants process associations
without explicit awareness. Probabilistic classifier is a classifier that is able to predict,
given an observation of an input, a probability distribution over a set of classes, rather
than only outputting the most likely class that the observation should belong to.
Probabilistic classifiers provide classification that can be useful or when combining
classifiers into ensembles.

A probabilistic method or model is based on the theory of probability or the fact that
randomness plays a role in predicting future events. The opposite is deterministic,
which is the opposite of random it tells us something can be predicted exactly, without
the added complication of randomness.

BAYESIAN LEARNING
In Bayesian machine learning, follow these three steps:

1. To define a model, use a “generative process” for the data, i.e., a sequence of
steps describing how the data was created.

a. This generative process includes the unknown model parameters.

b. Incorporate prior beliefs about these parameters, which take the form of
distributions over values that the parameters might take.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

2. Data are viewed as observations from the generative process.

3. After running the learning algorithm, it is left with an updated belief about the
parameters — i.e., a new distribution over the parameters.

The Bayesian strategy is particularly useful when:

• User has prior beliefs about unknown model parameters or explicit information
about data generation — i.e., useful info user wants to incorporate.

• User has few data or many unknown model parameters, and it is hard to get an
accurate result with data alone (without the added structure or information).

• User wants to capture the uncertainty about the result — how sure or unsure the
model is instead of only a single “best” result.

Bayesian learning Example:

Suppose user grab a carton of milk from the fridge, see that it is seven days past the
expiration date, and want to know if the milk is still good or if it has gone bad. A quick
internet search leads him to believe that there is roughly a 50–50 chance that the milk is
still good. This is a prior belief (Figure 1).

From past experience, user has some knowledge about how smelly milk gets when it
has gone bad. Specifically, let’s suppose he rate smelliness on a scale of 0–10 (0 being no
smell and 10 being completely rancid) and have probability distributions over the
smelliness of good milk and of bad milk (Figure 2).

Here’s how Bayesian learning works: When he gets some data, i.e., when he smells the
milk (Figure 3), He can apply the machinery of Bayesian inference (Figure 4) to
compute an updated belief about whether the milk is still good or has gone bad (Figure
5).

For example, if user observe that the milk is about a 5 out of 10 on the smelly scale, he
can then use Bayesian learning to factor in his prior beliefs and the distributions over
smelliness of good vs. bad milk to return an updated belief — that there is now a 33%
chance that the milk is still good and a 67% chance that the milk has gone bad.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Figure 5.1: Bayesian learning to find if the milk is gone bad

A program that computes an update belief about whether the “milk has gone bad
whenever user smells the milk”. A program will do the following:

1. Encode prior beliefs about whether the milk is still good or has gone bad and
probability distributions over the smelliness of good vs. bad milk.

2. Smell the milk and give this observation as an input to the program.

3. Do Bayesian learning automatically and return an updated belief about whether


or not the milk has gone bad.

For a Bayesian model, user needs to mathematically derive an inference algorithm i.e.,
the learning algorithm that computes the final distribution over beliefs given the data.

BAYES OPTIMAL CLASSIFIER


The Bayes Optimal Classifier is a probabilistic model that makes the most probable
prediction. It is described using the Bayes Theorem that provides a principled way for
calculating a conditional probability. It is also closely related to the Maximum a
Posteriori: a probabilistic framework referred to as MAP that finds the most probable
hypothesis for a training dataset.

The Bayes Optimal Classifier is computationally expensive, if not intractable to calculate,


and instead, simplifications such as the Gibbs algorithm and Naive Bayes can be used to
approximate the outcome.

The equation below demonstrates how to calculate the conditional probability for a new
instance (vi) given the training data (D), given a space of hypotheses (H).

P(vj | D) = sum {h in H} P(vj | hi) * P(hi | D)

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Where vj is a new instance to be classified, H is the set of hypotheses for classifying the
instance, hi is a given hypothesis, P(vj | hi) is the posterior probability for vi given
hypothesis hi, and P(hi | D) is the posterior probability of the hypothesis hi given the
data D.

Selecting the outcome with the maximum probability is an example of a Bayes optimal
classification.

max sum {h in H} P(vj | hi) * P(hi | D)

Any model that classifies examples using this equation is a Bayes optimal classifier and
no other model can outperform this technique, on average.

Any system that classifies new instances according to [the equation] is called a Bayes
optimal classifier, or Bayes optimal learner. No other classification method using the
same hypothesis space and same prior knowledge can outperform this method on
average. Although the classifier makes optimal predictions, it is not perfect given the
uncertainty in the training data and incomplete coverage of the problem domain and
hypothesis space. As such, the model will make errors. These errors are often referred
to as Bayes errors. Because the Bayes classifier is optimal, the Bayes error is the
minimum possible error.

NAÏVE BAYES CLASSIFIER


Naive Bayes classifiers are a collection of classification algorithms based on Bayes’
Theorem. It is not a single algorithm but a family of algorithms where all of them share a
common principle, i.e., every pair of features being classified is independent of each
other.

It is a classification technique based on Bayes’ Theorem with an assumption of


independence among predictors. In simple terms, a Naive Bayes classifier assumes that
the presence of a particular feature in a class is unrelated to the presence of any other
feature.

For example, some fruit may be an apple if it is red, round, and about 3 inches in
diameter. Even if these features depend on each other or upon the existence of the other
features, all these properties independently contribute to the probability that this fruit
is an apple and that is why it is known as ‘Naive’.

Naive Bayes model is easy to build and particularly useful for very large data sets. Along
with simplicity, Naive Bayes is known to outperform even highly sophisticated
classification methods.

Bayes theorem provides a way of calculating posterior probability P(c|x) from P(c), P(x)
and P(x|c). The equation is given below:

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

where,

• P(c|x) is the posterior probability of class (c, target) given predictor (x,
attributes).

• P(c) is the prior probability of class.

• P(x|c) is the likelihood which is the probability of predictor given class.

• P(x) is the prior probability of predictor.

Working of Naive Bayes algorithm:

A training data set of weather and corresponding target variable ‘Play’ (suggesting
possibilities of playing) is given below. User needs to classify whether players will play
or not based on weather condition. Given below steps to perform classification:

Step 1: Convert the data set into a frequency table

Step 2: Create Likelihood table by finding the probabilities like Overcast probability =
0.29 and probability of playing is 0.64.

Figure 5.2: A training data set of weather

Step 3: Now, use Naive Bayesian equation to calculate the posterior probability for each
class. The class with the highest posterior probability is the outcome of prediction.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Problem: Players will play if weather is sunny. Is this statement is correct?

User can solve it using method of posterior probability.

P(Yes | Sunny) = P( Sunny | Yes) * P(Yes) / P (Sunny)

Here we have P (Sunny |Yes) = 3/9 = 0.33, P(Sunny) = 5/14 = 0.36, P( Yes)= 9/14 = 0.64

Now, P (Yes | Sunny) = 0.33 * 0.64 / 0.36 = 0.60, which has higher probability.

Naive Bayes uses a similar method to predict the probability of different class based on
various attributes. This algorithm is mostly used in text classification and with problems
having multiple classes.

Applications of Naive Bayes Algorithms

• Real time Prediction: Naive Bayes is an eager learning classifier and it is sure
fast. Thus, it could be used for making predictions in real time.

• Multi class Prediction: This algorithm is also well known for multi class
prediction feature. Here we can predict the probability of multiple classes of
target variable.

• Text classification/ Spam Filtering/ Sentiment Analysis: Naive Bayes classifiers


mostly used in text classification (due to better result in multi class problems
and independence rule) have higher success rate as compared to other
algorithms. As a result, it is widely used in Spam filtering (identify spam e-mail)
and Sentiment Analysis (in social media analysis, to identify positive and
negative customer sentiments)

• Recommendation System: Naive Bayes Classifier and Collaborative Filtering


together builds a Recommendation System that uses machine learning and data
mining techniques to filter unseen information and predict whether a user would
like a given resource or not

BAYESIAN BELIEF NETWORKS


Bayesian Belief Network or Bayesian Network or Belief Network is a Probabilistic
Graphical Model (PGM) that represents conditional dependencies between random
variables through a Directed Acyclic Graph (DAG). The main objective of these networks
is trying to understand the structure of causality relations.

Conditional probability is the probability of a random variable when some other


random variable is given. It is shown by

If these two random variables are dependent,

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

If they are independent, then

The probabilities are calculated in the belief networks by the following formula

To be able to calculate the joint distribution one need to have conditional probabilities
indicated by the network.

Bayesian Network can be used for building models from data and experts’ opinions, and
it consists of two parts:

• Directed Acyclic Graph

• Table of conditional probabilities.

The generalized form of Bayesian network that represents and solve decision problems
under uncertain knowledge is known as an Influence diagram.

A Bayesian network graph is made up of nodes and Arcs (directed links), where:

Figure 5.3: A Bayesian network graph

Each node corresponds to the random variables, and a variable can be continuous or
discrete.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

Arc or directed arrows represent the causal relationship or conditional probabilities


between random variables. These directed links or arrows connect the pair of nodes in
the graph.

These links represent that one node directly influence the other node, and if there is no
directed link that means that nodes are independent with each other

• In the above diagram, A, B, C, and D are random variables represented by the


nodes of the network graph.

• If node B relates to node A by a directed arrow, then node A is called the parent
of Node B.

• Node C is independent of node A.

The Bayesian network has mainly two components:

• Causal Component

• Actual numbers

Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ), which determines the effect of the parent on that node.

MINING FREQUENT PATTERNS


Frequent Pattern Mining (Association Rule Mining) is an analytical process that finds
frequent patterns, associations, or causal structures from data sets found in various
kinds of databases such as relational databases, transactional databases, and other data
repositories. Frequent patterns are collections of items which appear in a data set at an
important frequency (usually greater than a predefined threshold) and can thus reveal
association rules and relations between variables.

A frequent pattern represents a set of items co-occurring in comparatively more


transactions. The frequency is quantified using the support metric. Itemset support is
the number of transactions where the itemset elements appear together divided by the
total number of transactions.

Support(item1) = count(item1)/count(all transactions)

Minimum support is a threshold used by the following algorithms in order to discard


sets of items from the analysis which don’t appear frequently enough.

The strength of the association rule between 2 items, (for instance item1 and item2) or
the association confidence represents the number of transactions containing item1 and
item 2 divided by the number of transactions containing item1

Confidence (item1→item2)=count(item1 & item2)/count(item1)

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

The confidence metric estimates the likelihood that a transaction containing item1 will
include also item2.

The lift is the ratio of observing two items together to the likelihood of seeing just one of
them. A lift greater than 1 means that items1 and 2 are more likely present together in
transactions while values inferior to 1 apply to the cases when the two items are rarely
associated.

Lift(item1→item2) = (Confidence (item1→item2))/(Support (item2)).

Apriori algorithm

Apriori algorithm uses data organized by horizontal layout. It is founded on the fact that
if a subset S appears k times in a database, any other subset S1 which contains S will
appear k times or less. This implies that when deciding on a minimum support
threshold (minimum frequency an item set needs to have in order to not be
discarded)we can avoid calculating S1 or any other superset of S if support(s) <
minimum support. It can be said that all such candidates are being discarded a priori.

The algorithm computes the counts for all itemsets of k elements (starting with k = 1).
During the next iterations the previous sets are being joined and thus we create all
possible k + 1 itemsets. Only the combinations appearing at a frequency inferior to the
minimum support rate are being discarded. The iterations end when no further
extensions (joins) are being found.

Eclat algorithm

Eclat (Equivalence Class Clustering and bottom-up Lattice traversal) algorithm uses
data organized by vertical layout which associates each element with the list of
underlying transactions. In an iterative depth-first search way the algorithm continues
by calculating for all combinations of k items (starting from k=1, it calculates all pairs of
2 items) the list of common transactions. In a nutshell, during the k step all
combinations of k items are calculated by intersecting the lists of transactions
associated with the k-1 itemsets. K will be incremented by 1 each time until no frequent
items or no candidate items can still be found.

Eclat algorithm is generally faster than apriori and requires only one database scan
which will find the support for all itemsets with 1 element. All k>1 iterations rely only
on previously stored data.

FP tree algorithm

FP tree algorithm uses data organized by horizontal layout. It is the most


computationally efficient algorithm from the 3 presented in this post. It only performs 2
database scans and keeps the data in an easily exploitable tree structure.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])
lOMoARcPSD|49232660

Downloaded from www.rgpvnotes.in, whatsapp: 8989595022

The first database scan sorts all items in the order of their global occurrence in the
database (the equivalent of applying a counter to the unraveled data of all transactions).
The second pass iterates line by line through the list of transactions and for each
transaction it sorts the elements by the global order (previous step corresponding to
the first database pass) and introduces them as nodes of a tree grown in depth. These
nodes are introduced with a count value of 1. Continuing the iterations, for each line
new nodes are being added to the tree at the point where the ordered items differ from
the existing tree. If the same pattern already exists, all common nodes will increase their
count value by one.

The FP tree can be pruned by removing all nodes having a count value inferior to a
minimum threshold occurrence. The remaining tree can be traversed and for instance
all paths from the root node to leaves correspond to clusters of frequently occurring
items.

follow us by
Downloaded onAstha
instagram for frequent updates: www.instagram.com/rgpvnotes.in
Mahajan ([email protected])

You might also like