0% found this document useful (0 votes)
96 views51 pages

ML Unit-2

The document discusses machine learning classification algorithms. It provides an overview of classification tasks, including binary classification (predicting one of two classes), multi-class classification (predicting one of several classes), and more. It also describes how classification works, involving building a classifier from training data and then measuring accuracy on test data. Common classification algorithms mentioned include decision trees, k-nearest neighbors, logistic regression, support vector machines, naive Bayes, and random forests.

Uploaded by

diroja5648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
96 views51 pages

ML Unit-2

The document discusses machine learning classification algorithms. It provides an overview of classification tasks, including binary classification (predicting one of two classes), multi-class classification (predicting one of several classes), and more. It also describes how classification works, involving building a classifier from training data and then measuring accuracy on test data. Common classification algorithms mentioned include decision trees, k-nearest neighbors, logistic regression, support vector machines, naive Bayes, and random forests.

Uploaded by

diroja5648
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

COURSE TITLE: MACHINE LEARNING

COURSE CODE: B20CI0502


SEMESTER : V
UNIT -2
Syllabus: Classification – Decision Tree, K-nearest neighbour, Logistic Regression,
Support Vector Machine Algorithm, Naïve Bayes Algorithm, Random Forest Algorithm

Classification
In machine learning, classification is a predictive modelling problem where the class label
is anticipated for a specific example of input data. A classification algorithm, in general,
is a function that weighs the input features so that the output separates one class into
positive values and the other into negative values. Classification is defined as the process
of recognition, understanding, and grouping of objects and ideas into preset categories
also known as “sub-populations.” With the help of these pre-categorized training
datasets, classification in machine learning programs leverage a wide range of algorithms
to classify future datasets into respective and relevant categories. Classification
algorithms used in machine learning utilize input training data for the purpose of
predicting the likelihood or probability that the data that follows will fall into one of the
predetermined categories. One of the most common applications of classification is for
filtering emails into “spam” or “non-spam”, as used by today’s top email service providers.
Based on training data, the Classification algorithm is a Supervised Learning technique
used to categorize new observations. In classification, a program uses the dataset or
observations provided to learn how to categorize new observations into various classes
or groups. For instance, 0 or 1, red or blue, yes or no, spam or not spam, etc. Targets,
labels, or categories can all be used to describe classes. The Classification algorithm uses
labeled input data because it is a supervised learning technique and comprises input and
output information. There are two types of learners.

 Lazy Learners

It first stores the training dataset before waiting for the test dataset to arrive. When using
a lazy learner, the classification is carried out using the training dataset's most
appropriate data. Less time is spent on training, but more time is spent on predictions.
Some of the examples are case-based reasoning and the KNN algorithm.

 Eager Learners

Before obtaining a test dataset, eager learners build a classification model using a training
dataset. They spend more time studying and less time predicting. Some of the examples
are ANN, naive Bayes, and Decision trees.
In simple words, classification is a type of pattern recognition in which classification
algorithms are performed on training data to discover the same pattern in new data sets.

Working of Classification:(“How does classification work?)


Data classification is a two-step process:
In the first step, a classifier is built describing a predetermined set of data classes
or concepts. This is the learning step or training phase, where a classification
algorithm builds the classifier by analyzing or learning from a training set made
up of database tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector,
X = (x1, x2,..., xn), depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2, ... , An-1.Each tuple, X, is assumed to belong to a
predefined class as determined by another database attribute called the class label
attribute. The class label attribute is discrete-valued and unordered. It is categorical in
that each value serves as a category or class. The individual tuples making up the training
set are referred to as training tuples and are selected from the database under analysis.
In the context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as
supervised learning , the learning of the classifier is “supervised” in that it is told to which
class each training tuple belongs. It contrasts with unsupervised learning in which the
class label of each training tuple is not known, and the number or set of classes to be
learned may not be known in advance.
This first step of the classification process can also be viewed as the learning of a mapping
or function, y = f (X), that can predict the associated class label y of a given tuple X. In this
view, we wish to learn a mapping or function that separates the data classes. Typically,
this mapping is represented in the form of classification rules, decision trees, or
mathematical formulae.
Training data are analyzed by a classification algorithm.Here, the class label attribute is
a loan decision, and the learned model or classifier is represented in the form of
classification rules. Also, the mapping is represented as classification rules that identify
loan applications as being either safe or risky . The rules can be used to categorize future
data tuples, as well as provide deeper insight into the database contents. They also
provide a compressed representation of the data.


(“What about classification accuracy?”)
In the second step , the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
accuracy of the classifier, this estimate would likely be optimistic, because the classifier
tends to overfit the data (i.e., during learning it may incorporate some particular
anomalies of the training data that are not present in the general data set overall).
Therefore, a test set is used, made up of test tuples and their associated class labels. These
tuples are randomly selected from the general data set. They are independent of the
training tuples, meaning that they are not used to construct the classifier. The accuracy
of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. The associated class label of each test tuple is compared with
the learned classifier’s class prediction for that tuple. If the accuracy of the classifier is
considered acceptable, the classifier can be used to classify future data tuples for which
the class label is not known. (Such data are also referred to in the machine learning
literature as “unknown” or “previously unseen” data.).
How does prediction differ from classification?

Data prediction is a two step process, similar to that of data. However, for prediction,
we lose the terminology of “class label attribute” because the attribute for which values
are being predicted is continuous-valued (ordered) rather than categorical (discrete-
valued and unordered). The attribute can be referred to simply as the predicted attribute.
Prediction and classification also differ in the methods that are used to build their
respective models. As with classification, the training set used to build a predictor should
not be used to assess its accuracy. An independent test set should be used instead. The
accuracy of a predictor is estimated by computing an error based on the difference
between the predicted value and the actual known value of y for each of the test tuples,
X.

Classification Tasks in Machine Learning


There are four different types of Classification Tasks in Machine Learning and they are
following -

 Binary Classification

 Multi-Class Classification

 Multi-Label Classification

 Imbalanced Classification

Binary Classification

Those classification jobs with only two class labels are referred to as binary classification.
Examples comprise -
Prediction of conversion (buy or not).
Churn forecast (churn or not).
Detection of spam email (spam or not).
Binary classification problems often require two classes, one representing the normal
state and the other representing the aberrant state.
For instance, the normal condition is "not spam," while the abnormal state is "spam."
Another illustration is when a task involving a medical test has a normal condition of
"cancer not identified" and an abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to the
class in the abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently used
to represent a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with the
situation where an event has a binary result of either 0 or 1. In terms of classification, this
indicates that the model forecasts the likelihood that an example would fall within class
1, or the abnormal state.
The following are well-known binary classification algorithms:
Logistic Regression
Support Vector Machines
Simple Bayes
Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were
created expressly for binary classification and do not by default support more than two
classes.
Let us now discuss Multi-Class Classification.

Multi-Class Classification

Multi-class labels are used in classification tasks referred to as multi-class classification.


Examples comprise -
Categorization of faces.
Classifying plant species.
Character recognition using optical.
The multi-class classification does not have the idea of normal and abnormal outcomes,
in contrast to binary classification. Instead, instances are grouped into one of several
well-known classes.
In some cases, the number of class labels could be rather high. In a facial recognition
system, for instance, a model might predict that a shot belongs to one of thousands or
tens of thousands of faces.
Text translation models and other problems involving word prediction could be
categorized as a particular case of multi-class classification. Each word in the sequence of
words to be predicted requires a multi-class classification, where the vocabulary size
determines the number of possible classes that may be predicted and may range from
tens of thousands to hundreds of thousands of words.
Multiclass classification tasks are frequently modeled using a model that forecasts a
Multinoulli probability distribution for each example.
An event that has a categorical outcome, such as K in 1, 2, 3,..., K, is covered by the
Multinoulli distribution, which is a discrete probability distribution. In terms of
classification, this implies that the model forecasts the likelihood that a given example
will belong to a certain class label.
For multi-class classification, many binary classification techniques are applicable.
The following well-known algorithms can be used for multi-class classification:
Progressive Boosting
Choice trees
Nearest K Neighbors
Rough Forest
Simple Bayes
Multi-class problems can be solved using algorithms created for binary classification.
In order to do this, a method is known as "one-vs-rest" or "one model for each pair of
classes" is used, which includes fitting multiple binary classification models with each
class versus all other classes (called one-vs-one).
One-vs-One: For each pair of classes, fit a single binary classification model.
The following binary classification algorithms can apply these multi-class classification
techniques:
One-vs-Rest: Fit a single binary classification model for each class versus all other classes.
The following binary classification algorithms can apply these multi-class classification
techniques:
Support vector Machine
Logistic Regression
Let us now learn about Multi-Label Classification.

Multi-Label Classification

Multi-label classification problems are those that feature two or more class labels and
allow for the prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence of
many known things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo
may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which
anticipate a single class label for each occurrence.
Multi-label classification problems are frequently modeled using a model that forecasts
many outcomes, with each outcome being forecast as a Bernoulli probability distribution.
In essence, this approach predicts several binary classifications for each example.
It is not possible to directly apply multi-label classification methods used for multi-class
or binary classification. The so-called multi-label versions of the algorithms, which are
specialized versions of the conventional classification algorithms, include:
Multi-label Gradient Boosting
Multi-label Random Forests
Multi-label Decision Trees
Another strategy is to forecast the class labels using a different classification algorithm.
Now, we will look into the Imbalanced Classification Task in detail.

Imbalanced Classification

The term "imbalanced classification" describes classification jobs where the distribution
of examples within each class is not equal.
A majority of the training dataset's instances belong to the normal class, while a minority
belong to the abnormal class, making imbalanced classification tasks binary classification
tasks in general.
Examples comprise -
Clinical diagnostic procedures
Detection of outliers
Fraud investigation
Although they could need unique methods, these issues are modeled as binary
classification jobs.
By oversampling the minority class or undersampling the majority class, specialized
strategies can be employed to alter the sample composition in the training dataset.
Examples comprise -
 SMOTE Oversampling
 Random Undersampling
It is possible to utilize specialized modeling techniques, like the cost-sensitive machine
learning algorithms that give the minority class more consideration when fitting the
model to the training dataset.
Examples comprise:
 Cost-sensitive Support Vector Machines
 Cost-sensitive Decision Trees
 Cost-sensitive Logistic Regression
Since reporting the classification accuracy may be deceptive, alternate performance
indicators may be necessary.
Examples comprise -
 F-Measure
 Recall
 Precision

Decision Tree

What is a Decision Tree?


Decision tree is one of the most popular and powerful tool for classification and
prediction.
Decision tree is a flowchart like structure where,
● each internal node denotes a test on an attribute
● each branch represents an outcome of the test
● each leaf node/terminal nodes holds a class label

The topmost node in any decision tree is the root node. Internal nodes are denoted by
rectangles and the leaf nodes are denoted by ovals.Same as any tree a Decision tree can
produce a binary tree and a non binary tree
Above image is an example of a Decision tree. It represents the concept buys_computer,
which predicts weather a customer at All Electronics is likely to purchase a computer or
not.
How are decision trees used for classification?
Decision tree can easily be converted to classification rules. Lets take a tuple X, for which
associated class label is unknown, the attribute clause of the tuple are tested against the
decision tree. After which a path is traced from the root to a leaf node, which holds the
class prediction for that tuple
Why are decision tree classifiers so popular?
The popularity of decision tree classifiers are based on its characteristics.
those are :
● It does not require any domain knowledge or parameter setting and therefore is
appropriate for exploratory knowledge discovery
● it can handle high dimensional data
● It can be easily understood by humans.
● The learning and classification steps of decision tree induction are simple and fast
● it has good accuracy
Some of the Application of Decision trees induction algorithms are:
● Medicine
● Manufacturing and production
● Financial analysis
● astronomy
● Molecular biology
Decision tree Induction
Algorithm to Generate decision tree:
Generate a decision tree from the training tuples of data partition D .Most Algorithms for
decision tree induction also follow such a top-down approach, which starts with a
training set of tuples and their associated class labels. The training set is recursively
partitioned into smaller subsets as the tree is being built.
A basic decision tree algorithm is summarized below.
Input
● Data Partition(D), which is a set of training tuples and their associated class labels
● Attribute list: Set of candidate attributes
● Attribute selection method: it is a procedure to determine the splitting criteria
that "best" partitions the data tuples into individual classes.
the Splitting Criterion consists of .
● Splitting attribute
● Split point or Splitting subset

Output: A Decision Tree


ALGORITHM STEPS:
1. create a node N;
2. if tuples in D are all of the same class, C then
2.1. return N as a leaf node labeled with the class C;
3. if attribute_list is empty then
3.1. return N as leaf node labeled with the majority class in D
4. apply Attribute_selection_method(D,attribute_list) to find the "best"
splitting_criterion;
5. label node N with the splitting _criterion;
6. if splitting attribute is discrete-valued and
6.1. multiway splits allowed then( // not restricted to binary trees)
6.2. attribute list ← attribute list − splitting attribute; (// remove splitting
attribute)
7. for each outcome j of splitting criterion(// partition the tuples and grow subtrees
for each partition)
8. let 𝐷 be the set of data tuples in D satisfying outcome j; (// a partition)
9. if 𝐷 is empty then
9.1. attach a leaf labeled with the majority class in D to node N;
10. else
10.1. attach the node returned by Generate decision tree(𝐷 , attribute list) to
node N
11. end for
12. return N;

The Approach is as follows:


● The algorithm is called with three parameters:
○ D - Data partition(initially it is the complete set of training tuples and their
associated class labels)
○ attribute_list - it is list of attributes describing the tuples
○ Attribute_selection-method: an efficient procedure for selecting the
attribute that "best" discriminates the given tuples according to class.
The Attribute_selection method uses a selection measure such as
information gain or gini index.
● The tree starts as a single node, N, representing the training tuples in D (step 1).
● If the tuples in D are all of the same class, then node N becomes a leaf and is labeled
with that class (steps 2 and 2.1). Note Step 3 and 3.1 are terminating conditions.
● If not, the splitting criterion is determined by calling the Attribute Selection
Method in the algorithm. The splitting criterion tells us which attribute to test at
node N by determining the “best” way to separate or partition the tuples in D into
individual classes(Step 4). the splitting criterion indicates the splitting attribute
and may also indicate either a split-point or a splitting subset. it also tells us which
branches to grow from node N with respect to the outcomes of the chosen test.
● The splitting criterion is determined so that, ideally, the resulting partitions at
each branch are as “pure” as possible. (A partition is pure if all of the tuples in it
belong to the same class. In other words, if we were to split up the tuples in D
according to the mutually exclusive outcomes of the splitting criterion, we hope
for the resulting partitions to be as pure as possible.)
● N is labeled with the splitting criterion which serves as test at the node(step 5).
The outcome scenarios after using the splitting criterion can be
○ A discrete valued: The outcome of the test at node N correspond directly to
the known values of A. Where a branch is created with each value known
clause 𝑎 , of A and labeled with that valuew
○ A continuous Valued: In this case, the test at node N has two possible
outcomes, corresponding to the conditions A ≤ split point and A > split
point, respectively. Where split point is the split-point returned by
Attribute selection method as part of the splitting criterion.
○ A is discrete-valued and a binary tree must be produced:
The test at node N is of the form “A ∈ 𝑆 ?”. 𝑆 is the splitting subset for A,
returned by Attribute selection method as part of the splitting criterion. It
is a subset of the known values of A. If a given tuple has value 𝑎 of A and if
𝑎 ∈ 𝑆 , then the test at node N is satisfied. Two branches are grown from
N . By convention, the left branch out of N is labeled yes so that
𝐷 corresponds to the subset of class-labeled tuples in D that satisfy the test.
The right branch out of N is labeled no so that 𝐷 corresponds to the subset
of class-labeled tuples from D that do not satisfy the test.
Below figure shows the Partitioning Scenarios with its example
● The algorithm uses the same process recursively to form a decision tree for the
tuples at each resulting partition, 𝐷 , of D (step 10).
● The recursive partitioning stops only when any one of the following terminating
conditions is true:
○ All of the tuples in partition D (represented at node N) belong to the same class
(steps 2)
○ There are no remaining attributes on which the tuples may be further
partitioned(step 3).In this case, majority voting is employed (step 3.1).This
involves converting node N into a leaf and labeling it with the most common class
in D. Alternatively, the class distribution of the node tuples may be stored.
○ There are no tuples for a given branch, that is, a partition 𝐷 is empty (step 9).
In this case, a leaf is created with the majority class in D (step 9.1).
● The resulting decision tree is returned (step 12)
Attribute Selection Measures
An attribute selection measure is a heuristic for selecting the splitting criterion that “best”
separates a given data partition, D, of class-labeled training tuples into individual classes.
If we were to split D into smaller partitions according to the outcomes of the splitting
criterion, ideally each partition would be pure (i.e., all of the tuples that fall into a given
partition would belong to the same class)
The Three main popular attribute selection measures are:
● Information Gain
● Gain Ratio
● Gini Index
The notation used here in attribute selection measures are as follows:
Let D, the data partition, be a training set of class-labeled tuples. Suppose the class
label attribute has m distinct values defining m distinct classes, Ci (for i = 1,..., m). Let 𝐶 ,
be the set of tuples of class 𝐶 in D. Let |D| and |𝐶 , | denote the number of tuples in D
and 𝐶 , , respectively.

Information Gain
Information gain is used for deciding the best features/attributes that render maximum
data about a class. Let node N represent or hold the tuples of partition D. The attribute
with the highest information gain is chosen as the splitting attribute for node N
the Expected information needed to classify a tuple in D is given by
𝐼𝑛𝑓𝑜(𝐷) = − ∑ 𝑝 𝑙𝑜𝑔 (𝑝 ) where,
● 𝑝 is the probability that an arbitrary tuple in D belongs to class 𝐶 and is
estimated by |𝐶 , |/|𝐷|
● Log function to the base 2 is used because the information is encoded in
bits
● Info(D) is just the average amount of information need to identity the class
label of a tuple in D
Info(D) is also known as the entropy of D.
How much more information would we require (after partitioning) to make an exact
classification? This amount is calculated as

|𝐷 |
𝐼𝑛𝑓𝑜 (𝐷) = × 𝐼𝑛𝑓𝑜(𝐷 )
|𝐷|

where,
| |
● acts as the weight of the jth partition
| |
● 𝐼𝑛𝑓𝑜 (𝐷) is the expected information required to classify a tuple from D
based on the partitioning by A

The Smaller the expected information(still) required, the greater the purity of the
partitions.
Information gain is defined as the difference between the original information requirement
(i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after
partitioning on A) that is,
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜 (𝐷)
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values.
Gain Ratio is an extension to Information Gain, which attempts to overcome this bias. It
applies a type of normalization to information gain using Split information value which is
defined as
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.

The attribute with the maximum gain ratio is selected as the splitting attribute.

Tree Pruning:

When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise or outliers. Tree pruning methods address this problem of
overfitting the data. Such methods typically use statistical measures to remove the
least reliable branches.
Advantages of Pruned Trees:
1. Smaller.
2. Less Complex.
3. Easier to comprehend.
4. Faster and better at correctly classifying independent test data.

Tree pruning approaches:


There are two common approaches to tree pruning: Pre pruning and Post pruning.
In the Pre Pruning Approach, a tree is “pruned” by halting its construction early
(e.g.,by deciding not to further split or partition the subset of training tuples at a
given node).Upon halting, the node becomes a leaf. The leaf may hold the most
frequent class among the subset tuples or the probability distribution of those tuples.
If partitioning the tuples at a node would result in a split that falls below a pre-
specified threshold, then further partitioning of the given subset is halted.
There are difficulties, however, in choosing an appropriate threshold. High
thresholds could result in oversimplified trees, whereas low thresholds could result
in very little simplification. The second and more common approach is Post Pruning,
which removes subtrees from a “fully grown” tree. A subtree at a given node is
pruned by removing its branches and replacing it with a leaf. The leaf is labeled with
the most frequent class among the subtree being replaced. The Cost Complexity
Pruning Algorithm used in CART is an example of the post pruning approach. This
approach considers the cost complexity of a tree to be a function of the number of
leaves in the tree and the error rate of the tree (where the error rate is the percentage
of tuples misclassified by the tree). It starts from the bottom of the tree. For each
internal node, N, it computes the cost complexity of the subtree at N, and the cost
complexity of the subtree at N if it were to be pruned (i.e., replaced by a leaf node).
The two values are compared. If pruning the subtree at node N would result in a
smaller cost complexity, then the subtree is pruned. Otherwise, it is kept. A pruning
set of class-labeled tuples is used to estimate cost complexity. This set is independent
of the training set used to build the unpruned tree and of any test set used for
accuracy estimation. The algorithm generates a set of progressively pruned trees. In
general, the smallest decision tree that minimizes the cost complexity is preferred. It
uses the training set to estimate error rates. The pessimistic the pruning method
adjusts the error rates obtained from the training set by adding a penalty, so as to
counter the bias incurred.
The “best” pruned tree is the one that minimizes the number of encoding bits. This
method adopts the Minimum Description Length (MDL) principle. The basic idea is
that the simplest solution is preferred. Unlike cost complexity pruning, it does not
require an independent set of tuples.
ISSUES IN DECISION TREE LEARNING
Practical issues in learning decision trees include determining how deeply to grow the
decision tree, handling continuous attributes, choosing an appropriate attribute selection
measure, handling training data with missing attribute values, handling attributes with
differing costs, and improving computational efficiency

Avoiding Overfitting the Data


The algorithm described grows each branch of the tree just deeply enough to perfectly
classify the training examples. While this is sometimes a reasonable strategy, in fact it can
lead to difficulties when there is noise in the data, or when the number of training
examples is too small to produce a representative sample of the true target function. In
either of these cases, this simple algorithm can produce trees that overjt the training
examples.
We will say that a hypothesis overfits the training examples if some other hypothesis that
fits the training examples less well actually performs better over the entire distribution
of instances (i.e., including instances beyond the training set).

Definition: Given a hypothesis space H, a hypothesis h E H is said to overlit the training


data if there exists some alternative hypothesis h' E H, such that h has smaller error than
h' over the training examples, but h' has a smaller error than h over the entire distribution
of instances.
Figure illustrates the impact of overfitting in a typical application of decision tree
learning. In this case, the ID3 algorithm is applied to the task of learning which medical
patients have a form of diabetes. The horizontal axis of this plot indicates the total
number of nodes in the decision tree, as the tree is being constructed.
The vertical axis indicates the accuracy of predictions made by the tree. The solid line
shows the accuracy of the decision tree over the training examples, whereas the broken
line shows accuracy measured over an independent set of test examples (not included in
the training set). Predictably, the accuracy of the tree
over the training examples increases monotonically as the tree is grown. However, the
accuracy measured over the independent test examples first increases, then decreases.
As can be seen, once the tree size exceeds approximately 25 nodes, further elaboration of
the tree decreases its accuracy over the test examples despite increasing its accuracy on
the training examples.
How can it be possible for tree h to fit the training examples better than h', but for it to
perform more poorly over subsequent examples? One way this can occur is when the
training examples contain random errors or noise.

<Fever= Yes, Cough= No, Breathing Issues= No, Infected= Yes>


Given the original error-free data, ID3 produces the decision tree shown below. However,
the addition of this incorrect example will now cause ID3 to construct a more complex
tree. In particular, the new example will be sorted into the second leaf node from the left
in the learned tree of Figure below, along with the previous positive examples D9 and Dl
1. Because the new example is labelled as a negative example, ID3 will search for further
refinements to the tree below this node. Of course as long as the new erroneous example
differs in some arbitrary way from the other examples affiliated with this node, ID3 will
succeed in finding a new decision attribute to separate out this new example from the
two previous positive examples at this tree node. The result is that ID3 will output a
decision tree (h) that is more complex than the original tree from Figure (h'). Of course
h will fit the collection of training examples perfectly, whereas the simpler h' will not.
However, given that the new decision node is simply a consequence of fitting the noisy
training example, we expect h to outperform h' over subsequent data drawn from the
same instance distribution.
The above example illustrates how random noise in the training examples can lead to
overfitting. In fact, overfitting is possible even when the training data are noise-free,
especially when small numbers of examples are associated with leaf nodes. In this case,
it is quite possible for coincidental regularities to occur, in which some attribute happens
to partition the examples very well, despite being unrelated to the actual target function.
Whenever such coincidental regularities exist, there is a risk of overfitting.
Overfitting is a significant practical difficulty for decision tree learning and many other
learning methods. For example, in one experimental study of ID3 involving five different
learning tasks with noisy, nondeterministic data overfitting was found to decrease the
accuracy of learned decision trees by 10-25% on most problems. There are several
approaches to avoiding overfitting in decision tree learning.
These can be grouped into two classes:
 approaches that stop growing the tree earlier, before it reaches the point where it
perfectly classifies the training data,
 approaches that allow the tree to overfit the data, and then post-prune the tree.
Although the first of these approaches might seem.more direct, the second approach of
post-pruning overfit trees has been found to be more successful in practice. This is due to
the difficulty in the first approach of estimating precisely when to stop growing the tree.
Regardless of whether the correct tree size is found by stopping early or by post-pruning,
a key question is what criterion is to be used to determine the correct final tree size.
Approaches include:
 Use a separate set of examples, distinct from the training examples, to evaluate the
utility of post-pruning nodes from the tree.
 Use all the available data for training, but apply a statistical test to estimate
whether expanding (or pruning) a particular node is likely to produce an
improvement beyond the training set.
 Use an explicit measure of the complexity for encoding the training examples and
the decision tree, halting growth of the tree when this encoding size is minimized.
This approach, based on a heuristic called the Minimum Description Length
principle.

Figure 3.6
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically.
However, when measured over a set of test examples independent of the training
examples, accuracy first increases, then decreases.
The first of the above approaches is the most common and is often referred to as a
training and validation set approach. We discuss the two main variants of this approach
below. In this approach, the available data are separated into two sets of examples: a
training set, which is used to form the learned hypothesis, and a separate validation set,
which is used to evaluate the accuracy of this hypothesis over subsequent data and, in
particular, to evaluate the impact of pruning this hypothesis. The motivation is this: Even
though the learner may be misled by random errors and coincidental regularities within
the training set, the validation set is unlikely to exhibit the same random fluctuations.
Therefore, the validation set can be expected to provide a safety check against overfitting
the spurious characteristics of the training set. Of course, it is important that the
validation set be large enough to itself provide a statistically significant sample of the
instances. One common heuristic is to withhold one-third of the available examples for
the validation set, using the other two-thirds for training.

REDUCED ERROR PRUNING


How exactly might we use a validation set to prevent overfitting? One approach, called
reduced-error pruning (Quinlan 1987), is to consider each of the decision nodes in
the.tree to be candidates for pruning. Pruning a decision node consists of removing the
subtree rooted at that node, making it a leaf node, and assigning it the most common
classification of the training examples affiliated with that node.
Nodes are removed only if the resulting pruned tree performs no worse than-the original
over the validation set. This has the effect that any leaf node added due to coincidental
regularities in the training set is likely to be pruned because these same coincidences are
unlikely to occur in the validation set. Nodes are pruned iteratively, always choosing the
node whose removal most increases the decision tree accuracy over the validation set.
Pruning of nodes continues until further pruning is harmful (i.e., decreases accuracy of
the tree over the validation set). The impact of reduced-error pruning on the accuracy of
the decision tree is illustrated in Figure 3.7. As in Figure 3.6, the accuracy of the tree is
shown measured over both training examples and test examples. The additional line in
Figure 3.7 shows accuracy over the test examples as the tree is pruned. When pruning
begins, the tree is at its maximum size and lowest accuracy over the test set. As pruning
proceeds, the number of nodes is reduced and accuracy over the test set increases. Here,
the available data has been split into three subsets: the training examples, the validation
examples used for pruning the tree, and a set of test examples used to provide an
unbiased estimate of accuracy over future unseen examples. The plot shows accuracy
over the training and test sets. Accuracy over the validation set used for pruning is not
shown. Using a separate set of data to guide pruning is an effective approach provided
a large amount of data is available. The major drawback of this approach is that when
data is limited, withholding part of it for the validation set reduces even further the
number of examples available for training. The following section presents an alternative
approach to pruning that has been found useful in many practical situations where data
is limited. Many additional techniques have been proposed as well, involving partitioning
the available data several different times in multiple ways, then averaging the results.

FIGURE 3.7
Effect of reduced-error pruning in decision tree learning. This plot shows the same curves of training and
test set accuracy as in Figure 3.6. In addition, it shows the impact of reduced error pruning of the tree
produced by ID3. Notice the increase in accuracy over the test set as nodes are pruned from the tree. Here,
the validation set used for pruning is distinct from both the training and test sets.

Underfitting in Machine Learning


A statistical model or a machine learning algorithm is said to have underfitting when it
cannot capture the underlying trend of the data, i.e., it only performs well on training
data but performs poorly on testing data. (It’s just like trying to fit undersized pants!)
Underfitting destroys the accuracy of our machine-learning model. Its occurrence
simply means that our model or the algorithm does not fit the data well enough. It
usually happens when we have less data to build an accurate model and also when we
try to build a linear model with fewer non-linear data. In such cases, the rules of the
machine learning model are too easy and flexible to be applied to such minimal data,
and therefore the model will probably make a lot of wrong predictions. Underfitting can
be avoided by using more data and also reducing the features by feature selection.
In a nutshell, Underfitting refers to a model that can neither performs well on the
training data nor generalize to new data.
Reasons for Underfitting
1. High bias and low variance.
2. The size of the training dataset used is not enough.
3. The model is too simple.
4. Training data is not cleaned and also contains noise in it.
Techniques to Reduce Underfitting
1. Increase model complexity.
2. Increase the number of features, performing feature engineering.
3. Remove noise from the data.
4. Increase the number of epochs or increase the duration of training to get better
results.

Overfitting in Machine Learning


A statistical model is said to be overfitted when the model does not make accurate
predictions on testing data. When a model gets trained with so much data, it starts
learning from the noise and inaccurate data entries in our data set. And when testing
with test data results in High variance. Then the model does not categorize the data
correctly, because of too many details and noise. The causes of overfitting are the non-
parametric and non-linear methods because these types of machine learning algorithms
have more freedom in building the model based on the dataset and therefore they can
really build unrealistic models. A solution to avoid overfitting is using a linear algorithm
if we have linear data or using the parameters like the maximal depth if we are using
decision trees.
In a nutshell, Overfitting is a problem where the evaluation of machine learning
algorithms on training data is different from unseen data.

Reasons for Overfitting:


1. High variance and low bias.
2. The model is too complex.
3. The size of the training data.

Techniques to Reduce Overfitting


1. Increase training data.
2. Reduce model complexity.
3. Early stopping during the training phase (have an eye over the loss over the training
period as soon as loss begins to increase stop training).
4. Ridge Regularization and Lasso Regularization.
5. Use dropout for neural networks to tackle overfitting.
Example of Constructing a Decision Tree
We’ll be using a sample dataset of COVID-19 infection. A preview of the entire dataset is
shown below.
ID Fever Cough Breathing Infected
Issues
1 No No No No
2 Yes Yes Yes Yes
3 Yes Yes No No
4 Yes No Yes Yes
5 Yes Yes Yes Yes
6 No Yes No No
7 Yes No Yes Yes
8 Yes No Yes Yes
9 No Yes Yes Yes
10 Yes Yes No Yes
11 No Yes No No
12 No Yes Yes Yes
13 No Yes Yes No
14 Yes Yes No No

The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and ‘Fever’ are
called feature columns or just features and the column used for leaf nodes i.e. ‘Infected’ is
called the target column.
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n

From the total of 14 rows in our dataset S, there are 8 rows with the target

value YES and 6 rows with the target value NO. The entropy of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99

IG calculation for Fever:


In this(Fever) feature there are 8 rows having value YES and 6 rows having value NO.
As shown below, in the 8 rows with YES for Fever, there are 6 rows having target
value YES and 2 rows having target value NO.
Fever Cough Breathing Issues Infected
Yes Yes Yes Yes
Yes Yes No No
Yes No Yes Yes
Yes Yes Yes Yes
Yes No Yes Yes
Yes No Yes Yes
Yes Yes No Yes
Yes Yes No No

As shown below, in the 6 rows with NO, there are 2 rows having target
value YES and 4 rows having target value NO.
Fever Cough Breathing Issues Infected
No No No No
No Yes No No
No Yes Yes Yes
No Yes No No
No Yes Yes Yes
No Yes Yes No

The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows

|S| = 14
For v = YES, |Sᵥ| = 8

Entropy(Sᵥ) = - (6/8) * log₂(6/8) - (2/8) * log₂(2/8) = 0.81


For v = NO, |Sᵥ| = 6

Entropy(Sᵥ) = - (2/6) * log₂(2/6) - (4/6) * log₂(4/6) = 0.91

# Expanding the summation in the IG formula:

IG(S, Fever) = Entropy(S) - (|Sʏᴇꜱ| / |S|) * Entropy(Sʏᴇꜱ) – (|Sɴᴏ| / |S|) * Entropy(Sɴᴏ)

∴ IG(S, Fever) = 0.99 - (8/14) * 0.81 - (6/14) * 0.91 = 0.13


Next, we calculate the IG for the features “Cough” and “Breathing issues”.
IG(S, Cough) = 0.04
IG(S, BreathingIssues) = 0.40

Since the feature Breathing issues have the highest Information Gain it is used to create
the root node. Hence, after this initial step our tree looks like this:

Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset of the
original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
Fever Cough Breathing Infected
Issues
Yes Yes Yes Yes
Yes No Yes Yes
Yes Yes Yes Yes
Yes No Yes Yes
Yes No Yes Yes
No Yes Yes Yes
No Yes Yes Yes
No Yes Yes No

Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ

(Set Breathing Issues Yes) which is shown above :


Note: For IG calculation the Entropy will be calculated from the subset Sʙʏ and not the
original dataset S.

IG(Sʙʏ, Fever) = 0.20

IG(Sʙʏ, Cough) = 0.09

IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing
Issues: Our tree now looks like this:
Next, we find the feature with the maximum IG for the right branch of Breathing Issues.
But, since there is only one unused feature left we have no other choice but to make it the
right branch of the root node.

So our tree now looks like this:

There are no more unused features, so we stop here and jump to the final step of creating
the leaf nodes.

For the left leaf node of Fever, we see the subset of rows from the original data set that
has Breathing Issues and Fever both values as YES.
Fever Cough Breathing Infected
Issues
Yes Yes Yes Yes
Yes No Yes Yes
Yes Yes Yes Yes
Yes No Yes Yes
Yes No Yes Yes

Since all the values in the target column are YES, we label the left leaf node as YES, but to
make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from the original data set
that have Breathing Issues value as YES and Fever as NO.

No Yes Yes Yes


No Yes Yes No
No Yes Yes No
Here not all but most of the values are NO, hence NO or Not Infected becomes our right
leaf node.
Our tree, now, looks like this:

We repeat the same process for the node Cough, however here both left and right leaves
turn out to be the same i.e. NO or Not Infected as shown below:

The right node of Breathing issues is as good as just a leaf node with class ‘Not infected’.
This is one of the Drawbacks of ID3, it doesn’t do pruning.

Example
In the following example, we are going to implement Decision Tree classifier on Pima
Indian Diabetes −
First, start with importing necessary python packages −
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Next, download the iris dataset from its weblink as follows −
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()

pregnant glucose bp skin insulin bmi pedigree age label


0 6 148 72 35 0 33.6 0.627 50 1
1 1 85 66 29 0 26.6 0.351 31 0
2 8 183 64 0 0 23.3 0.672 32 1
3 1 89 66 23 94 28.1 0.167 21 0
4 0 137 40 35 168 43.1 2.288 33 1
Now, split the dataset into features and target variable as follows −
feature_cols = ['pregnant', 'insulin', 'bmi', 'age','glucose','bp','pedigree']
X = pima[feature_cols] # Features
y = pima.label # Target variable
Next, we will divide the data into train and test split. The following code will split the
dataset into 70% training data and 30% of testing data −
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)
Next, train the model with the help of DecisionTreeClassifier class of sklearn as follows −
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = clf.predict(X_test)
Next, we can get the accuracy score, confusion matrix and classification report as follows

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[116 30]
[ 46 39]]
Classification Report:
precision recall f1-score support
0 0.72 0.79 0.75 146
1 0.57 0.46 0.51 85
micro avg 0.67 0.67 0.67 231
macro avg 0.64 0.63 0.63 231
weighted avg 0.66 0.67 0.66 231

Accuracy: 0.670995670995671

Visualizing Decision Tree


The above decision tree can be visualized with the help of following code −
from sklearn.tree import export_graphviz
from sklearn.externals.six import StringIO
from IPython.display import Image
import pydotplus

dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())

K-Nearest neighbour

o -Nearest Neighbour is one of the simplest Machine Learning algorithms based on


Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based
on the similarity. This means when new data appears then it can be easily
classified into a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly
it is used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any
assumption on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
o KNN algorithm at the training phase just stores the dataset and when it gets new
data, then it classifies that data into a category that is much similar to the new data.
o Example: Suppose, we have an image of a creature that looks similar to cat and
dog, but we want to know either it is a cat or dog. So for this identification, we can
use the KNN algorithm, as it works on a similarity measure. Our KNN model will
find the similar features of the new data set to the cats and dogs images and based
on the most similar features it will put it in either cat or dog category.

The most basic instance-based method is the k-NEAREST NEIGHBOR Algorithm. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn. The
nearest neighbors of an instance are defined in terms of the standard Euclidean distance.
More precisely, let an arbitrary instance x be described by the feature vector

where ar (x) denotes the value of the rth attribute of instance x. Then the distance
between two instances xi and xj is defined to be d(xi, xj), where

In nearest-neighbor learning the target function may be either discrete-valued or real-


valued. Let us first consider learning discrete-valued target functions of the form f : W -+
V, where V is the finite set {vl,. . . v,}. The k-NEAREST NEIGHBOR Algorithm for
approximatin5 a discrete-valued target function. The value f (x,) returned by this
algorithm as its estimate of f (x,) is just the most common value of f among the k training
examples nearest to x,. If we choose k = 1, then the 1-NEAREST NEIGHBOR Algorithm
assigns to f(x,) the value f (xi) where xi is the training instance nearest to x,. For larger
values of k, the algorithm assigns the most common value among the k nearest training
examples.
Figure below illustrates the operation of the k-NEAREST NEIGHBOR Algorithm for the
case where the instances are points in a two-dimensional space and where the target
function is boolean valued. The positive and negative training examples are shown by "+"
and "-"respectively. A query point x, is shown as well. Note the 1-NEAREST NEIGHBOR
Algorithm classifies x, as a positive example in this figure, whereas the 5--NEAREST
NEIGHBOR Algorithm classifies it as a negative example.

FIGURE
k-NEAREST NEIGHBOR : A set of positive and negative training examples is shown on the
left, along with a query instance x, to be classified. The I--NEAREST NEIGHBOR Algorithm
classifies x, positive, whereas 1-NEAREST NEIGHBOR it as negative. On the right is the
decision surface induced by the 1--NEAREST NEIGHBOR Algorithm for a typical set of
training examples. The convex polygon surrounding each training example indicates the
region of instance space closest to that point (i.e., the instances for which the 1-NEAREST
NEIGHBOR will assign the classification belonging to that training example).

Note the k-NEAREST NEIGHBOR Algorithm never forms an explicit general hypothesis f
regarding the target function f . It simply computes the classification of each new query
instance as needed. Nevertheless, we can still ask what the implicit general function is, or
what classifications would be assigned if we were to hold the training examples constant
and query the algorithm with every possible instance in X. The diagram on the right side
of Figure 8.1 shows the shape of this decision surface induced by 1-NEAREST NEIGHBOR
over the entire instance space. This kind of diagram is often called the Voronoi diagram
of the set of training examples

Why do we need a K-NN Algorithm?

Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:

How does K-NN work?


The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.
o uppose we have a new data point and we need to put it in the required category.
Consider the below image:

o
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:

o By calculating the Euclidean distance we got the nearest neighbors, as three


nearest neighbors in category A and two nearest neighbors in category B. Consider
the below image:

o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.

How to select the value of K in the K-NN Algorithm?


Below are some points to remember while selecting the value of K in the K-NN algorithm:
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.

Steps to implement the K-NN algorithm:


 Data Pre-processing step
 Fitting the K-NN algorithm to the Training set
 Predicting the test result
 Test accuracy of the result(Creation of Confusion matrix)
 Visualizing the test set result.

K Nearest Neighbor Algorithm

TABLE

The k-NEARESNT NEIGHBOR Algorithm for approximating a discrete-valued function f :


R" -> V .

Example of k-nearest neighbour algorithm


For the data set given below, predict the output of the following instance for k=5, <
Brightness=20, Saturation=35, Class=?>

BRIGHTNESS SATURATION CLASS


40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue

We have two columns — Brightness and Saturation. Each row in the table has a class of
either Red or Blue.
Before we introduce a new data entry, let's assume the value of K is 5.
To know its class, we have to calculate the distance from the new entry to other entries
in the data set using the Euclidean distance formula.

Here's the formula: √(X₂-X₁)²+(Y₂-Y₁)²


Where:
 X₂ = New entry's brightness (20).
 X₁= Existing entry's brightness.
 Y₂ = New entry's saturation (35).
 Y₁ = Existing entry's saturation.
Distance #1
For the first row, d1:

BRIGHTNESS SATURATION CLASS


40 20 Red

d1 = √(20 - 40)² + (35 - 20)²


= √400 + 225
= √625
= 25
We now know the distance from the new data entry to the first entry in the table. Let's
update the table.
BRIGHTNESS SATURATION CLASS DISTANCE
40 20 Red 25
50 50 Blue 33.54
60 90 Blue 68.01
10 25 Red 10
70 70 Blue 61.03
60 10 Red 47.17
25 80 Blue 45

Let's rearrange the distances in ascending order:


BRIGHTNESS SATURATION CLASS DISTANCE
10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17
70 70 Blue 61.03
60 90 Blue 68.01
Since we chose 5 as the value of K, we'll only consider the first five rows. That is:
BRIGHTNESS SATURATION CLASS DISTANCE
10 25 Red 10
40 20 Red 25
50 50 Blue 33.54
25 80 Blue 45
60 10 Red 47.17
As you can see above, the majority class within the 5 nearest neighbors to the new entry
is Red. Therefore, we'll classify the new entry as Red.
Here's the updated table:

BRIGHTNESS SATURATION CLASS


40 20 Red
50 50 Blue
60 90 Blue
10 25 Red
70 70 Blue
60 10 Red
25 80 Blue
20 35 Red

Implementation in Python
As we know K-nearest neighbors (KNN) algorithm can be used for both classification as
well as regression. The following are the recipes in Python to use KNN as classifier as well
as regressor −
KNN as Classifier
First, start with importing necessary python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names=headernames)
dataset.head()
slno. sepal-length sepal-width petal-length petal-width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines −
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the dataset
into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
Next, data scaling will be done as follows −
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60

Accuracy: 0.8833333333333333

KNN as Regressor
First, start with importing necessary Python packages −
import numpy as np
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
data = pd.read_csv(url, names=headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape

output:(150, 5)
Next, import KNeighborsRegressor from sklearn to fit the model −
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)
At last, we can find the MSE as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))
Output
The MSE is: 0.12226666666666669
Naïve Bayes Algorithm
What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, described below.
Studies comparing classification algorithms have found a simple Bayesian classifier
known as the naive Bayesian classifier to be comparable in performance with decision
tree and selected neural network classifiers. Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class
conditional independence. It is made to simplify the computations involved and, in this
sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike
naïve Bayesian classifiers, allow the representation of dependencies among subsets of
attributes. Bayesian belief networks can also be used for classification.
Bayes’ Theorem
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who
did early work in probability and decision theory during the 18th century.
Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is
described by measurements made on a set of n attributes. Let H be some hypothesis, such
as that the data tuple X belongs to a specified class C. For classification problems, we want
to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or
observed data tuple X. In other words, we are looking for the probability that tuple X
belongs to class C, given that we know the attribute description of X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For
example, suppose our world of data tuples is confined to customers described by the
attributes age and income, respectively, and that X is a 35-year-old customer with an
income of $40,000. Suppose that H is the hypothesis that our customer will buy a
computer. Then P(H|X) reflects the probability that customer X will buy a computer given
that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age,
income, or any other information, for that matter. The posterior probability, P(H|X), is
based on more information (e.g., customer information) than the prior probability, P(H),
which is independent of X.
Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know the
customer will buy a computer.
P(X)is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.
“How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from
the given data, as we shall see below. Bayes’ theorem is useful in that it provides a way of
calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X).
Bayes’ theorem is

Naïve Bayesian Classification


One highly practical Bayesian learning method is the naive Bayes learner, often called the
naive Bayes classifier. In some domains its performance has been shown to be
comparable to that of neural network and decision tree learning.
The naive Bayes classifier applies to learning tasks where each instance x is described by
a conjunction of attribute values and where the target function f ( x ) can take on any
value from some finite set V. A set of training examples of the target function is provided,
and a new instance is presented, described by the
tuple of attribute values (al, a2.. .a,). The learner is asked to predict the target value, or
classification, for this new instance. The Bayesian approach to classifying the new
instance is to assign the most probable target value, VMAPg, iven the attribute values (al,a
2 . . .a ,) that describe the instance.

We can use Bayes theorem to rewrite this expression as

Now we could attempt to estimate the two terms in Equation (6.19) based on the training
data. It is easy to estimate each of the P(vj) simply by counting the frequency with which
each target value vj occurs in the training data. However, estimating the different P(al,
a2.. . a,lvj) terms in this fashion is not feasible
unless we have a very, very large set of training data. The problem is that the number of
these terms is equal to the number of possible instances times the number of possible
target values. Therefore, we need to see every instance in the instance space many times
in order to obtain reliable estimates.
The naive Bayes classifier is based on the simplifying assumption that the attribute values
are conditionally independent given the target value. In other words, the assumption is
that given the target value of the instance, the probability of observing the conjunction al,
a2.. .a, is just the product of the probabilities
for the individual attributes:
Substituting this into Equation (6.19), we have the approach used by the naive Bayes
classifier.
Naive Bayes classifier:

where VNB denotes the target value output by the naive Bayes classifier. Notice that in a
naive Bayes classifier the number of distinct P(ai | vj) terms that must be estimated from
the training data is just the number of distinct attribute values times the number of
distinct target values-a much smaller number than if we were to estimate the P(a1, a2 . .
. an lvj) terms as first contemplated.
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned
on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if
and only if

Thus we maximize P(Ci |X). The class Ci for which P(Ci |X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X| Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|
Ci). Otherwise, we maximize P(X| Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive
to compute P(X| Ci). Thus,

--------------------------------(3)

We can easily estimate the probabilities P(x1| Ci), P(x2| Ci),..., P(xn|Ci)from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk| Ci) is the number of tuples of class Ci in D having the
value xk for Ak, divided by | Ci,D|, the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward. A continuous-valued attribute is typically assumed to have a
Gaussian distribution with a mean µ and standard deviation σ, defined by

-----------------------(4)
So

----------------------------(5)
These equations may appear daunting, but hold on! We need to compute µCi and σCi ,
which are the mean (i.e., average) and standard deviation, respectively, of the values of
attribute Ak for training tuples of class Ci . We then plug these two quantities into Equation
(4), together with xk, in order to estimate P(xk|Ci). For example, let X = (35, $40,000),
where A1 and A2 are the attributes age and income, respectively. Let the class label
attribute be buys computer. The associated class label for X is yes (i.e., buys computer =
yes). Let’s suppose that age has not been discretized and therefore exists as a continuous-
valued attribute. Suppose that from the training set, we find that customers in D who buy
a computer are 38 ±12 years of age. In other words, for attribute age and this class, we
have µ = 38 years and σ = 12. We can plug these quantities, along with x1 = 35 for our
tuple X into Equation (4) in order to estimate P(age = 35|buys computer = yes).

5. In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci . The
classifier predicts that the class label of tuple X is the class Ci if and only if
6
In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the
maximum.
Bayesian classifiers are also useful in that they provide a theoretical justification for other
classifiers that do not explicitly use Bayes’ theorem. For example, under certain
assumptions, it can be shown that many neural network and curve-fitting algorithms
output the maximum posteriori hypothesis, as does the naïve Bayesian classifier.

Example of Naïve Bayes


Predict the out for (Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)
Given the data set
Day Outlook Temperature Humidity Wind PlayTennis
1 Sunny Hot High Weak No
2 Sunny Hot High Strong No
3 Overcast Hot High Weak Yes
4 Rain Mild High Weak Yes
5 Rain Cool Normal Weak Yes
6 Rain Cool Normal Strong No
7 Overcast Cool Normal Strong Yes
8 Sunny Mild High Weak No
9 Sunny Cool Normal Weak Yes
10 Rain Mild Normal Weak Yes
11 Sunny Mild Normal Strong Yes
12 Overcast Mild High Strong Yes
13 Overcast Hot Normal Weak Yes
14 Rain Mild High Strong No

P(PlayTennis=Yes=9/14=0.64
P(PlayTennis=No)=5/14=0.36

Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rain 3/9 2/5

Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5

Humidity Y N
High 3/9 4/5
Normal 6/9 1/5

Windy Y N
Strong 3/9 3/5
Weak 6/9 2/5

(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong)

P(Outlook=Sunny | vj) P(Temperature=Cool| vj ) P( Humidity=High| vj ), P(Wind=Strong


| vj)
Vnb(yes) =P(yes) P(Sunny|yes) P(cool|yes) P(high |yes) P(strong |yes) =0.0053
Vnb(no) =P(No) P(Sunny|no) P(cool|no) P(high |no) P(strong |no) =0.0206

Vnb(yes)
Vnb(yes) = = 0.205
Vnb(yes) + Vnb(no)
Vnb(no)
Vnb(no) = = 0.795
Vnb(yes) + Vnb(no)

Example
Depending on our data set, we can choose any of the Naïve Bayes model explained above.
Here, we are implementing Gaussian Naïve Bayes model in Python −
We will start with required imports as follows −
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with
Gaussian distribution as follows −
from sklearn.datasets import make_blobs
X, y = make_blobs(300, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');
Next, for using GaussianNB model, we need to import and make its object as follows −
from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB()
model_GNB.fit(X, y);
Now, we have to do prediction. It can be done after generating some new data as follows

rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model_GNB.predict(Xnew)
Next, we are plotting new data to find its boundaries −
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);
Now, with the help of following line of codes, we can find the posterior probabilities of
first and second label −
yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)
Output
array([[0.998, 0.002],
[1. , 0. ],
[0.987, 0.013],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.986, 0.014]]
)

Support Vector Machine Algorithm


Support Vector Machine or SVM is one of the most popular Supervised Learning
algorithms, which is used for Classification as well as Regression problems. However,
primarily, it is used for Classification problems in Machine Learning.
The goal of the SVM algorithm is to create the best line or decision boundary that can
segregate n-dimensional space into classes so that we can easily put the new data point in
the correct category in the future. This best decision boundary is called a hyperplane.
SVM chooses the extreme points/vectors that help in creating the hyperplane. These
extreme cases are called as support vectors, and hence algorithm is termed as Support
Vector Machine. Consider the below diagram in which there are two different categories
that are classified using a decision boundary or hyperplane:

Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:

Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier. Linear SVMs
use a linear decision boundary to separate the data points of different classes. When the
data can be precisely linearly separated, linear SVMs are very suitable. This means that a
single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the
data points into their respective classes. A hyperplane that maximizes the margin between
the classes is the decision boundary.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier. Non-Linear SVM can
be used to classify data when it cannot be separated into two classes by a straight line (in
the case of 2D). By using kernel functions, nonlinear SVMs can handle nonlinearly
separable data. The original input data is transformed by these kernel functions into a
higher-dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified space.

Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.

How does SVM work?

One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.

Fig Multiple hyperplanes separate the data from two classes


So we choose the hyperplane whose distance from it to the nearest data point on each side
is maximized. If such a hyperplane exists it is known as the maximum-margin
hyperplane/hard margin. So from the above figure, we choose L2. Let’s consider a scenario
like shown below
Fig : Selecting hyperplane for data with outlier
Here we have one blue ball in the boundary of the red ball. So how does SVM classify the
data? It’s simple! The blue ball in the boundary of red ones is an outlier of blue balls. The
SVM algorithm has the characteristics to ignore the outlier and finds the best hyperplane
that maximizes the margin. SVM is robust to outliers.

Hyperplane which is the most optimized one


So in this type of data point what SVM does is, finds the maximum margin as done with
previous data sets along with that it adds a penalty each time a point crosses the margin.
So the margins in these types of cases are called soft margins. When there is a soft margin
to the data set, the SVM tries to minimize (1/margin+∧(∑penalty)). Hinge loss is a
commonly used penalty. If no violations no hinge loss.If violations hinge loss proportional
to the distance of violation.

What to do if data are not linearly separable?

Original 1D dataset for classification


Say, our data is shown in the figure above. SVM solves this by creating a new variable using
a kernel. We call a point xi on the line and we create a new variable yi as a function of
distance from origin o.so if we plot this we get something like as shown below
Mapping 1D data to 2D to become able to separate the two classes
In this case, the new variable y is created as a function of distance from the origin. A non-
linear function that creates a new variable is referred to as a kernel.

SVM Implementation in Python


Predict if cancer is Benign or malignant. Using historical data about patients diagnosed
with cancer enables doctors to differentiate malignant cases and benign ones are given
independent attributes.
Steps
 Load the breast cancer dataset from sklearn.datasets
 Separate input features and target variables.
 Build and train the SVM classifiers using RBF kernel.
 Plot the scatter plot of the input features.
 Plot the decision boundary.
 Plot the decision boundary

# Load the important packages


from sklearn.datasets import load_breast_cancer
import matplotlib.pyplot as plt
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.svm import SVC

# Load the datasets


cancer = load_breast_cancer()
X = cancer.data[:, :2]
y = cancer.target

#Build the model


svm = SVC(kernel="rbf", gamma=0.5, C=1.0)
# Trained the model
svm.fit(X, y)
# Plot Decision Boundary
DecisionBoundaryDisplay.from_estimator(
svm,
X,
response_method="predict",
cmap=plt.cm.Spectral,
alpha=0.8,
xlabel=cancer.feature_names[0],
ylabel=cancer.feature_names[1],
)
# Scatter plot
plt.scatter(X[:, 0], X[:, 1],
c=y,
s=20, edgecolors="k")
plt.show()

Breast Cancer Classifications with SVM RBF kernel

Implementing SVM in Python


For implementing SVM in Python we will start with the standard libraries import as
follows −

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()
Next, we are creating a sample dataset, having linearly separable data, from
sklearn.dataset.sample_generator for classification using SVM −
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');
The following would be the output after generating sample dataset having 100 samples
and 2 clusters −
We know that SVM supports discriminative classification. it divides the classes from each
other by simply finding a line in case of two dimensions or manifold in case of multiple
dimensions. It is implemented on the above dataset as follows −

xfit = np.linspace(-1, 3.5)


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
plt.plot([0.6], [2.1], 'x', color='black', markeredgewidth=4, markersize=12)
for m, b in [(1, 0.65), (0.5, 1.6), (-0.2, 2.9)]:
plt.plot(xfit, m * xfit + b, '-k')
plt.xlim(-1, 3.5);
The output is as follows −

We can see from the above output that there are three different separators that perfectly
discriminate the above samples.
As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) hence rather than drawing a zero line between classes we
can draw around each line a margin of some width up to the nearest point. It can be done
as follows −

xfit = np.linspace(-1, 3.5)


plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
for m, b, d in [(1, 0.65, 0.33), (0.5, 1.6, 0.55), (-0.2, 2.9, 0.2)]:
yfit = m * xfit + b
plt.plot(xfit, yfit, '-k')
plt.fill_between(xfit, yfit - d, yfit + d, edgecolor='none',
color='#AAAAAA', alpha=0.4)
plt.xlim(-1, 3.5);
From the above image in output, we can easily observe the “margins” within the
discriminative classifiers. SVM will choose the line that maximizes the margin.
Next, we will use Scikit-Learn’s support vector classifier to train an SVM model on this
data. Here, we are using linear kernel to fit SVM as follows −

from sklearn.svm import SVC # "Support vector classifier"


model = SVC(kernel='linear', C=1E10)
model.fit(X, y)
The output is as follows −
SVC(C=10000000000.0, cache_size=200, class_weight=None, coef0=0.0,
decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
kernel='linear', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)

Random Forest
A random forest is a supervised machine learning algorithm that is constructed from
decision tree algorithms. This algorithm is applied in various industries such as banking
and e-commerce to predict behavior and outcomes.
A random forest is a machine learning technique that’s used to solve regression and
classification problems. It utilizes ensemble learning, which is a technique that combines
many classifiers to provide solutions to complex problems.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the
random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is
an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
The (random forest) algorithm establishes the outcome based on the predictions of the
decision trees. It predicts by taking the average or mean of the output from various trees.
Increasing the number of trees increases the precision of the outcome.
A random forest eradicates the limitations of a decision tree algorithm. It reduces the
overfitting of datasets and increases precision. It generates predictions without requiring
many configurations in packages (like scikit-learn).

Features of a Random Forest Algorithm


 It’s more accurate than the decision tree algorithm.
 It provides an effective way of handling missing data.
 It can produce a reasonable prediction without hyper-parameter tuning.
 It solves the issue of overfitting in decision trees.
 In every random forest tree, a subset of features is selected randomly at the node’s
splitting point.
How random forest algorithm works
Decision trees are the building blocks of a random forest algorithm. A decision tree is a
decision support technique that forms a tree-like structure.
Applying decision trees in random forest
The main difference between the decision tree algorithm and the random forest
algorithm is that establishing root nodes and segregating nodes is done randomly in the
latter. The random forest employs the bagging method to generate the required
prediction.
Bagging involves using different samples of data (training data) rather than just one
sample. A training dataset comprises observations and features that are used for making
predictions. The decision trees produce different outputs, depending on the training data
fed to the random forest algorithm. These outputs will be ranked, and the highest will be
selected as the final output.
Our first example can still be used to explain how random forests work. Instead of having
a single decision tree, the random forest will have many decision trees. Let’s assume we
have only four decision trees. In this case, the training data comprising the phone’s
observations and features will be divided into four root nodes.
The root nodes could represent four features that could influence the customer’s choice
(price, internal storage, camera, and RAM). The random forest will split the nodes by
selecting features randomly. The final prediction will be selected based on the outcome
of the four trees.
The outcome chosen by most decision trees will be the final choice. If three trees
predict buying, and one tree predicts not buying, then the final prediction will be buying.
In this case, it’s predicted that the customer will buy the phone.
Classification in random forests
Classification in random forests employs an ensemble methodology to attain the
outcome. The training data is fed to train various decision trees. This dataset consists of
observations and features that will be selected randomly during the splitting of nodes.
A rain forest system relies on various decision trees. Every decision tree consists of
decision nodes, leaf nodes, and a root node. The leaf node of each tree is the final output
produced by that specific decision tree. The selection of the final output follows the
majority-voting system. In this case, the output chosen by the majority of the decision
trees becomes the final output of the rain forest system. The diagram below shows a
simple random forest classifier.
Let’s take an example of a training dataset consisting of various fruits such as bananas,
apples, pineapples, and mangoes. The random forest classifier divides this dataset into
subsets. These subsets are given to every decision tree in the random forest system. Each
decision tree produces its specific output. For example, the prediction for trees 1 and 2
is apple.
Another decision tree (n) has predicted banana as the outcome. The random forest
classifier collects the majority voting to provide the final prediction. The majority of the
decision trees have chosen apple as their prediction. This makes the classifier
choose apple as the final prediction.

Regression in random forests


Regression is the other task performed by a random forest algorithm. A random forest
regression follows the concept of simple regression. Values of dependent (features) and
independent variables are passed in the random forest model.
We can run random forest regressions in various programs such as SAS, R, and python.
In a random forest regression, each tree produces a specific prediction. The mean
prediction of the individual trees is the output of the regression. This is contrary to
random forest classification, whose output is determined by the mode of the decision
trees’ class.
Although random forest regression and linear regression follow the same concept, they
differ in terms of functions. The function of linear regression is y=bx + c, where y is the
dependent variable, x is the independent variable, b is the estimation parameter, and c is
a constant. The function of a complex random forest regression is like a blackbox.

Applications of random forest


Some of the applications of the random forest may include:
Banking
Random forest is used in banking to predict the creditworthiness of a loan applicant. This
helps the lending institution make a good decision on whether to give the customer the
loan or not. Banks also use the random forest algorithm to detect fraudsters.
Health care
Health professionals use random forest systems to diagnose patients. Patients are
diagnosed by assessing their previous medical history. Past medical records are reviewed
to establish the right dosage for the patients.
Stock market
Financial analysts use it to identify potential markets for stocks. It also enables them to
identify the behavior of stocks.
E-commerce
Through rain forest algorithms, e-commerce vendors can predict the preference of
customers based on past consumption behavior.
Advantages of random forest
 It can perform both regression and classification tasks.
 A random forest produces good predictions that can be understood easily.
 It can handle large datasets efficiently.
 The random forest algorithm provides a higher level of accuracy in predicting
outcomes over the decision tree algorithm.
Disadvantages of random forest
 When using a random forest, more resources are required for computation.
 It consumes more time compared to a decision tree algorithm.

Implementation in Python
First, start with importing necessary Python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names=headernames)
dataset.head()
sepal-length sepal-width petal-length petal-width Class

0 5.1 3.5 1.4 0.2 Iris-setosa

1 4.9 3.0 1.4 0.2 Iris-setosa

2 4.7 3.2 1.3 0.2 Iris-setosa

3 4.6 3.1 1.5 0.2 Iris-setosa

4 5.0 3.6 1.4 0.2 Iris-setosa

Data Preprocessing will be done with the help of following script lines −
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. The following code will split the
dataset into 70% training data and 30% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
Next, train the model with the help of RandomForestClassifier class of sklearn as follows

from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(X_train, y_train)
At last, we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −

from sklearn.metrics import classification_report, confusion_matrix, accuracy_score


result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[
[14 0 0]
[ 0 18 1]
[ 0 0 12]
]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 14
Iris-versicolor 1.00 0.95 0.97 19
Iris-virginica 0.92 1.00 0.96 12
micro avg 0.98 0.98 0.98 45
macro avg 0.97 0.98 0.98 45
weighted avg 0.98 0.98 0.98 45

Accuracy: 0.9777777777777777

You might also like