ML Unit-2
ML Unit-2
Classification
In machine learning, classification is a predictive modelling problem where the class label
is anticipated for a specific example of input data. A classification algorithm, in general,
is a function that weighs the input features so that the output separates one class into
positive values and the other into negative values. Classification is defined as the process
of recognition, understanding, and grouping of objects and ideas into preset categories
also known as “sub-populations.” With the help of these pre-categorized training
datasets, classification in machine learning programs leverage a wide range of algorithms
to classify future datasets into respective and relevant categories. Classification
algorithms used in machine learning utilize input training data for the purpose of
predicting the likelihood or probability that the data that follows will fall into one of the
predetermined categories. One of the most common applications of classification is for
filtering emails into “spam” or “non-spam”, as used by today’s top email service providers.
Based on training data, the Classification algorithm is a Supervised Learning technique
used to categorize new observations. In classification, a program uses the dataset or
observations provided to learn how to categorize new observations into various classes
or groups. For instance, 0 or 1, red or blue, yes or no, spam or not spam, etc. Targets,
labels, or categories can all be used to describe classes. The Classification algorithm uses
labeled input data because it is a supervised learning technique and comprises input and
output information. There are two types of learners.
Lazy Learners
It first stores the training dataset before waiting for the test dataset to arrive. When using
a lazy learner, the classification is carried out using the training dataset's most
appropriate data. Less time is spent on training, but more time is spent on predictions.
Some of the examples are case-based reasoning and the KNN algorithm.
Eager Learners
Before obtaining a test dataset, eager learners build a classification model using a training
dataset. They spend more time studying and less time predicting. Some of the examples
are ANN, naive Bayes, and Decision trees.
In simple words, classification is a type of pattern recognition in which classification
algorithms are performed on training data to discover the same pattern in new data sets.
❖
Data classification is a two-step process:
In the first step, a classifier is built describing a predetermined set of data classes
or concepts. This is the learning step or training phase, where a classification
algorithm builds the classifier by analyzing or learning from a training set made
up of database tuples and their associated class labels.
A tuple, X, is represented by an n-dimensional attribute vector,
X = (x1, x2,..., xn), depicting n measurements made on the tuple from n database
attributes, respectively, A1, A2, ... , An-1.Each tuple, X, is assumed to belong to a
predefined class as determined by another database attribute called the class label
attribute. The class label attribute is discrete-valued and unordered. It is categorical in
that each value serves as a category or class. The individual tuples making up the training
set are referred to as training tuples and are selected from the database under analysis.
In the context of classification, data tuples can be referred to as samples, examples,
instances, data points, or objects.
Because the class label of each training tuple is provided, this step is also known as
supervised learning , the learning of the classifier is “supervised” in that it is told to which
class each training tuple belongs. It contrasts with unsupervised learning in which the
class label of each training tuple is not known, and the number or set of classes to be
learned may not be known in advance.
This first step of the classification process can also be viewed as the learning of a mapping
or function, y = f (X), that can predict the associated class label y of a given tuple X. In this
view, we wish to learn a mapping or function that separates the data classes. Typically,
this mapping is represented in the form of classification rules, decision trees, or
mathematical formulae.
Training data are analyzed by a classification algorithm.Here, the class label attribute is
a loan decision, and the learned model or classifier is represented in the form of
classification rules. Also, the mapping is represented as classification rules that identify
loan applications as being either safe or risky . The rules can be used to categorize future
data tuples, as well as provide deeper insight into the database contents. They also
provide a compressed representation of the data.
❖
(“What about classification accuracy?”)
In the second step , the model is used for classification. First, the predictive
accuracy of the classifier is estimated. If we were to use the training set to measure the
accuracy of the classifier, this estimate would likely be optimistic, because the classifier
tends to overfit the data (i.e., during learning it may incorporate some particular
anomalies of the training data that are not present in the general data set overall).
Therefore, a test set is used, made up of test tuples and their associated class labels. These
tuples are randomly selected from the general data set. They are independent of the
training tuples, meaning that they are not used to construct the classifier. The accuracy
of a classifier on a given test set is the percentage of test set tuples that are correctly
classified by the classifier. The associated class label of each test tuple is compared with
the learned classifier’s class prediction for that tuple. If the accuracy of the classifier is
considered acceptable, the classifier can be used to classify future data tuples for which
the class label is not known. (Such data are also referred to in the machine learning
literature as “unknown” or “previously unseen” data.).
How does prediction differ from classification?
Data prediction is a two step process, similar to that of data. However, for prediction,
we lose the terminology of “class label attribute” because the attribute for which values
are being predicted is continuous-valued (ordered) rather than categorical (discrete-
valued and unordered). The attribute can be referred to simply as the predicted attribute.
Prediction and classification also differ in the methods that are used to build their
respective models. As with classification, the training set used to build a predictor should
not be used to assess its accuracy. An independent test set should be used instead. The
accuracy of a predictor is estimated by computing an error based on the difference
between the predicted value and the actual known value of y for each of the test tuples,
X.
Binary Classification
Multi-Class Classification
Multi-Label Classification
Imbalanced Classification
Binary Classification
Those classification jobs with only two class labels are referred to as binary classification.
Examples comprise -
Prediction of conversion (buy or not).
Churn forecast (churn or not).
Detection of spam email (spam or not).
Binary classification problems often require two classes, one representing the normal
state and the other representing the aberrant state.
For instance, the normal condition is "not spam," while the abnormal state is "spam."
Another illustration is when a task involving a medical test has a normal condition of
"cancer not identified" and an abnormal state of "cancer detected."
Class label 0 is given to the class in the normal state, whereas class label 1 is given to the
class in the abnormal condition.
A model that forecasts a Bernoulli probability distribution for each case is frequently used
to represent a binary classification task.
The discrete probability distribution known as the Bernoulli distribution deals with the
situation where an event has a binary result of either 0 or 1. In terms of classification, this
indicates that the model forecasts the likelihood that an example would fall within class
1, or the abnormal state.
The following are well-known binary classification algorithms:
Logistic Regression
Support Vector Machines
Simple Bayes
Decision Trees
Some algorithms, such as Support Vector Machines and Logistic Regression, were
created expressly for binary classification and do not by default support more than two
classes.
Let us now discuss Multi-Class Classification.
Multi-Class Classification
Multi-Label Classification
Multi-label classification problems are those that feature two or more class labels and
allow for the prediction of one or more class labels for each example.
Think about the photo classification example. Here a model can predict the existence of
many known things in a photo, such as “person”, “apple”, "bicycle," etc. A particular photo
may have multiple objects in the scene.
This greatly contrasts with multi-class classification and binary classification, which
anticipate a single class label for each occurrence.
Multi-label classification problems are frequently modeled using a model that forecasts
many outcomes, with each outcome being forecast as a Bernoulli probability distribution.
In essence, this approach predicts several binary classifications for each example.
It is not possible to directly apply multi-label classification methods used for multi-class
or binary classification. The so-called multi-label versions of the algorithms, which are
specialized versions of the conventional classification algorithms, include:
Multi-label Gradient Boosting
Multi-label Random Forests
Multi-label Decision Trees
Another strategy is to forecast the class labels using a different classification algorithm.
Now, we will look into the Imbalanced Classification Task in detail.
Imbalanced Classification
The term "imbalanced classification" describes classification jobs where the distribution
of examples within each class is not equal.
A majority of the training dataset's instances belong to the normal class, while a minority
belong to the abnormal class, making imbalanced classification tasks binary classification
tasks in general.
Examples comprise -
Clinical diagnostic procedures
Detection of outliers
Fraud investigation
Although they could need unique methods, these issues are modeled as binary
classification jobs.
By oversampling the minority class or undersampling the majority class, specialized
strategies can be employed to alter the sample composition in the training dataset.
Examples comprise -
SMOTE Oversampling
Random Undersampling
It is possible to utilize specialized modeling techniques, like the cost-sensitive machine
learning algorithms that give the minority class more consideration when fitting the
model to the training dataset.
Examples comprise:
Cost-sensitive Support Vector Machines
Cost-sensitive Decision Trees
Cost-sensitive Logistic Regression
Since reporting the classification accuracy may be deceptive, alternate performance
indicators may be necessary.
Examples comprise -
F-Measure
Recall
Precision
Decision Tree
The topmost node in any decision tree is the root node. Internal nodes are denoted by
rectangles and the leaf nodes are denoted by ovals.Same as any tree a Decision tree can
produce a binary tree and a non binary tree
Above image is an example of a Decision tree. It represents the concept buys_computer,
which predicts weather a customer at All Electronics is likely to purchase a computer or
not.
How are decision trees used for classification?
Decision tree can easily be converted to classification rules. Lets take a tuple X, for which
associated class label is unknown, the attribute clause of the tuple are tested against the
decision tree. After which a path is traced from the root to a leaf node, which holds the
class prediction for that tuple
Why are decision tree classifiers so popular?
The popularity of decision tree classifiers are based on its characteristics.
those are :
● It does not require any domain knowledge or parameter setting and therefore is
appropriate for exploratory knowledge discovery
● it can handle high dimensional data
● It can be easily understood by humans.
● The learning and classification steps of decision tree induction are simple and fast
● it has good accuracy
Some of the Application of Decision trees induction algorithms are:
● Medicine
● Manufacturing and production
● Financial analysis
● astronomy
● Molecular biology
Decision tree Induction
Algorithm to Generate decision tree:
Generate a decision tree from the training tuples of data partition D .Most Algorithms for
decision tree induction also follow such a top-down approach, which starts with a
training set of tuples and their associated class labels. The training set is recursively
partitioned into smaller subsets as the tree is being built.
A basic decision tree algorithm is summarized below.
Input
● Data Partition(D), which is a set of training tuples and their associated class labels
● Attribute list: Set of candidate attributes
● Attribute selection method: it is a procedure to determine the splitting criteria
that "best" partitions the data tuples into individual classes.
the Splitting Criterion consists of .
● Splitting attribute
● Split point or Splitting subset
Information Gain
Information gain is used for deciding the best features/attributes that render maximum
data about a class. Let node N represent or hold the tuples of partition D. The attribute
with the highest information gain is chosen as the splitting attribute for node N
the Expected information needed to classify a tuple in D is given by
𝐼𝑛𝑓𝑜(𝐷) = − ∑ 𝑝 𝑙𝑜𝑔 (𝑝 ) where,
● 𝑝 is the probability that an arbitrary tuple in D belongs to class 𝐶 and is
estimated by |𝐶 , |/|𝐷|
● Log function to the base 2 is used because the information is encoded in
bits
● Info(D) is just the average amount of information need to identity the class
label of a tuple in D
Info(D) is also known as the entropy of D.
How much more information would we require (after partitioning) to make an exact
classification? This amount is calculated as
|𝐷 |
𝐼𝑛𝑓𝑜 (𝐷) = × 𝐼𝑛𝑓𝑜(𝐷 )
|𝐷|
where,
| |
● acts as the weight of the jth partition
| |
● 𝐼𝑛𝑓𝑜 (𝐷) is the expected information required to classify a tuple from D
based on the partitioning by A
The Smaller the expected information(still) required, the greater the purity of the
partitions.
Information gain is defined as the difference between the original information requirement
(i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after
partitioning on A) that is,
𝐺𝑎𝑖𝑛(𝐴) = 𝐼𝑛𝑓𝑜(𝐷) − 𝐼𝑛𝑓𝑜 (𝐷)
Gain Ratio
The information gain measure is biased toward tests with many outcomes. That is, it
prefers to select attributes having a large number of values.
Gain Ratio is an extension to Information Gain, which attempts to overcome this bias. It
applies a type of normalization to information gain using Split information value which is
defined as
This value represents the potential information generated by splitting the training data
set, D, into v partitions, corresponding to the v outcomes of a test on attribute A.
The attribute with the maximum gain ratio is selected as the splitting attribute.
Tree Pruning:
When a decision tree is built, many of the branches will reflect anomalies in the
training data due to noise or outliers. Tree pruning methods address this problem of
overfitting the data. Such methods typically use statistical measures to remove the
least reliable branches.
Advantages of Pruned Trees:
1. Smaller.
2. Less Complex.
3. Easier to comprehend.
4. Faster and better at correctly classifying independent test data.
Figure 3.6
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically.
However, when measured over a set of test examples independent of the training
examples, accuracy first increases, then decreases.
The first of the above approaches is the most common and is often referred to as a
training and validation set approach. We discuss the two main variants of this approach
below. In this approach, the available data are separated into two sets of examples: a
training set, which is used to form the learned hypothesis, and a separate validation set,
which is used to evaluate the accuracy of this hypothesis over subsequent data and, in
particular, to evaluate the impact of pruning this hypothesis. The motivation is this: Even
though the learner may be misled by random errors and coincidental regularities within
the training set, the validation set is unlikely to exhibit the same random fluctuations.
Therefore, the validation set can be expected to provide a safety check against overfitting
the spurious characteristics of the training set. Of course, it is important that the
validation set be large enough to itself provide a statistically significant sample of the
instances. One common heuristic is to withhold one-third of the available examples for
the validation set, using the other two-thirds for training.
FIGURE 3.7
Effect of reduced-error pruning in decision tree learning. This plot shows the same curves of training and
test set accuracy as in Figure 3.6. In addition, it shows the impact of reduced error pruning of the tree
produced by ID3. Notice the increase in accuracy over the test set as nodes are pruned from the tree. Here,
the validation set used for pruning is distinct from both the training and test sets.
The columns used to make decision nodes viz. ‘Breathing Issues’, ‘Cough’ and ‘Fever’ are
called feature columns or just features and the column used for leaf nodes i.e. ‘Infected’ is
called the target column.
Entropy(S) = - ∑ pᵢ * log₂(pᵢ) ; i = 1 to n
From the total of 14 rows in our dataset S, there are 8 rows with the target
value YES and 6 rows with the target value NO. The entropy of S is calculated as:
Entropy(S) = — (8/14) * log₂(8/14) — (6/14) * log₂(6/14) = 0.99
As shown below, in the 6 rows with NO, there are 2 rows having target
value YES and 4 rows having target value NO.
Fever Cough Breathing Issues Infected
No No No No
No Yes No No
No Yes Yes Yes
No Yes No No
No Yes Yes Yes
No Yes Yes No
The block, below, demonstrates the calculation of Information Gain for Fever.
# total rows
|S| = 14
For v = YES, |Sᵥ| = 8
Since the feature Breathing issues have the highest Information Gain it is used to create
the root node. Hence, after this initial step our tree looks like this:
Next, from the remaining two unused features, namely, Fever and Cough, we decide
which one is the best for the left branch of Breathing Issues.
Since the left branch of Breathing Issues denotes YES, we will work with the subset of the
original data i.e the set of rows having YES as the value in the Breathing Issues
column. These 8 rows are shown below:
Fever Cough Breathing Infected
Issues
Yes Yes Yes Yes
Yes No Yes Yes
Yes Yes Yes Yes
Yes No Yes Yes
Yes No Yes Yes
No Yes Yes Yes
No Yes Yes Yes
No Yes Yes No
Next, we calculate the IG for the features Fever and Cough using the subset Sʙʏ
IG of Fever is greater than that of Cough, so we select Fever as the left branch of Breathing
Issues: Our tree now looks like this:
Next, we find the feature with the maximum IG for the right branch of Breathing Issues.
But, since there is only one unused feature left we have no other choice but to make it the
right branch of the root node.
There are no more unused features, so we stop here and jump to the final step of creating
the leaf nodes.
For the left leaf node of Fever, we see the subset of rows from the original data set that
has Breathing Issues and Fever both values as YES.
Fever Cough Breathing Infected
Issues
Yes Yes Yes Yes
Yes No Yes Yes
Yes Yes Yes Yes
Yes No Yes Yes
Yes No Yes Yes
Since all the values in the target column are YES, we label the left leaf node as YES, but to
make it more logical we label it Infected.
Similarly, for the right node of Fever we see the subset of rows from the original data set
that have Breathing Issues value as YES and Fever as NO.
We repeat the same process for the node Cough, however here both left and right leaves
turn out to be the same i.e. NO or Not Infected as shown below:
The right node of Breathing issues is as good as just a leaf node with class ‘Not infected’.
This is one of the Drawbacks of ID3, it doesn’t do pruning.
Example
In the following example, we are going to implement Decision Tree classifier on Pima
Indian Diabetes −
First, start with importing necessary python packages −
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
Next, download the iris dataset from its weblink as follows −
col_names = ['pregnant', 'glucose', 'bp', 'skin', 'insulin', 'bmi', 'pedigree', 'age', 'label']
pima = pd.read_csv(r"C:\pima-indians-diabetes.csv", header=None, names=col_names)
pima.head()
Accuracy: 0.670995670995671
dot_data = StringIO()
export_graphviz(clf, out_file=dot_data,
filled=True, rounded=True,
special_characters=True,feature_names = feature_cols,class_names=['0','1'])
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_png('Pima_diabetes_Tree.png')
Image(graph.create_png())
K-Nearest neighbour
The most basic instance-based method is the k-NEAREST NEIGHBOR Algorithm. This
algorithm assumes all instances correspond to points in the n-dimensional space Rn. The
nearest neighbors of an instance are defined in terms of the standard Euclidean distance.
More precisely, let an arbitrary instance x be described by the feature vector
where ar (x) denotes the value of the rth attribute of instance x. Then the distance
between two instances xi and xj is defined to be d(xi, xj), where
FIGURE
k-NEAREST NEIGHBOR : A set of positive and negative training examples is shown on the
left, along with a query instance x, to be classified. The I--NEAREST NEIGHBOR Algorithm
classifies x, positive, whereas 1-NEAREST NEIGHBOR it as negative. On the right is the
decision surface induced by the 1--NEAREST NEIGHBOR Algorithm for a typical set of
training examples. The convex polygon surrounding each training example indicates the
region of instance space closest to that point (i.e., the instances for which the 1-NEAREST
NEIGHBOR will assign the classification belonging to that training example).
Note the k-NEAREST NEIGHBOR Algorithm never forms an explicit general hypothesis f
regarding the target function f . It simply computes the classification of each new query
instance as needed. Nevertheless, we can still ask what the implicit general function is, or
what classifications would be assigned if we were to hold the training examples constant
and query the algorithm with every possible instance in X. The diagram on the right side
of Figure 8.1 shows the shape of this decision surface induced by 1-NEAREST NEIGHBOR
over the entire instance space. This kind of diagram is often called the Voronoi diagram
of the set of training examples
Suppose there are two categories, i.e., Category A and Category B, and we have a new data
point x1, so this data point will lie in which of these categories. To solve this type of
problem, we need a K-NN algorithm. With the help of K-NN, we can easily identify the
category or class of a particular dataset. Consider the below diagram:
o
Firstly, we will choose the number of neighbors, so we will choose the k=5.
Next, we will calculate the Euclidean distance between the data points. The Euclidean
distance is the distance between two points, which we have already studied in geometry.
It can be calculated as:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
TABLE
We have two columns — Brightness and Saturation. Each row in the table has a class of
either Red or Blue.
Before we introduce a new data entry, let's assume the value of K is 5.
To know its class, we have to calculate the distance from the new entry to other entries
in the data set using the Euclidean distance formula.
Implementation in Python
As we know K-nearest neighbors (KNN) algorithm can be used for both classification as
well as regression. The following are the recipes in Python to use KNN as classifier as well
as regressor −
KNN as Classifier
First, start with importing necessary python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names=headernames)
dataset.head()
slno. sepal-length sepal-width petal-length petal-width Class
Data Preprocessing will be done with the help of following script lines −
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. Following code will split the dataset
into 60% training data and 40% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.40)
Next, data scaling will be done as follows −
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Next, train the model with the help of KNeighborsClassifier class of sklearn as follows −
from sklearn.neighbors import KNeighborsClassifier
classifier = KNeighborsClassifier(n_neighbors=8)
classifier.fit(X_train, y_train)
At last we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
result = confusion_matrix(y_test, y_pred)
print("Confusion Matrix:")
print(result)
result1 = classification_report(y_test, y_pred)
print("Classification Report:",)
print (result1)
result2 = accuracy_score(y_test,y_pred)
print("Accuracy:",result2)
Output
Confusion Matrix:
[[21 0 0]
[ 0 16 0]
[ 0 7 16]]
Classification Report:
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 21
Iris-versicolor 0.70 1.00 0.82 16
Iris-virginica 1.00 0.70 0.82 23
micro avg 0.88 0.88 0.88 60
macro avg 0.90 0.90 0.88 60
weighted avg 0.92 0.88 0.88 60
Accuracy: 0.8833333333333333
KNN as Regressor
First, start with importing necessary Python packages −
import numpy as np
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
data = pd.read_csv(url, names=headernames)
array = data.values
X = array[:,:2]
Y = array[:,2]
data.shape
output:(150, 5)
Next, import KNeighborsRegressor from sklearn to fit the model −
from sklearn.neighbors import KNeighborsRegressor
knnr = KNeighborsRegressor(n_neighbors=10)
knnr.fit(X, y)
At last, we can find the MSE as follows −
print ("The MSE is:",format(np.power(y-knnr.predict(X),2).mean()))
Output
The MSE is: 0.12226666666666669
Naïve Bayes Algorithm
What are Bayesian classifiers?” Bayesian classifiers are statistical classifiers. They can
predict class membership probabilities, such as the probability that a given tuple belongs
to a particular class. Bayesian classification is based on Bayes’ theorem, described below.
Studies comparing classification algorithms have found a simple Bayesian classifier
known as the naive Bayesian classifier to be comparable in performance with decision
tree and selected neural network classifiers. Bayesian classifiers have also exhibited high
accuracy and speed when applied to large databases.
Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is
independent of the values of the other attributes. This assumption is called class
conditional independence. It is made to simplify the computations involved and, in this
sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike
naïve Bayesian classifiers, allow the representation of dependencies among subsets of
attributes. Bayesian belief networks can also be used for classification.
Bayes’ Theorem
Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who
did early work in probability and decision theory during the 18th century.
Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is
described by measurements made on a set of n attributes. Let H be some hypothesis, such
as that the data tuple X belongs to a specified class C. For classification problems, we want
to determine P(H|X), the probability that the hypothesis H holds given the “evidence” or
observed data tuple X. In other words, we are looking for the probability that tuple X
belongs to class C, given that we know the attribute description of X.
P(H|X) is the posterior probability, or a posteriori probability, of H conditioned on X. For
example, suppose our world of data tuples is confined to customers described by the
attributes age and income, respectively, and that X is a 35-year-old customer with an
income of $40,000. Suppose that H is the hypothesis that our customer will buy a
computer. Then P(H|X) reflects the probability that customer X will buy a computer given
that we know the customer’s age and income.
In contrast, P(H) is the prior probability, or a priori probability, of H. For our example,
this is the probability that any given customer will buy a computer, regardless of age,
income, or any other information, for that matter. The posterior probability, P(H|X), is
based on more information (e.g., customer information) than the prior probability, P(H),
which is independent of X.
Similarly, P(X|H) is the posterior probability of X conditioned on H. That is, it is the
probability that a customer, X, is 35 years old and earns $40,000, given that we know the
customer will buy a computer.
P(X)is the prior probability of X. Using our example, it is the probability that a person
from our set of customers is 35 years old and earns $40,000.
“How are these probabilities estimated?” P(H), P(X|H), and P(X) may be estimated from
the given data, as we shall see below. Bayes’ theorem is useful in that it provides a way of
calculating the posterior probability, P(H|X), from P(H), P(X|H), and P(X).
Bayes’ theorem is
Now we could attempt to estimate the two terms in Equation (6.19) based on the training
data. It is easy to estimate each of the P(vj) simply by counting the frequency with which
each target value vj occurs in the training data. However, estimating the different P(al,
a2.. . a,lvj) terms in this fashion is not feasible
unless we have a very, very large set of training data. The problem is that the number of
these terms is equal to the number of possible instances times the number of possible
target values. Therefore, we need to see every instance in the instance space many times
in order to obtain reliable estimates.
The naive Bayes classifier is based on the simplifying assumption that the attribute values
are conditionally independent given the target value. In other words, the assumption is
that given the target value of the instance, the probability of observing the conjunction al,
a2.. .a, is just the product of the probabilities
for the individual attributes:
Substituting this into Equation (6.19), we have the approach used by the naive Bayes
classifier.
Naive Bayes classifier:
where VNB denotes the target value output by the naive Bayes classifier. Notice that in a
naive Bayes classifier the number of distinct P(ai | vj) terms that must be estimated from
the training data is just the number of distinct attribute values times the number of
distinct target values-a much smaller number than if we were to estimate the P(a1, a2 . .
. an lvj) terms as first contemplated.
The naïve Bayesian classifier, or simple Bayesian classifier, works as follows:
1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2,..., xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2,..., An.
2. Suppose that there are m classes, C1, C2,..., Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned
on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if
and only if
Thus we maximize P(Ci |X). The class Ci for which P(Ci |X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3. As P(X) is constant for all classes, only P(X| Ci)P(Ci) need be maximized. If the class
prior probabilities are not known, then it is commonly assumed that the classes are
equally likely, that is, P(C1) = P(C2) = ··· = P(Cm), and we would therefore maximize P(X|
Ci). Otherwise, we maximize P(X| Ci)P(Ci).
4. Given data sets with many attributes, it would be extremely computationally expensive
to compute P(X| Ci). Thus,
--------------------------------(3)
We can easily estimate the probabilities P(x1| Ci), P(x2| Ci),..., P(xn|Ci)from the training
tuples. For each attribute, we look at whether the attribute is categorical or continuous-
valued. For instance, to compute P(X|Ci), we consider the following:
(a) If Ak is categorical, then P(xk| Ci) is the number of tuples of class Ci in D having the
value xk for Ak, divided by | Ci,D|, the number of tuples of class Ci in D.
(b) If Ak is continuous-valued, then we need to do a bit more work, but the calculation is
pretty straightforward. A continuous-valued attribute is typically assumed to have a
Gaussian distribution with a mean µ and standard deviation σ, defined by
-----------------------(4)
So
----------------------------(5)
These equations may appear daunting, but hold on! We need to compute µCi and σCi ,
which are the mean (i.e., average) and standard deviation, respectively, of the values of
attribute Ak for training tuples of class Ci . We then plug these two quantities into Equation
(4), together with xk, in order to estimate P(xk|Ci). For example, let X = (35, $40,000),
where A1 and A2 are the attributes age and income, respectively. Let the class label
attribute be buys computer. The associated class label for X is yes (i.e., buys computer =
yes). Let’s suppose that age has not been discretized and therefore exists as a continuous-
valued attribute. Suppose that from the training set, we find that customers in D who buy
a computer are 38 ±12 years of age. In other words, for attribute age and this class, we
have µ = 38 years and σ = 12. We can plug these quantities, along with x1 = 35 for our
tuple X into Equation (4) in order to estimate P(age = 35|buys computer = yes).
5. In order to predict the class label of X, P(X|Ci)P(Ci) is evaluated for each class Ci . The
classifier predicts that the class label of tuple X is the class Ci if and only if
6
In other words, the predicted class label is the class Ci for which P(X|Ci)P(Ci) is the
maximum.
Bayesian classifiers are also useful in that they provide a theoretical justification for other
classifiers that do not explicitly use Bayes’ theorem. For example, under certain
assumptions, it can be shown that many neural network and curve-fitting algorithms
output the maximum posteriori hypothesis, as does the naïve Bayesian classifier.
P(PlayTennis=Yes=9/14=0.64
P(PlayTennis=No)=5/14=0.36
Outlook Y N
Sunny 2/9 3/5
Overcast 4/9 0
Rain 3/9 2/5
Temperature Y N
Hot 2/9 2/5
Mild 4/9 2/5
Cool 3/9 1/5
Humidity Y N
High 3/9 4/5
Normal 6/9 1/5
Windy Y N
Strong 3/9 3/5
Weak 6/9 2/5
Vnb(yes)
Vnb(yes) = = 0.205
Vnb(yes) + Vnb(no)
Vnb(no)
Vnb(no) = = 0.795
Vnb(yes) + Vnb(no)
Example
Depending on our data set, we can choose any of the Naïve Bayes model explained above.
Here, we are implementing Gaussian Naïve Bayes model in Python −
We will start with required imports as follows −
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
Now, by using make_blobs() function of Scikit learn, we can generate blobs of points with
Gaussian distribution as follows −
from sklearn.datasets import make_blobs
X, y = make_blobs(300, 2, centers=2, random_state=2, cluster_std=1.5)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');
Next, for using GaussianNB model, we need to import and make its object as follows −
from sklearn.naive_bayes import GaussianNB
model_GBN = GaussianNB()
model_GNB.fit(X, y);
Now, we have to do prediction. It can be done after generating some new data as follows
−
rng = np.random.RandomState(0)
Xnew = [-6, -14] + [14, 18] * rng.rand(2000, 2)
ynew = model_GNB.predict(Xnew)
Next, we are plotting new data to find its boundaries −
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer')
lim = plt.axis()
plt.scatter(Xnew[:, 0], Xnew[:, 1], c=ynew, s=20, cmap='summer', alpha=0.1)
plt.axis(lim);
Now, with the help of following line of codes, we can find the posterior probabilities of
first and second label −
yprob = model_GNB.predict_proba(Xnew)
yprob[-10:].round(3)
Output
array([[0.998, 0.002],
[1. , 0. ],
[0.987, 0.013],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[1. , 0. ],
[0. , 1. ],
[0.986, 0.014]]
)
Example: SVM can be understood with the example that we have used in the KNN
classifier. Suppose we see a strange cat that also has some features of dogs, so if we want
a model that can accurately identify whether it is a cat or dog, so such a model can be
created by using the SVM algorithm. We will first train our model with lots of images of
cats and dogs so that it can learn about different features of cats and dogs, and then we
test it with this strange creature. So as support vector creates a decision boundary
between these two data (cat and dog) and choose extreme cases (support vectors), it will
see the extreme case of cat and dog. On the basis of the support vectors, it will classify it
as a cat. Consider the below diagram:
Types of SVM
SVM can be of two types:
Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset can
be classified into two classes by using a single straight line, then such data is termed as
linearly separable data, and classifier is used called as Linear SVM classifier. Linear SVMs
use a linear decision boundary to separate the data points of different classes. When the
data can be precisely linearly separated, linear SVMs are very suitable. This means that a
single straight line (in 2D) or a hyperplane (in higher dimensions) can entirely divide the
data points into their respective classes. A hyperplane that maximizes the margin between
the classes is the decision boundary.
Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which means if
a dataset cannot be classified by using a straight line, then such data is termed as non-
linear data and classifier used is called as Non-linear SVM classifier. Non-Linear SVM can
be used to classify data when it cannot be separated into two classes by a straight line (in
the case of 2D). By using kernel functions, nonlinear SVMs can handle nonlinearly
separable data. The original input data is transformed by these kernel functions into a
higher-dimensional feature space, where the data points can be linearly separated. A
linear SVM is used to locate a nonlinear decision boundary in this modified space.
Advantages of SVM
Effective in high-dimensional cases.
Its memory is efficient as it uses a subset of training points in the decision function called
support vectors.
Different kernel functions can be specified for the decision functions and its possible to
specify custom kernels.
One reasonable choice as the best hyperplane is the one that represents the largest
separation or margin between the two classes.
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()
Next, we are creating a sample dataset, having linearly separable data, from
sklearn.dataset.sample_generator for classification using SVM −
from sklearn.datasets.samples_generator import make_blobs
X, y = make_blobs(n_samples=100, centers=2, random_state=0, cluster_std=0.50)
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap='summer');
The following would be the output after generating sample dataset having 100 samples
and 2 clusters −
We know that SVM supports discriminative classification. it divides the classes from each
other by simply finding a line in case of two dimensions or manifold in case of multiple
dimensions. It is implemented on the above dataset as follows −
We can see from the above output that there are three different separators that perfectly
discriminate the above samples.
As discussed, the main goal of SVM is to divide the datasets into classes to find a maximum
marginal hyperplane (MMH) hence rather than drawing a zero line between classes we
can draw around each line a margin of some width up to the nearest point. It can be done
as follows −
Random Forest
A random forest is a supervised machine learning algorithm that is constructed from
decision tree algorithms. This algorithm is applied in various industries such as banking
and e-commerce to predict behavior and outcomes.
A random forest is a machine learning technique that’s used to solve regression and
classification problems. It utilizes ensemble learning, which is a technique that combines
many classifiers to provide solutions to complex problems.
A random forest algorithm consists of many decision trees. The ‘forest’ generated by the
random forest algorithm is trained through bagging or bootstrap aggregating. Bagging is
an ensemble meta-algorithm that improves the accuracy of machine learning algorithms.
The (random forest) algorithm establishes the outcome based on the predictions of the
decision trees. It predicts by taking the average or mean of the output from various trees.
Increasing the number of trees increases the precision of the outcome.
A random forest eradicates the limitations of a decision tree algorithm. It reduces the
overfitting of datasets and increases precision. It generates predictions without requiring
many configurations in packages (like scikit-learn).
Implementation in Python
First, start with importing necessary Python packages −
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
Next, download the iris dataset from its weblink as follows −
path = "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"
Next, we need to assign column names to the dataset as follows −
headernames = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'Class']
Now, we need to read dataset to pandas dataframe as follows −
dataset = pd.read_csv(path, names=headernames)
dataset.head()
sepal-length sepal-width petal-length petal-width Class
Data Preprocessing will be done with the help of following script lines −
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, 4].values
Next, we will divide the data into train and test split. The following code will split the
dataset into 70% training data and 30% of testing data −
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30)
Next, train the model with the help of RandomForestClassifier class of sklearn as follows
−
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=50)
classifier.fit(X_train, y_train)
At last, we need to make prediction. It can be done with the help of following script −
y_pred = classifier.predict(X_test)
Next, print the results as follows −
Accuracy: 0.9777777777777777