4 Classification
4 Classification
What is Classification?
Evaluation of classifiers:
There are various methods commonly used to evaluate the performance of a
classifier which are as follows −
Holdout Method –
In the holdout method, the initial record with labeled instances is partitioned
into two disjoint sets, known as the training and the test sets, accordingly. A
classification model is induced from the training set and its implementation is
computed on the test set.
The efficiency of the classifier can be computed depending on the efficiency
of the induced model on the test set. The holdout method has various
well-known disadvantages. First, some labeled instances are accessible for
training because several data are withheld for testing.
As a result, the induced model cannot be as best as when some labeled
examples are utilized for training. Second, the model can be hugely
dependent on the structure of the training and test sets.
On the other hand, if the training set is too large, then the estimated accuracy
computed from the smaller test set is less reliable. Therefore an estimate has
a broad confidence interval. Finally, the training and test sets are no higher
separate from each other.
Random Subsampling –
The holdout method can be repeated multiple times to enhance the
computation of a classifier's implementation. This method is called random
subsampling.
Let acci be the model accuracy during the ith iteration. The overall accuracy is
given by accsub=∑i=1k∑i=1kacci/k
Random subsampling encounters several issues associated with the holdout
approach because it does not use as much data is applicable for training. It
also has no control over the several times each data is used for testing and
training. Therefore, some data can be used for training more than others.
Cross-Validation −:
An alternative to random subsampling is cross-validation. In this method,
each data is used multiple times for training and accurately once for testing.
Consider that it can partition the record into two equal-sized subsets. First, it
can select one of the subsets for training and the other for testing. It can
change the roles of the subsets so that the earlier training set becomes the
test set. This method is known as twofold cross-validation.
would be sorted down the leftmost branch of this decision tree and would
therefore be classified as a negative instance.
In other words we can say that decision tree represent a disjunction of
conjunctions of constraints on the attribute values of instances.
● Decision trees are less appropriate for estimation tasks where the goal is
to predict the value of a continuous attribute.
● Decision trees are prone to errors in classification problems with many
class and relatively small number of training examples.
● Decision tree can be computationally expensive to train.
o Reduction in Variance
o Gini Impurity
o Information Gain
o Chi-Square
Reduction in Variance is a method for splitting the node used when the target
variable is continuous, i.e., regression problems. It is so-called because it
uses variance as a measure for deciding the feature on which node is split
into child nodes.
Here are the steps to split a decision tree using reduction in variance:
1. For each split, individually calculate the variance of each child node
2. Calculate the variance of each split as the weighted average variance of
child nodes
3. Select the split with the lowest variance
4. Perform steps 1-3 until completely homogeneous nodes are achieved
o Decision Tree Splitting Method #2: Gini Impurity/ index
The Formula for the calculation of the Gini Index is given below.
Entropy is used for calculating the purity of a node. Lower the value of
entropy, higher is the purity of the node. The entropy of a homogeneous
node is zero. Since we subtract entropy from 1, the Information Gain is higher
for the purer nodes with a maximum value of 1. Now, let’s take a look at the
formula for calculating the entropy:
1. For each split, individually calculate the entropy of each child node
2. Calculate the entropy of each split as the weighted average entropy of
child nodes
3. Select the split with the lowest entropy or highest information gain
4. Until you achieve homogeneous nodes, repeat steps 1-3
The above formula gives us the value of Chi-Square for a class. Take the sum
of Chi-Square values for all the classes in a node to calculate the Chi-Square
for that node. Higher the value, higher will be the differences between parent
and child nodes, i.e., higher will be the homogeneity.
1. For each split, individually calculate the Chi-Square value of each child
node by taking the sum of Chi-Square values for each class in a node
2. Calculate the Chi-Square value of each split as the sum of Chi-Square
values for all the child nodes
3. Select the split with a higher Chi-Square value
4. Until you achieve homogeneous nodes, repeat steps 1-3
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes' law, which is
used to determine the probability of a hypothesis with prior knowledge.
It depends on the conditional probability.
o The formula for Bayes' theorem is given as:
Where,
Working of Naïve Bayes' Classifier can be understood with the help of the
below example:
Suppose we have a dataset of weather conditions and corresponding target
variable "Play". So using this dataset we need to decide that whether we
should play or not on a particular day according to the weather conditions. So
to solve this problem, we need to follow the below steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Outlook Play
0 Rainy Yes
1 Sunny Yes
2 Overcast Yes
3 Overcast Yes
4 Sunny No
5 Rainy Yes
6 Sunny Yes
7 Overcast Yes
8 Rainy No
9 Sunny No
10 Sunny Yes
11 Rainy No
12 Overcast Yes
13 Overcast Yes
Weather Yes No
Overcast 5 0
Rainy 2 2
Sunny 3 2
Total 10 4
Rainy 2 2 4/14=0.29
Sunny 2 3 5/14=0.35
Applying Bayes'theorem:
P(Yes|Sunny)= P(Sunny|Yes)*P(Yes)/P(Sunny)
P(Sunny|Yes)= 3/10= 0.3
P(Sunny)= 0.3
P(Yes)=0.71
So P(Yes|Sunny) = 0.3*0.71/0.3= 0.7
P(No|Sunny)= P(Sunny|No)*P(No)/P(Sunny)
P(Sunny|NO)= 2/4=0.5
P(No)= 0.29
P(Sunny)= 0.3
So P(No|Sunny)= 0.5*0.29/0.3 = 0.48
So as we can see from the above calculation that P(Yes|Sunny)>P(No|Sunny)
Hence on a Sunny day, Player can play the game.
Advantages of Naïve Bayes Classifier:
o Naïve Bayes is one of the fast and easy ML algorithms to predict a class
of datasets.
o It can be used for Binary as well as Multi-class Classifications.
o It performs well in Multi-class predictions as compared to the other
Algorithms.
o It is the most popular choice for text classification problems.
The following diagram shows a directed acyclic graph for six Boolean variables.
The conditional probability table for the values of the variable LungCancer
(LC) showing each possible combination of the values of its parent nodes,
FamilyHistory (FH), and Smoker (S) is as follows −
What’s KNN?
The KNN is pretty simple, imagine that you have a data about colored balls:
● Purple balls;
● Yellow balls;
● And a ball that you don’t know if it’s purple or yellow, but you has all the
data about this color (except the color label).
So, how are you going to know the ball’s color? imagine you like a machine
that you only have the ball’s characteristics(data), but doesn’t the final label.
Hou do you will to know the ball’s color(final label/your class)?
Obs: Let’s suppose that data with number 1(and label R) are referring to the
purple balls and the data with number 2 (and label A) are referring to the
yellow balls, this’s just to make the explanation easier,
Each line refers to a ball and each column refers to a ball’s characteristic, in
the last column we have the class (color) of each of the balls:
● R -> purple;
● A -> yellow
We have 5 balls there ( 5 lines), each one with yours classification, you can
try to discover the new ball’s color (in the case the class) of N ways, one of
these N ways is to comparing this new ball’s characteristics with all the
others, and see what it looks like most, if the data(characteristics) of this new
ball (that you doesn’t know the correct class) is similar to the data of the
yellow balls, then the color of the new ball is yellow, if the data in the new
ball is more similar to the data of the purple then yellow, then the color of
the new ball is purple, it looks so simple, and that is almost what the knn
does, but in a most sophisticated way.
The KNN’s steps are:
1 — Receive an unclassified data;
2 — Measure the distance (Euclidian, Manhattan, Minkowski or Weighted)
from the new data to all others data that is already classified;
3 — Gets the K(K is a parameter that you define) smaller distances;
4 — Check the list of classes had the shortest distance and count the amount
of each class that appears;
5 — Takes as correct class the class that appeared the most times;
6 —Classifies the new data with the class that you took in step 5;
Calculating distance:
To calculate the distance between two points (your new sample and all the
data you have in your dataset) is very simple, as said before, there are several
ways to get this value, in this article we will use the Euclidean distance.
Characteristics of KNN
Between-sample geometric distance
The k-nearest-neighbor classifier is commonly based on the Euclidean
distance between a test sample and the specified training samples. Let xi be
an input sample with p features (xi1,xi2,…,xip) , n be the total number of input
samples (i=1,2,…,n) and p the total number of features (j=1,2,…,p) .
The Euclidean distance between sample xi and xl (l=1,2,…,n) is defined as
d(xi,xl)= √ xi1−xl1)2+(xi2−xl2)2+⋯+(xip−xlp)2