Classification Algorithms
Classification Algorithms
The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program
learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.
Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.
1. y=f(x), where y = categorical output
The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.
Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:
o Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.
1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of
the most related data stored in the training dataset. It takes less time in training but
more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner
takes more time in learning, and less time in prediction. Example: Decision Trees,
Naïve Bayes, ANN.
Classification Algorithms can be further divided into the Mainly two category:
o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification
1. ?(ylog(p)+(1?y)log(1?p))
2. Confusion Matrix:
3. AUC-ROC curve:
Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:
Logistic regression predicts the output of a categorical dependent variable. Therefore the
outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or
False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values
which lie between 0 and 1. The hypothesis of logistic regression tends it to limit the cost
function between 0 and 1. Therefore linear functions fail to represent it as it can have a value
greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.
o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:
o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Assumptions for Logistic Regression:
The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:
o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:
Hypothesis Representation
hΘ(x) = β₀ + β₁X
We have expected that our hypothesis will give values between 0 and 1.
Z = β₀ + β₁X
hΘ(x) = sigmoid(Z)
Cost Function
We learnt about the cost function J(θ) in the Linear regression, the cost function represents
optimization objective i.e. we create a cost function and minimize it so that we can develop
If we try to use the cost function of the linear regression in ‘Logistic Regression’ then it would
which it would be very difficult to minimize the cost value and find the global minimum.
Non-convex function
On the basis of the categories, Logistic Regression can be classified into three types:
o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "lo w", "Medium", or "High".
There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.
Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Branch/Sub Tree: A tree formed by splitting the tree.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.
For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Entropy is defined as the randomness or measuring the disorder of the information being
processed in Machine Learning. Further, in other words, we can say that entropy is the
machine learning metric that measures the unpredictability or impurity in the system.
When information is processed in the system, then every piece of information has a specific
value to make and can be used to draw conclusions from it. So if it is easier to draw a
valuable conclusion from a piece of information, then entropy will be lower in Machine
Learning, or if entropy is higher, then it will be difficult to draw any conclusion from that
piece of information.
23.2K
Entropy is a useful tool in machine learning to understand various concepts such as feature
selection, building decision trees, and fitting classification models, etc.
Consider a data set having a total number of N classes, then the entropy (E) can be
determined with the formula below:
Where;
Entropy always lies between 0 and 1, however depending on the number of classes in the
dataset, it can be greater than 1. But the high value of
Let's understand it with an example where we have a dataset having three colors of fruits as
red, green, and yellow. Suppose we have 2 red, 2 green, and 4 yellow observations
throughout the dataset. Then as per the above equation:
E=−(prlog2pr+pglog2pg+pylog2py)
Where;
Pr = 2/8 =1/4 [As only 2 out of 8 datasets represents red fruits]
Pg = 2/8 =1/4 [As only 2 out of 8 datasets represents green fruits]
Py = 4/8 = 1/2 [As only 4 out of 8 datasets represents yellow fruits]
Let's consider a case when all observations belong to the same class; then entropy will
always be 0.
E=−(1log21)
=0
When entropy becomes 0, then the dataset has no impurity. Datasets with 0 impurities are
not useful for learning. Further, if the entropy is 1, then this kind of dataset is good for
learning.
While building any machine learning model, the first thing that comes to our mind is how
we can build an accurate & 'good fit' model and what the challenges are that will come
during the entire procedure. Precision and Recall are the two most important but confusing
concepts in Machine Learning. Precision and recall are performance metrics used for pattern
recognition and classification in machine learning. These concepts are essential to build a
perfect machine learning model which gives more precise and accurate results. Some of the
models in machine learning require more precision and some model requires more recall.
So, it is important to know the balance between Precision and recall or, simply, precision-
recall trade-off.
Confusion Matrix helps us to visualize the point where our model gets confused in
discriminating two classes. It can be understood well through a 2×2 matrix where the row
represents the actual truth labels, and the column represents the predicted labels.
This matrix consists of 4 main elements that show different metrics to count a number of
correct and incorrect predictions. Each element has two words either as follows:
o True or False
o Positive or Negative
If the predicted and truth labels match, then the prediction is said to be correct, but when
the predicted and truth labels are mismatched, then the prediction is said to be incorrect.
Further, positive and negative represents the predicted labels in the matrix.
There are four metrics combinations in the confusion matrix, which are as follows:
o True Positive: This combination tells us how many times a model correctly classifies a
positive sample as Positive?
o False Negative: This combination tells us how many times a model incorrectly
classifies a positive sample as Negative?
o False Positive: This combination tells us how many times a model incorrectly
classifies a negative sample as Positive?
o True Negative: This combination tells us how many times a model correctly classifies
a negative sample as Negative?
Hence, we can calculate the total of 7 predictions in binary classification problems using a
confusion matrix.
1. Precision = True Positive/True Positive + False Positive
2. Precision = TP/TP+FP
o TP- True Positive
o FP- False Positive
o The precision of a machine learning model will be low when the value of;
1. TP+FP (denominator) > TP (Numerator)
o The precision of the machine learning model will be high when Value of;
1. TP (Numerator) > TP+FP (denominator)
Case 1- In the below-mentioned scenario, the model correctly classified two positive
samples while incorrectly classified one negative sample as positive. Hence, according to
precision formula;
Precision = TP/TP+FP
Precision = 2/2+1 = 2/3 = 0.667
Case 2- In this scenario, we have three Positive samples that are correctly classified, and one
Negative sample is incorrectly classified.
Precision = TP/TP+FP
Case 3- In this scenario, we have three Positive samples that are correctly classified but no
Negative sample which is incorrectly classified.
Precision = TP/TP+FP
Hence, in the last scenario, we have a precision value of 1 or 100% when all positive samples
are classified as positive, and there is no any Negative sample that is incorrectly classified.
What is Recall?
The recall is calculated as the ratio between the numbers of Positive samples correctly
classified as Positive to the total number of Positive samples. The recall measures the
model's ability to detect positive samples. The higher the recall, the more positive samples
detected.
1. Recall = True Positive/True Positive + False Negative
2. Recall = TP/TP+FN
o TP- True Positive
o FN- False Negative
o Recall of a machine learning model will be low when the value of;
TP+FN (denominator) > TP (Numerator)
o Recall of machine learning model will be high when Value of;
TP (Numerator) > TP+FN (denominator)
Below are some examples for calculating Recall in machine learning as follows
Example 1- Let's understand the calculation of Recall with four different cases where each
case has the same Recall as 0.667 but differs in the classification of negative samples. See
how:
In this scenario, the classification of the negative sample is different in each case. Case A has
two negative samples classified as negative, and case B have two negative samples classified
as negative; case c has only one negative sample classified as negative, while case d does
not classify any negative sample as negative.
However, recall is independent of how the negative samples are classified in the model;
hence, we can neglect negative samples and only calculate all samples that are classified as
positive.
In the above image, we have only two positive samples that are correctly classified as
positive while only 1 negative sample that is correctly classified as negative.
Hence, true positivity rate is 2 and while false negativity rate is 1. Then recall will be:
1. Recall = True Positive/True Positive + False Negative
Recall = TP/TP+FN
=2/(2+1)
=2/3
=0.667
Example-2
Now, we have another scenario where all positive samples are classified correctly as
positive. Hence, the True Positive rate is 3 while the False Negative rate is 0.
It helps us to measure the ability to classify positive It helps us to measure how many positive samples
samples in the model. were correctly classified by the ML model.
While calculating the Precision of a model, we should While calculating the Recall of a model, we only need
consider both Positive as well as Negative samples all positive samples while all negative samples will be
that are classified. neglected.
When a model classifies most of the positive samples When a model classifies a sample as Positive, but it can
correctly as well as many false-positive samples, then only classify a few positive samples, then the model is
the model is said to be a high recall and low precision said to be high accuracy, high precision, and low recall
model. model.
The precision of a machine learning model is Recall of a machine learning model is dependent on
dependent on both the negative and positive positive samples and independent of negative samples.
samples.
In Precision, we should consider all positive samples The recall cares about correctly classifying all positive
that are classified as positive either correctly or samples. It does not consider if any negative sample is
incorrectly. classified as positive.
If the recall is 100%, then it tells us the model has detected all positive samples as positive
and neglects how all negative samples are classified in the model. However, the model could
still have so many samples that are classified as negative but recall just neglect those
samples, which results in a high False Positive rate in the model.
Example-3
In this scenario, the model does not identify any positive sample that is classified as positive.
All positive samples are incorrectly classified as Negative. Hence, the true positive rate is 0,
and the False Negative rate is 3. Then Recall will be:
This means the model has not correctly classified any Positive Samples.