100% found this document useful (2 votes)
195 views

Classification Algorithms

The document discusses classification algorithms in machine learning. Classification algorithms are supervised learning techniques used to categorize new observations based on training data. The document outlines different types of classification problems, algorithms, and evaluation metrics like confusion matrices and ROC curves.

Uploaded by

shyma na
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
100% found this document useful (2 votes)
195 views

Classification Algorithms

The document discusses classification algorithms in machine learning. Classification algorithms are supervised learning techniques used to categorize new observations based on training data. The document outlines different types of classification problems, algorithms, and evaluation metrics like confusion matrices and ROC curves.

Uploaded by

shyma na
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 23

What is the Classification Algorithm?

The Classification algorithm is a Supervised Learning technique that is used to identify the
category of new observations on the basis of training data. In Classification, a program
learns from the given dataset or observations and then classifies new observation into a
number of classes or groups. Such as, Yes or No, 0 or 1, Spam or Not Spam, cat or dog, etc.
Classes can be called as targets/labels or categories.

Unlike regression, the output variable of Classification is a category, not a value, such as
"Green or Blue", "fruit or animal", etc. Since the Classification algorithm is a Supervised
learning technique, hence it takes labeled input data, which means it contains input with the
corresponding output.

In classification algorithm, a discrete output function(y) is mapped to input variable(x).

1. y=f(x), where y = categorical output  

The best example of an ML classification algorithm is Email Spam Detector.

The main goal of the Classification algorithm is to identify the category of a given dataset,
and these algorithms are mainly used to predict the output for the categorical data.

Classification algorithms can be better understood using the below diagram. In the below
diagram, there are two classes, class A and Class B. These classes have features that are
similar to each other and dissimilar to other classes.
The algorithm which implements the classification on a dataset is known as a classifier.
There are two types of Classifications:

o Binary Classifier: If the classification problem has only two possible outcomes, then it
is called as Binary Classifier.
Examples: YES or NO, MALE or FEMALE, SPAM or NOT SPAM, CAT or DOG, etc.
o Multi-class Classifier: If a classification problem has more than two outcomes, then it
is called as Multi-class Classifier.
Example: Classifications of types of crops, Classification of types of music.

Learners in Classification Problems:

In the classification problems, there are two types of learners:

1. Lazy Learners: Lazy Learner firstly stores the training dataset and wait until it
receives the test dataset. In Lazy learner case, classification is done on the basis of
the most related data stored in the training dataset. It takes less time in training but
more time for predictions.
Example: K-NN algorithm, Case-based reasoning
2. Eager Learners:Eager Learners develop a classification model based on a training
dataset before receiving a test dataset. Opposite to Lazy learners, Eager Learner
takes more time in learning, and less time in prediction. Example: Decision Trees,
Naïve Bayes, ANN.

Types of ML Classification Algorithms:

Classification Algorithms can be further divided into the Mainly two category:

o Linear Models
o Logistic Regression
o Support Vector Machines
o Non-linear Models
o K-Nearest Neighbours
o Kernel SVM
o Naïve Bayes
o Decision Tree Classification
o Random Forest Classification

Evaluating a Classification model:


Once our model is completed, it is necessary to evaluate its performance; either it is a
Classification or Regression model. So for evaluating a Classification model, we have the
following ways:

1. Log Loss or Cross-Entropy Loss:

o It is used for evaluating the performance of a classifier, whose output is a probability


value between the 0 and 1.
o For a good binary Classification model, the value of log loss should be near to 0.
o The value of log loss increases if the predicted value deviates from the actual value.
o The lower log loss represents the higher accuracy of the model.
o For Binary classification, cross-entropy can be calculated as:

1. ?(ylog(p)+(1?y)log(1?p))  

Where y= Actual output, p= predicted output.

2. Confusion Matrix:

o The confusion matrix provides us a matrix/table as output and describes the


performance of the model.
o It is also known as the error matrix.
o The matrix consists of predictions result in a summarized form, which has a total
number of correct predictions and incorrect predictions. The matrix looks like as
below table:

Actual Positive Actual Negative

Predicted Positive True Positive False Positive

Predicted Negative False Negative True Negative

3. AUC-ROC curve:

o ROC curve stands for Receiver Operating Characteristics Curve and AUC stands


for Area Under the Curve.
o It is a graph that shows the performance of the classification model at different
thresholds.
o To visualize the performance of the multi-class classification model, we use the AUC-
ROC Curve.
o The ROC curve is plotted with TPR and FPR, where TPR (True Positive Rate) on Y-axis
and FPR(False Positive Rate) on X-axis.

Use cases of Classification Algorithms

Classification algorithms can be used in different places. Below are some popular use cases
of Classification Algorithms:

o Email Spam Detection


o Speech Recognition
o Identifications of Cancer tumor cells.
o Drugs Classification
o Biometric Identification, etc.

Logistic Regression in Machine Learning


o Logistic regression is one of the most popular Machine Learning algorithms, which
comes under the Supervised Learning technique. It is used for predicting the
categorical dependent variable using a given set of independent variables.

Logistic regression predicts the output of a categorical dependent variable. Therefore the

outcome must be a categorical or discrete value. It can be either Yes or No, 0 or 1, true or

False, etc. but instead of giving the exact value as 0 and 1, it gives the probabilistic values

which lie between 0 and 1. The hypothesis of logistic regression tends it to limit the cost

function between 0 and 1. Therefore linear functions fail to represent it as it can have a value

greater than 1 or less than 0 which is not possible as per the hypothesis of logistic regression.

o Logistic Regression is much similar to the Linear Regression except that how they are
used. Linear Regression is used for solving Regression problems, whereas Logistic
regression is used for solving the classification problems.
o In Logistic regression, instead of fitting a regression line, we fit an "S" shaped logistic
function, which predicts two maximum values (0 or 1).
o The curve from the logistic function indicates the likelihood of something such as
whether the cells are cancerous or not, a mouse is obese or not based on its weight,
etc.
o Logistic Regression is a significant machine learning algorithm because it has the
ability to provide probabilities and classify new data using continuous and discrete
datasets.
o Logistic Regression can be used to classify the observations using different types of
data and can easily determine the most effective variables used for the classification.
The below image is showing the logistic function:

Logistic Function (Sigmoid Function):

o The sigmoid function is a mathematical function used to map the predicted values to
probabilities.
o It maps any real value into another value within a range of 0 and 1.
o The value of the logistic regression must be between 0 and 1, which cannot go
beyond this limit, so it forms a curve like the "S" form. The S-form curve is called the
Sigmoid function or the logistic function.
o In logistic regression, we use the concept of the threshold value, which defines the
probability of either 0 or 1. Such as values above the threshold value tends to 1, and
a value below the threshold values tends to 0.
Assumptions for Logistic Regression:

o The dependent variable must be categorical in nature.


o The independent variable should not have multi-collinearity.

Logistic Regression Equation:

The Logistic regression equation can be obtained from the Linear Regression equation. The
mathematical steps to get Logistic Regression equations are given below:

o We know the equation of the straight line can be written as:

o In Logistic Regression y can be between 0 and 1 only, so for this let's divide the
above equation by (1-y):
o But we need range between -[infinity] to +[infinity], then take logarithm of the
equation it will become:

The above equation is the final equation for Logistic Regression.

Hypothesis Representation

When using linear regression we used a formula of the hypothesis i.e.

hΘ(x) = β₀ + β₁X

For logistic regression we are going to modify it a little bit i.e.

σ(Z) = σ(β₀ + β₁X)

We have expected that our hypothesis will give values between 0 and 1.

Z = β₀ + β₁X

hΘ(x) = sigmoid(Z)

i.e. hΘ(x) = 1/(1 + e^-(β₀ + β₁X)


The Hypothesis of logistic regression

Cost Function

We learnt about the cost function J(θ) in the Linear regression, the cost function represents

optimization objective i.e. we create a cost function and minimize it so that we can develop

an accurate model with minimum error.

The Cost function of Linear regression

If we try to use the cost function of the linear regression in ‘Logistic Regression’ then it would

be of no use as it would end up being a non-convex function with many local minimums, in

which it would be very difficult to minimize the cost value and find the global minimum.

Non-convex function

For logistic regression, the Cost function is defined as:


Cost function of Logistic Regression

Type of Logistic Regression:

On the basis of the categories, Logistic Regression can be classified into three types:

o Binomial: In binomial Logistic regression, there can be only two possible types of the
dependent variables, such as 0 or 1, Pass or Fail, etc.
o Multinomial: In multinomial Logistic regression, there can be 3 or more possible
unordered types of the dependent variable, such as "cat", "dogs", or "sheep"
o Ordinal: In ordinal Logistic regression, there can be 3 or more possible ordered types
of dependent variables, such as "lo w", "Medium", or "High".

Decision Tree Classification Algorithm


o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node,
which expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further
split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:

Why use Decision Trees?

There are various algorithms in Machine learning, so choosing the best algorithm for the
given dataset and problem is the main point to remember while creating a machine learning
model. Below are the two reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-
like structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the algorithm starts from the
root node of the tree. This algorithm compares the values of root attribute with the record
(real dataset) attribute and, based on the comparison, follows the branch and jumps to the
next node.

For the next node, the algorithm again compares the attribute value with the other sub-
nodes and move further. It continues the process until it reaches the leaf node of the tree.
The complete process can be better understood using the below algorithm:

o Step-1: Begin the tree with the root node, says S, which contains the complete
dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created
in step -3. Continue this process until a stage is reached where you cannot further
classify the nodes and called the final node as a leaf node.

Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the
root node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node.
Finally, the decision node splits into two leaf nodes (Accepted offers and Declined offer).
Consider the below diagram:
Attribute Selection Measures

While implementing a Decision tree, the main issue arises that how to select the best
attribute for the root node and for sub-nodes. So, to solve such problems there is a
technique which is called as Attribute selection measure or ASM. By this measurement, we
can easily select the best attribute for the nodes of the tree. There are two popular
techniques for ASM, which are:

o Information Gain
o Gini Index

1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation
of a dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision
tree.
o A decision tree algorithm always tries to maximize the value of information gain, and
a node/attribute having the highest information gain is split first. It can be calculated
using the below formula:

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)  

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies


randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

Where,

o S= Total number of samples


o P(yes)= probability of yes
o P(no)= probability of no

2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in
the CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high
Gini index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:

Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree

Pruning is a process of deleting the unnecessary nodes from a tree in order to get the
optimal decision tree.

A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the
learning tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:

o Cost Complexity Pruning


o Reduced Error Pruning.

Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.

Disadvantages of the Decision Tree


o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may
increase.

Introduction to Entropy in Machine Learning

Entropy is defined as the randomness or measuring the disorder of the information being
processed in Machine Learning. Further, in other words, we can say that entropy is the
machine learning metric that measures the unpredictability or impurity in the system.

When information is processed in the system, then every piece of information has a specific
value to make and can be used to draw conclusions from it. So if it is easier to draw a
valuable conclusion from a piece of information, then entropy will be lower in Machine
Learning, or if entropy is higher, then it will be difficult to draw any conclusion from that
piece of information.

23.2K

Machine Learning - Preprocessing Structured Data - Detecting Outliers

Entropy is a useful tool in machine learning to understand various concepts such as feature
selection, building decision trees, and fitting classification models, etc.

What is Entropy in Machine Learning

Entropy is the measurement of disorder or impurities in the information processed in


machine learning. It determines how a decision tree chooses to split data.
We can understand the term entropy with any simple example: flipping a coin. When we flip
a coin, then there can be two outcomes. However, it is difficult to conclude what would be
the exact outcome while flipping a coin because there is no direct relation between flipping
a coin and its outcomes. There is a 50% probability of both outcomes; then, in such
scenarios, entropy would be high. This is the essence of entropy in machine learning.

Mathematical Formula for Entropy

Consider a data set having a total number of N classes, then the entropy (E) can be
determined with the formula below:

Where;

Pi = Probability of randomly selecting an example in class I;

Entropy always lies between 0 and 1, however depending on the number of classes in the
dataset, it can be greater than 1. But the high value of

Let's understand it with an example where we have a dataset having three colors of fruits as
red, green, and yellow. Suppose we have 2 red, 2 green, and 4 yellow observations
throughout the dataset. Then as per the above equation:

E=−(prlog2pr+pglog2pg+pylog2py)

Where;

Pr = Probability of choosing red fruits;

Pg = Probability of choosing green fruits and;


Py = Probability of choosing yellow fruits.

Pr = 2/8 =1/4 [As only 2 out of 8 datasets represents red fruits]

Pg = 2/8 =1/4 [As only 2 out of 8 datasets represents green fruits]

Py = 4/8 = 1/2 [As only 4 out of 8 datasets represents yellow fruits]

Now our final equation will be such as;

So, entropy will be 1.5.

Let's consider a case when all observations belong to the same class; then entropy will
always be 0.

E=−(1log21)

=0

When entropy becomes 0, then the dataset has no impurity. Datasets with 0 impurities are
not useful for learning. Further, if the entropy is 1, then this kind of dataset is good for
learning.

Precision and Recall in Machine Learning

While building any machine learning model, the first thing that comes to our mind is how
we can build an accurate & 'good fit' model and what the challenges are that will come
during the entire procedure. Precision and Recall are the two most important but confusing
concepts in Machine Learning. Precision and recall are performance metrics used for pattern
recognition and classification in machine learning. These concepts are essential to build a
perfect machine learning model which gives more precise and accurate results. Some of the
models in machine learning require more precision and some model requires more recall.
So, it is important to know the balance between Precision and recall or, simply, precision-
recall trade-off.

Confusion Matrix in Machine Learning


Confusion Matrix helps us to display the performance of a model or how a model has made
its prediction in Machine Learning.

Confusion Matrix helps us to visualize the point where our model gets confused in
discriminating two classes. It can be understood well through a 2×2 matrix where the row
represents the actual truth labels, and the column represents the predicted labels.

This matrix consists of 4 main elements that show different metrics to count a number of
correct and incorrect predictions. Each element has two words either as follows:

o True or False
o Positive or Negative

If the predicted and truth labels match, then the prediction is said to be correct, but when
the predicted and truth labels are mismatched, then the prediction is said to be incorrect.
Further, positive and negative represents the predicted labels in the matrix.

There are four metrics combinations in the confusion matrix, which are as follows:

o True Positive: This combination tells us how many times a model correctly classifies a
positive sample as Positive?
o False Negative: This combination tells us how many times a model incorrectly
classifies a positive sample as Negative?
o False Positive: This combination tells us how many times a model incorrectly
classifies a negative sample as Positive?
o True Negative: This combination tells us how many times a model correctly classifies
a negative sample as Negative?

Hence, we can calculate the total of 7 predictions in binary classification problems using a
confusion matrix.

Now we can understand the concepts of Precision and Recall.


What is Precision?

Precision is defined as the ratio of correctly classified positive samples (True Positive) to a


total number of classified positive samples (either correctly or incorrectly).

1. Precision = True Positive/True Positive + False Positive  
2. Precision = TP/TP+FP  
o TP- True Positive
o FP- False Positive

o The precision of a machine learning model will be low when the value of;

1. TP+FP (denominator) > TP (Numerator)  
o The precision of the machine learning model will be high when Value of;

1. TP (Numerator) > TP+FP (denominator)  

Hence, precision helps us to visualize the reliability of the machine learning model in


classifying the model as positive.

Examples to calculate the Precision in the machine learning model

Below are some examples for calculating Precision in Machine Learning:

Case 1- In the below-mentioned scenario, the model correctly classified two positive
samples while incorrectly classified one negative sample as positive. Hence, according to
precision formula;

Precision = TP/TP+FP
Precision = 2/2+1 = 2/3 = 0.667

Case 2- In this scenario, we have three Positive samples that are correctly classified, and one
Negative sample is incorrectly classified.

Put TP =3 and FP =1 in the precision formula, we get;

Precision = TP/TP+FP

Precision = 3/3+1 = 3/4 = 0.75

Case 3- In this scenario, we have three Positive samples that are correctly classified but no
Negative sample which is incorrectly classified.

Put TP =3 and FP =0 in precision formula, we get;

Precision = TP/TP+FP

Precision = 3/3+0 = 3/3 = 1

Hence, in the last scenario, we have a precision value of 1 or 100% when all positive samples
are classified as positive, and there is no any Negative sample that is incorrectly classified.

What is Recall?

The recall is calculated as the ratio between the numbers of Positive samples correctly
classified as Positive to the total number of Positive samples. The recall measures the
model's ability to detect positive samples. The higher the recall, the more positive samples
detected.

1. Recall = True Positive/True Positive + False Negative  
2. Recall = TP/TP+FN  
o TP- True Positive
o FN- False Negative

o Recall of a machine learning model will be low when the value of;
TP+FN (denominator) > TP (Numerator)
o Recall of machine learning model will be high when Value of;
TP (Numerator) > TP+FN (denominator)

Unlike Precision, Recall is independent of the number of negative sample classifications.


Further, if the model classifies all positive samples as positive, then Recall will be 1.

Examples to calculate the Recall in the machine learning model

Below are some examples for calculating Recall in machine learning as follows

Example 1- Let's understand the calculation of Recall with four different cases where each
case has the same Recall as 0.667 but differs in the classification of negative samples. See
how:

In this scenario, the classification of the negative sample is different in each case. Case A has
two negative samples classified as negative, and case B have two negative samples classified
as negative; case c has only one negative sample classified as negative, while case d does
not classify any negative sample as negative.

However, recall is independent of how the negative samples are classified in the model;
hence, we can neglect negative samples and only calculate all samples that are classified as
positive.
In the above image, we have only two positive samples that are correctly classified as
positive while only 1 negative sample that is correctly classified as negative.

Hence, true positivity rate is 2 and while false negativity rate is 1. Then recall will be:

1. Recall = True Positive/True Positive + False Negative  

Recall = TP/TP+FN

=2/(2+1)

=2/3

=0.667

Example-2

Now, we have another scenario where all positive samples are classified correctly as
positive. Hence, the True Positive rate is 3 while the False Negative rate is 0.

Recall = TP/TP+FN = 3/(3+0) =3/3 =1


Precision Recall

It helps us to measure the ability to classify positive It helps us to measure how many positive samples
samples in the model. were correctly classified by the ML model.

While calculating the Precision of a model, we should While calculating the Recall of a model, we only need
consider both Positive as well as Negative samples all positive samples while all negative samples will be
that are classified. neglected.

When a model classifies most of the positive samples When a model classifies a sample as Positive, but it can
correctly as well as many false-positive samples, then only classify a few positive samples, then the model is
the model is said to be a high recall and low precision said to be high accuracy, high precision, and low recall
model. model.

The precision of a machine learning model is Recall of a machine learning model is dependent on
dependent on both the negative and positive positive samples and independent of negative samples.
samples.

In Precision, we should consider all positive samples The recall cares about correctly classifying all positive
that are classified as positive either correctly or samples. It does not consider if any negative sample is
incorrectly. classified as positive.

If the recall is 100%, then it tells us the model has detected all positive samples as positive
and neglects how all negative samples are classified in the model. However, the model could
still have so many samples that are classified as negative but recall just neglect those
samples, which results in a high False Positive rate in the model.

Example-3

In this scenario, the model does not identify any positive sample that is classified as positive.
All positive samples are incorrectly classified as Negative. Hence, the true positive rate is 0,
and the False Negative rate is 3. Then Recall will be:

Recall = TP/TP+FN = 0/(0+3) =0/3 =0

This means the model has not correctly classified any Positive Samples.

Difference between Precision and Recall in Machine Learning

Why use Precision and Recall in Machine Learning models?


This question is very common among all machine learning engineers and data researchers.
The use of Precision and Recall varies according to the type of problem being solved.

o If there is a requirement of classifying all positive as well as Negative samples as


Positive, whether they are classified correctly or incorrectly, then use Precision.
o Further, on the other end, if our goal is to detect only all positive samples, then use
Recall. Here, we should not care how negative samples are correctly or incorrectly
classified the samples.

You might also like