0% found this document useful (0 votes)
16 views51 pages

Lecture 11 - 09.09.24 Classification Part 1

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views51 pages

Lecture 11 - 09.09.24 Classification Part 1

Uploaded by

Amritanshu Vivek
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

1

Classification - 1

Prof. Sashikumaar Ganesan, IISc Bangalore


Agenda

Introduction to Logistic
Decision Trees
Classification Regression

Classification Confusion
Q&A
Metrics Matrix
“ Introduction to Classification and
Classification algorithms
Classification

Classification is a type of supervised learning where the goal is to predict


the categorical class labels of new instances based on labelled training data

Fig: Location of classification on ML Tree Fig: Sample Classification


Introduction to Classification

Different Classification Algorithms

Logistic Regression
It is a simple and widely used method for binary classification. It
uses a logistic function to model the probability of the output
class.

Decision Trees
A decision tree is a hierarchical model that partitions the
feature space into a set of rectangular regions. Each leaf
node represents a class label.

Image Credits : Logistic regression – statquest , Decision Tree: cs.cmu.edu


Introduction to Classification

Different Classification Algorithms – Contd.

Random Forest
• Random forest is an ensemble method that combines
multiple decision trees to improve the prediction
accuracy.
• It creates a set of decision trees on randomly selected
subsets of the training data and then combines their
predictions.
Naive Bayes
• It is a probabilistic classifier that uses Bayes'
theorem to predict the class label of a new
instance.
• Naive Bayes assumes that the features are
conditionally independent given the class label.

Image Credits : Random Forest – wikipedia , Naive Bayes: https://medium.com/analytics-vidhya/na%C3%AFve-bayes-algorithm-5bf31e9032a2


Introduction to Classification

Different Classification Algorithms – Contd.

Support Vector Machines (SVM)


• SVMs are a powerful classification method that
constructs a hyperplane in a high-dimensional space
to separate the classes.
• SVMs maximize the margin between the hyperplane
and the closest points from each class.

Image Credits : wikipedia


Introduction to Classification
SVM Use Cases
When to use ? Types of SVM:

• Binary classification with clear separation • Linear SVM


• High-dimensional, small/medium datasets • Non-linear SVM (kernel-based)
• Support Vector Regression (SVR)
When not to use SVM ?

• Very large or imbalanced datasets


• When probabilistic outputs are needed

Image Credits : wikipedia


Introduction to Classification

Binary Classification

• Binary classification has only two labels for


classification
• Relevant examples for this model are
• Email – spam/not spam
• Anomaly / not Anomaly
• Popular algorithms used for binary classification
• Naïve Bayes
• Logistic Regression
• Support Vector Machines (SVM)
Introduction to Classification

Multi-Class Classification
• Multi-class classification has more than two labels for
classification
• Relevant examples for this model are
• Handwritten digits recognition
• Face expression classification
• Popular algorithms used for binary classification
• Naïve Bayes
• Random Forest
• Decision trees
• SVM
Introduction to Classification

Multi-Label Classification

• Multi-label classification have two or more class labels,


where one or more class labels may be predicted for
each example.
• Relevant examples for this model are
• Image classification with multiple objects on image
• Popular Multilabel algorithms
• Multi label decision trees
• Multi Label Random forests
Image Credits : https://towardsdatascience.com/yolo-object-detection-with-opencv-and-python-21e50ac599e9
Practical Example
Multi-Label vs Multi-Class -> Practical Example (Songs)

Multi-Class Multi-Label
Placing all your song collection into Tagging your songs in your media player
specific folder such as by year or by under different playlists
music director The song can be part of multiple playlists
Once placed, the song will belong to (or and can
will be inside) that specific folder only
Classification

Summary

• Classification is type of supervised learning where the goal is to predict the


categorical class labels
• There are different types of classification models such as decision trees, random
forest, logistic regression, SVM etc
• The most used types of classification are binary, Multi-class, Multi-label.
• In Multi-label classification, an object can be associated with multiple classes based
on a probability distribution
Test your understanding

1. True or False: In multi-class classification, each instance can belong to only one class.

2. True or False: Multi-label classification allows an instance to belong to multiple classes


simultaneously.

3. Which of the following is NOT typically used as a classification algorithm?


a) Random Forest b) Support Vector Machine c) Linear Regression d) Naive Bayes
Solutions

1. Answer: True Explanation: In multi-class classification, each instance is assigned to exactly


one class out of three or more possible classes. This is different from multi-label
classification where an instance can belong to multiple classes simultaneously.

2. Answer: True Explanation: In multi-label classification, each instance can be associated with
multiple classes at the same time. For example, a movie could be classified as both "action"
and "comedy".

3. Answer: Linear Regression Explanation: Linear Regression is primarily used for regression
tasks, where the goal is to predict a continuous numerical value. It's not typically used for
classification problems, which involve predicting discrete class labels.
“ Logistic Regression
Logistic Regression

• Logistic regression is a statistical method used for


binary or multi-class classification problems
• The logistic regression model is based on the logistic
function, which maps any real-valued input to a
probability value between 0 and 1.

Image Credits : Statquest


Logistic Regression

How is it different from linear Regression?

• Linear regression maps the input values to the output values (in continuous
domain)

• The logistic regression model is based on the logistic function, which maps any
real-valued input to a probability value between 0 and 1.
Logistic Regression

Solution : To use a function of z that goes from 0 to 1, where

Credits : Stanford – logistic regression


Logistic Regression

Idea of Logistic Regression

• Compute w∙x+b
• Pass it through the sigmoid function: σ(w∙x+b)
• Treat it as a probability

Here the value 0.5 is the decision


boundary
Logistic Regression

Idea of Logistic Regression

Credits : Stanford – logistic regression


Test your understanding

1. Which of the following is the standard activation function used in Logistic Regression?
a) ReLU b) Sigmoid c) Tanh d) Softmax

2. Which of the following loss functions is typically used in Logistic Regression? a) Mean
Squared Error b) Cross-Entropy Loss c) Hinge Loss d) Huber Loss

3. A Logistic Regression model outputs a probability of 0.7 for a certain instance. If the
classification threshold is 0.6, how will this instance be classified in a binary problem?
Solutions

1. Answer: b) Sigmoid Explanation: The sigmoid function is used to transform


the output to a probability between 0 and 1.

2. Answer: b)Cross-Entropy Loss (also known as Log Loss) is the standard loss
function for Logistic Regression as it measures the performance of a
classification model whose output is a probability value between 0 and 1.

3. Answer: The instance will be classified as the positive class (typically


labeled as 1). Explanation: Since 0.7 > 0.6 (the threshold), the model
predicts the positive class for this instance.
“ Accuracy of the classification model
Accuracy

MNIST Handwritten Dataset

• Set of small images of digits handwritten by


high school students and employees of the US
Census Bureau
• Generally referred as hello world problem in
ML classification
• 70,000 datapoints with 10 classes, with each
image of size 28x28 (784) features

Image Credits : Statquest


Accuracy

Let's look at a binary classification model

Lets build a basic binary classification model using from


MNIST dataset, where we must predict a number is "4" or
not

Positive Class Negative Class

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Accuracy

How well did the model perform?

• Based on the three-fold cross validation our model has obtained an accuracy of
96.47%
• Which is reasonably a good estimate for the given model
But is this metric sufficient?

Did it quantify the model properly?

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Accuracy

Let's build our custom classifier


This classifier will never predict anything as "positive" so rather it predicts every image as not
number 4 (negative)

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Accuracy

How did my custom model perform?

Prediction accuracy of Never 4 model- 90.3%

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Accuracy

How did this happen?

• To really understand this, we need to look at the


distribution of class labels within our dataset
• We can see that more than 90% percent of the
data belongs to negative class
• Which means, a model which predicts every value
as negative will be 90% accurate

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Accuracy

So can't I use accuracy to measure performance ?

• Accuracy is still a valid measure for performance of a classification model


• However, the accuracy is not a preferred metric, when we have a highly imbalanced
dataset
• For Eg : Credit card fraud detection ( where 99% of data is not fraud data)
• Therefore, we need to formulate a metric which considers the misclassifications of
various types into account (actual positive but predicted as negative or actual
negative predirected as positive)
Accuracy

Summary

• Accuracy is the measure of number of correct classifications with respect to the


total classifications
• Accuracy can provide a wrong representation on the quality of the model when the
dataset is imbalanced.
“ Confusion Matrix
Confusion Matrix

What is a confusion matrix ?

Confusion matrix (error matrix) summarize performance of the classification algorithm

Reference : Hands of Machine learning with scikit learn keras and tensorflow – Geron 3rd Edition
Confusion Matrix

Precision
• Assesses the accuracy of positive predictions made by a model
• Ratio of true positive predictions to the total number of positive
predictions (both true positives and false positives)
• Gauges the proportion of correctly predicted positive instances out of all
instances the model predicted as positive
• Valuable for evaluating the model's capability to avoid making incorrect
positive predictions and to minimize false positives

𝑇𝑃
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 =
𝑇𝑃 + 𝐹𝑃
Confusion Matrix

Recall
• Ability of a model to correctly identify all relevant instances in the dataset
• Ratio of true positive predictions to the total number of actual positive
instances
• Quantifies the model's effectiveness in capturing and "recalling" instances of a
particular class, thereby providing insight into its ability to minimize false
negatives (missed instances)

𝑇𝑃
𝑅𝑒𝑐𝑎𝑙𝑙 =
𝑇𝑃 + 𝐹𝑁
Confusion Matrix

Understanding Precision and Recall

• Precision can be made 100% by just making only correct positive prediction
and ensuring its correct ( Since FP = 0 , we will have precision as 1 )
• Recall is also known as sensitivity or True Positive Rate
• Recall can be increased by reducing the number of False Negatives, (i,e
mistakenly classifying a positive instance as negative)
Confusion Matrix

Understanding Precision and Recall – Contd.


Scenario:

• An operator at a radar screen identifies enemy missiles as dots. However, occasionally,


the radar may also display dots for flocks of birds or other obstacles.

High precision High Recall Operator


Operator
• Operator will mark dots on the screen • Operator will choose most dots as
as missiles cautiously to avoid false missiles to avoid missing actual
positives, ensuring only actual missiles (FN).
missiles are selected. • Operator would've selected numerous
• Operator would have chosen very few dots as missiles.
dots as missiles.
Confusion Matrix

Understanding Precision and Recall – Contd.


Need for high precision

• Instances where reducing false positives is essential include scenarios like ensuring the
safety of videos for children.
• While allowing a few false negatives (labeling a child-safe video as unsafe) might be
acceptable, the focus is primarily on detecting any non-child-safe video falsely marked as
safe (false positive).
Need for high recall
• Instances where reducing false negatives is crucial include situations such as identifying
shoplifters in a high-end jewellery store's surveillance video.
• In this case, the priority is to minimize instances where shoplifters are not detected (false
negatives).
• This might involve adjusting the model to accept more false positives (misidentifying non -
shoplifters as shoplifters) as a trade-off.
Confusion Matrix

F1 Score

• The F1 score is a single metric that combines precision and


recall, offering a balanced evaluation of a model's binary
classification performance.
• Harmonic Mean: It's calculated using the harmonic mean of
precision and recall, providing a measure that considers both
false positives and false negatives.
Confusion Matrix

F1 Score

• Range:
• The F1 score ranges between 0 and 1, with higher values
indicating better model performance.
• Class Imbalance:
• Useful when dealing with imbalanced datasets, where one
class is more prevalent than the other.
• Balancing Precision and Recall:
• Helps find a middle ground between identifying true positive
instances (precision) and capturing all relevant positive
instances (recall).
Test your understanding
1. In a binary classification problem, which of the following is NOT a component of a confusion
matrix? a) True Positives b) False Negatives c) True Negatives d) Actual Positives

2. True or False: Accuracy is always the best metric to evaluate a classification model, regardless
of class imbalance.

3. Which metric is defined as the ratio of correctly predicted positive samples to the total
predicted positive samples? a) Recall b) Precision c) F1-score d) Specificity

4. What does the area under the ROC curve (AUC-ROC) represent? a) The model's ability to
distinguish between classes b) The model's accuracy c) The model's precision d) The model's
recall

5. In a binary classification problem, a model achieves the following results: True Positives: 80
False Positives: 20 False Negatives: 10 True Negatives: 90 Calculate the model's precision and
recall.
Test your understanding

1. Answer: d) Actual Positives Explanation: A confusion matrix typically contains True Positives, True
Negatives, False Positives, and False Negatives. "Actual Positives" is a sum of True Positives and False
Negatives, not a direct component of the matrix.

2. Answer: False Explanation: Accuracy can be misleading in cases of class imbalance. Other metrics like
precision, recall, or F1-score may be more appropriate depending on the problem and class distribution.

3. Answer: b) Precision Explanation: Precision is defined as TP / (TP + FP), where TP is True Positives and FP
is False Positives. It measures the accuracy of positive predictions.

4. Answer: a) The model's ability to distinguish between classes Explanation: AUC-ROC represents the
model's ability to distinguish between positive and negative classes across various thresholds. A higher
AUC indicates better discrimination.

5. Answer: Precision = TP / (TP + FP) = 80 / (80 + 20) = 0.8 or 80% Recall = TP / (TP + FN) = 80 / (80 + 10) =
0.889 or 88.9%
“ Multiclass Classification
Multiclass Classification

How can we build Multi-class classifier models

• We have learnt how binary classifiers classify data into either positive or negative class.
• However, Not all binary classifiers such as SVM, and SGD classifiers are inherently equipped to
handle multi-class classification problems
• We can make an ensemble of binary classification models to perform a multi-class classification
models
• There are two main strategies
• One vs All ( One vs Rest ) OVR
• One vs One (OVO)
Multiclass Classification

One vs All

• For a N class classification problem, we build N


binary classification models
• In this case, the classification models will be
• Green vs [Red,Blue]
• Blue vs [Red,Green]
• Red vs [Blue,Green]

Image Reference : https://www.cc.gatech.edu/classes/AY2016/cs4476_fall/results/proj4/html/jnanda3/index.html


Multiclass Classification

One vs All – Contd.

• The final classification will be based on the


probabilities of each class
• Lets say For Eg :
• Green vs [Red,Blue] : 0.8
• Blue vs [Red,Green] : 0.4
• Red vs [Blue,Green] : -0.3
• Based on the scores, we can classify that as Green

Image Reference : https://www.researchgate.net/figure/The-considered-one-vs-all-multiclass-classification-approach_fig2_257018675


Multiclass Classification

One vs All – Continued

• To use any binary classifier with the OVR strategy,


we can make use of the OneVsRestClassifier from
the sklearn module
• This method may perform bad on imbalanced
datasets, since we will be comparing one class
against all the other classes
Multiclass Classification

One vs One

• We build an ensemble of binary classification models, where


we compare only two classes at a time
• In order to achieve comparison of single class with every
other class, we can compute it using

𝑁 ∗ (𝑁 − 1)
𝑇𝑜𝑡𝑎𝑙 𝐶𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟𝑠 =
2

Image Reference : https://www.sciencedirect.com/science/article/abs/pii/S0950705116301459


Multiclass Classification

One vs One

• For this example, where N = 3, we will get total of 3 classifiers.


• Green vs Blue
• Green vs Red
• Blue vs Red
• Each binary classifier predicts one class label, the majority of class labels is
considered as the final class
• Models like SVM default to OVO strategy since SVM scale badly and its easy to build
many smaller models than few bigger models.
Multiclass Classification

Summary

• One vs One, builds N*(N-1)/2 classifiers and One vs All uses N classifiers for a multi-
class classification problem with N classes
• For large datasets with multiple classes, OVR will be challenging as it may lead to
imbalanced dataset for classification
• OVR can comparatively deal with a smaller dataset per model ( since only 2 classes
are involved) however, the number of models is way higher than OVR

You might also like