0% found this document useful (0 votes)

4 views40 pages

5 Classification

Uploaded by

Suhani Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views40 pages

5 Classification

Uploaded by

Suhani Jain

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 40

Classification

Summer 2024
© IIT Roorkee India

Dr. Sharma T.
Univariate

Explaining
Exploration
the past
- Linear classifiers (e.g.
Logistic Regression)
- Non-linear classifiers (e.g.
Bivariate Classification K-Nearest Neighbors)
algorithms - Support Vector Machines
- Neural Networks

- Training, testing and

Data Analytics Model Selection and validation datasets
Classification
Evaluation - Metrics for evaluating
classification models

Handling Imbalanced - Strategies for handling

Predicting Datasets imbalanced datasets
Modeling Regression (undersampling,
the future
oversampling, class
weighting)

ClusteringDr. Sharma T. 2
Output: A category A real-value patterns identify identify
within a associations wider
group of between dependencies
uncategorized different
data data objects

Objective: Predictive analysis Dr. Sharma T.

Pattern recognition
Labeled vs Unlabeled data

Dr. Sharma T.
Introduction to Classification
What?
• Classification is a fundamental concept in the field of machine learning

• It involves identifying the category or class to which a new

observation belongs based on a set of labeled training data.

• It is a supervised learning technique that is used to categorize or label

a set of data into different classes or categories.

Dr. Sharma T.
Dr. Sharma T.
Types of classification
1) Binary Classification: predicting one of two possible outcomes, typically
represented by 1 and 0,True and False, or Positive and Negative.
− For example, classifying an email as spam or not spam, or diagnosing a patient as
having a disease or not.
2) Multiclass Classification: predicting one of more than two possible
outcomes.
− For example, classifying an object as a car, bicycle, or motorcycle, or recognizing
different types of fruits.
3) Multilabel Classification: predicting one or more outcomes for each
sample. In other words, each sample can belong to multiple categories or
classes at the same time.
− For example, classifying a movie as belonging to multiple genres such as action,
comedy, and drama.

Dr. Sharma T.
Real-world applications of classification
− Image classification: recognizing objects or people in images and
categorizing them into specific classes
− Spam filtering: classifying emails as spam or not spam
− Medical diagnosis: diagnosing diseases based on symptoms and test
results
− Credit risk assessment: predicting the likelihood of a loan default
based on various factors such as credit history, income, and job
stability

Dr. Sharma T.
Real-world applications of classification
− Sentiment analysis: classifying the sentiment of a piece of text as
positive, negative, or neutral
− Customer segmentation: dividing customers into different groups
based on their purchasing behavior and demographics
− Fraud detection: identifying fraudulent transactions in financial systems
− Marketing: classifying customers based on their likelihood to respond
to a marketing campaign, or to purchase a certain product or service.

Dr. Sharma T.
Basic Terminology

Dr. Sharma T.
Feature and Target Variables
• Feature variables: (also called predictors, inputs, or attributes) are the
variables used to describe an instance (such as an individual, item, or
event).
− These features are used to build a model that makes predictions about the
target variable (also called response, label, or output).
• Target variable: is the variable that we want to predict based on the
feature variables.
− In a classification problem, the target variable is categorical (e.g.Yes/No,
A/B/C), while in regression problems the target variable is continuous (e.g.
age, salary, height).

Dr. Sharma T.
Examples
• In a housing price prediction dataset, the feature variables could be
the number of bedrooms, square footage, neighborhood, and so on,
while the target variable would be the price of the house.

• In a medical diagnosis dataset, the feature variables could be patient

symptoms, medical history, and test results, while the target variable
would be the diagnosis (e.g. flu, pneumonia, etc.).

Dr. Sharma T.
Model Training
This is the process of building a machine learning model using a training
dataset.

The model is trained to learn the relationship between the features (input
variables) and the target variable.

This process involves selecting

an appropriate algorithm,
defining the hyperparameters,
and fitting the model to the
training data.

Dr. Sharma T.
Prediction
Once the model is trained, it can be used to make predictions on new,
unseen data.

During prediction, the feature values are input into the model, and the
target variable is predicted based on the learned relationship.

Dr. Sharma T.
Overfitting and Underfitting
• Overfitting and underfitting are two common issues faced while
training machine learning models.

• Overfitting occurs when a model is trained too well on the training

data and fits the noise in the data instead of the underlying pattern.
• As a result, it performs well on the training data but poorly on the unseen
data or validation data.
• Overfitting can be identified by having a high accuracy on the training data but
a low accuracy on the validation data.

Dr. Sharma T.
Overfitting and Underfitting
• Underfitting, on the other hand, occurs when a model is not complex
enough to capture the underlying pattern in the data. It results in a
low accuracy on both the training and validation data.

It is important to strike a balance between overfitting and underfitting

to build an effective model.

Dr. Sharma T.
Overfitting and Underfitting

Dr. Sharma T.
Bias and Variance
• Bias and variance are two important concepts in machine learning
that describe the error in a model's predictions.

• Bias refers to the error that is introduced by assuming that the

relationship between the features and target is too simple.
• A model with high bias pays little attention to the training data and
oversimplifies the relationship between the features and target.
• As a result, it often has a high training error and a high test error.

Dr. Sharma T.
Bias and Variance
• Variance, on the other hand, refers to the error that is introduced by
the model being too complex and fitting the training data too closely.
• A model with high variance pays too much attention to the training data and
overfits it, capturing the noise in the data as well as the underlying
relationship.
• As a result, it has a low training error but a high test error.

The goal in building a machine learning model is to find a balance

between bias and variance to minimize the total error.This is often
referred to as the bias-variance trade-off.

Dr. Sharma T.
Classification Algorithms

Dr. Sharma T.
Linear classifiers
• A linear classifier is a machine learning algorithm that uses a linear
function to separate data into different classes.
• The goal of a linear classifier is to find the hyperplane (a line or a plane
in high-dimensional space) that best separates the data into their
respective classes.
• The hyperplane is defined
by a set of coefficients that
are estimated during the
training phase.

Dr. Sharma T.
Linear classifiers
Examples of linear classifiers include
• Logistic Regression,
• Support Vector Machines (SVM) with linear kernels, and
• Linear Discriminant Analysis (LDA).

These algorithms make predictions based on the values of the features

and the coefficients of the hyperplane, which are used to determine the
class of an observation.

Dr. Sharma T.
Logistic Regression: Definition
• Logistic Regression is a popular supervised machine learning
algorithm used for binary classification problems.

• In logistic regression, the target variable is binary and the prediction is

made based on the relationship between the independent (or feature)
variables and the dependent (or target) variable.

• The main objective of logistic regression is to find the best fitting

model (i.e., a line or hyperplane) that separates the classes in the
feature space.

Dr. Sharma T.
Logistic Regression: How it works
• The algorithm works by modeling the
probability of an event occurring (e.g.,
a customer buying a product) using a
sigmoid function (the logistic
function).

• The output of the logistic regression

model is a probability score between
0 and 1, which can then be used to
make a binary classification.

Dr. Sharma T.
Logistic Regression: How it works

• x is a linear combination of one or more features in the dataset.

• f(x) is a probability between 0 and 1.
• For example, if the output of the function is above 0.5, the output is
considered as 1. On the other hand, if the output is less than 0.5, the output is
classified as 0.

Dr. Sharma T.
Model Selection and Evaluation

Dr. Sharma T.
Training,Training and Validation datasets
Training, testing, and validation datasets are used in the process of
developing and evaluating a machine learning model.

1.Training dataset:This dataset is used to train the machine learning

model.
− The model is trained by fitting the model to the training data.The
model learns the patterns in the data and uses them to make
predictions.

Dr. Sharma T.
Training,Training and Validation datasets
2.Testing dataset:This dataset is used to evaluate the performance of
the machine learning model after it has been trained.
− The model is presented with new, unseen data and it makes
predictions based on what it has learned from the training data.
− The accuracy of these predictions is then used to evaluate the
performance of the model.

Dr. Sharma T.
Testing,Training and Validation datasets
3.Validation dataset:This dataset is used to tune the hyperparameters
of the machine learning model.
− The model is trained on the training data and then evaluated on the
validation data.
− The hyperparameters are adjusted based on the performance of the
model on the validation data.
− This helps to prevent overfitting of the model to the training data.

Dr. Sharma T.
Metrics for evaluating classification models
Classification models can be evaluated using a variety of metrics, depending
on the specific use case and requirements.

Some of the most used metrics are:

1. Accuracy
2. Confusion matrix
3. Precision
4. Recall
5. F1 score
6. ROC curve (Receiver Operating Characteristic)
7. AUC (Area Under the Curve)
Dr. Sharma T.
Accuracy
Ratio of correct predictions made by a classifier to the total number of
predictions made

Number of correct predictions

Accuracy =
Total number of predictions

Dr. Sharma T.
Confusion Matrix
A 2-D table that shows the number of true positive, true negative, false
positive, and false negative predictions made by the model.

The entries in the matrix can then be used to calculate various

performance metrics, such as precision, recall, F1-score, and AUC for
the ROC curve.
Dr. Sharma T.
Precision
the number of true positive predictions (i.e. positive predictions that
are actually correct) divided by the total number of positive predictions
made by the model.

Precision = True Positives

True Positives + False Positives

Dr. Sharma T.
Recall
The proportion of actual positive instances that are correctly classified
as positive by the model.

Also called TPR (True Positive Rate) or Sensitivity.

Recall = True Positives

True Positives + False Negatives

Dr. Sharma T.
F1 score
A metric that combines precision and recall. It is calculated as the
harmonic mean of precision and recall.

2 * (Precision * Recall)
F1-score =
Precision + Recall

The F1 Score ranges between 0 and 1, with 1 being the best possible
score and 0 the worst.

Dr. Sharma T.
ROC curve
• A graphical representation of the performance of a binary
classification model as the discrimination threshold (probability threshold)
is varied.

• It plots the true positive rate (TPR) against the false positive rate
(FPR) at various threshold settings.

• The ROC curve is a useful tool for evaluating the trade-off between
the true positive rate and the false positive rate of a classifier.

Dr. Sharma T.
ROC curve
• FPR = FP / (FP + TN)
i.e. probability of false alarm

• TPR = TP / (TP + FN)

i.e. probability of detection

Dr. Sharma T.
AUC
• The interpretation of the ROC curve is based on the Area Under the
Curve (AUC), which summarizes the overall performance of the
model.

• An AUC of 1 indicates a perfect model, while an AUC of 0.5

represents a random model.

• A higher AUC value indicates a better performance, with a larger area

under the curve meaning a greater balance between TPR and FPR.

Dr. Sharma T.
Demo: Logistic Regression

Dr. Sharma T.

100 Employee Data Set
No ratings yet
100 Employee Data Set
7 pages
Practical Lab Manual-CSE-492
No ratings yet
Practical Lab Manual-CSE-492
4 pages
Module 2 - Error Control System and Error Detection - Santosh Jagtap
No ratings yet
Module 2 - Error Control System and Error Detection - Santosh Jagtap
18 pages
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal PDF Download
No ratings yet
Data Clustering Algorithms and Applications First Edition Charu C. Aggarwal PDF Download
45 pages
Machine Learning - I
No ratings yet
Machine Learning - I
126 pages
Data Mining Classification and Prediction
No ratings yet
Data Mining Classification and Prediction
17 pages
INT354 Unit 1 Part1
No ratings yet
INT354 Unit 1 Part1
16 pages
11 Regression
No ratings yet
11 Regression
34 pages
MachineLearning in Short
No ratings yet
MachineLearning in Short
10 pages
Lecture 18 One Hot Encoding
No ratings yet
Lecture 18 One Hot Encoding
13 pages
10 Classification SVM
No ratings yet
10 Classification SVM
22 pages
AI Lec 4
No ratings yet
AI Lec 4
35 pages
DSP 3 (A)
No ratings yet
DSP 3 (A)
5 pages
Fourier 4
No ratings yet
Fourier 4
18 pages
Ai - W2L4
No ratings yet
Ai - W2L4
18 pages
Chapter 5 Machine Learning
No ratings yet
Chapter 5 Machine Learning
96 pages
Machine Learning Notes "2023
No ratings yet
Machine Learning Notes "2023
31 pages
INT354 - Unit 1
No ratings yet
INT354 - Unit 1
72 pages
Supervised Machine Learning Algorithm
100% (1)
Supervised Machine Learning Algorithm
111 pages
ML Unit-IV Notes
No ratings yet
ML Unit-IV Notes
49 pages
DGC 111
No ratings yet
DGC 111
324 pages
Classification
100% (2)
Classification
105 pages
Ds Module 4
No ratings yet
Ds Module 4
73 pages
19 ML Intro
No ratings yet
19 ML Intro
31 pages
9 - Linear Discriminant Analysis
No ratings yet
9 - Linear Discriminant Analysis
19 pages
Artificial Intelligence Lec 2
No ratings yet
Artificial Intelligence Lec 2
17 pages
0 1 Knapsack
No ratings yet
0 1 Knapsack
54 pages
Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Unit6 - 1 Classification-and-Prediction-Basics
No ratings yet
Unit6 - 1 Classification-and-Prediction-Basics
12 pages
ML 2 PPT Unit 2
No ratings yet
ML 2 PPT Unit 2
214 pages
Advanced Machine Learning
No ratings yet
Advanced Machine Learning
5 pages
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
Fourier Transformation
No ratings yet
Fourier Transformation
22 pages
5 CommonPractices
No ratings yet
5 CommonPractices
106 pages
Lecture 3 (Chapter 7 ATPG Basics)
No ratings yet
Lecture 3 (Chapter 7 ATPG Basics)
25 pages
Chapter 19
No ratings yet
Chapter 19
30 pages
Machine Learning
No ratings yet
Machine Learning
37 pages
Ruiz Modified I2ml3e Chap6
No ratings yet
Ruiz Modified I2ml3e Chap6
38 pages
1 - Intro To Machine Learning
No ratings yet
1 - Intro To Machine Learning
34 pages
Module 3
No ratings yet
Module 3
98 pages
19-Introduction Classification Algorithm-18-09-2024
No ratings yet
19-Introduction Classification Algorithm-18-09-2024
102 pages
Unit 3
No ratings yet
Unit 3
53 pages
Task The Problems That Can Be Solved With Machine Learning
No ratings yet
Task The Problems That Can Be Solved With Machine Learning
9 pages
Data Science
No ratings yet
Data Science
38 pages
Machine Learning Note
No ratings yet
Machine Learning Note
40 pages
Unit Ii
No ratings yet
Unit Ii
118 pages
Classification
No ratings yet
Classification
22 pages
Introductiontomachinelearning 230723174746 1a0e5edc
No ratings yet
Introductiontomachinelearning 230723174746 1a0e5edc
27 pages
Introduction Class
No ratings yet
Introduction Class
134 pages
Chapter 01 Introduction To Machine Learning
No ratings yet
Chapter 01 Introduction To Machine Learning
59 pages
Aimlf Unit 3
No ratings yet
Aimlf Unit 3
20 pages
Machine Learning Volume I 280820241047
No ratings yet
Machine Learning Volume I 280820241047
4 pages
Intro To ML
No ratings yet
Intro To ML
26 pages
Image Enhancement
No ratings yet
Image Enhancement
2 pages
Machine Learning
No ratings yet
Machine Learning
33 pages
ENEE 222 Signals and Systems: Spring 2021 - Problem Set 7 - Due 4/13/2021
No ratings yet
ENEE 222 Signals and Systems: Spring 2021 - Problem Set 7 - Due 4/13/2021
2 pages
Big Data Analytics - Unit 3
No ratings yet
Big Data Analytics - Unit 3
55 pages
Lecture 4.2 Supervised Learning Classification
No ratings yet
Lecture 4.2 Supervised Learning Classification
25 pages
Basics of ML and Evaluation
No ratings yet
Basics of ML and Evaluation
42 pages
Presentation On ML
No ratings yet
Presentation On ML
469 pages
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
No ratings yet
CS601 - Machine Learning - Unit 1 - Notes - 1672759748
13 pages
Matrix YM
100% (3)
Matrix YM
17 pages
Classification
No ratings yet
Classification
53 pages
MachineLearning Jan2nd
100% (2)
MachineLearning Jan2nd
171 pages
Module3 DS PPT
No ratings yet
Module3 DS PPT
68 pages
Machine - Learning - Unit - 1
No ratings yet
Machine - Learning - Unit - 1
70 pages
00 0 Flow Shop and Job Shop Scheduling
No ratings yet
00 0 Flow Shop and Job Shop Scheduling
17 pages
ML 1 PPT Unit 1
No ratings yet
ML 1 PPT Unit 1
93 pages
Week 4 - Intro To ML
No ratings yet
Week 4 - Intro To ML
37 pages
Problem Set 1
No ratings yet
Problem Set 1
2 pages
Unit 3 in Machine Intelligence
No ratings yet
Unit 3 in Machine Intelligence
62 pages
Block Based Medical Image Watermarking Technique For Tamper Detection and Recovery
No ratings yet
Block Based Medical Image Watermarking Technique For Tamper Detection and Recovery
10 pages
Recurrent Neural Network
No ratings yet
Recurrent Neural Network
10 pages
Lecture 9
No ratings yet
Lecture 9
27 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
24 pages
CS7641 Machine Learning Midterm Notes PDF
No ratings yet
CS7641 Machine Learning Midterm Notes PDF
239 pages
11-12-K Means Using SPSS
No ratings yet
11-12-K Means Using SPSS
4 pages
Neural Networks Cheat Sheet - 2020 PDF
No ratings yet
Neural Networks Cheat Sheet - 2020 PDF
14 pages
Exercise 2
No ratings yet
Exercise 2
3 pages
0270 PDF Bib
No ratings yet
0270 PDF Bib
8 pages
Searching For Solution
No ratings yet
Searching For Solution
41 pages
9-Error Detection and Correction-21!01!2022 (21-Jan-2022) Material I 21-01-2022 Error Detection - Correction-Up
No ratings yet
9-Error Detection and Correction-21!01!2022 (21-Jan-2022) Material I 21-01-2022 Error Detection - Correction-Up
56 pages
Sullivan Wicks Solution Manual Introduction To Optimum Design 3rd Ed Jasbir Arora Students Would Be Able To Come Up With Innovative Conceptual Solutions in W PDF
0% (7)
Sullivan Wicks Solution Manual Introduction To Optimum Design 3rd Ed Jasbir Arora Students Would Be Able To Come Up With Innovative Conceptual Solutions in W PDF
3 pages
EE 210 - 01 - Linear Systems Theory (Fall 2020) : San José State University Department of Electrical Engineering
No ratings yet
EE 210 - 01 - Linear Systems Theory (Fall 2020) : San José State University Department of Electrical Engineering
6 pages
Unit 4 Classification
No ratings yet
Unit 4 Classification
87 pages
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
No ratings yet
Classification & Prediction: - Shailesh Yadav Central University of Rajasthan
28 pages
MLSM Lecture1 050923
No ratings yet
MLSM Lecture1 050923
37 pages
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
No ratings yet
Example-35: Solve The Following Non-Linear Programming Problem Using Kuhn
17 pages
Numerical Methods For Inverse Kinematics: 1 Problem Description
No ratings yet
Numerical Methods For Inverse Kinematics: 1 Problem Description
8 pages
Machine Learning HC
No ratings yet
Machine Learning HC
4 pages
Machine Learning
No ratings yet
Machine Learning
51 pages
Solving A Transportation Problem Actual Problem Using Excel Solver
No ratings yet
Solving A Transportation Problem Actual Problem Using Excel Solver
5 pages
Statistical Classification: Fundamentals and Applications
From Everand
Statistical Classification: Fundamentals and Applications
Fouad Sabry
No ratings yet
Machine Learning Notes
100% (3)
Machine Learning Notes
134 pages
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
From Everand
DATA MINING and MACHINE LEARNING. PREDICTIVE TECHNIQUES: ENSEMBLE METHODS, BOOSTING, BAGGING, RANDOM FOREST, DECISION TREES and REGRESSION TREES.: Examples with MATLAB
César Pérez López
No ratings yet

5 Classification

Uploaded by

5 Classification

Uploaded by

Classification

- Training, testing and

Handling Imbalanced - Strategies for handling

Objective: Predictive analysis Dr. Sharma T.

• It involves identifying the category or class to which a new

• It is a supervised learning technique that is used to categorize or label

• In a medical diagnosis dataset, the feature variables could be patient

This process involves selecting

• Overfitting occurs when a model is trained too well on the training

It is important to strike a balance between overfitting and underfitting

• Bias refers to the error that is introduced by assuming that the

The goal in building a machine learning model is to find a balance

These algorithms make predictions based on the values of the features

• In logistic regression, the target variable is binary and the prediction is

• The main objective of logistic regression is to find the best fitting

• The output of the logistic regression

• x is a linear combination of one or more features in the dataset.

1.Training dataset:This dataset is used to train the machine learning

Some of the most used metrics are:

Number of correct predictions

The entries in the matrix can then be used to calculate various

Precision = True Positives

Also called TPR (True Positive Rate) or Sensitivity.

Recall = True Positives

• TPR = TP / (TP + FN)

• An AUC of 1 indicates a perfect model, while an AUC of 0.5

• A higher AUC value indicates a better performance, with a larger area

You might also like