0% found this document useful (0 votes)
4 views40 pages

5 Classification

Uploaded by

Suhani Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views40 pages

5 Classification

Uploaded by

Suhani Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Classification

Summer 2024
© IIT Roorkee India

Dr. Sharma T.
Univariate

Explaining
Exploration
the past
- Linear classifiers (e.g.
Logistic Regression)
- Non-linear classifiers (e.g.
Bivariate Classification K-Nearest Neighbors)
algorithms - Support Vector Machines
- Neural Networks

- Training, testing and


Data Analytics Model Selection and validation datasets
Classification
Evaluation - Metrics for evaluating
classification models

Handling Imbalanced - Strategies for handling


Predicting Datasets imbalanced datasets
Modeling Regression (undersampling,
the future
oversampling, class
weighting)

ClusteringDr. Sharma T. 2
Output: A category A real-value patterns identify identify
within a associations wider
group of between dependencies
uncategorized different
data data objects

Objective: Predictive analysis Dr. Sharma T.


Pattern recognition
Labeled vs Unlabeled data

Dr. Sharma T.
Introduction to Classification
What?
• Classification is a fundamental concept in the field of machine learning

• It involves identifying the category or class to which a new


observation belongs based on a set of labeled training data.

• It is a supervised learning technique that is used to categorize or label


a set of data into different classes or categories.

Dr. Sharma T.
Dr. Sharma T.
Types of classification
1) Binary Classification: predicting one of two possible outcomes, typically
represented by 1 and 0,True and False, or Positive and Negative.
− For example, classifying an email as spam or not spam, or diagnosing a patient as
having a disease or not.
2) Multiclass Classification: predicting one of more than two possible
outcomes.
− For example, classifying an object as a car, bicycle, or motorcycle, or recognizing
different types of fruits.
3) Multilabel Classification: predicting one or more outcomes for each
sample. In other words, each sample can belong to multiple categories or
classes at the same time.
− For example, classifying a movie as belonging to multiple genres such as action,
comedy, and drama.

Dr. Sharma T.
Real-world applications of classification
− Image classification: recognizing objects or people in images and
categorizing them into specific classes
− Spam filtering: classifying emails as spam or not spam
− Medical diagnosis: diagnosing diseases based on symptoms and test
results
− Credit risk assessment: predicting the likelihood of a loan default
based on various factors such as credit history, income, and job
stability

Dr. Sharma T.
Real-world applications of classification
− Sentiment analysis: classifying the sentiment of a piece of text as
positive, negative, or neutral
− Customer segmentation: dividing customers into different groups
based on their purchasing behavior and demographics
− Fraud detection: identifying fraudulent transactions in financial systems
− Marketing: classifying customers based on their likelihood to respond
to a marketing campaign, or to purchase a certain product or service.

Dr. Sharma T.
Basic Terminology

Dr. Sharma T.
Feature and Target Variables
• Feature variables: (also called predictors, inputs, or attributes) are the
variables used to describe an instance (such as an individual, item, or
event).
− These features are used to build a model that makes predictions about the
target variable (also called response, label, or output).
• Target variable: is the variable that we want to predict based on the
feature variables.
− In a classification problem, the target variable is categorical (e.g.Yes/No,
A/B/C), while in regression problems the target variable is continuous (e.g.
age, salary, height).

Dr. Sharma T.
Examples
• In a housing price prediction dataset, the feature variables could be
the number of bedrooms, square footage, neighborhood, and so on,
while the target variable would be the price of the house.

• In a medical diagnosis dataset, the feature variables could be patient


symptoms, medical history, and test results, while the target variable
would be the diagnosis (e.g. flu, pneumonia, etc.).

Dr. Sharma T.
Model Training
This is the process of building a machine learning model using a training
dataset.

The model is trained to learn the relationship between the features (input
variables) and the target variable.

This process involves selecting


an appropriate algorithm,
defining the hyperparameters,
and fitting the model to the
training data.

Dr. Sharma T.
Prediction
Once the model is trained, it can be used to make predictions on new,
unseen data.

During prediction, the feature values are input into the model, and the
target variable is predicted based on the learned relationship.

Dr. Sharma T.
Overfitting and Underfitting
• Overfitting and underfitting are two common issues faced while
training machine learning models.

• Overfitting occurs when a model is trained too well on the training


data and fits the noise in the data instead of the underlying pattern.
• As a result, it performs well on the training data but poorly on the unseen
data or validation data.
• Overfitting can be identified by having a high accuracy on the training data but
a low accuracy on the validation data.

Dr. Sharma T.
Overfitting and Underfitting
• Underfitting, on the other hand, occurs when a model is not complex
enough to capture the underlying pattern in the data. It results in a
low accuracy on both the training and validation data.

It is important to strike a balance between overfitting and underfitting


to build an effective model.

Dr. Sharma T.
Overfitting and Underfitting

Dr. Sharma T.
Bias and Variance
• Bias and variance are two important concepts in machine learning
that describe the error in a model's predictions.

• Bias refers to the error that is introduced by assuming that the


relationship between the features and target is too simple.
• A model with high bias pays little attention to the training data and
oversimplifies the relationship between the features and target.
• As a result, it often has a high training error and a high test error.

Dr. Sharma T.
Bias and Variance
• Variance, on the other hand, refers to the error that is introduced by
the model being too complex and fitting the training data too closely.
• A model with high variance pays too much attention to the training data and
overfits it, capturing the noise in the data as well as the underlying
relationship.
• As a result, it has a low training error but a high test error.

The goal in building a machine learning model is to find a balance


between bias and variance to minimize the total error.This is often
referred to as the bias-variance trade-off.

Dr. Sharma T.
Classification Algorithms

Dr. Sharma T.
Linear classifiers
• A linear classifier is a machine learning algorithm that uses a linear
function to separate data into different classes.
• The goal of a linear classifier is to find the hyperplane (a line or a plane
in high-dimensional space) that best separates the data into their
respective classes.
• The hyperplane is defined
by a set of coefficients that
are estimated during the
training phase.

Dr. Sharma T.
Linear classifiers
Examples of linear classifiers include
• Logistic Regression,
• Support Vector Machines (SVM) with linear kernels, and
• Linear Discriminant Analysis (LDA).

These algorithms make predictions based on the values of the features


and the coefficients of the hyperplane, which are used to determine the
class of an observation.

Dr. Sharma T.
Logistic Regression: Definition
• Logistic Regression is a popular supervised machine learning
algorithm used for binary classification problems.

• In logistic regression, the target variable is binary and the prediction is


made based on the relationship between the independent (or feature)
variables and the dependent (or target) variable.

• The main objective of logistic regression is to find the best fitting


model (i.e., a line or hyperplane) that separates the classes in the
feature space.

Dr. Sharma T.
Logistic Regression: How it works
• The algorithm works by modeling the
probability of an event occurring (e.g.,
a customer buying a product) using a
sigmoid function (the logistic
function).

• The output of the logistic regression


model is a probability score between
0 and 1, which can then be used to
make a binary classification.

Dr. Sharma T.
Logistic Regression: How it works

• x is a linear combination of one or more features in the dataset.


• f(x) is a probability between 0 and 1.
• For example, if the output of the function is above 0.5, the output is
considered as 1. On the other hand, if the output is less than 0.5, the output is
classified as 0.

Dr. Sharma T.
Model Selection and Evaluation

Dr. Sharma T.
Training,Training and Validation datasets
Training, testing, and validation datasets are used in the process of
developing and evaluating a machine learning model.

1.Training dataset:This dataset is used to train the machine learning


model.
− The model is trained by fitting the model to the training data.The
model learns the patterns in the data and uses them to make
predictions.

Dr. Sharma T.
Training,Training and Validation datasets
2.Testing dataset:This dataset is used to evaluate the performance of
the machine learning model after it has been trained.
− The model is presented with new, unseen data and it makes
predictions based on what it has learned from the training data.
− The accuracy of these predictions is then used to evaluate the
performance of the model.

Dr. Sharma T.
Testing,Training and Validation datasets
3.Validation dataset:This dataset is used to tune the hyperparameters
of the machine learning model.
− The model is trained on the training data and then evaluated on the
validation data.
− The hyperparameters are adjusted based on the performance of the
model on the validation data.
− This helps to prevent overfitting of the model to the training data.

Dr. Sharma T.
Metrics for evaluating classification models
Classification models can be evaluated using a variety of metrics, depending
on the specific use case and requirements.

Some of the most used metrics are:

1. Accuracy
2. Confusion matrix
3. Precision
4. Recall
5. F1 score
6. ROC curve (Receiver Operating Characteristic)
7. AUC (Area Under the Curve)
Dr. Sharma T.
Accuracy
Ratio of correct predictions made by a classifier to the total number of
predictions made

Number of correct predictions


Accuracy =
Total number of predictions

Dr. Sharma T.
Confusion Matrix
A 2-D table that shows the number of true positive, true negative, false
positive, and false negative predictions made by the model.

The entries in the matrix can then be used to calculate various


performance metrics, such as precision, recall, F1-score, and AUC for
the ROC curve.
Dr. Sharma T.
Precision
the number of true positive predictions (i.e. positive predictions that
are actually correct) divided by the total number of positive predictions
made by the model.

Precision = True Positives


True Positives + False Positives

Dr. Sharma T.
Recall
The proportion of actual positive instances that are correctly classified
as positive by the model.

Also called TPR (True Positive Rate) or Sensitivity.

Recall = True Positives


True Positives + False Negatives

Dr. Sharma T.
F1 score
A metric that combines precision and recall. It is calculated as the
harmonic mean of precision and recall.

2 * (Precision * Recall)
F1-score =
Precision + Recall

The F1 Score ranges between 0 and 1, with 1 being the best possible
score and 0 the worst.

Dr. Sharma T.
ROC curve
• A graphical representation of the performance of a binary
classification model as the discrimination threshold (probability threshold)
is varied.

• It plots the true positive rate (TPR) against the false positive rate
(FPR) at various threshold settings.

• The ROC curve is a useful tool for evaluating the trade-off between
the true positive rate and the false positive rate of a classifier.

Dr. Sharma T.
ROC curve
• FPR = FP / (FP + TN)
i.e. probability of false alarm

• TPR = TP / (TP + FN)


i.e. probability of detection

Dr. Sharma T.
AUC
• The interpretation of the ROC curve is based on the Area Under the
Curve (AUC), which summarizes the overall performance of the
model.

• An AUC of 1 indicates a perfect model, while an AUC of 0.5


represents a random model.

• A higher AUC value indicates a better performance, with a larger area


under the curve meaning a greater balance between TPR and FPR.

Dr. Sharma T.
Demo: Logistic Regression

Dr. Sharma T.

You might also like