5 Classification
5 Classification
Summer 2024
© IIT Roorkee India
Dr. Sharma T.
Univariate
Explaining
Exploration
the past
- Linear classifiers (e.g.
Logistic Regression)
- Non-linear classifiers (e.g.
Bivariate Classification K-Nearest Neighbors)
algorithms - Support Vector Machines
- Neural Networks
ClusteringDr. Sharma T. 2
Output: A category A real-value patterns identify identify
within a associations wider
group of between dependencies
uncategorized different
data data objects
Dr. Sharma T.
Introduction to Classification
What?
• Classification is a fundamental concept in the field of machine learning
Dr. Sharma T.
Dr. Sharma T.
Types of classification
1) Binary Classification: predicting one of two possible outcomes, typically
represented by 1 and 0,True and False, or Positive and Negative.
− For example, classifying an email as spam or not spam, or diagnosing a patient as
having a disease or not.
2) Multiclass Classification: predicting one of more than two possible
outcomes.
− For example, classifying an object as a car, bicycle, or motorcycle, or recognizing
different types of fruits.
3) Multilabel Classification: predicting one or more outcomes for each
sample. In other words, each sample can belong to multiple categories or
classes at the same time.
− For example, classifying a movie as belonging to multiple genres such as action,
comedy, and drama.
Dr. Sharma T.
Real-world applications of classification
− Image classification: recognizing objects or people in images and
categorizing them into specific classes
− Spam filtering: classifying emails as spam or not spam
− Medical diagnosis: diagnosing diseases based on symptoms and test
results
− Credit risk assessment: predicting the likelihood of a loan default
based on various factors such as credit history, income, and job
stability
Dr. Sharma T.
Real-world applications of classification
− Sentiment analysis: classifying the sentiment of a piece of text as
positive, negative, or neutral
− Customer segmentation: dividing customers into different groups
based on their purchasing behavior and demographics
− Fraud detection: identifying fraudulent transactions in financial systems
− Marketing: classifying customers based on their likelihood to respond
to a marketing campaign, or to purchase a certain product or service.
Dr. Sharma T.
Basic Terminology
Dr. Sharma T.
Feature and Target Variables
• Feature variables: (also called predictors, inputs, or attributes) are the
variables used to describe an instance (such as an individual, item, or
event).
− These features are used to build a model that makes predictions about the
target variable (also called response, label, or output).
• Target variable: is the variable that we want to predict based on the
feature variables.
− In a classification problem, the target variable is categorical (e.g.Yes/No,
A/B/C), while in regression problems the target variable is continuous (e.g.
age, salary, height).
Dr. Sharma T.
Examples
• In a housing price prediction dataset, the feature variables could be
the number of bedrooms, square footage, neighborhood, and so on,
while the target variable would be the price of the house.
Dr. Sharma T.
Model Training
This is the process of building a machine learning model using a training
dataset.
The model is trained to learn the relationship between the features (input
variables) and the target variable.
Dr. Sharma T.
Prediction
Once the model is trained, it can be used to make predictions on new,
unseen data.
During prediction, the feature values are input into the model, and the
target variable is predicted based on the learned relationship.
Dr. Sharma T.
Overfitting and Underfitting
• Overfitting and underfitting are two common issues faced while
training machine learning models.
Dr. Sharma T.
Overfitting and Underfitting
• Underfitting, on the other hand, occurs when a model is not complex
enough to capture the underlying pattern in the data. It results in a
low accuracy on both the training and validation data.
Dr. Sharma T.
Overfitting and Underfitting
Dr. Sharma T.
Bias and Variance
• Bias and variance are two important concepts in machine learning
that describe the error in a model's predictions.
Dr. Sharma T.
Bias and Variance
• Variance, on the other hand, refers to the error that is introduced by
the model being too complex and fitting the training data too closely.
• A model with high variance pays too much attention to the training data and
overfits it, capturing the noise in the data as well as the underlying
relationship.
• As a result, it has a low training error but a high test error.
Dr. Sharma T.
Classification Algorithms
Dr. Sharma T.
Linear classifiers
• A linear classifier is a machine learning algorithm that uses a linear
function to separate data into different classes.
• The goal of a linear classifier is to find the hyperplane (a line or a plane
in high-dimensional space) that best separates the data into their
respective classes.
• The hyperplane is defined
by a set of coefficients that
are estimated during the
training phase.
Dr. Sharma T.
Linear classifiers
Examples of linear classifiers include
• Logistic Regression,
• Support Vector Machines (SVM) with linear kernels, and
• Linear Discriminant Analysis (LDA).
Dr. Sharma T.
Logistic Regression: Definition
• Logistic Regression is a popular supervised machine learning
algorithm used for binary classification problems.
Dr. Sharma T.
Logistic Regression: How it works
• The algorithm works by modeling the
probability of an event occurring (e.g.,
a customer buying a product) using a
sigmoid function (the logistic
function).
Dr. Sharma T.
Logistic Regression: How it works
Dr. Sharma T.
Model Selection and Evaluation
Dr. Sharma T.
Training,Training and Validation datasets
Training, testing, and validation datasets are used in the process of
developing and evaluating a machine learning model.
Dr. Sharma T.
Training,Training and Validation datasets
2.Testing dataset:This dataset is used to evaluate the performance of
the machine learning model after it has been trained.
− The model is presented with new, unseen data and it makes
predictions based on what it has learned from the training data.
− The accuracy of these predictions is then used to evaluate the
performance of the model.
Dr. Sharma T.
Testing,Training and Validation datasets
3.Validation dataset:This dataset is used to tune the hyperparameters
of the machine learning model.
− The model is trained on the training data and then evaluated on the
validation data.
− The hyperparameters are adjusted based on the performance of the
model on the validation data.
− This helps to prevent overfitting of the model to the training data.
Dr. Sharma T.
Metrics for evaluating classification models
Classification models can be evaluated using a variety of metrics, depending
on the specific use case and requirements.
1. Accuracy
2. Confusion matrix
3. Precision
4. Recall
5. F1 score
6. ROC curve (Receiver Operating Characteristic)
7. AUC (Area Under the Curve)
Dr. Sharma T.
Accuracy
Ratio of correct predictions made by a classifier to the total number of
predictions made
Dr. Sharma T.
Confusion Matrix
A 2-D table that shows the number of true positive, true negative, false
positive, and false negative predictions made by the model.
Dr. Sharma T.
Recall
The proportion of actual positive instances that are correctly classified
as positive by the model.
Dr. Sharma T.
F1 score
A metric that combines precision and recall. It is calculated as the
harmonic mean of precision and recall.
2 * (Precision * Recall)
F1-score =
Precision + Recall
The F1 Score ranges between 0 and 1, with 1 being the best possible
score and 0 the worst.
Dr. Sharma T.
ROC curve
• A graphical representation of the performance of a binary
classification model as the discrimination threshold (probability threshold)
is varied.
• It plots the true positive rate (TPR) against the false positive rate
(FPR) at various threshold settings.
• The ROC curve is a useful tool for evaluating the trade-off between
the true positive rate and the false positive rate of a classifier.
Dr. Sharma T.
ROC curve
• FPR = FP / (FP + TN)
i.e. probability of false alarm
Dr. Sharma T.
AUC
• The interpretation of the ROC curve is based on the Area Under the
Curve (AUC), which summarizes the overall performance of the
model.
Dr. Sharma T.
Demo: Logistic Regression
Dr. Sharma T.