Evaluation Metrics For Classification Model in Python
Last Updated :
27 May, 2024
Classification is a supervised machine-learning technique that predicts the class label based on the input data. There are different classification algorithms to build a classification model, such as Stochastic Gradient Classifier, Support Vector Machine Classifier, Random Forest Classifier, etc. To choose the right model, it is important to gauge the performance of each classification algorithm.
This tutorial will look at different evaluation metrics to check the model's performance and explore which metrics to choose based on the situation.
Understanding Classification Evaluation Metrics
Understanding classification evaluation metrics is crucial for assessing the performance of machine learning models, especially in tasks like binary or multiclass classification. Some common metrics are:
Let's consider the MNIST dataset and try to understand the metrics based on the classifier. MNIST has a set of 70,000 small, handwritten-digit images. Let's go through the dataset before we start.
Python
from keras.datasets import mnist
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D, Flatten, Dense
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, roc_auc_score, roc_curve
import matplotlib.pyplot as plt
import seaborn as sns
(train_X, train_y), (test_X, test_y) = mnist.load_data()
train_X = train_X.reshape((train_X.shape[0], 28, 28, 1)).astype('float32') / 255
test_X = test_X.reshape((test_X.shape[0], 28, 28, 1)).astype('float32') / 255
train_y = to_categorical(train_y)
test_y = to_categorical(test_y)
model = Sequential([
Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
MaxPooling2D((2, 2)),
Flatten(),
Dense(100, activation='relu'),
Dense(10, activation='softmax')
])
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(train_X, train_y, epochs=3, batch_size=200, validation_split=0.2, verbose=2)
y_pred = model.predict(test_X)
y_pred_classes = y_pred.argmax(axis=1)
y_test_classes = test_y.argmax(axis=1)
Output:
Epoch 1/3
240/240 - 24s - loss: 0.3217 - accuracy: 0.9088 - val_loss: 0.1426 - val_accuracy: 0.9573 - 24s/epoch - 98ms/step
Epoch 2/3
240/240 - 18s - loss: 0.1022 - accuracy: 0.9707 - val_loss: 0.0805 - val_accuracy: 0.9770 - 18s/epoch - 76ms/step
Epoch 3/3
240/240 - 21s - loss: 0.0667 - accuracy: 0.9808 - val_loss: 0.0659 - val_accuracy: 0.9815 - 21s/epoch - 89ms/step
313/313 [==============================] - 2s 6ms/step
1. Accuracy
Classification accuracy is the simplest evaluation metric. It is defined as the number of correct predictions divided by the total number of predictions multiplied by 100. The accuracy metric works great if the target variable classes in the data are approximately balanced. For example, if 60% of the classes in an animal dataset are dogs and 40% are cats, then we can say that it is a balanced dataset. It calculates the ratio of correctly predicted instances to the total instances. It's calculated as:
Accuracy= \frac{\text Total Number of Predictions}
{\text Number of Correct Predictions}
In the context of the MNIST dataset, accuracy measures how often the model correctly identifies the handwritten digits.
Python
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test_classes, y_pred_classes)
print(f'Accuracy: {accuracy:.4f}')
Output:
Accuracy: 0.9822
2. Confusion Matrix
The confusion matrix is another way to evaluate the performance of a classifier. Here, it counts the number of times instances of class A are classified as class B. For example, the number of times the classifier confused images of 5s with non-5s.
This is a table that is often used to describe the performance of a classification model. It presents a summary of the predictions made by the model against the actual class labels. The confusion matrix is a matrix with four different combinations of predicted and actual classes: True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN).
Let's compute the confusion matrix to evaluate the performance of a classifier. We can make use of MNIST dataset to compute the confusion matrix. The stepsare as follows:
Python
cm = confusion_matrix(y_test_classes, y_pred_classes)
plt.figure(figsize=(10, 4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Output:
Confusion Matrix
3. Precision, Recall and F1 Score
A confusion matrix is a great way to evaluate the performance of a classifier, but sometimes we may need a more concise metric. Here comes the importance of precision.
3.1 Precision: Precision provides the accuracy of the positive prediction made by the classifier. The equation is as follows:
Precision = True Positive / (True Positive + False Positive)
When to choose precision?
In some cases, we need high precision. For example, consider that we trained a classifier to detect videos that are safe for kids. Here, we prefer a classifier that keeps only the safe one (high precision), irrespective of whether the classifier rejects many good videos (low recall).
Precision is typically used with another metric called recall (sensitivity, or the true positive rate ━ TPR).
3.2 Recall: Recall is the ratio of number of true positive predictions (correctly detected by the classifer) to the total number of actual positive instances in the dataset. It measures the completeness of positive predictions. The equation is as follows:
Recall = True Positve / (True Positive + False Negative)
When to choose recall?
In some cases, high recall is given importance instead of high precision. Suppose you train a classifier for fire detection with high precision; certain actual cases were not considered. So it is important to maintain a high recall. Here, security guards will get a few false alarms, but they will be alarmed in almost every actual case.
3.3 F1 Score: The F1 score is the harmonic mean of precision and recall. It favors classifiers that have similar precision and recall. Here, the classifier will only get a high F1 score if both recall and precision are high. The equation is as follows:
F1 = 2 * (Precision * Recall) / (Precision + Recall)
When to choose F1 Score?
F1 Score is invaluable in binary classification tasks, especially with imbalanced datasets, where accuracy can be misleading. It strikes a balance between precision and recall, crucial in scenarios where both are equally important, like medical diagnosis. This metric effectively captures the trade-off between precision and recall, offering a comprehensive evaluation of model performance.
For implementation refer to code below:
Python
precision = precision_score(y_test_classes, y_pred_classes, average='macro')
recall = recall_score(y_test_classes, y_pred_classes, average='macro')
f1 = f1_score(y_test_classes, y_pred_classes, average='macro')
print(f'Precision: {precision:.4f}')
print(f'Recall: {recall:.4f}')
print(f'F1 Score: {f1:.4f}')
Output:
Precision: 0.9823
Recall: 0.9821
F1 Score: 0.9822
In the above code, we make use of the f1_score() method from the sklearn metric to calculate the F1 score.
4. ROC Curve
The Receiver Operating Characteristic (ROC) curve is a graphical representation of the performance of a classification model at various thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR). The Area Under the ROC Curve (AUC-ROC) is a metric to evaluate the performance of a binary classification model. AUC-ROC value lies between 0 and 1, where a higher value indicates better performance. AUC-ROC is insensitive to class distribution and gives an aggregate measure of performance across all possible classification thresholds.
The true positive rate is calculated as:
TPR = True Positives / (True Positives + False Negatives)
It defines how good the model is at predicting the positive class for a positive outcome. It is also known as sensitivity.
The false positive rate is calculated as:
FPR = False Positives / (False Positives + True Negatives)
It is also referred to as inverted specificity (1 - specificity), where specifity is calculated as:
Specificity = True Negative / (True Negaive + False Positive)
Let's get to the implementation part using Sklearn.The code is as follows:
Python
# Assuming y_test and y_pred_prob are the true labels and predicted probabilities respectively
y_pred_prob = model.predict(test_X)
roc_auc = roc_auc_score(test_y, y_pred_prob, multi_class='ovr')
print(f'ROC AUC: {roc_auc:.4f}')
# Plotting ROC Curve for one class (e.g., class 0)
fpr, tpr, _ = roc_curve(y_test_classes == 0, y_pred_prob[:, 0])
plt.plot(fpr, tpr, label='Class 0 ROC curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend(loc='best')
plt.show()
Output:
ROC AUC: 0.9998
ROC-AUC Curve
Conclusion
It is important to gain insights on how well a machine learning algorithm performs on unseen data. Choosing the right evaluation metrics will help identify the right ML algorithm that performs well. Here, we have gone through different evaluation metrics and also discussed how to choose the right evaluation metrics for classification.
Similar Reads
CatBoost Metrics for model evaluation
To make sure our model's performance satisfies evolving expectations and criteria, proper evaluation is crucial when it comes to machine learning model construction. Yandex's CatBoost is a potent gradient-boosting library that gives machine learning practitioners and data scientists a toolbox of mea
15+ min read
Tree-Based Models for Classification in Python
Tree-based models are a cornerstone of machine learning, offering powerful and interpretable methods for both classification and regression tasks. This article will cover the most prominent tree-based models used for classification, including Decision Tree Classifier, Random Forest Classifier, Gradi
8 min read
Cat & Dog Classification using Convolutional Neural Network in Python
Convolutional Neural Networks (CNNs) are a type of deep learning model specifically designed for processing images. Unlike traditional neural networks CNNs uses convolutional layers to automatically and efficiently extract features such as edges, textures and patterns from images. This makes them hi
5 min read
ROC Curves for Multiclass Classification in R
Receiver Operating Characteristic (ROC) curves are a powerful tool for evaluating the performance of classification models. While ROC curves are straightforward for binary classification, extending them to multiclass classification presents additional challenges. In this article, we'll explore how t
3 min read
Compute Classification Report and Confusion Matrix in Python
Classification Report and Confusion Matrix are used to check machine learning model's performance during model development. These help us understand the accuracy of predictions and tells areas of improvement. In this article, we will learn how to compute these metrics in Python using a simple exampl
3 min read
Classification using PyTorch linear function
In machine learning, prediction is a critical component. It is the process of using a trained model to make predictions on new data. PyTorch is an open-source machine learning library that allows developers to build and train neural networks. One common use case in PyTorch is using linear classifier
7 min read
How to Create simulated data for classification in Python?
In this article, we are going to see how to create simulated data for classification in Python. We will use the sklearn library that provides various generators for simulating classification data. Single Label Classification Here we are going to see single-label classification, for this we will use
2 min read
Comprehensive Guide to Classification Models in Scikit-Learn
Scikit-Learn, a powerful and user-friendly machine learning library in Python, has become a staple for data scientists and machine learning practitioners. It offers a wide array of tools for data mining and data analysis, making it accessible and reusable in various contexts. This article delves int
12 min read
Tumor Detection using classification - Machine Learning and Python
In this article, we will be making a project through Python language which will be using some Machine Learning Algorithms too. It will be an exciting one as after this project you will understand the concepts of using AI & ML with a scripting language. Â The following libraries/packages will be u
9 min read
Phishing Classification using Ensemble model
With the rise of digital technology usage, it is becoming easier for attackers to steal personal information from users by committing phishing, one of the most common and dangerous cybercrimes. In this context, our exploration is related to phishing classification using an ensemble model. In this ar
7 min read