0% found this document useful (0 votes)
64 views11 pages

Machine Learning Model Evaluation

The document discusses the importance of machine learning model evaluation, highlighting techniques such as cross-validation and metrics like accuracy, precision, recall, F1 score, and confusion matrix for classification tasks. It also covers regression evaluation metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). Overall, it emphasizes that proper evaluation is crucial for ensuring model effectiveness and reliability in real-world applications.

Uploaded by

Nilay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views11 pages

Machine Learning Model Evaluation

The document discusses the importance of machine learning model evaluation, highlighting techniques such as cross-validation and metrics like accuracy, precision, recall, F1 score, and confusion matrix for classification tasks. It also covers regression evaluation metrics including Mean Absolute Error (MAE), Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and Mean Absolute Percentage Error (MAPE). Overall, it emphasizes that proper evaluation is crucial for ensuring model effectiveness and reliability in real-world applications.

Uploaded by

Nilay Yadav
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Machine Learning Model Evaluation

Model evaluation is a process that uses some metrics which help us to


analyze the performance of the model. Think of training a model like
teaching a student. Model evaluation is like giving them a test to see if
they truly learned the subject—or just memorized answers. It helps us
answer:
Did the model learn patterns?
Will it fail on new questions?

Model development is a multi-step process and we need to keep a check


on how well the model do future predictions and analyze a models
weaknesses. There are many metrics for that. Cross Validation is one
technique that is followed during the training phase and it is a model
evaluation technique.

Cross-Validation: The Ultimate Practice Test


Cross Validation is a method in which we do not use the whole dataset for
training. In this technique some part of the dataset is reserved for testing
the model. There are many types of Cross-Validation out of which K Fold
Cross Validation is mostly used. In K Fold Cross Validation the original
dataset is divided into k subsets. The subsets are known as folds. This is
repeated k times where 1 fold is used for testing purposes, rest k-1 folds
are used for training the model. It is seen that this technique generalizes
the model well and reduces the error rate.
Holdout is the simplest approach. It is used in neural networks as well as
in many classifiers. In this technique the dataset is divided into train and
test datasets. The dataset is usually divided into ratios like 70:30 or 80:20.
Normally a large percentage of data is used for training the model and a
small portion of dataset is used for testing the model.

Evaluation Metrics for Classification Task


Classification is used to categorize data into predefined labels or classes.
To evaluate the performance of a classification model we commonly use
metrics such as accuracy, precision, recall, F1 score and confusion matrix.
These metrics are useful in assessing how well model distinguishes
between classes especially in cases of imbalanced datasets. By
understanding the strengths and weaknesses of each metric, we can
select the most appropriate one for a given classification problem.

In this Python code, we have imported the iris dataset which has features
like the length and width of sepals and petals. The target values are Iris
setosa, Iris virginica, and Iris versicolor. After importing the dataset we
divided the dataset into train and test datasets in the ratio 80:20. Then we
called Decision Trees and trained our model. After that, we performed the
prediction and calculated the accuracy score, precision, recall, and f1 score.
We also plotted the confusion matrix.
Importing Libraries and Dataset

Now let’s load the toy dataset iris flowers from the sklearn.datasets library
and then split it into training and testing parts (for model evaluation) in the
80:20 ratio.

random_state=20,

Now, let’s train a Decision Tree Classifier model on the training data, and
then we will move on to the evaluation part of the model using different
metrics.

1. Accuracy

Accuracy is defined as the ratio of number of correct predictions to the


total number of predictions. This is the most fundamental metric used to
evaluate the model. The formula is given by:
Accuracy = TP+TP +TN
TN+FP+ FN
However Accuracy has a drawback. It cannot perform well on an
imbalanced dataset. Suppose a model classifies that the majority of the
data belongs to the major class label. It gives higher accuracy, but in
general model cannot classify on minor class labels and has poor
performance.

print("Accuracy:", accuracy_score(y_test,

Output:

Accuracy: 0.9333333333333333
2. Precision and Recall

Precision is the ratio of true positives to the summation of true positives


and false positives. It basically analyses the positive predictions.
Precision = TP
TP+FP
The drawback of Precision is that it does not consider the True Negatives
and False Negatives.
Recall is the ratio of true positives to the summation of true positives and
false negatives. It basically analyses the number of correct positive
samples.
Recall = TP
TP+FN
The drawback of Recall is that often it leads to a higher false positive rate.

print("Precision:", precision_score(y_test,

average="weighted"))

print('Recall:', recall_score(y_test,

average="weighted"))

Output:

Precision: 0.9435897435897436
Recall: 0.9333333333333333

3. F1 score

F1 score is the harmonic mean of precision and recall. It is seen that during
the precision-recall trade-off if we increase the precision, recall decreases
and vice versa. The goal of the F1 score is to combine precision and recall.
2×Precision×Recall
F1 Score = Precision+Recall
Output:

F1 score: 0.9327777777777778

4. Confusion Matrix

Confusion matrix is a N x N matrix where N is the number of target


classes. It represents number of actual outputs and predicted outputs.
Some terminologies in the matrix are as follows:
True Positives: It is also known as TP. It is the output in which the
actual and the predicted values are YES.
True Negatives: It is also known as TN. It is the output in which the
actual and the predicted values are NO.
False Positives: It is also known as FP. It is the output in which the
actual value is NO but the predicted value is YES.
False Negatives: It is also known as FN. It is the output in which the
actual value is YES but the predicted value is NO.

cm_display.plot()

Output:
Confusion matrix for the output of the model

In the output the accuracy of model is 93.33%. Precision is approximately


0.944 and Recall is 0.933. F1 score is approximately 0.933. Finally the
confusion matrix is plotted. Here class labels denote the target classes:

0 = Setosa
1 = Versicolor
2 = Virginica

From the confusion matrix, we see that 8 setosa classes were correctly
predicted. 11 Versicolor test cases were also correctly predicted by the
model and 2 virginica test cases were misclassified. In contrast, the rest 9
were correctly predicted.

5. AUC-ROC Curve

AUC (Area Under Curve) is an evaluation metric that is used to analyze the
classification model at different threshold values. The Receiver Operating
Characteristic (ROC) curve is a probabilistic curve used to highlight the
model’s performance. The curve has two parameters:
TPR: It stands for True positive rate. It basically follows the formula of
Recall.
FPR: It stands for False Positive rate. It is defined as the ratio of False
positives to the summation of false positives and True negatives.
This curve is useful as it helps us to determine the model’s capacity to
distinguish between different classes. Let us illustrate this with the help of
a simple Python example

Output:

Auc 0.75

AUC score is a useful metric to evaluate the model. It highlights model’s


capacity to separate the classes. In the above code 0.75 is a good AUC
score. A model is considered good if the AUC score is greater than 0.5 and
approaches 1.

Evaluation Metrics for Regression Task


Regression is used to determine continuous values. It is mostly used to
find a relation between a dependent and independent variable. For
classification we use a confusion matrix, accuracy, f1 score, etc. But for
regression analysis since we are predicting a numerical value it may differ
from the actual output. So we consider the error calculation as it helps to
summarize how close the prediction is to the actual value. There are many
metrics available for evaluating the regression model.
In this Python Code we have implemented a simple regression model
using the Mumbai weather CSV file. This file comprises Day, Hour,
Temperature, Relative Humidity, Wind Speed and Wind Direction. The link
for the dataset is here.
We are interested in finding relationship between Temperature and
Relative Humidity. Here Relative Humidity is the dependent variable and
Temperature is the independent variable. We performed linear regression
and use different metrics to evaluate the performance of our model. To
calculate the metrics we make extensive use of sklearn library.

mean_squared_error, mean_absolute_percentage_error

Now let’s load the data into the panda’s data frame and then split it into
training and testing parts (for model evaluation) in the 80:20 ratio.

test_size=0.20,
random_state=0)

Now, let’s train a simple linear regression model. On the training data and
we will move to the evaluation part of the model using different metrics.

1. Mean Absolute Error ( MAE)

This is the simplest metric used to analyze the loss over the whole
dataset. As we know that error is basically the difference between the
predicted and actual values. Therefore MAE is defined as the average of
the errors calculated. Here we calculate the modulus of the error, perform
summation and then divide the result by the total number of data points.
It is a positive value. The formula of MAE is given by
MAE = ∑ ∣ypred–yactual∣
N
i=1
N

y_pred=Y_pred)
Output:

Mean Absolute Error 1.7236295632503873

2. Mean Squared Error( MSE)

The most commonly used metric is Mean Square error or MSE. It is a


function used to calculate the loss. We find the difference between the
predicted values and actual variable, square the result and then find the
average by all datapoints present in dataset. MSE is always positive as we
square the values. Small the value of MSE better is the performance of our
model. The formula of MSE is given:
∑ (ypred–yactual)
N 2

MSE = i=1
N

Output:

Mean Square Error 3.9808057060106954

3. Root Mean Squared Error( RMSE)

RMSE is a popular method and is the extended version of MSE. It indicates


how much the data points are spread around the best line. It is the
standard deviation of the MSE. A lower value means that the data point
lies closer to the best fit line.

RMSE =
N
pred –yactual )2
N

y_pred=Y_pred,
squared=False)

Output:
Root Mean Square Error 1.9951956560725306

4. Mean Absolute Percentage Error ( MAPE)

MAPE is used to express the error in terms of percentage. It is defined as


the difference between the actual and predicted value. The error is then
divided by the actual value. The results are then summed up and finally
and we calculate the average. Smaller the percentage better the
performance of the model. The formula is given by
MAPE = 1 ∑N ( ∣ypred–yactual∣ ) × 100%
N i=1 ∣yactual∣

Output:

Mean Absolute Percentage Error 0.02334408993333347

Evaluating machine learning models is a important step in ensuring their


effectiveness and reliability in real-world applications. Using appropriate
metrics such as accuracy, precision, recall, F1 score for classification and
regression-specific measures like MAE, MSE, RMSE and MAPE can assess
model performance for different tasks. Moreover adopting evaluation
techniques like cross-validation and holdout ensures that models
generalize well to unseen data.

You might also like