P06 The Classification Pipeline Ans
P06 The Classification Pipeline Ans
This practical is best done only after you have gone through the lectures L04:The Classification
Pipeline.
It has a total of 70,000 small images of digits where 60,000 are for training and 10,000 for testing.
In [1]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
np.random.seed(42) # to ensure we have consistent results
In [2]:
from sklearn.datasets import fetch_openml
X, y = fetch_openml('mnist_784', version=1, return_X_y=True)
In [3]:
# convert X, y to numpy arrays (for newer version of sklearn)
if isinstance(X, pd.DataFrame):
X = X.values
if isinstance(y, pd.Series):
y = y.values
X stores all the digit images in MNIST. There are 70,000 rows and 784 columns in X . Each row
represents a sample (an image) in MNIST. So, each image in mnist is a vector of size 784 which
corresponds to 28x28 pixels (28 x 28 = 784). Each pixel has a value between 0 (black) and 255
(white).
y contains the labels for corresponding images, stored as String '0' to '9'.
In [4]:
print('Shape of X:', X.shape)
print('Shape of y:', y.shape)
In [5]:
print('label frequency')
print(pd.Series(y).value_counts())
label frequency
1 7877
7 7293
3 7141
2 6990
9 6958
0 6903
6 6876
8 6825
4 6824
5 6313
dtype: int64
The following code display one sample from the dataset. First, we reshape the selected sample's
feature vector from (784,) to (28, 28). Then, we invoke matplotlib command imshow to show the
image.
In [6]:
def display(one_digit):
one_digit_image = one_digit.reshape(28,28)
plt.imshow (one_digit_image, cmap = matplotlib.cm.gray, interpolation = 'nearest')
Data preparation
Split dataset into training and testing set
First, splits the dataset ( X , y ) into the training set ( X_train , y_train ) which stores the first
60,000 samples and testing set ( X_test , y_test ) which stores the last 10,000 samples.
In [7]:
X_train, X_test, y_train, y_test = X[:60000], X[60000:], y[:60000], y[60000:]
In [8]:
print('Labels of first 10 training samples:', y_train[:10]) # show the first 10 sample
Labels of first 10 training samples: ['5', '0', '4', '1', '9', '2', '1', '3', '1', '4']
Categories (10, object): ['0', '1', '2', '3', ..., '6', '7', '8', '9']
2. Binary Classification
In this section, we learn how to build a binary classifier that is able to distinguish between just two
classes, 5 and not-5 images.
The following code create the targeted variable y_train_5 and y_test_5 which is True only
for samples with digit 5 and False otherwise.
In [9]:
y_train_5 = (y_train == '5')
y_test_5 = (y_test == '5')
In [10]:
print('y_train : y_train_5')
for i in range(20):
print(y_train[i], ' : ', y_train_5[i])
y_train : y_train_5
5 : True
0 : False
4 : False
1 : False
9 : False
2 : False
1 : False
3 : False
1 : False
4 : False
3 : False
5 : True
3 : False
6 : False
1 : False
7 : False
2 : False
8 : False
6 : False
9 : False
In [11]:
from sklearn.linear_model import SGDClassifier
Performing predictions
Now, we can use it to detect image of number 5
In [12]:
y_pred = sgd_clf.predict(X_train)
Let's look at our detection result for some randomly generated samples
In [13]:
def peek_results(actual, predicted, num = 20):
print('actual | Predicted')
print('------------------')
for i in range(num):
sel = np.random.randint(0, len(y_train))
print(actual[sel], ' |', predicted[sel])
peek_results(y_train_5, y_pred)
actual | Predicted
------------------
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | True
False | False
False | False
First, we use the accuracy measure to evaluate the performance of the system.
#Correctly predicted
Accuracy =
#All samples
This can be done easily through the command accuracy_score. All we need to do is to provide the
vector of actual labels ( y_train_5 ) and the corresponding vector of predicted labels ( y_pred ).
The following code computes the training accuracy. We get a very high accuracy!
Ans:
In [14]:
from sklearn.metrics import accuracy_score
Let's use cross_val_score to evaluate our SGDClassifier model using K-fold cross-validation.
The following code compute and show the validation accuracies of our model using 3-fold cross-
validation. Set the scoring = 'accuracy'
Ans:
In [15]:
from sklearn.model_selection import cross_val_score
Notes: the function cross_val_score will not update our classifier sgd_clf with any of the
fitted models. This is because cross_val_score makes a copy of sgd_clf and train using only
the copy version. Consequently, none of the models built are saved.
So, even if we classify all digit-5 images wrongly, the accuracy will still be as high as 0.90.
This shows that the accuracy metric can be misleading when the dataset is skewed. The accuracy
measure is not suitable for this dataset.
In [16]:
from sklearn.metrics import accuracy_score
1. TN (true negatives) are the number of negative samples that are correctly predicted as negative
2. FP (false positives) are the number of negative samples that are falsely predicted as positive
3. TP (true positives) are the number of positive samples that are correctly predicted as positive
4. FN (false negatives) are the number of positive samples that are falsely predicted as negative
The following steps compute the confusion matrix for our classifier. First, we need to get the
prediction results for all training samples.
One common way to do so is to build a model using the whole training set and then use the
model to predict the labels of the training set itself.
We can also use cross-validation to predict the labels of the training set through the command
cross_val_predict.
1. Train on the samples in fold2 and fold3, predict the labels of samples in fold 1
2. Train on the samples in fold1 and fold3, predict the labels of samples in fold 2
3. Train on the samples in fold1 and fold2, predict the labels of samples in fold 3
Combining the prediction result from all 3 folds, we get the cross-validated prediction for all
samples in the whole training set.
Note that the function cross_val_predict will not update our classifier sgd_clf with any of
the fitted models. This is because cross_val_predict makes a copy of sgd_clf and train
using only the copy version. Consequently, none of the models built are saved.
The following code shows how to use 3-fold cross validation to predict the label for all training
samples.
In [17]:
from sklearn.model_selection import cross_val_predict
In [18]:
peek_results(y_train_5, y_pred_cv)
actual | Predicted
------------------
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
False | False
True | True
False | False
False | False
False | False
False | False
False | False
Ans:
[[52336 2243]
[ 1060 4361]]
In [19]:
from sklearn.metrics import confusion_matrix
cm = confusion_matrix (y_train_5, y_pred_cv)
print(cm)
[[52336 2243]
[ 1060 4361]]
In [21]:
import seaborn as sns
In [24]:
group_names = ['True Neg','False Pos','False Neg','True Pos']
labels = np.asarray(labels).reshape(2,2)
precision is the accuracy of the positive predictions. In Scikit-Learn, this is implemented by the
function precision_score. A high precision is desired. It is high if most samples that the model
predicts as positive are indeed positive.
TP
precision =
T P +F P
recall is the ratio of positive instances that are correctly detected by the classifier. In Scikit-
Learn, this is implemented by the function recall_score. A high recall is desirable. It is high if
most of the positive samples in the dataset are identified as positive by our model.
TP
recall =
T P +F N
f1 combines precision and recall into a single metric. It is the harmonic mean of the two
measures. In Scikit-Learn, this is implemented by the function f1_score. A high f1 score is
desirable. F1 score is high only if both recall and precision are high. It will be low when either
one of the two measures is low.
TP
F1 =
F N +F P
TP+
2
Ans:
precision = 0.6604
recall = 0.8045
f1 score = 0.7253
In [ ]:
from sklearn.metrics import precision_score, recall_score, f1_score
Unfortunately, for most cases, increasing precision reduces recall and vice versas. This is called the
precision/recall tradeoff.
In the following, we shall plot precision-recall performance on our training set to show the trade-
off between these two measures as we adjust the threshold value. nce we have precision-recall
graph, we can use it to decide on a suitable threshold value to be used for our final model.
In [ ]:
def peek_scores(actual, scores, num = 20):
print('actual | score')
print('------------------')
for i in range(num):
sel = np.random.randint(0, len(y_train))
print(actual[sel], ' |', scores[sel])
In [ ]:
peek_scores(y_train_5, y_scores)
We can also retrieve the cross-validated scores for all samples through the function
cross_val_predict with the parameter method="decision_function" . For each sample, the
scores generated by different folds are combined and returned.
In [ ]:
y_scores_cv = cross_val_predict (sgd_clf, X_train, y_train_5, cv = 3, method = "decisio
y_scores_cv
To generate the precision-recall graphs, we need to compute the precision and scores with different
threshold values. The Scikit-Learn library precision_recall_curve which will automatically compute
precision-recall pairs for different threshold values.
In [ ]:
from sklearn.metrics import precision_recall_curve
Using the precision and recall values at different thresholds, we can now plot our precision-recall
graph.
In [ ]:
def plot_precision_vs_recall(precisions, recalls):
plt.plot(recalls, precisions, "b-", linewidth=3)
plt.plot(np.linspace(0, 1, 20), np.linspace(1, 0, 20), 'k--')
plt.xlabel("Recall", fontsize=16)
plt.ylabel("Precision", fontsize=16)
plt.axis([0, 1, 0, 1])
plt.figure(figsize=(8, 6))
plot_precision_vs_recall(precisions, recalls)
plt.title ('Precision-Recall Graph (Training Set)', fontsize = 20)
plt.show()
TPR or recall
FPR
the ratio of negative instances that are incorrectly classified as positive or F P R .
FP
=
T N +F P
In [ ]:
from sklearn.metrics import roc_curve
fpr, tpr, thresholds = roc_curve(y_train_5, y_scores_cv)
Using the FPR and TPR values at different thresholds, we can now plot our precision-recall graph.
In [ ]:
def plot_roc_curve (fpr, tpr, style = 'b-', label = None):
plt.plot(fpr, tpr, style, linewidth = 3, label = label)
plt.plot([0,1], [0, 1], 'k--')
plt.axis([0, 1, 0, 1])
plt.xlabel ('False Positive Rate')
plt.ylabel ('True Positive Rate')
plt.title('TPR vs FPR', fontsize=20)
plot_roc_curve(fpr, tpr)
In [ ]:
from sklearn.metrics import roc_auc_score
print('AUC = {:.4f}'.format(auc))
In [ ]:
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import precision_score, recall_score, f1_score
Hints: Use cross_validate to evaluate the model. cross_validate function differs from cross_val_score
where it allows specifying multiple metrics for evaluation and it can return both validation scores
and training scores.
In [ ]:
from sklearn.model_selection import cross_validate
from sklearn.linear_model import SGDClassifier
Q2. Create and evaluate the performance of RandomForestClassifier classifier to classify digit-5
images.
Create a random forest classifier (n_estimators = 10, random_state = 42). Then evaluate the
performance of your system in terms of precision and recall metric on the training set using 3-fold
cross-validation.
Ans (your answer may differ slightly):
In [ ]:
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
Analysis: From the above results, we can see that SGDClassifier performs not as good, this is
because it is a linear classifier and thus suffer from underfitting. The performance of Random Forest
is the better since it is able to model non-linear data quite well.
First, compute the cross-validated classification scores for all models. Use the function
cross_val_predict to compute the averaged scores for all samples. Use 3 fold cross-validation
( cv=3 ). Set the parameter method as follows:
SGDCLassifier: decision_function
RandomForestClassifier: predict_proba
In [ ]:
print('Computing scores for SGDClassifier...', end = '')
y_scores_sgd = cross_val_predict (sgd_clf, X_train, y_train_5, cv=3, method = 'decis
print('done')
Note: SGDClassifier returns a vector of size (N,) which stores the probability of each image
belonging to the digit-5 (positive samples). On the other hand, RandomForestClassifier returns a
matrix of size (N,2) where the first column is the probability that the data belong to class 0 (non
digit-5), and the second column refers to the probability that the data belong to class 1 (digit-5).
In [ ]:
print('Shape of y_scores_sgd:', y_scores_sgd.shape)
print('Shape of y_scores_forest:', y_scores_forest.shape)
To make the scores consistent we only retain the second column of RandomForestClassifier .
In [ ]:
y_scores_forest = y_scores_forest[:, 1]
First, we get the FPR, TPR and the thresholds of the classifiers by calling roc_curve function (Refer to
the above example to see how to do it).
In [ ]:
fpr_sgd, tpr_sgd, thresholds_sgd = roc_curve (y_train_5, y_scores_sgd)
fpr_forest, tpr_forest, thresholds_forest = roc_curve (y_train_5, y_scores_forest)
Then, plot the ROC curve for the classifiers. You may invoke the function plot_roc_curve that we have
defined above to do this.
In [ ]:
plt.figure(figsize = (8, 6))
plot_roc_curve(fpr_sgd, tpr_sgd, 'r-', 'SGDClassifier')
plot_roc_curve(fpr_forest, tpr_forest, 'b-', 'RandomForestClassifier')
plt.legend (loc='lower right')
plt.show()
Q5. Compute and show the AUC measures for the classifiers.
In [ ]:
from sklearn.metrics import roc_auc_score