Module 3.4 Classification Models, Case Study
Module 3.4 Classification Models, Case Study
Fraud detection is a task inherently suitable for machine learning, as machine learning-based models can scan through
huge transactional datasets, detect unusual activity, and identify all cases that might be prone to fraud. Also, the
computations of these models are faster compared to traditional rule-based approaches. By collecting data from
various sources and then mapping them to trigger points, machine learning solutions are able to discover the rate of
defaulting or fraud propensity for each potential customer and transaction, providing key alerts and insights for the
financial institutions.
In this case study, we will use various classification-based models to detect whether a transaction is a normal payment
or a fraud. The foci of this case study are:
1. Handling an unbalanced dataset, given that the fraud dataset is highly unbalanced with a small number of
fraudulent observations.
2. Selecting the right evaluation metric, given that one of the main goals is to reduce false negatives (cases in
which fraudulent transactions incorrectly go unnoticed).
The dataset used is obtained from Kaggle. This dataset holds transactions by European cardholders that occurred over
two days in September 2013, with 492 cases of fraud out of 284,807 transactions.
In the classification framework defined for this case study, the response (or target) variable has the column name
“Class.” This column has a value of 1 in the case of fraud and a value of 0 otherwise. The dataset has been anonymized
for privacy reasons. Given that certain feature names are not provided (i.e., they are called V1, V2, V3, etc.), the
visualization and feature importance will not give much insight into the behavior of the model.
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")
The following sections walk through some high-level data inspection. Since the feature descriptions are not provided,
visualizing the data will not lead to much insight. This step will be skipped in this case study. Also, this data is from
Kaggle and is already in a cleaned format without any empty rows or columns or #N/A values. Data cleaning or
categorization is unnecessary.
The first thing we must do is gather a basic sense of our data. Remember, except for the transaction and amount, we
do not know the names of other columns. The only thing we know is that the values of those columns have been
scaled. Let’s look at the shape and columns of the data:
# shape
print(data.shape)
#peek at data
set_option('display.width', 100)
data.head(5)
We will also drop the “Time” column as it is not relevant for prediction purposes. Since all the anonymous variables
(V1 to 28) are standardized, we also normalize the “Amount” column. EDIT: We already covered scaling and normalizing
the data in our LSTM lecture.
# scaling the data using standardization and removing the first column (Time) from the dataset
data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1))
data = data.drop(['Time'], axis = 1)
Notice the stark imbalance of the data labels. Most of the transactions are nonfraud. If we use this dataset as the base
for our modeling, most models will not place enough emphasis on the fraud signals; the nonfraud data points will
drown out any weight the fraud signals provide. As is, we may encounter difficulties modeling the prediction of fraud,
with this imbalance leading the models to simply assume all transactions are nonfraud. This would be an unacceptable
result. We will explore some ways of dealing with this issue in the subsequent sections.
We will use 80% of the dataset for model training and 20% for testing:
# train-test split
Y = data["Class"]
X = data.loc[:, data.columns != 'Class']
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)
In this step, we will evaluate different machine learning models. To optimize the various hyperparameters of the
models, we use ten-fold cross-validation. Let us design our test harness. We will evaluate algorithms using the accuracy
metric. This is a gross metric that will give us a quick idea of how correct a given model is. It is useful on binary
classification problems.
Let’s create a baseline of performance for this problem and spot-check a number of different algorithms. It is important
to bear in mind that we will train all the algorithms using the default hyperparameters. The accuracy of many machine
learning algorithms is highly sensitive to the hyperparameters chosen for training the model. We will tune the
hyperparameters of select models later.
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True, )
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
After performing the k-fold cross validation on the models shown above, the overall performance is as follows:
# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,4)
pyplot.show()
The accuracy of the overall result is quite high. But let us check how well it predicts the fraud cases. Choosing one of
the model KNN (with the highest accuracy) from the results above and looking at the result on the test set:
# prepare model
model = KNeighborsClassifier()
model.fit(X_train, Y_train)
In Python’s sklearn, the Confusion Matrix created has four different quadrants:
• Top-Left Quadrant = True Negatives
• Top-Right Quadrant = False Positives
• Bottom-Left Quadrant = False Negatives
• Bottom-Right Quadrant = True Positives
Overall accuracy is strong as shown by 0.9992977774656788 output, but the confusion metrics tell a different story.
Despite the high accuracy level, 33 out of 100 instances of fraud are missed and incorrectly predicted as nonfraud. The
false negative rate is substantial. False negatives in fraud are where fraudulent transactions or activities are incorrectly
identified by anti-fraud systems as legitimate, and therefore allowed to proceed.
The intention of a fraud detection model is to minimize these false negatives. To do so, the first step would be to choose
the right evaluation metric.
As discussed in the overview of classification models and evaluation metrics, Accuracy is the number of correct
predictions made as a ratio of all predictions made. Precision is the number of items correctly identified as positive out
of total items identified as positive by the model. Recall is the total number of items correctly identified as positive out
of total true positives.
For this type of problem, we should focus on recall, the ratio of true positives to the sum of true positives and false
negatives. So, if false negatives are high, then the value of recall will be low.
Since we encountered poor model performance in the previous section due to the unbalanced dataset, we will focus
our attention to that. The main issue over here we have a very poor recall rate for the minority class when the original
imbalanced data is used for training the model.
Let us recall from Module 2.2 EDA Part 2 Cross-Sectional Data some techniques on how to treat class imbalance. One
of the widely adopted class imbalance techniques for dealing with highly unbalanced datasets is called resampling. It
consists of removing samples from the majority class (under-sampling) and/or adding more examples from the
minority class (over-sampling).
Despite the advantage of balancing classes, these techniques also have their weaknesses. The simplest implementation
of over-sampling is to duplicate random records from the minority class, which can cause overfitting. In under-
sampling, the simplest technique involves removing random records from the majority class, which can cause a loss of
information.
SMOTE (Synthetic Minority Oversampling Technique) is an oversampling technique where the synthetic samples are
generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random
oversampling. SMOTE creates new, synthetic observations from present samples of the minority class. Not only does
it duplicate the existing data, it also creates new data that contains values that are close to the minority class with the
help of data augmentation. These new synthetic training records are made randomly by selecting one or more K-
nearest neighbors for each of the minority classes. After completing oversampling, the problem of an imbalanced
dataset is resolved and we are ready to test different classification models.
# implementing SMOTE to treat imbalanced data
The sampling_strategy argument, if not called for, has default setting of “auto” where the SMOTE algorithm will
oversample the minority class and make it equal to majority class. If declared as “float” like in our code, the
oversampling ratio is 45%, which corresponds to the desired ratio of the number of samples in the minority class over
the number of samples in the majority class after resampling.
Sidebar, if you encounter the below error when running the code above, please run the following troubleshooting in
the Anaconda Powershell Prompt and then restart the kernel.
After running the code above, the minority class is oversampled and is now 45% of the majority class. We still preserved
the ratio that there are more cases of nonfraud but the cases of Class = 1 or fraud cases are not severely undersampled
anymore.
We will train all the models again, this time using the SMOTE’d train set. We will also set the evaluation metric to recall.
if false negatives are high, then the value of recall will be low. Models will now be ranked according to this metric.
num_folds = 10
scoring = 'recall'
models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(random_state=seed)))
models.append(('CART', DecisionTreeClassifier(random_state=seed)))
results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
cv_results = cross_val_score(model, X_train_res, Y_train_res, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Again, expect the code above to run for at least another 45 minutes.
We see that the KNN model has the best recall of the four models, followed by the CART model. Logistic regression
and SVM did poorly. We continue by evaluating the test set using the trained KNN and CART models:
# prepare model 1
model1 = KNeighborsClassifier()
model1.fit(X_train_res, Y_train_res)
# prepare model 2
model2 = DecisionTreeClassifier(random_state=seed)
model2.fit(X_train_res, Y_train_res)
Performing SMOTE improved the Recall of KNN and CART models for the minority class, albeit the Accuracy lowered a
little (but not significant enough).
Let us see if we can improve the prediction power of the selected models. Recall that the codes above used default
hyperparameter settings. We can do hyperparameter tuning using grid search. Let us perform grid search for the CART
model. As covered also in the previous lecture notes, typical hyperparameters that need tuning for CART models are
the maximum depth and minimum number of samples. We’ll throw in also the impurity reduction to help with tree
pruning.
# using GridSearchCV to fine tune CART model hyperparameters
max_depth = [5,10,20]
min_samples_split = [10,20,40,60]
min_impurity_decrease = [0.0001, 0.0005, 0.001, 0.005, 0.01]
param_grid = dict(max_depth=max_depth, min_samples_split=min_samples_split,
min_impurity_decrease=min_impurity_decrease)
cf_model = DecisionTreeClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
grid = GridSearchCV(estimator=cf_model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X_train_res, Y_train_res)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
The above code will run for at least 1 hour. Combining grid search with k-fold cross validation will take up memory and
will use significant amount of computational time. Make sure to reduce software/applications that may hog your
device’s memory.
Note that the grid search performed is on the SMOTE’d dataset and with Recall as our scoring metric. After the grid
search, the blow hyperparameters were chosen:
In the next step, the final model is prepared, and the result on the test set is checked:
The accuracy and recall for the minority class of the model is high. Let’s look at the confusion matrix:
This is a trade-off the financial institution would have to consider. There is an inherent cost balance between the
operational overhead, and possible customer experience impact, from processing false positives and the financial loss
resulting from missing fraud cases through false negatives.
Let’s see also which variables had the biggest impact on our classifier.
From the below image, we can see that V14 have single-handedly impacted the classification.
Part 8: Conclusion
In this case study, we performed fraud detection on credit card transactions. We illustrated how different classification
machine learning models stack up against each other and demonstrated that choosing the right metric can make an
important difference in model evaluation. SMOTE was shown to lead to a significant improvement, as all fraud cases
in the test set were correctly identified after applying SMOTE. This came with a trade-off, though. The reduction in false
negatives came with an increase in false positives.
Overall, by using different machine learning models, choosing the right evaluation metrics, and handling unbalanced
data, we demonstrated how the implementation of a simple classification-based model can produce robust results for
fraud detection. For methods and techniques in EDA and feature engineering, revisit Module 2.4 EDA Part 4
Classification Task in Python and R.