0% found this document useful (0 votes)
33 views

Module 3.4 Classification Models, Case Study

This document discusses using machine learning models for credit card fraud detection. It notes that fraud detection is well-suited to machine learning as models can efficiently scan large transaction datasets. The case study uses classification models to detect fraudulent transactions from a credit card transaction dataset. It notes challenges include an imbalanced dataset with few fraudulent observations and the need to minimize false negatives where fraudulent transactions are missed. Initial models are run to benchmark performance, showing overall accuracy is high but false negatives remain substantial, indicating the need to optimize models for recall to reduce missed fraud cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views

Module 3.4 Classification Models, Case Study

This document discusses using machine learning models for credit card fraud detection. It notes that fraud detection is well-suited to machine learning as models can efficiently scan large transaction datasets. The case study uses classification models to detect fraudulent transactions from a credit card transaction dataset. It notes challenges include an imbalanced dataset with few fraudulent observations and the need to minimize false negatives where fraudulent transactions are missed. Initial models are run to benchmark performance, showing overall accuracy is high but false negatives remain substantial, indicating the need to optimize models for recall to reduce missed fraud cases.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

MODULE 3.

4 SUPERVISED LEARNING: CLASSIFICATION MODELS PART 2


Case Study: Credit Card Fraud Detection

Fraud detection is a task inherently suitable for machine learning, as machine learning-based models can scan through
huge transactional datasets, detect unusual activity, and identify all cases that might be prone to fraud. Also, the
computations of these models are faster compared to traditional rule-based approaches. By collecting data from
various sources and then mapping them to trigger points, machine learning solutions are able to discover the rate of
defaulting or fraud propensity for each potential customer and transaction, providing key alerts and insights for the
financial institutions.

In this case study, we will use various classification-based models to detect whether a transaction is a normal payment
or a fraud. The foci of this case study are:
1. Handling an unbalanced dataset, given that the fraud dataset is highly unbalanced with a small number of
fraudulent observations.
2. Selecting the right evaluation metric, given that one of the main goals is to reduce false negatives (cases in
which fraudulent transactions incorrectly go unnoticed).

Part 1: Defining the use case

The dataset used is obtained from Kaggle. This dataset holds transactions by European cardholders that occurred over
two days in September 2013, with 492 cases of fraud out of 284,807 transactions.

In the classification framework defined for this case study, the response (or target) variable has the column name
“Class.” This column has a value of 1 in the case of fraud and a value of 0 otherwise. The dataset has been anonymized
for privacy reasons. Given that certain feature names are not provided (i.e., they are called V1, V2, V3, etc.), the
visualization and feature importance will not give much insight into the behavior of the model.

Part 2: Loading Python Packages and Dataset

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot
from pandas import read_csv, set_option
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler

from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV


from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

from pickle import dump


from pickle import load

import warnings
warnings.filterwarnings("ignore")

path = r'C:\Users\cdeani\Documents\Duane DECSC 131\05 Kaggle Datasets\creditcard.csv'


data = pd.read_csv(path)
Part 3: Exploratory Data Analysis

The following sections walk through some high-level data inspection. Since the feature descriptions are not provided,
visualizing the data will not lead to much insight. This step will be skipped in this case study. Also, this data is from
Kaggle and is already in a cleaned format without any empty rows or columns or #N/A values. Data cleaning or
categorization is unnecessary.

The first thing we must do is gather a basic sense of our data. Remember, except for the transaction and amount, we
do not know the names of other columns. The only thing we know is that the values of those columns have been
scaled. Let’s look at the shape and columns of the data:

# shape
print(data.shape)

#peek at data
set_option('display.width', 100)
data.head(5)

We will also drop the “Time” column as it is not relevant for prediction purposes. Since all the anonymous variables
(V1 to 28) are standardized, we also normalize the “Amount” column. EDIT: We already covered scaling and normalizing
the data in our LSTM lecture.

# scaling the data using standardization and removing the first column (Time) from the dataset
data['normAmount'] = StandardScaler().fit_transform(np.array(data['Amount']).reshape(-1, 1))
data = data.drop(['Time'], axis = 1)

# get a cursory summary of dataset


data.dtypes
As shown, the variable names are nondescript (V1, V2, etc.). Also, the data type for the entire dataset is float, except
Class, which is of type integer. How many are fraud and how many are not fraud? Let us check:

# check proportion of target variable

class_names = {0:'Not Fraud', 1:'Fraud'}


print(data.Class.value_counts().rename(index = class_names))

Notice the stark imbalance of the data labels. Most of the transactions are nonfraud. If we use this dataset as the base
for our modeling, most models will not place enough emphasis on the fraud signals; the nonfraud data points will
drown out any weight the fraud signals provide. As is, we may encounter difficulties modeling the prediction of fraud,
with this imbalance leading the models to simply assume all transactions are nonfraud. This would be an unacceptable
result. We will explore some ways of dealing with this issue in the subsequent sections.

Part 4: Train-Test Split

We will use 80% of the dataset for model training and 20% for testing:

# train-test split

Y = data["Class"]
X = data.loc[:, data.columns != 'Class']
validation_size = 0.2
seed = 7
X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size, random_state=seed)

Part 5: Initial Run of Classification Models

In this step, we will evaluate different machine learning models. To optimize the various hyperparameters of the
models, we use ten-fold cross-validation. Let us design our test harness. We will evaluate algorithms using the accuracy
metric. This is a gross metric that will give us a quick idea of how correct a given model is. It is useful on binary
classification problems.

# test options for classification


num_folds = 10
scoring = 'accuracy'

Let’s create a baseline of performance for this problem and spot-check a number of different algorithms. It is important
to bear in mind that we will train all the algorithms using the default hyperparameters. The accuracy of many machine
learning algorithms is highly sensitive to the hyperparameters chosen for training the model. We will tune the
hyperparameters of select models later.

# spot-check basic Classification algorithms


models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(random_state=seed)))
models.append(('CART', DecisionTreeClassifier(random_state=seed)))
Again, all the algorithms use default tuning parameters. We will display the mean and standard deviation of accuracy
scoring for each algorithm as we calculate and collect the results for use later. The below code will run the k-fold cross
validation of 10 k-folds with four models. This will run for at least 45 minutes. Given the size of the dataset, the SVM
and CART models take the longest to run.

results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True, )
cv_results = cross_val_score(model, X_train, Y_train, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)

After performing the k-fold cross validation on the models shown above, the overall performance is as follows:

# compare algorithms
fig = pyplot.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
pyplot.boxplot(results)
ax.set_xticklabels(names)
fig.set_size_inches(8,4)
pyplot.show()

The accuracy of the overall result is quite high. But let us check how well it predicts the fraud cases. Choosing one of
the model KNN (with the highest accuracy) from the results above and looking at the result on the test set:

# prepare model
model = KNeighborsClassifier()
model.fit(X_train, Y_train)

# estimate accuracy on validation set


predictions = model.predict(X_test)
print(accuracy_score(Y_test, predictions))
print(classification_report(Y_test, predictions))

And producing the confusion matrix yields:

df_cm = pd.DataFrame(confusion_matrix(Y_validation, predictions),


columns=np.unique(Y_validation), index = np.unique(Y_validation))
df_cm.index.name = 'Actual'
df_cm.columns.name = 'Predicted'
sns.heatmap(df_cm, cmap="Blues", annot=True,annot_kws={"size": 16})

In Python’s sklearn, the Confusion Matrix created has four different quadrants:
• Top-Left Quadrant = True Negatives
• Top-Right Quadrant = False Positives
• Bottom-Left Quadrant = False Negatives
• Bottom-Right Quadrant = True Positives

Overall accuracy is strong as shown by 0.9992977774656788 output, but the confusion metrics tell a different story.
Despite the high accuracy level, 33 out of 100 instances of fraud are missed and incorrectly predicted as nonfraud. The
false negative rate is substantial. False negatives in fraud are where fraudulent transactions or activities are incorrectly
identified by anti-fraud systems as legitimate, and therefore allowed to proceed.

The intention of a fraud detection model is to minimize these false negatives. To do so, the first step would be to choose
the right evaluation metric.
As discussed in the overview of classification models and evaluation metrics, Accuracy is the number of correct
predictions made as a ratio of all predictions made. Precision is the number of items correctly identified as positive out
of total items identified as positive by the model. Recall is the total number of items correctly identified as positive out
of total true positives.

For this type of problem, we should focus on recall, the ratio of true positives to the sum of true positives and false
negatives. So, if false negatives are high, then the value of recall will be low.

Part 6: Dealing with class imbalance

Since we encountered poor model performance in the previous section due to the unbalanced dataset, we will focus
our attention to that. The main issue over here we have a very poor recall rate for the minority class when the original
imbalanced data is used for training the model.

Let us recall from Module 2.2 EDA Part 2 Cross-Sectional Data some techniques on how to treat class imbalance. One
of the widely adopted class imbalance techniques for dealing with highly unbalanced datasets is called resampling. It
consists of removing samples from the majority class (under-sampling) and/or adding more examples from the
minority class (over-sampling).

Despite the advantage of balancing classes, these techniques also have their weaknesses. The simplest implementation
of over-sampling is to duplicate random records from the minority class, which can cause overfitting. In under-
sampling, the simplest technique involves removing random records from the majority class, which can cause a loss of
information.

SMOTE (Synthetic Minority Oversampling Technique) is an oversampling technique where the synthetic samples are
generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random
oversampling. SMOTE creates new, synthetic observations from present samples of the minority class. Not only does
it duplicate the existing data, it also creates new data that contains values that are close to the minority class with the
help of data augmentation. These new synthetic training records are made randomly by selecting one or more K-
nearest neighbors for each of the minority classes. After completing oversampling, the problem of an imbalanced
dataset is resolved and we are ready to test different classification models.
# implementing SMOTE to treat imbalanced data

print("Before SMOTE, counts of label '1': {}".format(sum(Y_train == 1)))


print("Before SMOTE, counts of label '0': {} \n".format(sum(Y_train == 0)))

# import SMOTE module from imblearn library


# pip install imblearn (if you don't have imblearn installed yet)

from imblearn.over_sampling import SMOTE


sm = SMOTE(sampling_strategy=0.45, random_state=seed)
X_train_res, Y_train_res = sm.fit_resample(X_train, Y_train.ravel())

print("After SMOTE, counts of label '1': {}".format(sum(Y_train_res == 1)))


print("After SMOTE, counts of label '0': {}".format(sum(Y_train_res == 0)))

The sampling_strategy argument, if not called for, has default setting of “auto” where the SMOTE algorithm will
oversample the minority class and make it equal to majority class. If declared as “float” like in our code, the
oversampling ratio is 45%, which corresponds to the desired ratio of the number of samples in the minority class over
the number of samples in the majority class after resampling.

Sidebar, if you encounter the below error when running the code above, please run the following troubleshooting in
the Anaconda Powershell Prompt and then restart the kernel.
After running the code above, the minority class is oversampled and is now 45% of the majority class. We still preserved
the ratio that there are more cases of nonfraud but the cases of Class = 1 or fraud cases are not severely undersampled
anymore.

We will train all the models again, this time using the SMOTE’d train set. We will also set the evaluation metric to recall.
if false negatives are high, then the value of recall will be low. Models will now be ranked according to this metric.

# tuning the models to use 'Recall' and SMOTE'd training set

num_folds = 10
scoring = 'recall'

models = []
models.append(('LR', LogisticRegression(random_state=seed)))
models.append(('KNN', KNeighborsClassifier()))
models.append(('SVM', SVC(random_state=seed)))
models.append(('CART', DecisionTreeClassifier(random_state=seed)))

results = []
names = []
for name, model in models:
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
cv_results = cross_val_score(model, X_train_res, Y_train_res, cv=kfold, scoring=scoring)
results.append(cv_results)
names.append(name)
msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
print(msg)
Again, expect the code above to run for at least another 45 minutes.

We see that the KNN model has the best recall of the four models, followed by the CART model. Logistic regression
and SVM did poorly. We continue by evaluating the test set using the trained KNN and CART models:

# prepare model 1
model1 = KNeighborsClassifier()
model1.fit(X_train_res, Y_train_res)

# estimate recall on validation set


predictions1 = model1.predict(X_test)
print(accuracy_score(Y_test, predictions1))
print(classification_report(Y_test, predictions1))

# prepare model 2
model2 = DecisionTreeClassifier(random_state=seed)
model2.fit(X_train_res, Y_train_res)

# estimate recall on validation set


predictions2 = model2.predict(X_test)
print(accuracy_score(Y_test, predictions2))
print(classification_report(Y_test, predictions2))

Performing SMOTE improved the Recall of KNN and CART models for the minority class, albeit the Accuracy lowered a
little (but not significant enough).

Part 7: Model Tuning and Testing the Final Model

Let us see if we can improve the prediction power of the selected models. Recall that the codes above used default
hyperparameter settings. We can do hyperparameter tuning using grid search. Let us perform grid search for the CART
model. As covered also in the previous lecture notes, typical hyperparameters that need tuning for CART models are
the maximum depth and minimum number of samples. We’ll throw in also the impurity reduction to help with tree
pruning.
# using GridSearchCV to fine tune CART model hyperparameters
max_depth = [5,10,20]
min_samples_split = [10,20,40,60]
min_impurity_decrease = [0.0001, 0.0005, 0.001, 0.005, 0.01]
param_grid = dict(max_depth=max_depth, min_samples_split=min_samples_split,
min_impurity_decrease=min_impurity_decrease)
cf_model = DecisionTreeClassifier(random_state=seed)
kfold = KFold(n_splits=num_folds, random_state=seed, shuffle=True)
grid = GridSearchCV(estimator=cf_model, param_grid=param_grid, scoring=scoring, cv=kfold)
grid_result = grid.fit(X_train_res, Y_train_res)
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))

The above code will run for at least 1 hour. Combining grid search with k-fold cross validation will take up memory and
will use significant amount of computational time. Make sure to reduce software/applications that may hog your
device’s memory.

Note that the grid search performed is on the SMOTE’d dataset and with Recall as our scoring metric. After the grid
search, the blow hyperparameters were chosen:

In the next step, the final model is prepared, and the result on the test set is checked:

# prepare final classifier model


cf_model = DecisionTreeClassifier(max_depth=20, min_impurity_decrease=0.0001, min_samples_split=10 ,
random_state=seed)
cf_model.fit(X_train_res, Y_train_res)

# predict against original test set


cf_pred = cf_model.predict(X_test)
print(accuracy_score(Y_test, cf_pred))
print(classification_report(Y_test, cf_pred))

The accuracy and recall for the minority class of the model is high. Let’s look at the confusion matrix:

# confusion matrix of final model


df_cm_final = pd.DataFrame(confusion_matrix(Y_test, cf_pred),
columns=np.unique(Y_test), index = np.unique(Y_test))
df_cm_final.index.name = 'Actual'
df_cm_final.columns.name = 'Predicted'
sns.heatmap(df_cm_final, cmap="Blues", annot=True,annot_kws={"size": 16})
The results on the test set are impressive, with a high accuracy and, importantly, less false negatives. However, we
see that an outcome of using our under-sampled data is a propensity for false positives — cases in which nonfraud
transactions are misclassified as fraudulent.

This is a trade-off the financial institution would have to consider. There is an inherent cost balance between the
operational overhead, and possible customer experience impact, from processing false positives and the financial loss
resulting from missing fraud cases through false negatives.

Let’s see also which variables had the biggest impact on our classifier.

# extracting feature importance

def plot_importance(model, features, num=len(X), save=False):


feature_imp = pd.DataFrame({'Value': cf_model.feature_importances_, 'Feature': features.columns})
pyplot.figure(figsize=(10, 10))
sns.set(font_scale=1)
sns.barplot(x="Value", y="Feature", data=feature_imp.sort_values(by="Value", ascending=False)[0:num])
pyplot.title('Features')
pyplot.tight_layout()
pyplot.show()
if save:
pyplot.savefig('importances.png')

plot_importance(cf_model, X, 29) # there are 29 feature/independent variables

From the below image, we can see that V14 have single-handedly impacted the classification.
Part 8: Conclusion

In this case study, we performed fraud detection on credit card transactions. We illustrated how different classification
machine learning models stack up against each other and demonstrated that choosing the right metric can make an
important difference in model evaluation. SMOTE was shown to lead to a significant improvement, as all fraud cases
in the test set were correctly identified after applying SMOTE. This came with a trade-off, though. The reduction in false
negatives came with an increase in false positives.

Overall, by using different machine learning models, choosing the right evaluation metrics, and handling unbalanced
data, we demonstrated how the implementation of a simple classification-based model can produce robust results for
fraud detection. For methods and techniques in EDA and feature engineering, revisit Module 2.4 EDA Part 4
Classification Task in Python and R.

You might also like