Assignment 1 Individual Assignment Template
Assignment 1 Individual Assignment Template
(ITS66604)
STUDENT DECLARATION
1. I confirm that I am aware of the University’s Regulation Governing Cheating in a University Test
and Assignment and of the guidance issued by the School of Computing and IT concerning
plagiarism and proper academic practice, and that the assessed work now submitted is in
accordance with this regulation and guidance.
2. I understand that, unless already agreed with the School of Computing and IT, assessed work may
not be submitted that has previously been submitted, either in whole or in part, at this or any other
institution.
3. I recognise that should evidence emerge that my work fails to comply with either of the above
declarations, then Imay be liable to proceedings under Regulation.
Abstract 10
Related Works 30
Methodology 30
Conclusion 5
Submission Requirements 10
Remarks:
Acknowledgements
I would like to thank also my professor Ms. Nicole for her sincere support during this work. I
also want to thank Taylor’s University for availing the resources and the developers of Kaggle
for offering the dataset. These individuals made it possible for the implementation of this
experimental study and have been instrumental in the measures given for employing and
analysing the credit card fraud detection.
Abstract
Specifically in the credit card fraud detection, this research work employs the Credit Card Fraud
Detection dataset found in Kaggle, which contains transaction of European credit cards in
September 2013. It can be greatly noted that, the arrival of this dataset was prejudiced greatly in
the extreme where 98% of the provided transactions reflected a fraudulent transaction only. The
numbers of cases has risen many folds as evident from the current Australian data set which has
risen to 172%, mainly due to which procedures like, Synthetic Minority Over-sampling Technique
(SMOTE) needs to be incorporated for the enhancement of the prescient models. Only logistic
regression and decision tree model is examined, and the therefore, the results deduced are on the
accuracy of decision tree trained under balanced data set which yielded 80% accuracy. As for the
further research, it is necessary to study other types of scaling, balancing, and tuning as well as
parallel computing to obtain even more accurate fraud identification and minimize the time spent
on it.
Table of Contents
1.0 Introduction
1.1 Background
Another sub-area of financial crimes is the detection of financial fraud where many
changes have occurred due to the elaboration of ITs and analysis of big data. The use
of conventional techniques in supervised learning has not been very effective in
arriving at the best classification of such slightly more complicated and interrelated
instances of frauds that is why the turning to machine intelligence. The latter works
show that almost all the algorithms of ML programs are highly effective and offer a
fairly accurate chance of predicting fraudulent transactions. Kaggle was a good
source for the data set used for this research and the name of the data set was the
Credit Card Fraud Detection; this data set contains lots of transaction made by the
European card holders in the entire month of September 2013. It is also noted that
whereas most of the transactions are of a fraudulent nature, it is very difficult to
approach the good ones which makes 0 class significantly dominant in the given
dataset. Now, knowing the skewness level at 172%, one can pinpoint that for such a
skewed distribution, use may best be addressed through optimal solutions such as
Synthetic Minority Over-sampling Technique (SMOTE). Comparing the two
algorithms being logistic regression and the decision tree, the tree is more competent
especially if trained with balanced datasets since they attain a sensible degree of
accuracy. The idea for this study is to extend the use of a multi-layered machine
learning algorithm project to map concurrency with parallel computation techniques
so as to increase the ratio of identification executed by the machine.
1.2 Research Goal
The research aim lies in constructing more effective approaches that comprise of the
state-of-art locally trainable machine learning techniques together with the parallel
processing techniques to achieve increased levels of accuracy in detection of fraudulent
activities than can currently be achieved.
1.3 Research Objectives
This research aims to enhance financial fraud detection methodologies by addressing several key
objectives:This research aims to enhance financial fraud detection methodologies by addressing
several key objectives:
1. Evaluate Machine Learning Algorithms: Please tell me to what extent the selected methods:
logistic regression and decision tree, are reliable in fraudulent case detection using Credit Card
Fraud Detection dataset.
2. Optimize Computational Efficiency: Explain the term distributed processing and explain how
the use of GPU can be of help in enhancing the algorithms which are used in the making of the
models for detecting frauds.
3. Address Data Imbalance: To express the relationship with the disparity within the given table,
it is proposed to mention that for obtaining the correct proportion of the example representation
of the minority class, and, thus, achieve the desired accuracy in this regard, some correction
methods may be used, including SMOTE.
4. Examine PII Features: If you have the time read through the document and try to comprehend
how it becomes possible to alter the current regulation of Personally Identifiable Information
(PII) impact the model and if it is possible to assist or envoke adversity for individual persons.
Towards these goals, the research propose the approach as to the improvements that can be made
in the methods used in the financial fraud detection models in order to augment on the efforts
that the financial institutions are putting in to end these evils.
2.0 Related works
2.1 Gap Analysis
Reference Features
Dal Pozzolo et al. (2015) Transaction ID, Customer ID, Transaction amount, Timestamp, Customer location,
Merchant ID, Transaction status, Card type, Fraud label, Data normalization,
Principal Component Analysis (PCA), 80%:20% data split, Random Forest (RF),
Neural Network (NN), Support Vector Machine (SVM)
Jurgovsky et al. (2018) Transaction ID, Customer ID, Transaction amount, Timestamp, Customer location,
Device type, Merchant ID, Transaction status, Card type, Fraud label, Data
normalization, Time-based features, 70%:30% data split, Recurrent Neural
Network (RNN), Logistic Regression (LR)
Liu et al. (2019) Transaction ID, Customer ID, Transaction amount, Timestamp, Customer location,
Device type, Merchant ID, Transaction status, Card type, Fraud label, Data scaling,
SMOTE applied, 75%:25% data split, Decision Tree (DT), Gradient Boosting (GB)
3.0 Methodology
3.1 Dataset
The dataset was taken from Kaggle under the Credit Card Fraud Detection dataset. This
dataset is built with facts about purchases made by cardholders from Europe in
September 2013. This is highly skewed as only a few transactions out of the total are
fraudulent ones. Incorporated into the dataset is a list of diverse features, which is
pertinent to transactional activities. Some of the features that are extracted include
Transaction ID, customer ID, amount of the transaction, time when transaction occurred,
geographical location of the customer and so on.
Column Name Description Format
Transaction_ID Identification number for Int
the transaction
Customer_ID Identification number for Int
the customer
Transaction_amount Amount of the transaction Float
Transaction_type Type of the transaction String
(e.g., purchase,
withdrawal)
Merchant_ID Identification number for Int
the merchant
Time_stamp Date and time of the String
transaction
Customer_age Age of the customer Int
Customer_gender Gender of the customer String
Customer_location Location of the customer String
Device_type Type of device used for String
the transaction
IP_address IP address of the device String
used for the transaction
Merchant_category Category of the merchant String
Payment_method Method of payment used String
in the transaction
Transaction_status Status of the transaction String
(e.g., completed, pending)
Card_type Type of card used in the String
transaction
Card_issuer Issuer of the card used in String
the transaction
Card_country Country of the card issuer String
Transaction_frequency Frequency of transactions Int
by the customer
Previous_transaction_amount Amount of the previous Float
transaction
Fraud_label Indicator if the transaction Int
is fraudulent
Account_balance Balance of the customer's Float
account
Account_tenure Duration of the customer's Int
account
Number_of_transactions Total number of Int
transactions made by the
customer
Number_of_chargebacks Number of chargebacks Int
made by the customer
Merchant_rating Rating of the merchant Float
Average_transaction_amount Average amount of Float
transactions made by the
customer
Customer_credit_score Credit score of the Int
customer
Customer_income Income of the customer Int
Number_of_linked_accounts Number of accounts Int
linked to the customer
For the pre-processing phase, the data in used is obtained from Kaggle’s Credit Card
Fraud Detection dataset before importing the same into a suitable analysis environment
using data manipulation or analysis tools such as the use of Python’s pandas package.
This is succeeded by data cleaning where data which is proved to be inconsistent and
contain missing or invalid values is dealt with BEFORE ANALYSIS. This, therefore,
entails removing some of the characteristics or features from the dataset so that it is free
from any unwanted or irrelevant qualities that might interfere with the outcome of other
analyses that are expected to be conducted.
Hence, while working with categorical data, one has to use techniques such as one hot
code or label encode in order to convert the categorical type of data to a numerical type
of data that is suitable for use in most of the modeling algorithms. Numerical attributes
are also preprocessed for the removal of the bias resulting from the differences in the
scale by incorporating normalization or the standardization element. Certain
considerations are taken in datasets for fraud detection because of the unequal
distribution of the classes such as SMOTE in case of balancing of classes.
The dataset is split in the training and test data to verify the proper performance of the
model and a portion of data is set aside for tuning the hyperparameters. During these
steps, the outcomes of the data pre-processing techniques are verified and it can be
ensured that the further dataset is created in the right way for model building. These are
major steps that have to be taken to give the subsequent phases of model development
and performance assessment a green light.
3.3 Models
This can entail feature extraction whereby the analyst only selects the demographic
variables into which the machine learning algorithms can feed to identify the fraudulent
transactions from the data set. Memory models such as decision tree and logistic
regression models were employed for prediction because they are used for binary
classification and their results are easier to interpret than convolutional models.
1. Logistic Regression
In this context, Logistic Regression is more selective for the current binary classification
problem as it provides a probability of the existence of certain inputs in a particular class.
This maps the ‘input features to a ‘probability value’ which ranges from 0- 1 then
normalizes this probability to arrive at ‘class value’.
Steps:
Data Preparation: Now let’s divide the proposed pre-processed data set into two,
using the pre-processed data; training data set and test data set.
Model Training: Therefore, to do this in the TRAIN step, get the logistic
regression model estimate from the training data set.
Hyperparameter Tuning: Subsequently, it is then possible to utilize the cross
validation in a way that would allow you to figure out how it would be best to go
about tuning the parameter C.
Model Evaluation: Assess it relative to other measures like the accuracy, the
precision, the recall, the F1 measure, and the area under the curve of the receiver
operating characteristics, indicating which of these equals it.
2. Decision Tree
The feature values are then used on the data to divide into further subsets, and the derived
tree is one that constitutes the node as a decision rule and the outlet node as a class label.
Steps:
Data Preparation: Here, it is wiser to start from the pre-processed data and,
therefore, move directly to the division of the entire data set into the train/test
split.
Model Training: Build a decision tree, then in a more specific manner, fit the
decision tree to the training data set.
Hyperparameter Tuning: Since cross validation is the procedure of fine-tuning
some of the specified parameters such as max depth, min samples split and min
samples leaf.
Model Evaluation: Since reporting the results of the model, the Performance
Metric is chosen as the Logistic Regression.
Model Comparison
Thus, the performance of the two modelling techniques on the said dataset will be
evaluated with respect to some parameters such as accuracy, precision, etc to determine
which model holds the most promise for the detection of the fraudulent transactions.
Here is a list of steps describing how an algorithm for the detection of fraud cases can be
reached by applying the machine learning method. Initially, the data from the dataset
sourced from Kaggle: Fraud Detection in Credit Card is pre-cleaned to only present vital
information to enable a productive test. These operations include: Some of which are
missing value handling, normalizing/scaling of continous variables and finally the
handling of categorical data. Another equally crucial aspect is knowledge on how
Imbalance data can be handled, methods of handling it include Synthetic Minority Over-
sampling Technique (SMOTE). This is because the true intention is to build a model that
will enable one to classify incidents into two classes and because outsiders will have to
explain the models, the two types of models namely logistic regression and decision tree
are chosen for model selection. In both the models training the pre-processing done on
the data is adopted and while searching for the best iteration of the parameters, cross-
validation measurements are applied. However, in the case of the frauds related to the
transactions their performance is evaluated qualitatively in terms of some statistic
indicators such as the accuracy rate, precision rate, recall rate, F1 measure, and the area
under receiving operation characteristic known as AUC-ROC. Comparison of the models
to find out the best alternative: The precision of the two models of the logistic regression
and decision tree is ascertained, and which model out of the two is most appropriate for
the job. It keeps the methodological strategy overall and optimized in constructing and
evaluating the ML models with credit card information to identify fraud with the
intention of providing the revelation as believable as possible.
4.0 Implementation and Results
4.1 Initial EDA
Having the dataset uploaded into the google colab notebook, the dataset is imported into the
environment and placed into the Pandas data frame format known as credit_card_data. To
have a general idea about the structure of the data, nodes 1 to 5 and 171 to 175 are printed.
The two mentioned sources for the dataset have the same number of rows, which is 284,807,
and columns, which equals 31.
Time: Abs time from this transaction to the first transaction in the data set.
V1 to V28: Original features that have been used to acquire principal components
with the help of PCA, are presented in the form of an anonymized table.
Amount: Transaction amount.
Class: True for the fraud transaction; false for the not a fraud: Returns 1 or 0.
Using the ‘.info()’ function, there is a summary of the datatype of each column present
within the table. The output variable is qualitative having a total of 30 numerical features
and one integer feature representing the class label. This implies that the given dataset does
not have records with null values.
The dataset to be used in the study does not contain any missing values in most of
the cases.
The ‘Class’ column is of integer data type while all other columns are of float data
type.
The summary statistics of the numerical columns are computed in the ‘.described()’
function. This includes the measures of central tendencies and the measures of dispersion;
mean, standard deviation, and range of values. It is useful to get an idea about the
distribution and the presence of outliers in the given set of data.
Using the ‘.nunique()’ function, the number of unique categories in the 'Class', 'Time', and
'Amount' columns are displayed.
Class: There are two new categories, which entails that the data set has two
partitions: 0 representing non-fraudulent, and 1 representing fraudulent.
Time: So, there are 124,592 unique values which point out that the time of
transactions significantly varies.
Amount: The possible values vary from 0 to 32766, the last one having been
reached 2 times, which proves that the amounts of the transactions are rather
different.
In the first EDA, it is observed that the target variable ‘Class’ which indicates the
fraudulent and non-fraudulent transactions has a significant number of observations in the
latter category. Thus, in extending the modelling phase to include radical ideas, balancing
techniques must be employed. The data has no missing values, and the features are mostly
numerical that will enable the direct application of machine learning algorithms with a only
prerequisite of normalization and standardization being required.
4.2 Descriptive EDA
After transformation, a count plot is generated subsequently with the aid of the Seaborn
library with ‘.countplot()’ function, and the resultant plot shows the count of binary
values of the ‘Class’ variable. As illustrated by the bar chart in figure 7, the dataset
splits most of the data into the ‘0’ fraud label and therefore the model is likely to meet
more non fraudulent transactions than fraudulent ones.
Concerning the transaction amounts, it is possible to observe that they vary from low to
high numbers – this means that small and large transactions are equally common.
In the following visualizations, there are all the important features in the given set
presenting asymmetrical and uneven distributions which have to be taken into account
during the stage of constructing the model.
4.3 Modeling
4.3.1 Model 1 (Logistic Regression)
4.3.1.1 Experiment 1a (Initial Logistic Regression model)
Objective: To evaluate the performance of the logistic regression model on the raw
dataset.
4.3.3.4 …
4.3.3.5 Summary of Model 3
Findings: Summarize the performance of the random forest model across
different experiments.
Conclusion: Determine the best approach of the random forest model across
different experiments.
4.4 Summary of Implementation and Results
In this section we will be discussing the implementation and results of various models
used for credit card fraud detection. They then evaluated each model on accuracy,
precision, recall, F1 Score and AUC-ROC to determine the best method. The models
include logistic regression, decision tree and random forest with different techniques
like SMOTE for class imbalance handling followed by feature scaling; hyperparameter
tuning as well as model optimization using various methods of feature selection.
Results from these experiments help you understand in which situations each model is
strong, and when the best possible choice should be used to detect frauds.
5.0 Analysis and Recommendations
6.0 Conclusions
This study inspected logistic regression and decision tree models in credit card fraud detection,
using Kaggle dataset that was imbalanced. The decision tree model, specifically after SMOTE
and hyperparameter tuning, showed superior performance compared to logistic regression in
terms of accuracy, precision, and recall.
Key Findings:
• Logistic regression improved after applying SMOTE and feature scaling but was not robust in
identifying fraud.
• The decision tree model noticeably outperformed logistic regression, especially after
optimization and SMOTE.
• SMOTE is an essential aspect of continuing work to correct and improve model performance in
class imbalance scenarios.
• Feature scaling is essential for logistic regression, as it helps to balance precision with recall.
• Continuation of model tuning is very important for keeping up to date with the changes in
fraudulent behaviour patterns.
• Parallel computing can speed up the computation of the multiple algorithms.
Future work will involve further exploration of algorithms such as Random Forest and Neural
Networks, and other methods of balancing data that can improve the model performance in terms
of accuracy and robustness. Practical applications should acknowledge the methods and improve
upon them in the context of designing better fraud detection models, leading to the identification
of fraudulent transactions and saving money for the companies.
7.0 References
Dal Pozzolo, A., Caelen, O., Le Borgne, Y., Waterschoot, S., & Bontempi, G. (2015).
Calibrating Probability with Undersampling for Unbalanced Classification. In 2015
IEEE Symposium Series on Computational Intelligence. DOI: 10.1109/SSCI.2015.33
Jurgovsky, J., Granitzer, G., Ziegler, K., Calabretto, S., Portier, P., He-Guelton, L., &
Caelen, O. (2018). Sequence classification for credit-card fraud detection. Expert
Systems with Applications, 100, 234-245. DOI: 10.1016/j.eswa.2018.01.037
Liu, H., Dai, Y., & Wang, Z. (2019). Credit Card Fraud Detection Using Isolation
Forest Algorithm. In 2019 5th International Conference on Computer and
Communications (ICCC). DOI: 10.1109/ICCC47050.2019.9064252