Novel Machine Learning Approachfor Analyzing Anonymous Credit Card Fraud Patterns
Novel Machine Learning Approachfor Analyzing Anonymous Credit Card Fraud Patterns
Sylvester Manlangit
Charles Darwin University, NT, Australia
[email protected]
Sami Azam
Charles Darwin University, NT, Australia
[email protected]
Bharanidharan Shanmugam
Charles Darwin University, NT, Australia
[email protected]
Asif Karim
Charles Darwin University, NT, Australia
[email protected]
ABSTRACT
Fraudulent credit card transactions are on the rise and have become a significantly
problematic issue for financial intuitions and individuals. Various methods have already
been implemented to handle the issue, but the embezzlers have always managed to
employ innovative tactics to circumvent a number of security measures and execute the
fraudulent transactions. Thus, instead of a rule-based system, an intelligent and
adaptable machine learning based algorithm should be an answer to tackle such
sophisticated digital theft. The presented framework uses k-NN for classification and
utilises Principal Component Analysis (PCA) for raw data transformation. Neighbours
(anomalies in data) were created using Synthetic Minority Oversampling Technique
(SMOTE) and a distance-based feature selection method was employed. The proposed
process performed well by having a precision and F-Score of 98.32% and 97.44%
respectively for k-NN and 100% and 98.24% respectively for Time subset when using
the misclassified instances. This work also demonstrates a larger and clearer
classification breakdown, which aids in achieving higher precision rate and improved
recall rate. In a view to accomplish such high accuracy, the original datum was
transformed using Principal Component Analysis (PCA), neighbours (anomalies in
data) were created using Synthetic Minority Oversampling Technique (SMOTE) and a
distance based feature selection method was employed. The proposed process
performed well when using the misclassified instances in the test dataset used in the
previous work, while demonstrating a larger and clearer classification breakdown.
1. INTRODUCTION
Credit card now days is responsible for transactions in the scale of billions of
dollars. 1 Global card business, in 2014 alone had a financial volume of around USD
28.84 trillion. 2 This meant that the importance of credit card has increased. These have
become part of the financial systems. The convenience it has brought to consumers is
one of the major factors. These can be used as substitute loan products. 3 In fact, credit
cards are the key vehicle for the global e-commerce industry, which, as of 2017, was a
USD 1.5 trillion industry in terms of turnover. 4
It is now a common process to pay with credit cards. 5 With the growth of credit
card transactions, fraudulent transactions have also increased. It is not just a financial
aspect, now a days Identity Fraud has also become a real concern. 6 Besides, with the
ever-growing increase in online transaction, where the card actually remains unswept or
absent physically, the rate of fraud is rocketing. For a note, online payment systems in
2015 have churned out more than $31 trillion worldwide, in the same line, credit card
losses have accounted for $21 billion in the same year. 7 This rate is expected to grow
by 51% by the year 2020. 8
Fraudulent transaction basically means, utilizing someone’s credit card without
their knowledge or authorization. The perpetrator in most cases do not have any relation
to the owner nor he\she ever intents to impart any knowledge about themselves or the
process used in the embezzlement; the amount will never be returned as well. 9, 10
Merchants are more at risk than consumers. In the event of fraud, the merchant is one of
the prime sufferers as his\her product is compromised. They often have to reimburse the
chargeback fees and face the risk of closing their accounts. 10 These can lead to serious
damage of reputation for the merchant and they can even face lawsuits of varying
nature. 11
With the increase in the versatility of payment methods, new fraud patterns have
emerged. This made the current fraud detection systems unsuccessful. 12 Another reason
why fraud detection systems fail is that the persons committing fraud constantly change
patterns when committing fraud, and this is the exact reason why the devised barrier
against the fraudsters must employ machine learning concepts, not just to tackle, but
also to address the phenomenon of “Concept Drift”. Stolen credit card information can
often be used by the malicious agents to carry out transactions in the black market,
where cryptocurrencies such as bitcoin, is already in heavy use. 13
As mentioned by Hand et al. 14, there are two levels of fraud protection. These are
fraud prevention and fraud detection. Fraud preventions are things done to stop fraud
from happening, while fraud detection pertains to detecting fraudulent transaction the
moment they happen. 1 Innovation is needed for fraud detection because fraudsters are
also evolving. 6
There are a number of critical factors in training the algorithm in fraud detection. 12
Further, public data is not always available due to the confidentiality issue. 15 The
designed system also has to address factors such as non-stationary distribution of the
data, decidedly imbalanced class distributions (skewed towards observations that are
authentic) and unceasing streams of transactions. 15
From several studies, it is identified that the need for strictly accurate and high-
performance fraud detection systems, based on automated machine learning principles,
which can keep or even outpace the phenomena of “Concept Drift” in the problem
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 177
domain, is on the rise. Concept Drift refers to the change in behaviour of the fraudsters
and methods applied over time. The existing rule-based static systems are all too behind
the time to cope with the continuous cycle of innovative fraud methods of the crime
gangs and thus leaves a lot to desire. To address this gap, this study proposes an
advanced method based on feature selection of two of the most critical parameters and
involves the usage of the k-NN algorithms for an effective classification to build the
model. The appropriateness or fit of k-NN for the problem statement has been
demonstrated in the previous studies and the model has been trained and tested in the
present work.
2. RELATED WORKS
2.1 Fraud detection
Numerous researches have been carried out on fraud detection. One of these
studies, compared the performance of each classification algorithm, such as Logistic
Regression, Support Vector Machines and Random Forests, where Random Forest
performed showed optimum performance. However, the selected attribute set that has
been used for the experimentation, can actually be expanded even further to present a
more accurate result. 6 Another study used feature engineering for detecting fraud, the
researchers used the original features to create a pattern. An example was ‘Time’, a
spending pattern was formed specifying the times a particular customer uses their credit
card. But as the dataset used is proprietary, the discussion on specific features used,
including the response and calculation time, is rather limited. 12 Other works range from
the imbalanced dataset to creating new ways in improving fraud detection. 1, 15, 16 A
study also showed data mining techniques used in fraud detection, some of the
examples mentioned are clustering, classification and Neural Networks. Neural
Networks have been found to be the most impactful, but the research also mentions the
difficulty in developing systems based on Neural Network due to the lack of usable
datasets. 17 One of the main issues with fraud detection is the involvement of growing
large databases. 1 Another problem in fraud detection is the small number of fraudulent
transactions compared to the normal transaction. The best algorithms resulted in many
false positives (normal transactions classified as fraudulent transactions). 1 Lepoivre et
al. demonstrated a system based on unsupervised methods such as PCA and
SimpleKMeans (SKM) algorithm. 5 PCA has been used to represent transactions
described by attributes such as amount, date etc. in a reduced subspace than the initial
one, in a view to minimize information loss. Authors have clustered the data using SKM
with impressive results. However, the test datasets were extremely limited and the
system relies more on the execution time minimization rather than accuracy and
precision. Therefore, it would be hard to deploy it in real situations as high accuracy as
well as the precision in the final results are essential aspects of such sensitive automated
processes. Table 1 provides a summary of the related works discussed for this study.
178 International Journal of Electronic Commerce Studies
Identified
Author Literature Research Focus Outcome
Shortcoming
The study
S. compared the
Data mining for performance of
Bhattacharyya, Random The selected
credit card Logistic
S. Jha, K. Forest showed attribute set could
fraud: A Regression,
Tharakunnel better have been more
comparative Support Vector
and C. Performance expanded
study Machines and
Westland [6]
Random
Forests
Due to the
A. C.
Feature proprietary nature
Bahnsen, D. Patterns for
engineering Feature of the dataset, the
Aouada, A. frauds have
strategies for engineering for discussion on
Stojanovic and been created
credit card detecting fraud specific features
B. Ottersten using different
fraud detection created has been
[12] features
rather limited
Exploration of Using
Data mining clustering, Neural The difficulty in
K. Chaudhary
techniques in classification Networks developing
and B. Mallick
Fraud. and Neural performed systems based on
[17]
Detection: Networks in optimally Neural Networks
Credit Card fraud detection
M. R.
Credit card
Lepoivre, C. Application of Better
fraud detection
O. Avanzini, PCA and representation The test dataset
with
G. Bignon, L. SimpleKMeans of features in was extremely
unsupervised
Legendre, and (SKM) a reduced limited
algorithms
A. K. Piwele algorithm subspace
(Report)
[5]
Besides, Ref [1, 15 and 16] discusses issues such as handling imbalanced dataset and
creating new ways in improving fraud detection
Data mining is a widely used discipline in the fraud detection arena where analysis
of large datasets is accomplished to find unknown relationships between data and to
present it in a way that it can be understood by the owner of the data, and this data must
be useful. 14 This is also known as secondary data analysis. There are two types of
methods in the analysis of the datasets, supervised and unsupervised. In supervised
methods, it is assumed that past transactions are available and dependable, however,
fraud patterns that have already taken place are often limited. 15 Unsupervised methods
require slight or no prior classifications to anomalies. Hence, they are suitable for the
transactions with no label. These methods mostly rely on Outliers- a basic but non-
standard form of an observation. Stream Outlier Detection based on Reverse k-Nearest
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 179
3. IMPLEMENTATION METHODS
3.2 Cross-Validation
Here the training and validation sets must cross-over in successive rounds for each
data point is given a chance of being validated. The basic form of cross-validation is the
k-fold cross-validation. In this cross-validation, the dataset is equally divided into k
number of groups. The number of iterations is performed is same as k, where training
and validation are performed. k -1 of the dataset are used for training and validation,
while the rest of the dataset is used for testing the performance of the prediction
model20.
180 International Journal of Electronic Commerce Studies
Confusion Matrix: Classification algorithms are usually assessed using the Confusion
Matrix shown in Figure 1.
Predicted Predicted
Negative Positive
Actual Negative TN FP
Actual Positive FN TP
In the above figure, the columns are the class predictor, while the rows are the
actual class. TN (True Negative) stands for the number of correctly classified negative
examples, FP (False Positive) denotes the number of misclassified positive samples. FN
refers to the negative examples that are classified as positives, and TP is the samples of
positives that are correctly classified. 21 In this research, fraudulent transactions have
been marked as positive, while the non-fraudulent ones as Negative. Predictive accuracy
may not be a good measure if the class imbalance is large. 21
Recall: The ratio of correctly classified positive instances to that of the total
number of actual positive instances are termed as the Recall. Such measurement
projects the capability of the classification algorithm in question to correctly classify the
actual fraudulent instances. Equation 1 measures the percentage of fraud samples
correctly classified as fraud.
TP
Recall = (1)
TP + FN
Precision: Precision demonstrates the degree of true fraud instances out of the
predicted ones. Equation 2 measures the precision accuracy.
TP
Precision = (2)
TP + FP
For performance analysis of the algorithm, both Recall and Precision will be
evaluated along with the F1 Score (using equation 3). F1 Score projects a better balance
of the performance measure by taking both Precision and Recall into account.
Precision*Recall
F1 Score = 2 x (3)
Precision + Recall
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 181
4. METHODOLOGY
The first part of the study is to analyse the dataset. The aim is to understand what
anonymization process was used and how it transformed the data. After that original
features ‘Time’ and ‘Amount’ will be broken down into separate groupings. This will
give an insight into how many fraudulent transactions are present in each grouping, will
be presented using a histogram. These groupings may be used in the training set size of
the proposed process. A process for detecting fraud will be proposed. The proposed
process will then be tested.
The first comparison will be between the dataset used in the initial analysis and the
dataset size of the proposed fraud detection process. The next comparison will be using
test data that has been misclassified in the initial analysis. A table will be shown for
each comparison.
As suggested by Pozzolo et al. 15, a fraud detection system should have good
ranking of transactions with their probability of being a fraud, rather than correctly
classifying the transaction. The effectivity of the detection method will depend on how
close it can predict the correct classification of the new data. The output of the
suggested detection method should output the fraud probability of the new data, and it
should also show how it reached the result.
The adopted methodology will include the PCA-transformation of the feature set to
a sub-space with minimal information loss. The features “Time” and “Amount” will be
left out of this process. Synthetic Minority Oversampling Technique (SMOTE) will
then be applied to the selected training set. Grouping of features is the next step due to
the reason as described in the beginning of this section. Once the grouping will be
completed, k-NN will be applied for classification purposes, after which the results will
be evaluated.
4.1 Anonymized credit card transaction dataset
The dataset used in this research is from Kaggle.com. It is an anonymized credit
card data from European credit card users. It was anonymized to protect the privacy of
the credit card users. The data has been distorted in a way that it will be impossible to
identify any individual. Anonymization is the process of distorting data to preserve
privacy. The dataset has numerical input variables that have been PCA transformed. 16
Principal Component Analysis (PCA) can be a basis for multivariate data analysis. One
of the goals of PCA is finding a connection between each data. 22 PCA has transformed
the original values into numbers, effectively hiding the privacy of the credit card users.
The concerned dataset has recorded a total of 284,807 transactions over a period of
2 days. Out of all the recorded transactions, 492 have been classified as fraud (0.172%
of total transactions). The dataset is highly unbalanced. 16 According to Han et al. 23,
two types of imbalances in a dataset are generally observed. The first one is between-
class imbalance, this is an imbalance where one class have more samples than the other
(as cited by Chawla), 21 and the other is within-class imbalance, where some subsets
within the class have fewer samples than the others in the class. 24 The majority class is
the class having lots of samples and the minority class is the one having fewer samples.
23
The dataset used in this research has an imbalance between classes. The fraudulent
transactions, referred as the minority class, and the non-fraudulent transactions, which is
the majority class. Sampling techniques have been suggested to address imbalanced
datasets.
182 International Journal of Electronic Commerce Studies
The dataset has 30 columns, features V1 to V28 are values that were PCA
transformed to preserve confidentiality except for features such as “Time” and
“Amount”. “Time” is the seconds elapsed from the first transaction in the dataset, and
“Amount” is the transaction amount. The feature “Class” is the classification of the
transaction, showing whether the transaction is a fraud. The research 16 pointed out that
the “Amount” feature can be used for example-dependent cost-sensitive learning. This
research may use “Time” and “Amount” features to find out any patterns that will lead
to finding or detecting future fraudulent transactions. The features have been arranged
from the highest to lowest, V1 having the highest variance, while V28 has the lowest
variance as shown in Table 2. It is advised that, 16 measuring precision and recall
accuracy by using the Area Under the Precision-Recall Curve (AUPRC).
In the case of credit card transactions, features of the original data are transformed
into a smaller subspace without losing any information. 5 The transformation of the
original data into a smaller subspace can also be described as the reduction of
dimensions, which can easily be achieved using PCA. 25 If the original data is going to
be reduced into one dimension, it has been suggested to create a principal component
that has the most variation. 25 Another characteristic of PCA is that it expresses the data
in a way that highlights its similarities and differences. Patterns in high-dimensional
data are often hard to graphically represent, however, with the ability of PCA to reduce
dimensions, analysis can be made much easier and intuitive. 26
contribute to the result. There are two broad categories, these are the wrappers and
filters. Wrappers evaluate features using algorithms and based on the resulting accuracy,
the feature is eliminated or retained. Filters use a heuristic based evaluation of the
features depending on the general characteristics of the data. 27 In the first part of the
study, a feature will be used to build a classification model using the logistic regression
algorithm. The results will be compared and check if there are features that can give
higher fraud detection accuracy than the other features.
Now using Sturges’ rule, the number of classes that are going to be used to
divide the transactions will be determined. Sturges’ rule is used to determine the number
of class groupings when creating a histogram. The computation is given below:
Number of Classes = 1 + 3.3 log₁₀ n
= 1 + 3.3 log₁₀ (86400)
= 1 + 3.3(4.936514)
Number of Classes = 17.29049535
184 International Journal of Electronic Commerce Studies
The value of n is the total number of seconds in a day. The formula’s result is
17.29, rounding it off to 17. The histogram will have 17 groupings. These grouping may
be used when proposing a new fraud detection process.
Next the class width has to be determined, this is to find out which transaction falls
into what class. The smallest value refers to the starting second of the day, while largest
value refers to the last second in a day (86400), these can be seen in Table 1. The
formula used to determine the class width can be seen on the below calculation. The
result gave us is 5082.35. This work will round-off the result to the nearest 100, making
it 5100. The first class in Table 4 having 3 and 7 instances of fraud on the first two days
respectively. These results will be used to divide the dataset.
This study tried to transform the unit of measure of seconds into hours in the day
assuming that the first recorded transaction happened on 12 midnight. Table 4 shows the
converted values, and it provided additional insight as to why there were fewer
transactions in the earlier classes because they happened on earlier parts of the day as
this is mostly a non-working hour, thus not many people were active in carrying out the
transactions. As demonstrated in Table 5, this timeframe has 17 classes.
Table 6 shows there are four transactions not classified under class 1 of the feature
‘Amount’. Further breakdown is needed because 99.17% of the fraud transactions
belong to Class 1 ($0 to $1400).
Table 8 shows most of the fraudulent transactions have the amount ranging from
$0 to $4. This contained 44.92% of the total amount of fraudulent transactions. This
small-valued transactions might be one of the causes for the loss of time and resources.
Processing these small transactions causes the bank to allocate personnel to check the
legitimacy of the transactions and merchant to pay fees for failed transactions or
fraudulent transactions. The value of the transaction compared with the resources
allocated is not balanced. As suggested by Pozzolo et al. 16 adding a cost-sensitive
equation to the classification algorithm will enable it to increase the possibility of small-
valued transaction with features that are similar to big-valued transactions. This will
shorten the time of the investigation and it will allow the fraud analyst to focus on the
bigger valued transactions or increased productivity when investigating small-value
transactions.
With the information gained in the earlier parts of the study, this section proposes a
fraud detection method. Figure 4 sketches out the proposed framework.
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 189
The process starts when a new transaction comes in. This transaction will be
PCA transformed in such a way that it follows the original dataset’s PCA transformation
process, denoted as point 1 in Figure 4. A subset from the original dataset will be used
as the training set of the classification of the new transaction. The subset will match
with the class groupings using the feature ‘Time’ in Table 3, denoted as point 2 in
Figure 4. The next process is to apply Synthetic Minority Oversampling Technique
(SMOTE) to the fraud class of the selected subset, denoted as point 3 in Figure 4.
SMOTE is an oversampling approach which, in place of carrying out oversampling with
replacements, creates synthetic samples. These synthetic samples are presented along
the line segments of the bona fide minority samples using the k minority class nearest
190 International Journal of Electronic Commerce Studies
Fraud data have been identified as an anomaly in the pattern, these are usually
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 191
alone, and these are surrounded by non-fraudulent data which can be seen in Figure 6.
The fraudulent transactions are located inside the yellow circles as shown in Figure 5.
Since the process uses k-NN in classifying the test data, the nearest neighbours
determine the classification of the test data. In Figure 5, it shows that three fraud data
are surrounded by several non-fraud data. Although, k can have the value of 1, but is
easily influenced by noise or unrelated data. Other sources suggest to use the formula,
√n, where n is the number of fraud instances in the subset, when determining the value
of k. 30
In Figure 6, there is an area between the fraud data and the surrounding non-
fraudulent data. This is where the SMOTE samples will be placed, creating neighbours
around the fraud data, and this creates a clustered area around it. If the test data fall
within this area, then it can be interpreted as the test data having the same features as the
fraudulent data.
Several studies have explained and used several feature selection methods.
Examples of these methods are Information Gain, Gain Ratio and Correlation based
Feature Selection (CFS). 31 These methods are used in other classification algorithms,
Information Gain and Gain Ratio have been used mostly in Decision Trees. This type of
selection selects the best feature in splitting the decision tree.
The proposed process uses the k-NN algorithm to classify a test data or a new
transaction. This study will follow the distance-based feature selection method proposed
by Kim et al. 32 for feature selection in fault detection prediction model. The model has
two outcomes, which is similar to fraud detection. The method uses the Euclidean
distance formula, the mean absolute value and standard deviation of each feature. Mean
absolute value (MAV) refers to average distance of the values from the mean. 33 and
standard deviation (SD) refers to how far are the values to the mean. 34 These statistical
measures group feature based on how clustered they are; this would separate features
whose data are clustered and data that are spread out. Equation 4 is used for calculating
the distance between features.
Figure 7. Class 1 in time group, using features V14, V12, V11 and V4
Figure 8. Linear projection of class 1 in time group using all the features
Upon using the proposed feature selection method, using the class 1 grouping in
the feature ‘Time’, Table 9 showed the top 10 features that have the shortest distance in
relation to the other features. In columns labelled 2 to 10, these are the features whose
distance is close to the primary feature in column labelled ‘feature’. These groupings
will be used in the classification process when applying the k-NN algorithm.
Table 9. Top 10 features with shortest total distance regarding the other features
Upon checking the groupings, there are features that were repeatedly used, shown
in Table 10 However, there were features that were missed and not included in the
groupings. When Random Forest performed feature groupings, it used all the features
when classifying the test data. This study wants to make sure that when detecting fraud,
the process should use all features, but place these into different groupings. The other
features are ranked in the lower parts of the table.
194 International Journal of Electronic Commerce Studies
Table 10. Features repeatedly used, while some features were not used
6. RESULTS
In this section, the proposed process will be compared with the results obtained
from the usage of a classification algorithm with a tool. Following are the sampling
techniques used on the dataset (Table 11).
Table 12 shows a comparison between the performance of the k-NN classification
algorithm and performance achieved using the time groupings of the dataset with the
same applied settings. The precision has increased when using the time grouping,
example used was the class 1 grouping. Although there was a decrease in recall
accuracy, but comparing the size of the misclassified data, it is found to be lower
because the original subset contains fewer fraud data compared to the whole dataset.
The proposed grouping also prevents characteristics of unrelated data from getting
mixed in with the test data. This section interprets the finding as preventing the regular
spending pattern at certain times from other time-groupings from admixing with fraud
pattern present in the selected time-grouping.
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 195
Table 12. Comparison between k-NN classification algorithm and a time subset
grouping
Following the proposed process, a subset of the dataset used to classify the selected
instance will be used to identify the nearest neighbours. The time of the selected
instance is 48313.69189, this falls in the 9th class or grouping in the ‘Time’ feature. The
range of the grouping is from 40801 seconds to 45900 seconds. The next process is to
determine the groupings, five of these. Table 14 shows the groupings of feature when
the classification process is implemented. All the features were present in one of the
groupings.
Table 15 shows the summary of each grouping compared to the whole dataset.
Using the dataset used in the initial analysis, the nearest neighbours of the selected
instance were divided into 3 Frauds and 4 Non-Frauds.
Table 14. Groupings that is going to be used for the classification process
v16 8 9 10 11 13 14 15 16 17 18 19 20
v11 7 8 9 10 11 13 14 15 16 17 18 19
v2 1 2 3 4 5 6 7 8 9 10 11 13
v27 12 14 16 17 18 19 20 21 22 23 24 25
Therefore, the algorithm has classified it as a non-fraud because there are more
non-fraud neighbours compared to the fraud neighbours. In the case of the proposed
process, all the groupings have shown that it leans to the fraud class. It also meant that
the proposed process does work while correctly classifying a data.
Table 14 also indicates that the fraud neighbours are closer to the tested instance.
Group 4 showed the nearest distance, which is 0.98. The non-fraud neighbours in the
proposed process displayed a shorter distance compared to the non-fraud neighbours in
the initial analysis.
The initial analysis showed greater number of non-fraud neighbours compared to
the fraud neighbours. In the feature groupings, groups 1, 3 and 4 showed the nearest
neighbours of the tested instance belong to the fraud class. Group 2 is the only feature
grouping that showed non-fraud neighbours, and the grouping showed that there are
more fraud neighbours compared to the non-fraud.
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 197
Table 15. Classification summaries for each grouping and the whole dataset
Attribute
Fraud
Fraud
Fraud
Fraud
Fraud
Fraud
Fraud
Fraud
Fraud
Fraud
Non-
Non-
Non-
Non-
Non-
Count 3 4 7 0 4 3 7 0 7 0
Total Distance 4.76 9.15 9.82 0.00 6.26 5.98 10.10 0.00 6.85 0.00
Average Distance 1.59 2.29 1.40 0.00 1.57 1.99 1.44 0.00 0.98 0.00
7. CONCLUSION
Based on the finding in the prior research work, an intelligent machine learning
based framework has been developed in this current study to effectively combat the
issue of credit card fraud detection. It achieved an F-Score of 97.44% with precision and
recall of 98.32% and 96.58% respectively. The model employs k-NN, Principal
Component Analysis (PCA) and SMOTE along with a distance-based feature selection
method has also been introduced. The proposed process used smaller training set. This
reduced the training and testing time of the classification algorithm, as well as the
resources needed to perform the algorithm. Application of smaller subset cut down the
number of misclassified instances, because the instances are grouped in such a way
where spending characteristics are time specific. The proposed process performed
commendably when using the misclassified instances in the test dataset used in the
initial analysis. 2 The process demonstrated a larger and clearer classification
breakdown. However, there were misclassified instances that portrayed the same results.
This study identified one possible reason for the misclassification, which is the SMOTE
instances were too many. This resulted in allowing the fraudulent data to get mixed with
the non-fraudulent ones.
The dataset used in this study showed features that separated the fraudulent and
non-fraudulent data. This helped the classification algorithms in creating models with
high fraud detection accuracy. On the other hand, there were features that clearly mixed
both classes. The reason Random Forest and Logistic Regression could produce good
results was the combination of all features. It used the features that clearly separated the
classes as the main identifier of the fraudulent class while using the other features as the
gradient separator. The structure of the features helped in the creation of the fraud
detection method. This study concludes that the interpretation of the dataset directly
affects the fraud detection process. Without factors or features that help separate a fraud
from a non-fraud transaction, the difficulty in detecting fraud becomes a difficult task.
8. FUTURE WORK
The success in classifying the misclassified instances of k-NN classification
algorithm by the proposed fraud detection process presented in this study showed
promise. However, there are rooms for improvements such as having a different number
of fraudulent transactions each time grouping is done. Further, the percentage value of
198 International Journal of Electronic Commerce Studies
SMOTE needs to be identified correctly. The comparison of results only tested the
misclassified instances in the test data. This study did not cover the effects of the
quantity of the SMOTE instances. The main reason for using SMOTE is to create an
area for the single fraud instances. This is one of the future works of this study.
Transforming the new data into PCA format is part of the proposed process. The
process of using PCA on to the new data needs to be studied, as the original data has
been concealed, there is a need to know how PCA can affect the performance of a
dataset and the possibility of information gain or loss when transforming a dataset.
The transformation of the dataset into numerical form added to the effectivity of
the k-NN algorithm. This study has defined fraud as anomaly in the spending pattern of
a person. The possibility of this happening regularly is under 1% in a group of 280,000
transactions. Using this definition, new data that has the same features or numerical
value as the recorded fraudulent transaction in the same time grouping will classify the
new data as a fraudulent transaction.
To check the proposed process on to other binary result dataset is another objective
for the future. This is to see if it will perform in detecting the positive class of the
dataset. If the performance of the proposed process is high, this could be interpreted that
the proposed process is applicable to binary result dataset using time series groupings.
Sylvester Manlangit, Sami Azam, Bharanidharan Shanmugam and Asif, Karim 199
9. REFERENCES
[1] S. Jha and C. Westland, A Descriptive Study of Credit Card Fraud Pattern. Global
Business Review, 14(3), p373-84, 2013.
https://doi.org/10.1177/0972150913494713
[2] S. Manlangit, S. Azam, B. Shanmugam, K. Kannoorpatti, M. Jonkman and A.
Balasubramaniam, An efficient method for detecting fraudulent transactions using
classification algorithms on an anonymized credit card dataset. Intelligent Systems
Design and Applications, Springer, 736, p418-429, 2018.
https://doi.org/10.1007/978-3-319-76348-4_41
[3] J. M. Liñares-Zegarra and J. O. S. Wilson, Credit card interest rates and risk: new
evidence from US survey data. The European Journal of Finance, 20(10), p892-
914, 2014. https://doi.org/10.1080/1351847X.2013.839461
[4] Statistica. e-Commerece. Retrieved from:
https://www.statista.com/outlook/243/100/ecommerce/worldwide#, 2018.
[5] M. R. Lepoivre, C. O. Avanzini, G. Bignon, L. Legendre, and A. K. Piwele, Credit
card fraud detection with unsupervised algorithms (Report). Journal of Advances
in Information Technology, 7(1), 34, 2016. https://doi.org/10.12720/jait.7.1.34-38
[6] S. Bhattacharyya, S. Jha, K. Tharakunnel and C. Westland, Data mining for
credit card fraud: A comparative study. Decision Support Systems, 50(3), p602-613,
2011. https://doi.org/10.1016/j.dss.2010.08.008
[7] C. Jiang, J. Song, G. Liu, L. Zheng and W. Luan, Credit Card Fraud Detection: A
Novel Approach Using Aggregation Strategy and Feedback Mechanism. IEEE
Internet of Things Journal, March, 2018.
https://doi.org/10.1109/JIOT.2018.2816007
[8] NilsonReport. retrieved from: The Nilson Report:
https://www.nilsonreport.com/upload/content promo/The Nilson Report 10-17-
2016.pdf, October, 2016.
[9] A. Prakash and C. Chandrasekar, A parameter optimized approach for improving
credit card fraud detection. International Journal of Computer Science Issues,
10(1), p360-366, 2013.
[10] V. R. Ganji and S. N. P. Mannem, Credit card fraud detection using anti-k
nearest neighbor algorithm. International Journal on Computer Science and
Engineering, 4(6), p1035-1039, 2012.
[11] R. Elitzur, Y. Sai, A Laboratory Study Designed for Reducing the Gap between
Information Security Knowledge and Implementation. International Journal of
Electronic Commerce Studies, 1(1), p37-50, 2010.
[12] A. C. Bahnsen, D. Aouada, A. Stojanovic and B. Ottersten, Feature engineering
strategies for credit card fraud detection. Expert Systems with Applications, 51,
p134-142, 2016. https://doi.org/10.1016/j.eswa.2015.12.030
[13] J. Jose, K. Kannoorpatti, B. Shanmugam, S. Azam, K. Yeo, A Critical Review of
Bitcoins Usage by Cybercriminals. International Conference on Computer
Communication and Informatics (ICCCI), India, 2017.
[14] D. J. Hand, H. Mannila and P. Smyth, Principles of data mining, MIT Press, 2001.
[15] A. D. Pozzolo, O. Caelen, Y. L. Borgne, S. Waterschoot and G. Bontempi, Learned
lessons in credit card fraud detection from a practitioner perspective. Expert
Systems with Applications, 41(10), p4915-4928, 2014.
https://doi.org/10.1016/j.eswa.2014.02.026
[16] A. D. Pozzolo, O. Caelen, R. A. Johnson and G. Bontempi, Calibrating Probability
with Undersampling for Unbalanced Classification. In IEEE Symposium Series on
200 International Journal of Electronic Commerce Studies