0% found this document useful (0 votes)
118 views5 pages

Hybrid Machine Learning System For Solving Fraud Detection Tasks

The document discusses a hybrid machine learning system for fraud detection tasks. It proposes a system consisting of two subsystems: an unsupervised learning subsystem for anomaly detection and a supervised learning subsystem for interpreting anomaly types. The system allows high-speed processing of real-time data streams and was effective in detecting anomalies in real data.

Uploaded by

Disha Date
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
118 views5 pages

Hybrid Machine Learning System For Solving Fraud Detection Tasks

The document discusses a hybrid machine learning system for fraud detection tasks. It proposes a system consisting of two subsystems: an unsupervised learning subsystem for anomaly detection and a supervised learning subsystem for interpreting anomaly types. The system allows high-speed processing of real-time data streams and was effective in detecting anomalies in real data.

Uploaded by

Disha Date
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE Third International Conference on Data Stream Mining & Processing

August 21-25, 2020, Lviv, Ukraine

Hybrid Machine Learning System for Solving


Fraud Detection Tasks
Olena Vynokurova Dmytro Peleshko Oleksandr Bondarenko
GeoGuard GeoGuard GeoGuard
Kharkiv, Ukraine Lviv\Kharkiv, Ukraine Kharkiv, Ukraine
[email protected] [email protected] [email protected]

Vadim Ilyasov Vladislav Serzhantov Marta Peleshko


GeoGuard GeoGuard Lviv State University of Life Safety
Kharkiv, Ukraine Kharkiv, Ukraine Lviv, Ukraine
[email protected] [email protected] [email protected]

Abstract — In parallel with technological development the This means that fraud detection methods that focus on the
problem of fraud detection is becoming more and more specific nature of input data that contain information from
important. Increasing number of electronic transactions in electronic transactions related to credit cards. And the
various business environments, on the one hand, and software resulting classification models are tightly tied to this business
and technology development, on the other hand, lead to an domain. This specialization is not quite a disadvantage, as it
active increase in electronic crime. In the paper the hybrid allows easy enough scaling of systems or expansion of types
system of machine learning for solving tasks of anomalies of detection when new types of fraud appear.
detection has been proposed. This hybrid system consists of two
subsystems – anomalies detection subsystem (based on II. RELATED WORK
unsupervised learning) and the interpretation subsystem of
anomaly type (based on supervised system). The advantage of Using set of rules is one of first approach for developing
proposed hybrid system is the high-speed data processing when fraud detection systems. This approach has been developed in
the data are fed in real time. The effectiveness of the proposed the form of knowledge base system. The most well-known
approach was confirmed during the solution of the detecting mechanisms for their implementation are expert systems [1].
anomalies problem based on real data streams.
Using a predefined set of rules simplifies the software
Keywords— fraud detection, anomaly detection, hybrid development of fraud detection systems, but in general such
system, isolation forest, random forest, transactions, machine systems have a number of disadvantages:
learning • the development of rules depends on the quality of
the examination of the business environment. This determines
I. INTRODUCTION
the direct dependence of the effectiveness of the set of rules
In parallel with technological development the problem of on the qualifications of expert analysts who create these rules
fraud detection is becoming more and more important. [2].
Increasing number of electronic transactions in various
business environments, on the one hand, and software and • expanding the system of rules is costly. New experts
technology development, on the other hand, lead to an active are needed to expand the set of rules. Therefore, the
increase in electronic crime. Authentication methods are no appearance of new types of fraud leads to increased costs for
longer the only way to protect against fraud. Early detection modification of software systems.
of fraud is one of the main ways to prevent fraud. Blocking • with a significant increase in the set of rules, the
anomalous electronic transactions in some cases is almost the speed of the system can significantly decrease. This problem
main way to avoid fraud. However, the development of is getting more intense when large feature vectors are used.
mathematical methods for detecting fraud stimulates the
skilled development of ways for concealing fraud. This leads • in the case, when the rules use a threshold, it is very
to the fact that the practical algorithms of fraud detection are difficult to achieve the adaptability of these values to
no longer universal. environmental conditions.
In many cases, in order to increase the accuracy of early Another defining characteristic of rule-based systems is
identification of anomalous electronic transactions, it is the size of the rule base [1]. Small size databases occur
necessary to develop specialized software solutions. The primarily in cases where the input data vectors have a small
essence of specialization is to use models that are adapted to dimension. Therefore, software solutions based on such
the specifics of the company's business activities. For databases are characterized by high speed. But the accuracy of
example, most of the available scientific papers on detecting these systems will again depend on experts.
fraud-related anomalies are related to credit cards. In terms of support, small databases are much easier to
administer. This is another advantage of small databases.

978-1-7281-3214-3/20/$31.00 ©2020 IEEE 1

Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:06:29 UTC from IEEE Xplore. Restrictions apply.
However, modern operational processes manipulate large- classification describes a variety of combined uses of
scale vectors. And this fact leads to a significant increase in unsupervised and supervised methods. In addition, a
the database and reduce the advantages of using rules for fraud significant number of comparative experiments were
detection tasks. conducted to assess the effectiveness of their use.
Other methods for solving a fraud detection tasks are In point of view of the development and operation of
statistical methods. The group of statistical methods includes software systems for fraud detection, machine learning
methods that are based on elements of probability theory, methods have three main advantages:
mathematical statistics and data collected over a period of
time. 1. An increase of the electronic transactions number
usually leads to an increase in the accuracy of fraud detection
Using statistical methods is one of the modern main models.
directions of development of fraud detection methods. On the
one hand, high accuracy of anomaly detection is obtained. On 2. The dimensions of modern data sets make it
the other hand, the use of various inaccurate estimation impossible to analyze them without automation. Machine
parameters greatly reduces the flexibility of these methods and learning methods significantly simplify and increase the
adaptability to changes in input data. For example, many processing speed of large data sets.
methods require setting thresholds. Other methods require 3. The use of machine learning methods makes it
information on the statistical distribution of input data, etc. [3]. possible to detect hidden dependencies. This is important to
Today, all static methods of fraud detection can be divided improve the accuracy of the systems and to increase the
into two categories: supervised and unsupervised methods. resistance of the system to the emergence of new types of
Both categories of these methods are united by the use of fraud.
historical data (record of observations from the past) for As the analysis of scientific results shows that among the
effective fraud detection. The depth of this story for each most popular methods for Fraud Detection are Logistic
method of different categories may be different. One of the regression, Random Forests and Support Vector Machines [7,
main problems of supervised methods is the need to have sets 8]. It should be noted that [7] shows the efficiency of
of labeled features at the input. classification of anomalies using SVM. And in [8] the
This is not always possible and therefore contributes to the effectiveness of anomaly detection using Random Forests.
development of unsupervised methods. In the [4] authors have III. ANOMALY DETECTION HYBRID ARCHITECTURE
used a combination of unsupervised and supervised methods FOR FRAUD DETECTION PROBLEMS
based on a self-organizing map and a neural network. Another
example of a combination is hybrid methods from [5]. In [6] For solving task of fraud detection the pipline for real-time
the classification of various hybrid methods is presented. This anomality detection is proposed.

Fig. 1. Architecting an Anomaly Detection Pipeline

This pipeline consists of two flows: training/testing flow, to the stage of feature embedding, which is the most time-
which trains and tests the developed model and retrains it, consuming, at this stage is cleaning, normalizing and
evaluation flow that processing data stream from server in real embedding data. At this stage, the final training and testing
time. The data are collected using no-sql DB Elasticsearch and dataset for the developed model is formed.
each transaction is stored in json file. Depending on the
solution type of transaction (Android, IOS, gdk (Windows and The next stage is the development of an anomaly detector
MAC), Solus (html), Plugin (Windows and MAC)), the model and a system for interpreting anomalies type. Then the
different type of information about transaction can be stored model is trained and tested and after that we get a ready model
in json file. for use on the evaluation flow.

For training/testing flow we have collected the historical The evaluation flow consists of capturing the data stream
data for the dataset, which must be balanced for all type of from the server and forwarding it to the feature engineering
transactions and their combining. After that, the data are fed stage. Based on the developed embedding methods on the

Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:06:29 UTC from IEEE Xplore. Restrictions apply.
training / testing flow, a dataset is obtained which is fed to the software on the devices that can be rooted; spoofed
deployed model. After that the calculation of transaction information about users and their location and etc.
scoring and making decision are performed.
Based on these fields and its combination 41 features is
Feature Engineering stage is the most time-consuming. It developed
consists of the selecting field from database, the correlation
analysis, the cleaning and preprocessing data and features x ( k ) = { x1 ( k ), x2 ( k ),  , xn ( k )}
embedding. Based on correlation analysis we select 64 fields (1)
for building dataset. Furthermore, we have to fill gaps in data.
Since each transaction may have different filled fields in the where x ( k ) is transaction, xi (k ) , i = 1 n are features (in
connection with solution type. It is necessary to fill in all gaps our case n=41), k is real time.
based on machine learning methods or expert analysis of each
field. It is important because quality of filling gaps affects This set of features forms the training and testing dataset.
accuracy of anomaly detection. And, also, we need to On the case study stage, it was determined that the solution
normalize and code numeric type data. The next stage of of the problem of fraud detection involves the need not only
feature engineering is Categorical Data Embedding. Among to determine anomalous transactions, but also to interpret why
the analyzed fields, 70% of fields are categorical variables. the transaction was failed. All existing methods of detection
Different fields require different encoding, or the combination anomalies solve only the first problem. Thus, it is necessary to
and encoding of several features together. We used: Label develop a hybrid model that could solve the problem of fraud
Encoding, One Hot Encoding, Embedding Vector, Binary detection and interpretation of the transaction anomaly type.
Encoding, Hashing, Сrosstab, Frequency Encoding and some
modified methods. Proposed approach in the paper involves the use of two
sequential models as entities to solve the problem of fraud
The current version of the anomaly detector model is detection. The first model is a binary classifier that solves the
based on data from 64 fields of database. These fields describe ‘fraud’ or ‘non-fraud’ problem. The next model is a multi-
the id of users and devices; the information about geolocation class classifier that defines the ‘fraud’ type. The general
of the users and devices, which perform this transaction the architecture is shown in fig. 2 and a more detailed architecture
history of transactions; the information about connection type of the deployed model are shown in Fig. 3.
(gsm, gps, wifi, ip); the information about running process and

Fig. 2. General architecture of model

Fig. 3. Detailed architecture of deployed model

Nowadays there are a lot of methods for anomaly method [10], Local Outlier Factor method [11], KNN method
detection, the Robust Covariance method [9], One-class SVM [12] and Isolation Forest [13] have been investigated, and for

Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:06:29 UTC from IEEE Xplore. Restrictions apply.
developing model of anomaly detector we have selected of node j , • rj is child node from right split on node j , • lj is
Isolation Forest. Among the different anomaly detection child node from left split on node j .
algorithms, Isolation Forest is one with unique capabilities. It
is a model free algorithm that is computationally efficient, can The importance for each feature on a decision tree is
easily be adapted for use with parallel computing paradigms, obtained in the form:
and has been proven to be highly effective in detecting
anomalies. In our case, it gave best accuracy among others.
n im
j
Thus, the vector x ( k ) = { x1 ( k ), x2 ( k ),  , xn ( k )} from the f i im =
j
(4)
constructed dataset is fed to generate an isolation tree. x is 
k ∈all nodes
nkim
recursively separated by randomly selecting a feature and a
random value of this feature between min( xq ) and max( xq )
where f i im is the importance of feature i , nim
j is the
the values of the selected feature and so on until the tree is
constructed. Thus, we get an isolation tree which is a proper importance of node j .
binary tree, where each node in the tree has exactly zero or
After that these features importance are normalized using
two daughter nodes.
expression
The task of detecting anomalies is to provide a rating of
transactions that reflects the degree of their anomaly. Thus, f i im
one way to detect an anomalous transaction is to sort the f i im = . (5)
transactions according to their length or anomaly scores; and 
j∈all features
f jim
anomalies are transactions that will be at the top of the list.
The path length and anomaly estimate are determined by the
algorithm proposed in [13]. The final feature importance, at the Random Forest level,
is its average over all the trees. The sum of the feature’s
In the case of Isolation Forest, anomaly score is defined importance value on each tree is calculated and divided by the
as: total number of trees:


E ( p ( x ))
− fijim
s ( x, l ) = 2 m (l )
(2) RFi im =
j∈all trees
(5)
Tr
where p ( x ) is the path length of observation x , m (l ) is the
average path length of unsuccessful search in a Binary Search where RFi im is the importance of feature i calculated from
Tree and l is the number of external nodes. More on the all trees in the Random Forest model, f ijim is the normalized
anomaly score and its components can be read in [13].
feature importance for i in tree j , Tr is total number of
Each observation is given an anomaly score and the trees.
following decision can be made on its basis:
Anomalous transactions and borderline transactions that
• a score close to 1 indicates anomalous have been detected by the anomaly detector model are fed to
transactions; the interpreter model. Cross validation has used for tuning
• score much smaller than 0.5 indicates normal hyperparameters of interpreter model. Therefore, we have a
transactions; cascade of classifiers which in general presents the proposed
anomaly detector model.
• if all scores are close to 0.5 then the entire sample
does not seem to have clearly distinct anomalous IV. EXPERIMENTS
transactions The proposed hybrid system was developed to solve the
The random forest method [14] was chosen to develop an problem of detecting anomalies in the geolocation of users
interpreter model of the fraud detection hybrid system. The during transactions (GPS spoofing, Wi-Fi spoofing, location
interpreter model provides an explanation to the provider- jumping, etc.). Experimental studies were conducted on the
companies, why the transaction was determined as abnormal. DB of GeoGuard company.
Because the system under development may have situations The specific feature of the system is that the decision
where several types of anomalies can be present in a single cannot be made at the level of an anomaly, the type of
transaction, the use of random forest-based methods is a anomaly must be explained to justify the decision.
priority. The input vector consisted of 41 features based on
information collected in the no-sql Elasticsearch database in
For each decision tree, Scikit-learn calculates a nodes
importance using Gini Importance, assuming only two child json form for each transaction. Transaction information
nodes (binary tree): depends on the type of user's operating system (Android, IOS,
MacOs, Windows) and consists of fields describing user
n im r r l l geolocation when executing transaction from gsm, gps, wi-fi
j = w j U j − w j U j − w jU j
(3) sources types and fields describing user's device. Frequency
of transaction appearance in the environment is 800
where n imj is the importance of node j , w j is weighted transactions per 1 minute.
number of samples reaching node j , U j is the impurity value

Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:06:29 UTC from IEEE Xplore. Restrictions apply.
Training dataset consists of about 90 000 samples, REFERENCES
number of trees was 300 trees for anomaly detection model [1] N.F. Ryman-Tubb, P. Krause, and W. Garn. “How Artificial
and 500 trees for interpreter model. Intelligence and machine learning research impacts payment card fraud
Table I shows the results of detection and accuracy of detection: A survey and industry benchmark,” Engineering
Applications of Artificial Intelligence, no. 76, pp. 130–157, 2018.
interpretation of proposed hybrid machine learning system.
[2] R. P. Dazeley, “To The Knowledge Frontier and Beyond: A Hybrid
System for Incremental Contextual- Learning and Prudence Analysis,”
TABLE I. THE RESULTS OF ANOMALIES DETECTION AND THEIR PhD thesis, University of Tasmania, 2006
INTERPRETATION
http://https://eprints.utas.edu.au/8173
Accuracy, % [3] A. Patcha, and J. M. Park, “An overview of anomaly detection
Solution techniques: Existing solutions and latest technological trends,”
Training Testing Classification & Computer Networks, vol. 51(12), pp. 3448–3470, 2007.
Detector Detector Interpreter
[4] J. T. Quah, and M. Sriganesh “Real-time credit card fraud detection
all solution 91 90 90 using computational intelligence,” Expert Systems with Applications,
ios 91 90 95 vol. 35 (4), pp. 1721–1732, 2008.
[5] Y. Moreau, E. Lerouge, H. Verrelst, C. Stormann, P. Burge, and K. U.
android 92 90 95 Leuven, “A hybrid system for fraud detection in mobile
plugin communications,” Neural Networks, pp. 447–454, April 1999.
(Windows+ 99 96 99 [6] C. Phua, V. Lee, K. Smith, and R. Gayler, “A Comprehensive Survey
MacOs) of Data Mining-based Fraud Detection Research.” DOI:
gdk 10.1016/j.chb.2012.01.002. Arxiv:1009.6119., 2010.
(Windows+ 97 96 98 [7] S. Bhattacharyya, S. Jha, K. Tharakunnel, and J. C. Westland. “Data
MacOs) mining for credit card fraud: A comparative study,” Decision Support
Systems, vol. 50 (3), pp. 602–613, 2011.
The results show that the anomaly detection on each [8] D. Meyer, F. Leisch, K. Hornik. “The support vector machine under
individual solution is more accurate than when all solutions test,” Neurocomputing, vol. 55 (12), pp. 169 –186, 2003.
are combined into one dataset. The specificity of the system [9] P. J. Rousseeuw and M. Hubert, “Anomaly Detection by Robust
Statistics” arXiv:1707.09752v2 [stat.ML] 14 Oct 2017
is the need to balance the dataset, in which all known
[10] R. Zhang, Sh. Zhang, S. Muthuraman, and J. Jiang. “One Class Support
anomalies on each operating system must be presented. It is Vector Machine for Anomaly Detection in the Communication
also a feature of the system that multiple anomalies may be Network Performance Data,” In: 5th WSEAS Int. Conference on
present in a transaction at the same time, which complicates Applied Electromagnetics, Wireless and Optical Communications,
the process of interpretation. 2007.
[11] M. M. Breunig, H.-P. Kriegel, R. T. Ng, J. Sander, “LOF: Identifying
V. CONCLUSION Density-Based Local Outliers” in Proc. ACM SIGMOD 2000 Int.
Conf. On Management of Data, Dalles, TX, 2000
In the paper a hybrid system of machine learning for [12] Y. Djenouri, A. Belhadi, J. C. Lin and A. Cano, "Adapted K-Nearest
solving problems of anomaly detection is proposed. Such Neighbors for Detecting Anomalies on Spatio–Temporal Traffic
hybrid system consists of two subsystems - subsystem of Flow," in IEEE Access, vol. 7, pp. 10015-10027, 2019. doi:
anomaly detection and subsystem of anomaly type 10.1109/ACCESS.2019.2891933
interpretation (classification), which are based on a cascade of [13] T.L. Fei, M. T. Kai, Z. Zhi-Hua, “Isolation Forest,” in Proceedings of
the 2008 Eighth IEEE International Conference on Data Mining, pp.
decision trees with supervised and unsupervised learning. The 413–422, December 2008, https://doi.org/10.1109/ICDM.2008.17
advantage of the hybrid system is the speed of processing the
[14] L. Breima. “Random Forests. Machine Learning”, v. 45 (1), pp. 5–32,
data that are fed in real time. The effectiveness of the proposed 2010, doi:10.1023/A:1010933404324
approach has been confirmed in solving the practical problem
of detecting anomalies in user geolocation during transactions
execution (GPS spoofing, Wi-Fi spoofing, location jumping,
etc.).

Authorized licensed use limited to: National Inst of Training & Indust Eng - Mumbai. Downloaded on August 12,2021 at 08:06:29 UTC from IEEE Xplore. Restrictions apply.

You might also like