ML-1 Research Paper
ML-1 Research Paper
Karan More
Dept: B.Tech-CSE
Lovely Profesional University
Phagawara, Punjab
Reg. No.: 12223186
Abstract— With the development of technology and the by a third party who is unauthorized (Singh &
continuous growth of the economy in the modern world, Upadhyaya, 2012). Anomalous traffic patterns in a
fraudulent activities have become a norm in the financial computer network may indicate a malicious attempt to
industry, causing a loss of hundreds of billions of dollars breach or compromise the system and can cause severe
to institutions and consumers every year. Fraudsters keep disruptions or even alert to a hacked computer that is
on changing their tactics in exploiting the weaknesses of sending sensitive data to an unauthorized destination
current preventive measures, with most targeting the (Chandola et al., 2009; Singh & Upadhyaya, 2012). In
financial sector. These crimes comprise credit card fraud, the health sector, anomaly detection techniques detect
malignant cells or regions within medical images,
health and auto insurance fraud, money laundering,
including MRI scans (Spence et al., 2001; Han et al.,
securities and commodities fraud, and insider trading. On
2020). In spacecrafts, anomalous readings or
their own, fraud prevention systems alone do not give measurements from sensors could mean a faulty
adequate protection against such crimes. So, fraud component, and in nature, an earthquake can be predicted
detection systems stand more evidently than ever at a by finding anomalies in the precursor data (Fujimaki,
juncture from which one would look more keenly for cost Yairi, & Machida, 2005; Saradjian & Akhoondzadeh,
saving by doing so: in detecting the fraudulent acts 2011). In all of the applications described, there is a
already committed. Over the last couple of decades, sense of a "normal" model of the data that the anomalies
researchers have greatly studied the anomaly detection deviate from (Aggarwal, 2013).
techniques for this purpose, many of which employed
statistical, artificial intelligence, and machine learning Financial fraud, which encompasses but is not limited to
models. The popularity of supervised learning algorithms credit card fraud, insurance fraud, money laundering,
comprises the majority of the types of models that were healthcare fraud, and securities and commodities fraud,
studied in research up until recently. However, supervised has recently drawn an incredible amount of unwanted
learning models are associated with many challenges that attention from efforts and interests trying to prevent
have been and can be addressed by semi-supervised and them. The alarming increasing rates of financial fraud
un- supervised learning models proposed in recently pose serious economic concerns. The global losses are
published literature. This survey aims to investigate and estimated annually at billions of dollars; for instance, the
present a thorough review of the most popular and cost in the United States has been estimated at more than
effective anomaly detection techniques applied to detect $400 billion per year (Bhattacharyya, Jha, Tharakunnel,
& Westland, 2011; Kirkos, Spathis, & Manolopoulos,
financial fraud, with a focus on highlighting the recent
2007). It is also important to mention that several other
advancements in the areas of semi-supervised and
crimes related to financial fraud have serious impacts at
unsupervised learning. the industry level, connected with funding such
Keywords— Outlier, Anomaly detection, Outlier prohibited activities as organized crime, drug trafficking,
detection, Machine learning, Deep learning, Financial as mentioned by West & Bhattacharya, 2016. The losses
taken from this crime are usually borne by the companies
fraud, Credit card fraud, Insurance fraud, Securities and
and traders involved in commerce, who more or less end
commodities fraud, Insider trading, Money laundering
up paying out everything connected with expenses
carried over from fraud. This would include chargebacks
I. INTRODUCTION to a merchant, administrative costs pertaining to such
fraud, but more importantly, loss in confidence by
Anomaly detection is a very broad term that centers on consumers who will be victimized by frauds such as
finding instances of data or occurrences that do not these Quah & Sriganesh, 2008; Sa´nchez et al., 2009).
exhibit the anticipated behavior (Chandola, Banerjee, & Consequences of fraud such as these are dire and justify
Kumar, 2009). Often the two terms outlier detection are developing strategies and techniques needed for
considered synonymous with each other. Most of the detection.
methods developed for anomaly detection are basically
equivalent but are named after their domains of The purpose of this work is to attempt an overview of the
application. Other similar words for anomalous are state-of-the-art in recent research and literature published
discordant objects, exceptions, aberrations, peculiarities within the realm of anomaly detection techniques used in
or impurities (Chandola et al., 2009; Aggarwal, 2013). financial fraud. Such a work ought to make clear a
Anomaly detection has been employed in this paper, and discussion over research directions and techniques
where necessary, outlier detection has also been used advanced towards creating a complete guide in terms of
considering the approach to the problem. recent contributions, developments, and experimental
outcomes within this field.
The importance of anomalies is being able to be
translated to important, actionable and many applications We outline the paper as follows: Section II gives an
commonly critical information, but the reasons why the overview of related surveys in the area. The motivation
data may have anomalies induced to it vary: for instance, behind this overview is to show that past survey papers
malice or failure of systems; common across all of these have often maintained rather narrow scopes, and
reasons are that the characteristics of interest are of consequently highlight the need for a centralized source
particular interest to the analyst Chandola et al., 2009). of information for anomaly detection in the financial
Such abnormal behavior of credit card data may refer to fraud domain. Section III of the paper defines the
identity theft or even fraudulent transactions performed anomaly along with a high-level overview of the
associated detection task and an overview of the nature possible for anomaly detection in time-series data (Gupta et
of the problem as well as its associated challenges. In al., 2014). The authors provide valuable insight into the
section IV, background information on fraud in the various applications of temporal anomaly detection and the
financial domain is contextualized with an outline of the related challenges in each domain.
different types of fraudulent acts committed and insight
into how they occur. In section V, a critical review of the Other survey papers exist that concentrate more specifically
surveyed literature studying anomaly detection on the techniques and applications of anomaly detection that
techniques applied to detect financial fraud is presented, have been researched in the context of financial fraud.
summarizing key findings, limitations, and suggestions Bolton and Hand, who are the authors of some of the earliest
from published research. Finally, we conclude with final and most influential surveys of statistical fraud detection,
remarks on the key findings and provide suggestions for provide an in-depth background on the different types of
future research avenues in section VI. financial fraud and how they are committed, such as credit
card fraud, insurance fraud, money laundering, and others
(Bolton & Hand, 2002). In their 2002 paper, they still reflect
II. RELATED SURVEYS on the challenge to identify fraud in various different
locations during a review of methods applied in research
Many published research literatures have studied the work done in identifying the numerous categories of fraud.
application of anomaly detection techniques in various Two more years up to this similar review in structure from
applications that have been the subject matter of focus for the Kou and Huang study with much recent attention of
several recent survey and review papers. Of those surveys, financial fraud detection deep learning methods highlighted
there are several which focused on a broad scope of are Kou et al. Phua et al. approach fraud detection from a
applications, strategies, and techniques that have had practical data-oriented, performance-driven point of view
significant influence on further research in diverse fields. and not from application or technique oriented views like
previous survey papers (Phua, Lee, Smith, & Gayler, 2010).
One of the first very early surveys on anomaly or outlier Also, their work extends the kind of frauds, methods, and
detection methodologies was published in 2004 by Hodge techniques covered than previous surveys with some
and Austin and provides a pretty thorough review of the discussion on internal fraud and the implementations of
subject. Literature provides much background information hybrid approaches.
on outliers or anomalies and challenges that surround the
task of detection of them along with a pretty thorough Pourhabibi et al. (Pourhabibi, Ong, Kam, & Boo, 2020)
review of early statistical, machine learning and ensemble conducted a systematic literature review of various graph-
methods applied to the task. In 2009, Chandola et al. also based anomaly detection techniques that have been
conducted a survey on the various anomaly detection performed in published literature about financial fraud. The
techniques proposed in research that had not yet been authors widely surveyed the methods advanced to analyze
discussed in Hodge and Austin, giving even further insight connectivity patterns in communication networks to identify
into the many real-life applications they are utilized in suspicious behaviors (Pourhabibi et al., 2020). The layout of
(Chandola et al., 2009). In 2012, a survey published by the review was quite analogous to the survey of Ngai et al.
Zimek et al. reviewed unsupervised anomaly detection in (Ngai et al., 2011) that included limitations based on
tech- niques specifically for high-dimensional numerical different methods and gave an overview of the four graph-
data, discussing based approaches: community-based, probabilistic-based,
structural-based, compression-based and decomposition-
the aspects of the 'curse of dimensionality' in great detail based (Pourhabibi et al., 2020). Their applications were
(Zimek, Schubert, & Kriegel, 2012). The literature also banking fraud detection, insurance fraud detection, anti-
compared two categories of specialized algorithms: one that money laundering and much more. The authors explain in
addressed the presence of irrelevant features or attributes detail the merits of each technique and difficulties
and the other that was more concerned with efficiency and encountered.
effectiveness issues (Zimek et al., 2012). The other issue that
temporal data raises is anomaly detection, which is a
problem that Gupta et al. have surveyed comprehensively in III. MODEL EVALUATION & FORMULATION
2014 (Gupta, Gao, Aggarwal, & Han, 2014). Given the
advances in computational power, which have made What are Anomalies?
possible various forms of temporal data, the authors There have been many authors with slightly different
comprehensively survey the techniques that have been made definitions for what is an anomaly, yet a definition has never
really come into play. Exact definitions of an anomaly rely
on assumptions of the structure of the data and application
in question. Yet there are definitions that can be thought of
as general to most if not all cases irrespective of setting or
application. Among these, perhaps the most commonly
quoted and used is from Hawkins: "An outlier is an
observation which deviates so much from the other
observations as to arouse suspicions that it was generated by
a different mechanism" (Hawkins, 1980). This definition
pertains to data coming from a statistics-based intuition
whereby normal data follows a generating mechanism, and
anomalies are samples or instances which fail to fit the boundary, such as the points o1 and o2 in. For example,
described mechanism. Anomalies hence are often conduits consider credit card usage data for individuals defined by
for important information that a system might have, only one feature: the purchase amount. Comparing any
especially characteristics causing the impact on the transaction that is more than the normal spending range for
mechanism of generation (Aggarwal, 2013). Throughout the that individual would be a point anomaly.
rest of this paper, we shall use the definition as understood Anomalous in a given context but otherwise not: that is what
by Hawkins regarding the term anomaly. Fig. 1 shows a contextual anomalies are-these data examples. In that way,
two-dimensional graph that represents the concept of the structure of the data set induces the notion of the context
multiple anomalies. From the figure, it is clear that data for this type of anomalies, and thus, it is strictly necessary to
elements appear as two norm al regions, N1 and N2, define the problem formulation. In general, every data
because most of the events fall within those regions. Those instance has two kinds of attributes defining it. First are the
that are distant from most of the rest of the observations, contextual ones. They are those attributes whose values will
whether as an individual or in a small group, like points o1, determine the context for that instance, that is, in the time-
o2 and the region O3, are anomalies. Anomaly detection is series case, the value of time. Non-contextual attributes
akin to noise removal, where the issue is unwanted noise in describe non-behavioural features of an object, such as the
data, but is not the same thing. purchase amount in the credit card dataset. Then anomalous
behavior is identified based on behavioural attributes in
some specific context. For example, contextual attribute in
credit card fraud will be the time of purchase. Assume a
Fig. 1. Graphical visualization of anomalous data in a simple two-dimensional person has a weekly bill of hundred dollars for expenditure
representation.
except on the Christmas week, which happens to be a
(Chandola et al., 2009). Real-world data, often are noisy to thousand dollars. A thousand dollars spent in an average
an extent which may not interest the analyst but causes week in May would be considered a contextual anomaly
difficulties in the analysis of data. Normally, only highly because it does not conform to the typical behavior in the
interesting deviations would be of interest (Aggarwal, context of time, even though spending the same amount in
2013). The fact that noise corrupts data analysis calls for the week of Christmas would be considered normal
noise removal, which involves removing any unwanted (Chandola et al., 2009).
objects before data analysis (Xiong, Pandey, Steinbach, & Lastly, collective anomalies are a set of related instances
Kumar, 2006). Another related but different concept is that are anomalous to the whole data set. Events or data
novelty detection. Instead, these types of methods detect instances in a collective anomaly do not necessarily
previously unobserved or new patterns in the data, and represent anomalies in themselves, but as a group, their
usually, the principal difference from anomaly detection is appearance is anomalous. Point anomalies can be found in
that new patterns are usually incorporated into a model any data set; however, collective anomalies exist only in a
following their detection (Markou & Singh, 2003). data set that has some form of relation between its data
There are anomalies due to human error and instrument instances. Contextual attributes will display only if
errors among other causes. there are contextual attributes in the data.
faults, natural deviations in populations, fraudulent activity,
behavioural changes of systems or faults within a system Challenges in Anomaly Detection
(Hodge & Austin, 2004). Depending upon the application An observation or an event which is not consistent with
area, the anomaly detection system can deal with the expected normal behavior, as defined in earlier, then a basic
anomaly. For instance, assume that the anomaly implies a intuitive approach would be to define a region which would
typographical error typed by a data-entry clerk. In that case, be a representation of the normal behavior. All observations
a simple notification to the clerk will correct the error and lying outside the boundary of such a normal region are
bring the anomalous data back to a normal entry. termed anomalies, and one simple illustration is presented in
Anomalous data coming from instrument readings can just Fig. 1. This is, on the face of it, a very intuitive approach,
be deleted once identified. More importantly, anomaly but the peculiarity of anomaly detection problems makes
detection systems in critical environments such as intrusion this much more challenging than most analytical and
monitoring or fraud detection systems must be able to detect learning problems. Contributing to these complexities are
anomalies fast and in real-time with a suitable alarm to several factors.
permit intervention (Hodge & Austin, 2004). It is very difficult to define a normal region or boundary that
includes all possibilities of normal behavior, and usually, the
Types of Anomalies boundary between normal and anomalous behavior lacks
There are three types of anomalies, and the type of anomaly precision. This would imply that observations near the
being addressed is an important component to consider in boundary, whether they are normal or anomalous classes,
any anomaly detection approach. would be mostly misclassified. The maliciously induced
Point anomalies are the most primitive form of anomaly and anomalies usually evolve and change since the adversary of
make up the largest proportion of the study that utilizes the anomaly detection system is trying to masquerade
anomaly detection. They can be defined as a singular data anomalous events as normal, hence increasing their
instance or event that is drastically different from all other hardness. In general, the same thing also holds for most
data. In simple words, they are those points that lie outside other domains in which normal behavior is always changing
the boundary that is established as normal or a separating and the definition of normal behavior today might not be
appropriately indicative of what would happen in the future. Examiners, 2020). Neither are individual fraudsters and
Concept drift is the term applied in most literature to the perpetrators. Well-organized crime communities and
phe-nomenon of underlying models changing over time perpetrators have also been investing in expanding and
(Gama, Zˇliobaite˙, evolving their techniques (Bhattacharyya et al., 2011).
Pechenizkiy, & Bouchachia, 2014). Anomalies are Therefore, it is important to continually evolve and improve
heterogeneous as well, meaning irregular. Hence, a class of these fraud detection techniques since fraudsters are always
anomalies could have dramatically different features adapting and innovating strategies and methods to breach
indicating abnormality compared to another anomalous class them (West & Bhattacharya, 2016).
(Pang et al., 2020). As per the Internet Crime Complaint Center, it is reported
Also, the nature of anomaly is domain or application- that
dependent. In medicine, minor shifts from the benchmark 467,361 internet crimes complaints were filed in the US
can be considered anomalous; for example, shifting body with reported losses surpassing $3.5 billion. The losses
temperatures. On the other hand, similar variation in the comprise almost all kinds of acts that involve compromise
financial world for example, the stock market could be of business e-mails for unauthorized transfers of funds,
within the norm (Chandola et al., 2009). The technique that identity frauds through cheques and credit card applications
would be developed from one domain may therefore not be and much more. According to the ACFE's 2020 Report to
easily applied from another. the Nations, research and investigations show that
The other factors that contribute to the complexity of organizations worldwide lose 5 percent of revenue due to
anomaly detection have to do with the inability to acquire fraud every year and that the average fraud case remains
labeled data due to various reasons. This one is privacy undetected for 14 months, costing an average of 8,300
concerns when one deals with sensitive data, or the cost of dollars per month (Association of Certified Fraud
data labelling if it were to be carried out in a human setting Examiners, 2020). Table 1 is the reported complaints and
(Domingues, Filippone, Michiardi, & Zouaoui, 2018). total loss that the IC3 reports each year from 2015 to 2019.
Anomalous examples are also scarce in presentation, as From Table 1, it can be observed that there has been an
opposed to standard in- stance, where normally the vast acceleration rate of total losses reported in each year since
majority makes up the data (Pang et al., 2020; Boukerche, 2017, which further brings out the point that fraudulent
Zheng, & Alfandi, 2020). For the small classes, in such efforts never stop adapting and evolving around systems put
cases, the classical anomaly detection by a standard in place against them.
classifier tends to ignore the small classes because of being
swamped by the larger ones. Again, in normal data
instances, noise is quite often present, which is acting like
an anomaly, challenging to deduce clear boundaries or
decision rules within the given data set. Table 1
Annual complaints received and total loss reported by IC3 (Center, 2019).