6anomaly Fraud Detection
6anomaly Fraud Detection
/ Fraud Detection
Anomaly detection is a technique used to identify unusual patterns that do not conform to
expected behavior, called outliers.
Outlier detection (also known as anomaly detection) is the process of finding data objects with
behaviors that are very different from expectation. Such objects are called outliers or anomalies.
It has many applications in business, from intrusion detection (identifying strange patterns
in network traffic that could signal a hack) to system health monitoring (spotting a
malignant tumor in an MRI scan), and from fraud detection in credit card transactions to
fault detection in operating environments.
Anomalies can be broadly categorized as:
1. Point anomalies: A single instance of data is anomalous if it's too far off from the
rest. Business use case: Detecting credit card fraud based on "amount spent."
The simplest approach to identifying irregularities in data is to flag the data points that
deviate from common statistical properties of a distribution, including mean, median,
mode, and quantiles. Let's say the definition of an anomalous data point is one that deviates
by a certain standard deviation from the mean.
Algorithm:
Example: 10, 11, 15, 25, 35, 30, 7, 68
Assumption: Normal data points occur around a dense neighborhood and abnormalities are
far away.
The nearest set of data points are evaluated using a score, which could be Eucledian
distance or a similar measure dependent on the type of the data (categorical or numerical).
They could be broadly classified into two algorithms:
1. K-nearest neighbor: k-NN is a simple, non-parametric lazy learning technique used to
classify data based on similarities in distance metrics such as Eucledian, Manhattan,
Minkowski, or Hamming distance.
2. Relative density of data: This is better known as local outlier factor (LOF). This concept
is based on a distance metric called reachability distance.
Clustering is one of the most popular concepts in the domain of unsupervised learning.
Assumption: Data points that are similar tend to belong to similar groups or clusters, as
determined by their distance from local centroids.
K-means is a widely used clustering algorithm. It creates 'k' similar clusters of data points.
Data instances that fall outside of these groups could potentially be marked as anomalies.