Anomaly Detection
Anomaly Detection
Anomaly/Outlier
Detection
• Working assumption:
• There are considerably more “normal” observations than “abnormal”
observations (outliers/anomalies) in the data
Types of
Anomalies/Outliers
• Univariate Outliers – Univariate outliers refer
to the data points that are significantly different
from other data points in a single variable
• Considering the temperature of the city in summer,
a temperature of around 100 degrees Celsius on a
particular day can be a Univariate outlier
• A positive Z-score means that the data point is above the mean
• A negative Z-score means that the data point is below the mean
• A Z-score close to 0 means that the data point is close to the
mean
• In general, a data point with a Z-score greater than 3 or less
than -3 is considered anomalous
• Assumption: This method requires the data to be close to
normal
Anomaly Detection Schemes
Univariate Outliers
• Interquartile Range (IQR) Method and Box-and-Whisker’s
Plot – This method is used to identify outliers based on the
distribution of the data
• Case 2 – The data consists of the number of goals scored by the top goal
scorer in every World Cup from 1930 through 2018 (21 competitions in
total). Determine the Anomalies in the scores
Anomaly Detection Schemes
Multivariate Outliers
• Isolation Forest Method – Isolation Forest is an unsupervised
machine learning algorithm for anomaly detection. This method is
built based on decision trees
• The samples that travel deeper into the tree are less likely to be
anomalies as they require more cuts to isolate them. Similarly, the
samples that end up in shorter branches indicated anomalies, as it
is easier for the tree to separate them from other observations
• When the decision tree is created, it takes fewer nodes to reach the outliers than
other normal data points
• This method directly detects anomalies using isolation (how far a data point is
from the rest of the data)
• The Isolation Forest detects anomalies by introducing binary trees that
recursively generate partitions by randomly selecting a feature and then
randomly selecting a split value for the feature. The partitioning process will
continue until it separates all the data points from the rest of the samples
Isolation Forest Method
The algorithm will start building binary decision trees by
randomly splitting the dataset
After that, the algorithm will split the data randomly and continue building the decision
tree. Let’s assume this time the split looks like this:
Isolation Forest Method
The same process of a random split will continue until all the data points are separated
The algorithm takes three splits to isolate the above point. At the same time, if the
algorithm continues the splitting process, it will take more time to isolate other points
because they are close to each other (way more iterations will be required)
The algorithm will create a random forest of such decision trees and calculate the average
number of splits to isolate each data point. The lower the number of split operations needed
to isolate a point, the more chance the data point will be an outlier
Isolation Forest Method
More no. of splits required to isolate a A data point splits in very few
data point. So, it is not an outlier iterations. So, it is an outlier
Anomaly Detection Schemes
Multivariate Outliers
• Local Outlier Factor (LOF) Algorithm – LOF is an algorithm
used for unsupervised outlier detection
• It produces an anomaly score that represents data points which are
outliers in the data set
• For each point, compute the density of its local neighbourhood
• When a point is considered as an outlier based on its local
neighbourhood, it is a local outlier. LOF will identify an outlier
considering the density of the neighbourhood
• Outliers are points with the largest LOF value
LOF Algorithm
• LOF is based on the following:
K-distance and K-neighbours
Reachability Distance (RD)
Local Reachability Density (LRD)
Local Outlier Factor (LOF)
LOF Algorithm
• K-distance is the distance between the point and its Kᵗʰ nearest
neighbour. K-neighbours denoted by Nₖ(A) include a set of points
that lie in or on the circle of radius K-distance
In other words, if a point Xi lies within the K-neighbours of Xj, the reachability distance
will be K-distance of Xj (blue line), else reachability distance will be the distance
between Xi and Xj (orange line)
LOF Algorithm
• Local Reachability Density (LRD)
This tells how far a point is from the nearest cluster of points
Low values of LRD imply that the closest cluster is far from the
point
LOF Algorithm
• Local Outlier Factor (LOF)
The LRD of each point is used to compare with the average LRD of its K
neighbours
LOF < 1 => Inlier (similar data point which is inside the density cluster)
• This dataset presents transactions that occurred in two days, where we have 492
frauds out of 2,84,807 transactions
• It contains only numerical input variables. The feature 'Time' contains the seconds
that have elapsed between each transaction and the first transaction in the dataset.
The feature 'Amount' is the transaction Amount. Feature 'Class' is the target
variable, and it takes value 1 in case of fraud and 0 otherwise
• Other features are credit card number, expiry date, CVV, cardholder name,
transaction location, transaction date-time, etc.