0% found this document useful (0 votes)
11 views22 pages

Anomaly Detection

Uploaded by

Akash Singhania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views22 pages

Anomaly Detection

Uploaded by

Akash Singhania
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Anomaly Detection

Anomaly/Outlier
Detection

• Anomaly - The set of data points


that are considerably different
than the remainder of the data

• Anomaly detection is a technique


used to identify these outliers
Business Applications
• Credit card fraud detection (spending a lot of
amount in one day or another sum amount on
another day will send a message of alert or will
block your card as it’s not related to how you
were spending previously)

• Telecommunication fraud detection

• Network intrusion detection (identifying strange


patterns in network traffic that could signal a
hack)

• Fault detection in operating environments

• System health monitoring (spotting a malignant


tumor in an MRI scan)
Anomaly Detection
• Challenges
• How many outliers are there in the data?
• Method is unsupervised
• Finding the needle in a haystack

• Working assumption:
• There are considerably more “normal” observations than “abnormal”
observations (outliers/anomalies) in the data
Types of
Anomalies/Outliers
• Univariate Outliers – Univariate outliers refer
to the data points that are significantly different
from other data points in a single variable
• Considering the temperature of the city in summer,
a temperature of around 100 degrees Celsius on a
particular day can be a Univariate outlier

• Multivariate Outliers – Multivariate outliers


refer to the data points that are significantly
different from other data points in multiple
variables
• In a class of students, the average height and
weight are 1.7 m and 50 kg, respectively. If a
student in a class has a height and weight of 4
m and 10 kg, it is considered a multivariate
outlier
Anomaly Detection Schemes
Univariate Outliers
• Z-Score Method – The Z-score method is used to identify outliers
based on their distance from the mean

• A positive Z-score means that the data point is above the mean
• A negative Z-score means that the data point is below the mean
• A Z-score close to 0 means that the data point is close to the
mean
• In general, a data point with a Z-score greater than 3 or less
than -3 is considered anomalous
• Assumption: This method requires the data to be close to
normal
Anomaly Detection Schemes
Univariate Outliers
• Interquartile Range (IQR) Method and Box-and-Whisker’s
Plot – This method is used to identify outliers based on the
distribution of the data

A data point < Q1 – 1.5* (Q3-Q1) OR


> Q3 + 1.5*(Q3-Q1) is an outlier
Case Analysis

• Case 1 - Student participation rates in the SAT (Scholastic Assessment Test)


in Connecticut school districts in 2022 are recorded. Determine the
Anomalies in the rates
Note: The SAT is a standardized test widely used for college admissions in the United
States

• Case 2 – The data consists of the number of goals scored by the top goal
scorer in every World Cup from 1930 through 2018 (21 competitions in
total). Determine the Anomalies in the scores
Anomaly Detection Schemes
Multivariate Outliers
• Isolation Forest Method – Isolation Forest is an unsupervised
machine learning algorithm for anomaly detection. This method is
built based on decision trees

• The samples that travel deeper into the tree are less likely to be
anomalies as they require more cuts to isolate them. Similarly, the
samples that end up in shorter branches indicated anomalies, as it
is easier for the tree to separate them from other observations

• Each tree in an Isolation Forest is called an Isolation Tree


Isolation Forest Method
• Isolation Forest Method Steps:

1. When given a dataset, it is assigned to a binary tree (Anomaly and No


Anomaly)
2. Branching of the tree starts by selecting a random feature (from the set of all
N features) first. Then, branching is done on a random threshold
3. If the value of a data point is less than the selected threshold, it goes to the left
branch or else to the right. Thus, a node is split into left and right branches
4. This process from step 2 is continued recursively till each data point is
completely isolated
5. The above steps are repeated to construct random binary trees
Anomaly Detection Schemes
Multivariate Outliers
• Isolation Forest Method

• When the decision tree is created, it takes fewer nodes to reach the outliers than
other normal data points
• This method directly detects anomalies using isolation (how far a data point is
from the rest of the data)
• The Isolation Forest detects anomalies by introducing binary trees that
recursively generate partitions by randomly selecting a feature and then
randomly selecting a split value for the feature. The partitioning process will
continue until it separates all the data points from the rest of the samples
Isolation Forest Method
The algorithm will start building binary decision trees by
randomly splitting the dataset

After that, the algorithm will split the data randomly and continue building the decision
tree. Let’s assume this time the split looks like this:
Isolation Forest Method
The same process of a random split will continue until all the data points are separated

The algorithm takes three splits to isolate the above point. At the same time, if the
algorithm continues the splitting process, it will take more time to isolate other points
because they are close to each other (way more iterations will be required)

The algorithm will create a random forest of such decision trees and calculate the average
number of splits to isolate each data point. The lower the number of split operations needed
to isolate a point, the more chance the data point will be an outlier
Isolation Forest Method

More no. of splits required to isolate a A data point splits in very few
data point. So, it is not an outlier iterations. So, it is an outlier
Anomaly Detection Schemes
Multivariate Outliers
• Local Outlier Factor (LOF) Algorithm – LOF is an algorithm
used for unsupervised outlier detection
• It produces an anomaly score that represents data points which are
outliers in the data set
• For each point, compute the density of its local neighbourhood
• When a point is considered as an outlier based on its local
neighbourhood, it is a local outlier. LOF will identify an outlier
considering the density of the neighbourhood
• Outliers are points with the largest LOF value
LOF Algorithm
• LOF is based on the following:
 K-distance and K-neighbours
 Reachability Distance (RD)
 Local Reachability Density (LRD)
 Local Outlier Factor (LOF)
LOF Algorithm
• K-distance is the distance between the point and its Kᵗʰ nearest
neighbour. K-neighbours denoted by Nₖ(A) include a set of points
that lie in or on the circle of radius K-distance

If K=2, K-neighbours of A will be C, B, and D. Here, the value of


K=2, and the ||N₂(A)|| = 3
LOF Algorithm
• Reachability Distance (RD)

RD(Xi, Xj) = max (K–distance(Xj), distance (Xi, Xj))

It is defined as the maximum of K-distance of Xj and the distance between Xi and Xj

In other words, if a point Xi lies within the K-neighbours of Xj, the reachability distance
will be K-distance of Xj (blue line), else reachability distance will be the distance
between Xi and Xj (orange line)
LOF Algorithm
• Local Reachability Density (LRD)

LRD is the inverse of the average reachability distance of A from its


neighbours

According to the LRD formula, the greater the average reachability


distance (i.e., neighbours are far from the point), the less density of
points is present around a particular point

This tells how far a point is from the nearest cluster of points

Low values of LRD imply that the closest cluster is far from the
point
LOF Algorithm
• Local Outlier Factor (LOF)

The LRD of each point is used to compare with the average LRD of its K
neighbours

LOF is the ratio of the average LRD of the K neighbours of A to the


LRD of A
If the point is not an outlier (inlier), the ratio of the average LRD of
neighbours is approximately equal to the LRD of a point (because the
density of a point and its neighbours are roughly equal). In that case, LOF is
nearly equal to 1. On the other hand, if the point is an outlier, the LRD of a
point is less than the average LRD of neighbours. Then LOF value will be
high
LOF Algorithm
• Local Outlier Factor (LOF)

LOF ~ 1 => Similar data point

LOF < 1 => Inlier (similar data point which is inside the density cluster)

LOF > 1 => Outlier


Case Analysis
• Credit Card Fraud Detection

• The dataset contains transactions made by credit cards in September 2023 by


European cardholders

• This dataset presents transactions that occurred in two days, where we have 492
frauds out of 2,84,807 transactions

• It contains only numerical input variables. The feature 'Time' contains the seconds
that have elapsed between each transaction and the first transaction in the dataset.
The feature 'Amount' is the transaction Amount. Feature 'Class' is the target
variable, and it takes value 1 in case of fraud and 0 otherwise

• Other features are credit card number, expiry date, CVV, cardholder name,
transaction location, transaction date-time, etc.

You might also like