0% found this document useful (0 votes)
11 views51 pages

Unit 2 - Part A

This document discusses anomaly detection and dimensionality reduction. It defines anomalies as deviations from expected patterns in data. There are different types of anomalies, including point anomalies where a single data point is very different, and collective anomalies where a group of related data points together deviate from larger patterns. Anomaly detection is useful for applications like fraud detection, health monitoring, and intrusion detection to identify unexpected patterns that could signal problems. Dimensionality reduction techniques like PCA can also help detect anomalies by reducing noise and identifying underlying patterns in data.

Uploaded by

Raksheet Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views51 pages

Unit 2 - Part A

This document discusses anomaly detection and dimensionality reduction. It defines anomalies as deviations from expected patterns in data. There are different types of anomalies, including point anomalies where a single data point is very different, and collective anomalies where a group of related data points together deviate from larger patterns. Anomaly detection is useful for applications like fraud detection, health monitoring, and intrusion detection to identify unexpected patterns that could signal problems. Dimensionality reduction techniques like PCA can also help detect anomalies by reducing noise and identifying underlying patterns in data.

Uploaded by

Raksheet Jain
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Outlier Detection and

Dimensionality Reduction
Unit 2
What is Anomaly?
• Various management software can be used to evaluate the
operational performance of applications and Key Performance
Indicators (KPIs) to evaluate the success of the organization
• Within the given dataset, are data patterns that represent business as
usual
• An unexpected change within these data patterns
or an event that does not conform to the expected data pattern
is considered an anomaly
• In other words, an anomaly is a deviation from business as usual
Topics to be covered….
• Introduction to anomaly (outlier) detection
• Types of anomaly detection
• Applications of Outlier detection
• Proximity based Outlier detection: distance and density based outlier
detection
• One class SVM
• Principal Component Analysis (PCA),
• Applications of PCA,
• Autoencoders: Denoising Autoencoders, Variational Autoencoders
• Applications of Autoencoders
What is Anomaly?
• It is not unusual about an e-Commerce website collecting a large
amount of revenue on specific days like festival season
because a high volume of sales during festival season
• It would be an anomaly if a company didn’t have high sales volume on
these days
especially if festival sale for previous years was very high
• It can be an anomaly if it breaks a pattern that is normal for the data
from that particular metric
• Anomalies aren’t categorically good or bad
They are deviations from the expected value for a metric at a given
point in time
Introduction
• Usually, anomalies are undetectable
by a human expert
• These items/events are called
outliers
• Anomalous data can indicate critical
incidents, such as a technical glitch,
or potential opportunities
Outliers

Outliers are often visible symptoms.


Example: Outliers
• Suppose e-commerce website has glitch in price
• It is entered $10 instead of $100 for a product.
Example: Outliers
• Deeper inspection can identify underlying error
• Customer buys multiple products
Need for Anomaly Detection
• For diseases, normal activity can be compared with diseases such as
malaria, dengue, swine-flu, etc. for which we have a cure
• SarS-CoV-2 (CoViD-19), on the other hand, is an anomaly
It shows characteristics of a normal disease with the exception of
delayed symptoms
• Had the SarS-CoV-2 anomaly been detected in its very early stage, its
spread could have been contained significantly
• Since SarS-CoV-2 is an entirely new anomaly that has never been seen
before
even a supervised learning procedure to detect this as an anomaly
would have failed
Need for Anomaly Detection
• Supervised learning model learns patterns from the features and
labels in the dataset
• By providing normal data of pre-existing diseases to an
unsupervised learning algorithm
• We could have detected this virus as an anomaly with high
probability
• since it would not have fallen into the category of normal diseases
• Therefore unsupervised learning methods are preferred over
supervised learning methods in most cases
What is time series data anomaly detection?
• Anomaly detection is based on an ability to accurately analyze time
series data in real time
• Time series data is composed of a sequence of values over time
• Each sample is typically a pair of two items
A timestamp for when the metric was measured
and the value associated with metric at that time
• Time series data is a record that contains information necessary for
making educated guesses about what can be reasonably expected in
the future
• Anomaly detection systems use those expectations to identify
actionable signals within the data
Applications
• Business
• Intrusion detection (identifying strange patterns in network traffic
that could signal a hack)
• Health monitoring (spotting a malignant tumor in an MRI scan)
• Fraud detection in credit card transactions….
Anomaly detection is not Noise detection
Anomaly detection is similar to noise removal and novelty detection
• Anomaly detection
is concerned with identifying an unobserved pattern in new observations
not included in training data
Ex: sudden interest in a new channel on YouTube during Christmas, for
instance
• Noise removal
Process of immunizing analysis from the occurrence of unwanted
observations
In other words, removing noise from an otherwise meaningful signal
Categories of Anomaly

1. Point anomaly (Global outlier)


2. Contextual anomalies
3. Collective anomalies
Point Anomaly
• Value of outlier much different from
other samples
• Business use case: Detecting credit
card fraud based on "amount spent"
• Use of zoom meeting from January
to February 2021 increased by 100%.
From February to March it increased
400%
Point Anomaly

Global economy
Contextual (Conditional) Outliers
• Its value significantly deviates
from the rest of the data points
in the same context
• Same value may not be
considered an outlier if it
occurred in a different context
• Common in time-series data

May not be an outlier in different context


Contextual (Conditional) Outliers
• Generally, for time series data, the “context” is temporal
because time series data are records of a specific quantity over time
Business use case:
Spending INR 5,000 on food every day during the holiday season is normal,
but may be odd otherwise

• In Mumbai it rains in June


• If it rains in January then it is
an outlier
Contextual (Conditional) Outliers

due to pricing glitch


Contextual (Conditional) Outliers
• Values are not outside the normal global range
• Are abnormal compared to the seasonal pattern
Collective Anomaly
• A set of data instances collectively helps in detecting anomalies
• Business use case: Someone is trying to copy data form a remote
machine to a local host unexpectedly
An anomaly that would be flagged as a potential cyber attack
• A subset of data points within a data set is considered anomalous
if those values as a collection deviate significantly from the entire
data set
• but the values of the individual data points are not themselves
anomalous in either a contextual or global sense
Collective Anomaly

• In time series data, it can be normal peaks and valleys occurring


outside of a time frame when that seasonal sequence is normal
• or as a combination of time series that is in an outlier state as a
group
Collective Anomaly

• two time series that were discovered


to be related to each other, are
combined into a single anomaly
• For each time series the individual
behavior does not deviate
significantly from the normal range,
but the combined anomaly indicated
a bigger issue
Collective Anomaly
• A group of data points in a large dataset is significantly different from
other points but each data point is not anomalous
• ex: two time series data are combined. individually they do not
deviate significantly. After combining they are considered a big issue.
Collective Anomaly
• A group of people leave neighborhood on the same day
• Generally individually leave from time to time
• It is unusual that a large group leaves neighborhood at the same time
Collective Anomaly
• Running an ad campaign.
• With increase in budget, there is an increase in number of clicks and impressions
• When they happen together there is an issue with campaign
• Individually they are not anomalous
Collective Anomaly
Collective Anomaly
Example: anomalies
• A plane landing on a highway is a global outlier
• because it’s a truly rare event that a plane would have to land there
• If the highway was congested with traffic that would be
a contextual outlier if it was happening at 3 a.m.
• when traffic doesn’t usually start until later in the morning when
people are heading to work
• And if every car on the freeway was moving to the left lane at the
same time that would be a collective outlier
• because although it’s definitely not rare that people move to the left
lane
• it is unusual that all cars would relocate at the same exact time
Example: anomalies
• A banking customer who normally deposits no more than INR1000 a
month in checks at a local ATM suddenly makes two cash deposits of
INR 5000 each in the span of two weeks is a global anomaly
• because this event has never before occurred in this customer’s
history
• The time series data of their weekly deposits would show an abrupt
recent spike
• Such a drastic change would raise alarms as these large deposits
could imply illicit commerce or money laundering
When to use time series anomaly detection?
• Depending on business model and use case anomaly detection is used for
valuable metrics such as:
• Web page views
• Daily active users
• Mobile app installs
• Cost per click
• Customer acquisition costs
• Revenue per click
• Volume of transactions
• Average order value
• And more
Simple Example of Anomaly
• N1 and N2 are regions of Y

normal behaviour N1 o1

• Points o1 and o2 are O3

anomalies
• Points in region O3 are
anomalies o2

N2

X
Key Challenges in Anomaly Detection
• Defining a representative normal region is challenging
• The boundary between normal and outlying behaviour is often not
precise
• The exact notion of an outlier is different for different application
domains
• Availability of labelled data for training/validation
• Malicious adversaries
• Data might contain noise
• Normal behaviour keeps evolving
Aspects of Anomaly Detection Problem

• Nature of input data


• Availability of supervision
• Type of anomaly: point, contextual, structural
• Output of anomaly detection
• Evaluation of anomaly detection techniques
Input Data
• Most common form of Tid SrcIP
Start
Dest IP Dest Number
Attack
time of bytes
data handled by anomaly 1 206.135.38.95 11:07:20 160.94.179.223
Port
139 192 No
detection techniques is 2 206.163.37.95 11:13:56 160.94.179.219 139 195 No

Record Data 3 206.163.37.95 11:14:29 160.94.179.217 139 180 No

• Univariate 4 206.163.37.95 11:14:30 160.94.179.255 139 199 No

5 206.163.37.95 11:14:32 160.94.179.254 139 19 Yes


• Multivariate
6 206.163.37.95 11:14:35 160.94.179.253 139 177 No

7 206.163.37.95 11:14:36 160.94.179.252 139 172 No

8 206.163.37.95 11:14:38 160.94.179.251 139 285 Yes

9 206.163.37.95 11:14:41 160.94.179.250 139 195 No

10 206.163.37.95 11:14:44 160.94.179.249 139 163 Yes


10
Input Data – Nature of Attributes

• Nature of attributes
• Binary Tid SrcIP Duration Dest IP
Number
of bytes
Internal

• Categorical 1 206.163.37.81 0.10 160.94.179.208 150 No

• Continuous 2 206.163.37.99 0.27 160.94.179.235 208 No

• Hybrid 3 160.94.123.45 1.23 160.94.179.221 195 Yes

4 206.163.37.37 112.03 160.94.179.253 199 No

5 206.163.37.41 0.32 160.94.179.244 181 No


Data Labels
• Supervised Anomaly Detection
• Labels available for both normal data and anomalies
• Similar to rare class mining
• Semi-supervised Anomaly Detection
• Labels available only for normal data
• Unsupervised Anomaly Detection
• No labels assumed
• Based on the assumption that anomalies are very rare
compared to normal data
Output of Anomaly Detection
• Label
• Each test instance is given a normal or anomaly label
• This is especially true of classification-based approaches
• Score
• Each test instance is assigned an anomaly score
• Allows the output to be ranked
• Requires an additional threshold parameter
Taxonomy of Anomaly Detection Approaches
Anomaly Detection Point Anomaly Detection

Classification Based Nearest Neighbor Based Clustering Based Statistical Others


Rule Based Density Based Parametric Information Theory Based
Neural Networks Based Distance Based Non-parametric Spectral Decomposition Based
SVM Based Visualization Based

Contextual Anomaly Collective Anomaly Online Anomaly Distributed Anomaly


Detection Detection Detection Detection

*Outlier Detection – A Survey, Varun Chandola, Arindam Banerjee, and Vipin Kumar, Technical Report TR07-17, University of Minnesota
Taxonomy of Anomaly Detection Approaches
Anomaly Detection Point Anomaly Detection

Classification Based Nearest Neighbor Based Clustering Based Statistical Others


Rule Based Density Based Parametric Information Theory Based
Neural Networks Based Distance Based Non-parametric Spectral Decomposition Based
SVM Based Visualization Based

Contextual Anomaly Collective Anomaly Online Anomaly Distributed Anomaly


Detection Detection Detection Detection
Nearest Neighbor Based Techniques
• Key assumption: normal points have close neighbors while
anomalies are located far from other points
• General two-step approach
1.Compute neighborhood for each data record
2. Analyze the neighborhood to determine whether data record is anomaly
or not
• Categories:
• Distance based methods
• Anomalies are data points most distant from other points
• Density based methods
• Anomalies are data points in low density regions
Nearest Neighbor Based Techniques
• Advantage
• Can be used in unsupervised or semi-supervised setting (do not make any
assumptions about data distribution)
• Drawbacks
• If normal points do not have sufficient number of neighbors the techniques may fail
• Computationally expensive
• In high dimensional spaces, data is sparse and the concept of similarity may not be
meaningful anymore
Due to the sparseness, distances between any two data records may become quite
similar => Each data record may be considered as potential outlier!
Nearest Neighbor Based Techniques
• Distance based approaches
• A point O in a dataset is an DB(p, d) outlier if at least fraction p of the
points in the data set lies greater than distance d from the point O
• Density based approaches
• Compute local densities of particular regions and declare instances in
low density regions as potential anomalies
• Approaches
• Local Outlier Factor (LOF)
• Connectivity Outlier Factor (COF)
• Multi-Granularity Deviation Factor (MDEF)
Nearest Neighbor Based Techniques
• Distance based approach
• For each data point d compute the distance to the k-th nearest neighbor dk
• Sort all data points according to the distance dk
• Outliers are points that have the largest distance dk and therefore are located in the more sparse
neighborhoods
• Usually data points that have top n% distance dk are identified as outliers
• n – user parameter
• Not suitable for datasets that have modes with varying density
Nearest Neighbor Based Techniques
Density based approach
Local Outlier Factor (LOF)
• Local outlier factor (LOF) is an algorithm used for Unsupervised outlier detection.
• When a point is considered as an outlier based on its local neighborhood, it is a local
outlier.
• LOF will identify an outlier considering the density of the neighborhood.
• LOF performs well when the density of the data is not the same throughout the
dataset.
• It produces an anomaly score that represents data points which are outliers in the
data set.
• It does this by measuring the local density deviation of a given data point with respect
to the data points near it.
Local Outlier Factor (LOF)

Sequential Steps for LOF:


• K-distance and K-neighbors
• Reachability distance (RD)
• Local reachability density (LRD)
• Local Outlier Factor (LOF)
Local Outlier Factor (LOF)

Sequential Steps for LOF:


• K-distance and K-neighbors
• Reachability distance (RD)
• Local reachability density (LRD)
• Local Outlier Factor (LOF)
Local Outlier Factor (LOF)
K-DISTANCE AND K-NEIGHBORS
• K-distance is the distance between the point, and it’s Kᵗʰ nearest neighbor.
• K-neighbors denoted by Nk(A) includes a set of points that lie in or on the circle of radius K-
distance.
• K-neighbors can be more than or equal to the value of K.
• Consider four points A, B, C, and D
• If K=2, K-neighbors of A will be C, B, and D.
• Here, the value of K=2 but the ||N₂(A)|| = 3.
• Therefore, ||Nk(point)|| will always be greater than or equal to K.
Local Outlier Factor (LOF)
REACHABILITY DENSITY (RD)
• It is defined as the maximum of K-distance of Xj and the distance between Xi and Xj.
• The distance measure is problem-specific (Euclidean, Manhattan, etc.)

• If a point Xi lies within the K-neighbors of Xj, the reachability distance will
be K-distance of Xj (blue line), else reachability distance will be the
distance between Xi and Xj (orange line).
Local Outlier Factor (LOF)

LOCAL REACHABILITY DENSITY (LRD)


• Reachability distances to all of the k-nearest neighbors of a point are calculated to
determine the Local Reachability Density (LRD) of that point.
• The local reachability density is a measure of the density of k-nearest points around
a point.
• The closer the points are, the distance is lesser, and the density is more, hence the
inverse is taken in the equation.
• LRD is inverse of the average reachability distance of A from its neighbors.
• Intuitively according to LRD formula, more the average reachability distance (i.e.,
neighbors are far from the point), less density of points are present around a
particular point.
• This tells how far a point is from the nearest cluster of points.
• Low values of LRD implies that the closest cluster is far from the point.
Local Outlier Factor (LOF)
LOCAL OUTLIER FACTOR (LOF)
• LRD of each point is used to compare with the average LRD of its K
neighbors.
• LOF is the ratio of the average LRD of the K neighbors of A to the
LRD of A.
• Intuitively, if the point is not an outlier (inlier), the ratio of average
LRD of neighbors is approximately equal to the LRD of a point
(because the density of a point and its neighbors are roughly
equal).
• In that case, LOF is nearly equal to 1.
• On the other hand, if the point is an outlier, the LRD of a point is
less than the average LRD of neighbors. Then LOF value will be high.
• Generally, if LOF> 1, it is considered as an outlier, but that is not
always true.
• Let’s say we know that we only have one outlier in the data, then
we take the maximum LOF value among all the LOF values, and the
point corresponding to the maximum LOF value will be considered
as an outlier.

You might also like