Outlier Analysis
Outlier Analysis
Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
2
What Are Outliers?
An outlier is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a
different mechanism” [Hawkins 1980]
and
“an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett
and Lewis 1994].
• Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...
4
Types of Outliers (II)
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
• Applications: E.g.: Collective Outlier
• Intrusion detection: When a number of computers
keep sending denial-of-service packages to each other
• Financial markets: Few companies buying and selling
the same shares among themselves
Detection of collective outliers
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.
5
Reasons of Outliers in Data
• Naturally occurring anomalies
• Wrong values
• Fraud
• Any other reason?
Questions for Discussion
Questions:
• Distance-based Methods:
• Normal data points are close to many other points, others are outliers
• Model-based Methods:
• Learns a classifier model to detect outliers, then applies the model on test data points
• Angle-based Methods:
• Based on the mean angle formed by the point with pairs of other points in the dataset
Statistics-based Methods
• Normal data points in high probability regions, outliers in low
probability regions
• Important Theorems:
• Markov’s Theorem: Let X be a random variable that takes only non-
negative random values. Then for any constant α satisfying , the
following is true:
Disadvantages:
• Assumption of a particular distribution may not be valid, particularly for high-
dimensional data
• Several hypothesis tests available, making decision harder
Hypothesis Testing Methods
• Grubb’s Test –
• to detect single outlier in a (approximately normal) univariate distribution
• Tietjen-Moore Test –
• generalization of the Grubbs test to the case of more than one outlier
• requires the number of outliers to be specified exactly.
• https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
Outliers using Quartiles
• Find the 1st and 3rd Quartiles of the data
• Compute IQR = Q3 – Q1
• Any point outside [Q1 – 1.5*IQR, Q3 + 1.5*IQR] (or [Q1 – 3*IQR, Q3 +
3*IQR]) can be called an outlier, depending on the application
Robust Univariate Statistics
• Correctly captures important properties of the underlying distribution even in the presence of many
outliers.
• Robust Univariate Statistics – Median and Median Absolute Deviation (MAD) to replace Mean and SD
• Median Absolute Deviation (MAD) is the median of the absolute deviations from the data’s median
• MAD = mediani(|xi − medianj(xj)|)
• A robust outlier detection technique known as Hampel X84 is based on Median and MAD
• Hampel X84 marks outliers as those data points that are more than 1.4826 MADs away from the
median, where is the number of standard deviations away from the mean one would have used if
there were no outliers in the dataset
• Example – Median of t1 to t9 is 32 and MAD is 7, and let us use 2 sds away from the mean as outliers
• In terms of Hampel X84 test, points that are 1.4826 ∗ 2 = 2.9652 away from the median as outliers
• So the normal value range would be [32 − 7 ∗ 2.9652, 32 + 7 ∗ 2.9652]= [11.2436, 52.7564], which
is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation
• Under the new normal range, t1[Age] and t9[Age] are now correctly flagged as outliers
• Python implementation: https://github.com/MichaelisTrofficus/hampel_filter
Distance-based Methods
A normal data point should be close to many other data points, and data points that deviate from
such normal behavior are declared outliers
Advantages:
• Purely based on data – no assumption on distribution
• Easy to apply
Disadvantages:
• Will fail on data sets where: normal data instance does not have close neighbours, or anomalies
have some close data points
• Computational complexity of distance between every two points is high
• Performance depends on the distance metric chosen
• Difficult to use for complex data like graphs, sequences etc
Model-based Methods
• Train a classifier model to distinguish between normal data and anomalies
• Apply the trained classifier to test data points to determine if it is an outlier
• One-class vs. multi-class
Advantages:
• Can use powerful ML algorithms
• Testing is quite fast
Disadvantages:
• Relies on the availability of accurate labels for various normal classes
Model Based Outlier Detection
Applications
Is SBI an outlier?
SBI staring at intense competition post-HDFC Bank
merger. But what makes it a forever-value stock?
https://economictimes.indiatimes.com/prime/money-and-markets/sbi-
staring-at-intense-competition-post-hdfc-bank-merger-but-what-makes-
it-a-forever-value-stock/primearticleshow/102613813.cms
Is Rainfall in Aug 2023 an Outlier?
Clouds of worry over rain deficit as monsoon goes from
"above normal" to "below normal" in just 15 days
https://economictimes.indiatimes.com/news/india/clouds-of-worry-ov
er-rain-deficit-as-monsoon-goes-from-above-normal-to-below-normal-i
n-just-15-days/articleshow/102760409.cms?from=mdr
What about landslides – in
Uttarakhand?
Are they an outlier?
https://timesofindia.indiatimes.com/india/a-sinking-feeling-across-uttar
akhand/articleshow/102872610.cms