0% found this document useful (0 votes)
29 views28 pages

Outlier Analysis

The document provides an overview of outlier analysis, defining outliers and discussing various detection methods including statistical, proximity-based, clustering-based, and classification approaches. It categorizes outliers into global, contextual, and collective types, and highlights their significance in applications such as fraud detection and medical analysis. Additionally, it outlines advantages and disadvantages of different detection techniques, including statistical methods, distance-based methods, and model-based methods.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views28 pages

Outlier Analysis

The document provides an overview of outlier analysis, defining outliers and discussing various detection methods including statistical, proximity-based, clustering-based, and classification approaches. It categorizes outliers into global, contextual, and collective types, and highlights their significance in applications such as fraud detection and medical analysis. Additionally, it outlines advantages and disadvantages of different detection techniques, including statistical methods, distance-based methods, and model-based methods.

Uploaded by

Shreya Parekh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 28

Outlier Analysis

Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
2
What Are Outliers?
An outlier is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a
different mechanism” [Hawkins 1980]
and
“an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett
and Lewis 1994].

• Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...

• Outliers are different from the noise data


• Noise is random error or variance in a measured variable
• Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Customer segmentation
• Medical analysis
3
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly) Global Outlier

• Object is Og if it significantly deviates from the rest of the data set


• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 30o C in Ahmedabad: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
• Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
• Issue: How to define or formulate meaningful context?

4
Types of Outliers (II)
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
• Applications: E.g.: Collective Outlier
• Intrusion detection: When a number of computers
keep sending denial-of-service packages to each other
• Financial markets: Few companies buying and selling
the same shares among themselves
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.

5
Reasons of Outliers in Data
• Naturally occurring anomalies
• Wrong values
• Fraud
• Any other reason?
Questions for Discussion

Questions:

• Does outlier mean wrong value?

• Are outliers the same as noise?

• We’ve found outliers in the data. What to do with them?


Overview of Outlier Detection
Techniques
• Statistics-based Methods:
• normal data points in high probability regions, outliers in low probability
regions

• Distance-based Methods:
• Normal data points are close to many other points, others are outliers

• Model-based Methods:
• Learns a classifier model to detect outliers, then applies the model on test data points

• Angle-based Methods:
• Based on the mean angle formed by the point with pairs of other points in the dataset
Statistics-based Methods
• Normal data points in high probability regions, outliers in low
probability regions
• Important Theorems:
• Markov’s Theorem: Let X be a random variable that takes only non-
negative random values. Then for any constant α satisfying , the
following is true:

• Chebychev’s Theorem: For any arbitrary


1 random variable X,
𝑃 (| X −𝜇| ≥𝑘 𝜎 ) ≤ 2
𝑘
,
where k is a non-negative constant
Questions
• Compare Chebychev’s Theorem with a similar result that you know
about.
• Which result gives a better bound and why?
Statistics-based Methods
• Two broad categories of methods:
• Hypothesis Testing Methods
• Distribution Fitting Methods

• Hypothesis testing methods –


• Null hypothesis: there is no outlier.
• Compute a test statistic and use a test such as Grubb’s Test to determine the
probability of rejection of null hypothesis

• Distribution fitting methods –


• Infers a probability distribution function (pdf) based on observed data
Statistical Methods – Advantages
and Disadvantages
Advantages:
• statistical interpretation for discovered outliers
• score or a confidence interval for every data point, rather than binary decision
• Unsupervised method – no need for labeled training data set

Disadvantages:
• Assumption of a particular distribution may not be valid, particularly for high-
dimensional data
• Several hypothesis tests available, making decision harder
Hypothesis Testing Methods

• Grubb’s Test –
• to detect single outlier in a (approximately normal) univariate distribution

• Tietjen-Moore Test –
• generalization of the Grubbs test to the case of more than one outlier
• requires the number of outliers to be specified exactly.

• Generalized Extreme Studentized Deviate (ESD) Test –


• requires only an upper bound on the suspected number of outliers
• used when the exact number of outliers is unknown
Grubb’s Test
• A standard test when testing for a single outlier
• This outlier is expunged from the dataset and the test is repeated until no outliers are
detected.
• Multiple iterations change the probabilities of detection, and the test is not to be used for a
small sample size
• H0 : there are no outliers in the dataset
• Ha : there is exactly one outlier in the dataset.
• Grubbs’ test statistic is defined as
• G = maxi=1, . . . , N | , with and s denoting the sample mean and sample sd respectively
• the null hypothesis of no outliers is rejected if

• Where is the upper critical value of the t-distribution with N-2 df


Example of Outlier detection using
Grubb’s Test
• Refer the following data. Does any value seem to be an outlier, by
inspection?
• Perform Grubb’s Test to test for outliers on the data:
Grubb’s Test in Action
• Test statistic: 2.66 (for t9)
• Gcritical: 2.21 at
• As G > Gcritical, reject H0. t9[Age] is an outlier.
• Remove t9
• For remaining 8 Age values, mean = 28.88, sd =12.62
• Grubb’s test statistic = 1.32, for t8
• Gcritical: 2.01 at
• As G < Gcritical, Accept H0.
Distribution Fitting
• First fit a distribution to the data, to understand the normal behavior

• Apply a statistical inference procedure to determine if a certain data


point belongs to the fitted distribution

• Declare data points that have a low probability of belonging to the


learned statistical model as outliers.
Distribution Fitting for Univariate
Data
• Compute the Mean and SD of the data
• Compute the z-scores of the data points
• Any point with a z-score > a specified threshold, such as |z| > 3, can be considered an
outlier
• However, the outliers present in the data, if any, impact the computed mean and
standard deviation - the Masking Problem
• E.g. – mean and sd of Age are 136.78 and 323.92
• So if threshold is z = 2, then t9[Age] is declared an outlier
• t1[Age] is in the range and is missed as an outlier

• https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
Outliers using Quartiles
• Find the 1st and 3rd Quartiles of the data
• Compute IQR = Q3 – Q1
• Any point outside [Q1 – 1.5*IQR, Q3 + 1.5*IQR] (or [Q1 – 3*IQR, Q3 +
3*IQR]) can be called an outlier, depending on the application
Robust Univariate Statistics
• Correctly captures important properties of the underlying distribution even in the presence of many
outliers.
• Robust Univariate Statistics – Median and Median Absolute Deviation (MAD) to replace Mean and SD
• Median Absolute Deviation (MAD) is the median of the absolute deviations from the data’s median
• MAD = mediani(|xi − medianj(xj)|)
• A robust outlier detection technique known as Hampel X84 is based on Median and MAD
• Hampel X84 marks outliers as those data points that are more than 1.4826 MADs away from the
median, where is the number of standard deviations away from the mean one would have used if
there were no outliers in the dataset
• Example – Median of t1 to t9 is 32 and MAD is 7, and let us use 2 sds away from the mean as outliers
• In terms of Hampel X84 test, points that are 1.4826 ∗ 2 = 2.9652 away from the median as outliers
• So the normal value range would be [32 − 7 ∗ 2.9652, 32 + 7 ∗ 2.9652]= [11.2436, 52.7564], which
is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation
• Under the new normal range, t1[Age] and t9[Age] are now correctly flagged as outliers
• Python implementation: https://github.com/MichaelisTrofficus/hampel_filter
Distance-based Methods

A normal data point should be close to many other data points, and data points that deviate from
such normal behavior are declared outliers

Advantages:
• Purely based on data – no assumption on distribution
• Easy to apply

Disadvantages:
• Will fail on data sets where: normal data instance does not have close neighbours, or anomalies
have some close data points
• Computational complexity of distance between every two points is high
• Performance depends on the distance metric chosen
• Difficult to use for complex data like graphs, sequences etc
Model-based Methods
• Train a classifier model to distinguish between normal data and anomalies
• Apply the trained classifier to test data points to determine if it is an outlier
• One-class vs. multi-class

Advantages:
• Can use powerful ML algorithms
• Testing is quite fast

Disadvantages:
• Relies on the availability of accurate labels for various normal classes
Model Based Outlier Detection
Applications
Is SBI an outlier?
SBI staring at intense competition post-HDFC Bank
merger. But what makes it a forever-value stock?

https://economictimes.indiatimes.com/prime/money-and-markets/sbi-
staring-at-intense-competition-post-hdfc-bank-merger-but-what-makes-
it-a-forever-value-stock/primearticleshow/102613813.cms
Is Rainfall in Aug 2023 an Outlier?
Clouds of worry over rain deficit as monsoon goes from
"above normal" to "below normal" in just 15 days

https://economictimes.indiatimes.com/news/india/clouds-of-worry-ov
er-rain-deficit-as-monsoon-goes-from-above-normal-to-below-normal-i
n-just-15-days/articleshow/102760409.cms?from=mdr
What about landslides – in
Uttarakhand?
Are they an outlier?

There were 11,219 landslides in the state from 1988 to


2022. .. This year, there have been 1,123 already with
the monsoon’s end still a month away.

https://timesofindia.indiatimes.com/india/a-sinking-feeling-across-uttar
akhand/articleshow/102872610.cms

You might also like