0% found this document useful (0 votes)

30 views28 pages

Outlier Analysis

The document provides an overview of outlier analysis, defining outliers and discussing various detection methods including statistical, proximity-based, clustering-based, and classification approaches. It categorizes outliers into global, contextual, and collective types, and highlights their significance in applications such as fraud detection and medical analysis. Additionally, it outlines advantages and disadvantages of different detection techniques, including statistical methods, distance-based methods, and model-based methods.

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views28 pages

Outlier Analysis

Uploaded by

Shreya Parekh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 28

Outlier Analysis

Outlier Analysis
• Outlier and Outlier Analysis
• Outlier Detection Methods
• Statistical Approaches
• Proximity-Base Approaches
• Clustering-Base Approaches
• Classification Approaches
• Mining Contextual and Collective Outliers
• Outlier Detection in High Dimensional Data
• Summary
2
What Are Outliers?
An outlier is “an observation which deviates so much from other observations as to arouse suspicions that it was generated by a
different mechanism” [Hawkins 1980]
and
“an outlier observation is one that appears to deviate markedly from other members of the sample in which it occurs” [Barnett
and Lewis 1994].

• Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne Gretzky, ...

• Outliers are different from the noise data

• Noise is random error or variance in a measured variable
• Noise should be removed before outlier detection
• Outliers are interesting: It violates the mechanism that generates the normal data
• Outlier detection vs. novelty detection: early stage, outlier; but later merged into the model
• Applications:
• Credit card fraud detection
• Telecom fraud detection
• Customer segmentation
• Medical analysis
3
Types of Outliers (I)
• Three kinds: global, contextual and collective outliers
• Global outlier (or point anomaly) Global Outlier

• Object is Og if it significantly deviates from the rest of the data set

• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
• Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 30o C in Ahmedabad: outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
• Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
• Issue: How to define or formulate meaningful context?

4
Types of Outliers (II)
• Collective Outliers
• A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data objects
may not be outliers
• Applications: E.g.: Collective Outlier
• Intrusion detection: When a number of computers
keep sending denial-of-service packages to each other
• Financial markets: Few companies buying and selling
the same shares among themselves
 Detection of collective outliers

Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure
on objects.

5
Reasons of Outliers in Data
• Naturally occurring anomalies
• Wrong values
• Fraud
• Any other reason?
Questions for Discussion

Questions:

• Does outlier mean wrong value?

• Are outliers the same as noise?

• We’ve found outliers in the data. What to do with them?

Overview of Outlier Detection
Techniques
• Statistics-based Methods:
• normal data points in high probability regions, outliers in low probability
regions

• Distance-based Methods:
• Normal data points are close to many other points, others are outliers

• Model-based Methods:
• Learns a classifier model to detect outliers, then applies the model on test data points

• Angle-based Methods:
• Based on the mean angle formed by the point with pairs of other points in the dataset
Statistics-based Methods
• Normal data points in high probability regions, outliers in low
probability regions
• Important Theorems:
• Markov’s Theorem: Let X be a random variable that takes only non-
negative random values. Then for any constant α satisfying , the
following is true:

• Chebychev’s Theorem: For any arbitrary

1 random variable X,
𝑃 (| X −𝜇| ≥𝑘 𝜎 ) ≤ 2
𝑘
,
where k is a non-negative constant
Questions
• Compare Chebychev’s Theorem with a similar result that you know
about.
• Which result gives a better bound and why?
Statistics-based Methods
• Two broad categories of methods:
• Hypothesis Testing Methods
• Distribution Fitting Methods

• Hypothesis testing methods –

• Null hypothesis: there is no outlier.
• Compute a test statistic and use a test such as Grubb’s Test to determine the
probability of rejection of null hypothesis

• Distribution fitting methods –

• Infers a probability distribution function (pdf) based on observed data
Statistical Methods – Advantages
and Disadvantages
Advantages:
• statistical interpretation for discovered outliers
• score or a confidence interval for every data point, rather than binary decision
• Unsupervised method – no need for labeled training data set

Disadvantages:
• Assumption of a particular distribution may not be valid, particularly for high-
dimensional data
• Several hypothesis tests available, making decision harder
Hypothesis Testing Methods

• Grubb’s Test –
• to detect single outlier in a (approximately normal) univariate distribution

• Tietjen-Moore Test –
• generalization of the Grubbs test to the case of more than one outlier
• requires the number of outliers to be specified exactly.

• Generalized Extreme Studentized Deviate (ESD) Test –

• requires only an upper bound on the suspected number of outliers
• used when the exact number of outliers is unknown
Grubb’s Test
• A standard test when testing for a single outlier
• This outlier is expunged from the dataset and the test is repeated until no outliers are
detected.
• Multiple iterations change the probabilities of detection, and the test is not to be used for a
small sample size
• H0 : there are no outliers in the dataset
• Ha : there is exactly one outlier in the dataset.
• Grubbs’ test statistic is defined as
• G = maxi=1, . . . , N | , with and s denoting the sample mean and sample sd respectively
• the null hypothesis of no outliers is rejected if

• Where is the upper critical value of the t-distribution with N-2 df

Example of Outlier detection using
Grubb’s Test
• Refer the following data. Does any value seem to be an outlier, by
inspection?
• Perform Grubb’s Test to test for outliers on the data:
Grubb’s Test in Action
• Test statistic: 2.66 (for t9)
• Gcritical: 2.21 at
• As G > Gcritical, reject H0. t9[Age] is an outlier.
• Remove t9
• For remaining 8 Age values, mean = 28.88, sd =12.62
• Grubb’s test statistic = 1.32, for t8
• Gcritical: 2.01 at
• As G < Gcritical, Accept H0.
Distribution Fitting
• First fit a distribution to the data, to understand the normal behavior

• Apply a statistical inference procedure to determine if a certain data

point belongs to the fitted distribution

• Declare data points that have a low probability of belonging to the

learned statistical model as outliers.
Distribution Fitting for Univariate
Data
• Compute the Mean and SD of the data
• Compute the z-scores of the data points
• Any point with a z-score > a specified threshold, such as |z| > 3, can be considered an
outlier
• However, the outliers present in the data, if any, impact the computed mean and
standard deviation - the Masking Problem
• E.g. – mean and sd of Age are 136.78 and 323.92
• So if threshold is z = 2, then t9[Age] is declared an outlier
• t1[Age] is in the range and is missed as an outlier

• https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/
Outliers using Quartiles
• Find the 1st and 3rd Quartiles of the data
• Compute IQR = Q3 – Q1
• Any point outside [Q1 – 1.5*IQR, Q3 + 1.5*IQR] (or [Q1 – 3*IQR, Q3 +
3*IQR]) can be called an outlier, depending on the application
Robust Univariate Statistics
• Correctly captures important properties of the underlying distribution even in the presence of many
outliers.
• Robust Univariate Statistics – Median and Median Absolute Deviation (MAD) to replace Mean and SD
• Median Absolute Deviation (MAD) is the median of the absolute deviations from the data’s median
• MAD = mediani(|xi − medianj(xj)|)
• A robust outlier detection technique known as Hampel X84 is based on Median and MAD
• Hampel X84 marks outliers as those data points that are more than 1.4826 MADs away from the
median, where is the number of standard deviations away from the mean one would have used if
there were no outliers in the dataset
• Example – Median of t1 to t9 is 32 and MAD is 7, and let us use 2 sds away from the mean as outliers
• In terms of Hampel X84 test, points that are 1.4826 ∗ 2 = 2.9652 away from the median as outliers
• So the normal value range would be [32 − 7 ∗ 2.9652, 32 + 7 ∗ 2.9652]= [11.2436, 52.7564], which
is a much more reasonable range than [−511.06, 784.62] derived using mean and standard deviation
• Under the new normal range, t1[Age] and t9[Age] are now correctly flagged as outliers
• Python implementation: https://github.com/MichaelisTrofficus/hampel_filter
Distance-based Methods

A normal data point should be close to many other data points, and data points that deviate from
such normal behavior are declared outliers

Advantages:
• Purely based on data – no assumption on distribution
• Easy to apply

Disadvantages:
• Will fail on data sets where: normal data instance does not have close neighbours, or anomalies
have some close data points
• Computational complexity of distance between every two points is high
• Performance depends on the distance metric chosen
• Difficult to use for complex data like graphs, sequences etc
Model-based Methods
• Train a classifier model to distinguish between normal data and anomalies
• Apply the trained classifier to test data points to determine if it is an outlier
• One-class vs. multi-class

Advantages:
• Can use powerful ML algorithms
• Testing is quite fast

Disadvantages:
• Relies on the availability of accurate labels for various normal classes
Model Based Outlier Detection
Applications
Is SBI an outlier?
SBI staring at intense competition post-HDFC Bank
merger. But what makes it a forever-value stock?

https://economictimes.indiatimes.com/prime/money-and-markets/sbi-
staring-at-intense-competition-post-hdfc-bank-merger-but-what-makes-
it-a-forever-value-stock/primearticleshow/102613813.cms
Is Rainfall in Aug 2023 an Outlier?
Clouds of worry over rain deficit as monsoon goes from
"above normal" to "below normal" in just 15 days

https://economictimes.indiatimes.com/news/india/clouds-of-worry-ov
er-rain-deficit-as-monsoon-goes-from-above-normal-to-below-normal-i
n-just-15-days/articleshow/102760409.cms?from=mdr
What about landslides – in
Uttarakhand?
Are they an outlier?

There were 11,219 landslides in the state from 1988 to

2022. .. This year, there have been 1,123 already with
the monsoon’s end still a month away.

https://timesofindia.indiatimes.com/india/a-sinking-feeling-across-uttar
akhand/articleshow/102872610.cms

Feature Engineering
No ratings yet
Feature Engineering
63 pages
Test To Identify Outliers in Data Series
100% (1)
Test To Identify Outliers in Data Series
16 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outlier
No ratings yet
Outlier
9 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
Unit 1
No ratings yet
Unit 1
38 pages
Practice Test 1 Answers
No ratings yet
Practice Test 1 Answers
30 pages
Unit 5
No ratings yet
Unit 5
70 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
DS 5-Marks Semeseter Suggestion
No ratings yet
DS 5-Marks Semeseter Suggestion
56 pages
Unit 5
No ratings yet
Unit 5
47 pages
ADII10 Analisa Outlier
No ratings yet
ADII10 Analisa Outlier
37 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Highway Alignment: Premlatha K Naidu Assistant Professor
No ratings yet
Highway Alignment: Premlatha K Naidu Assistant Professor
193 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Dissrtatn Cmplte PDF
No ratings yet
Dissrtatn Cmplte PDF
162 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Datamining Seminar
No ratings yet
Datamining Seminar
19 pages
Creative Non-Fiction - Q3 - W6
100% (5)
Creative Non-Fiction - Q3 - W6
17 pages
Ads Exp 7
No ratings yet
Ads Exp 7
10 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Unit 4
No ratings yet
Unit 4
17 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Outlier or Anomaly Detection
No ratings yet
Outlier or Anomaly Detection
9 pages
Lecture 12 Outliers and Guidelines For Exercises
No ratings yet
Lecture 12 Outliers and Guidelines For Exercises
6 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
No ratings yet
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
45 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Outliers PDF
No ratings yet
Outliers PDF
5 pages
Outliers
No ratings yet
Outliers
3 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Surveys (Tunneling)
No ratings yet
Surveys (Tunneling)
66 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
Soal Uas Bhs. Inggris Xii
No ratings yet
Soal Uas Bhs. Inggris Xii
18 pages
Off-Line Programming Techniques For Multirobot Cooperation System
No ratings yet
Off-Line Programming Techniques For Multirobot Cooperation System
17 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Chapter 12. Outlier Analysis
No ratings yet
Chapter 12. Outlier Analysis
4 pages
Seminar On: Electronic Braking System (Ebs)
No ratings yet
Seminar On: Electronic Braking System (Ebs)
21 pages
ANALOGY Bank Without Password
No ratings yet
ANALOGY Bank Without Password
8 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
Accounting Information Systems 14th Edition (Ebook PDF) Download
100% (1)
Accounting Information Systems 14th Edition (Ebook PDF) Download
58 pages
Campbell - Introduction To Geomagnetic Fields
No ratings yet
Campbell - Introduction To Geomagnetic Fields
26 pages
Upsc Cms Guru Answerkey2022p1
No ratings yet
Upsc Cms Guru Answerkey2022p1
45 pages
Introduction To Industrial Relations: Lecture 1& 2
No ratings yet
Introduction To Industrial Relations: Lecture 1& 2
54 pages
Choosing A Course Booklet 2022
No ratings yet
Choosing A Course Booklet 2022
9 pages
FITA - Academy - UI UX Design
No ratings yet
FITA - Academy - UI UX Design
17 pages
2 Term 9 Form
No ratings yet
2 Term 9 Form
31 pages
Brochure 12 Pages With Item Code Compressed
No ratings yet
Brochure 12 Pages With Item Code Compressed
12 pages
Cyber Security Unit 1
No ratings yet
Cyber Security Unit 1
11 pages
Economic Incentive For Intermittent Operation of Air Separation Plants With Variable Power Cost
No ratings yet
Economic Incentive For Intermittent Operation of Air Separation Plants With Variable Power Cost
8 pages
Plano de Trabalho
No ratings yet
Plano de Trabalho
107 pages
Python Class 11 Test Gen 002
No ratings yet
Python Class 11 Test Gen 002
6 pages
4 Transpiration
No ratings yet
4 Transpiration
15 pages
Science Spectrum Circular - Grdaes VIII & IX - 2024-25
No ratings yet
Science Spectrum Circular - Grdaes VIII & IX - 2024-25
2 pages
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
No ratings yet
Session 2. Legal, Technological, Accounting, Political Environments and The Role of Culture
25 pages
Open Silicon Pakistan Brochure
No ratings yet
Open Silicon Pakistan Brochure
1 page
3dsro CompanyPresentation
No ratings yet
3dsro CompanyPresentation
10 pages
Corrimano in Luce - It
No ratings yet
Corrimano in Luce - It
7 pages
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
No ratings yet
Prediction of Compressive Strength of Concrete With Agricultural Waste and Natural Fibre 2024
5 pages
Abitha J - Resume
No ratings yet
Abitha J - Resume
1 page
Kayleigh O'Keeffe: Ph. D. in Biology
No ratings yet
Kayleigh O'Keeffe: Ph. D. in Biology
4 pages
Elementary Statistics
From Everand
Elementary Statistics
jay prakash Maheshwari
5/5 (1)
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet

Outlier Analysis

Uploaded by

Outlier Analysis

Uploaded by

Outlier Analysis

• Outliers are different from the noise data

• Object is Og if it significantly deviates from the rest of the data set

• Does outlier mean wrong value?

• Are outliers the same as noise?

• We’ve found outliers in the data. What to do with them?

• Chebychev’s Theorem: For any arbitrary

• Hypothesis testing methods –

• Distribution fitting methods –

• Generalized Extreme Studentized Deviate (ESD) Test –

• Where is the upper critical value of the t-distribution with N-2 df

• Apply a statistical inference procedure to determine if a certain data

• Declare data points that have a low probability of belonging to the

There were 11,219 landslides in the state from 1988 to

You might also like