0% found this document useful (0 votes)

28 views32 pages

Outlier Detection & Analysis 03

Uploaded by

pujiswathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views32 pages

Outlier Detection & Analysis 03

Uploaded by

pujiswathy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Outlier Detection & Analysis

Outlier - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Introduction
• Traditional Data Mining Categories
– Majority of Objects
• Dependency detection
• Class identification
• Class description
– Exceptions
• Exception/outlier detection
Motivation for Outlier Analysis
• Fraud Detection (Credit card, telecommunications, criminal
activity in e-Commerce)
• Customized Marketing (high/low income buying habits)
• Medical Treatments (unusual responses to various drugs)
• Analysis of performance statistics (professional athletes)
• Weather Prediction
• Financial Applications (loan approval, stock tracking)

“One persons noise could be another person’s signal.”

What is an outlier?

• Observations inconsistent
with rest of the dataset –
Global Outlier

• Special outliers – Local

Outlier
– Observations inconsistent
with their neighborhoods
– A local instability or
discontinuity
Causes of Outliers
• Poor data quality / contamination
• Low quality measurements, malfunctioning
equipment, manual error
• Correct but exceptional data
Outlier Detection Approaches
• Objective:
– Define what data can be considered as
inconsistent in a given data set
• Statistical-Based Outlier Detection
• Deviation-Based Outlier Detection
• Distance-Based Outlier Detection
– Find an efficient method to mine the outliers
Why A Special Technique to Identify
Outliers?
• Why not just modify clustering or other algorithms to
detect outliers?
– Performance considerations
– Subjective to the clustering algorithm and clustering
parameters
– Only certain attributes may have outlier properties, no
need to disqualify the entire tuple
– Contamination may occur by “column”, not by row
Outlier Analysis - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Statistical-Based Outlier
Detection (Distribution-based)
• Assumptions:
– Knowledge of data (distribution,
mean, variance)
• Statistical discordancy test
– Data is assumed to be part of a
working hypothesis (working
hypothesis)
– Each data object in the dataset
is compared to the working
hypothesis and is either
accepted in the working
hypothesis or rejected as
discordant into an alternative
hypothesis (outliers)
Statistical-Based Outlier
Detection (Distribution-based)
• Assumptions:
– Knowledge of data (distribution,
mean, variance)
• Statistical discordancy test
– Data is assumed to be part of a
working hypothesis (working
hypothesis)
– Each data object in the dataset
is compared to the working
hypothesis and is either
accepted in the working
hypothesis or rejected as
discordant into an alternative
hypothesis (outliers)
Statistical-Based Outlier
detection (Depth-based)
• Data is organized into layers
according to some
definition of depth
• Shallow layers are more
likely to contain
outliers than deep
layers
• Can efficiently handle
computation for k < 4
Statistical-Based Outlier Detection
• Strengths
– Most outlier research has been done in this area, many
data distributions are known
• Weakness
– Almost all of the statistical models are univariate (only
handle one attribute) and those that are multivariate only
efficiently handle k<4
– All models assume the distribution is known –this is not
always the case
– Outlier detection is completely subjective to the
distribution used
Outlier Analysis - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Deviation-Based Outlier Detection
• Simulate a mechanism familiar to human
being: after seeing a series of similar data, an
element disturbing the series is considered an
exception
• Sequential Exception Techniques
• OLAP Data Cube Techniques
Sequential Exception

• Select subsets of data Ij (j=1,2,…,n) from the dataset I

• Compare the dissimilarity of I and (I-Ij)
• Find out the minimum subset Ij that reduce the
disimuliarity the most
• Smoothing factor
– D is a dissimilarity function
– C is a cardinality function, for example, the number of
elements in the dataset
Example
Let the data set I be the set of integer values {1,4,4,4}

Ij I- Ij C(I- Ij) D(I- Ij) SF(Ij)

{} {1,4,4,4} 4 1.69 0.00

{4} {1,4,4} 3 2.00 -0.93

{4,4} {1,4} 2 2.25 -1.12

{4,4,4} {1} 1 0.00 1.69

{1} {4,4,4} 3 0.00 5.07

{1,4} {4,4} 2 0.00 3.38

{1,4,4} {4} 1 0.00 1.69

Note, when Ij = {}, D(I) = D(I-Ij) = 1.69, SF(Ij)=0

When Ij={1}, SF(Ij) has the maximum value, so {1} is the outlier set
OLAP Data Cube Technique
• Deviation detection process is overlapped with cube
computation
• Precomputed measures indicating data exceptions
are needed
• A cell value is considered an exception if it is
significantly different from the expected value, based
on a statistical model
• Use visual cues such as background color to reflect
the degree of exception
Outlier Analysis - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Distance-Based Outlier Detection
• Distance-based: An object O in a dataset T is a
DB(p,D) outier if at least fraction p of the objects in T
are >= distance D from O
• A point O in a dataset is an outlier with respect to
parameters k and d if no more than k points in the
dataset are at a distance of d or less from O.
• Relative measurement: Let Dk(O) denote the distance
of the kth nearest neighbor of O. It is a measure of
how much of an outlier point O is.
Index-based Algorithm
• Indexing Structures such as R-tree (R+-tree), K-D (K-D-B) tree are built for
the multi-dimensional database
• The index is used to search for neighbors of each object O within radius D
around that object.
• Once K (K = N(1-p)) neighbors of object O are found, O is not an outlier.
• Worst-case computation complexity is O(K*n2), K is the dimensionality and
n is the number of objects in the dataset.
• Pros: scale well with K
• Cons: the index construction process may cost much time
Nested-loop Algorithm
• Divides the buffer space into two halves (first and second
arrays)
• Break data into blocks and then feed two blocks into the
arrays.
• Directly computes the distance between each pair of objects,
inside the array or between arrays
• Decide the outlier.
• Here comes an example:…
• Same computational complexity as the index-based algorithm
• Pros: Avoid index structure construction
• Try to minimize the I/Os
Example – stage 1
Buffer DB
A is the target block on stage 1
A A B
Load A into the first array (1R)
B C D Load B into the second array (1R)
Load C into the second array (1R)
Starting Point of Stage 1
Load D into the second array (1R)

A A B Total: 4 Reads

D C D

End Point of Stage 1

Example – stage 2
Buffer DB
D is the target block on stage 2
A A B
D is already in the buffer (no R)
D C D A is already in the buffer (no R)
Load B into the first array (1R)
Starting Point of Stage 2
Load C into the first array (1R)

C A B Total: 2 Reads

D C D

End Point of Stage 2

Example – stage 3
Buffer DB
C is the target block on stage 3
C A B
C is already in the buffer (no R)
D C D D is already in the buffer (no R)
Load A into the second array (1R)
Starting Point of Stage 3
Load B into the second array (1R)

C A B Total: 2 Reads

B C D

End Point of Stage 3

Example – stage 4
Buffer DB
B is the target block on stage 4
C A B
B is already in the buffer (no R)
B C D C is already in the buffer (no R)
Load A into the first array (1R)
Starting Point of Stage 4
Load D into the first array (1R)

D A B Total: 2 Reads

B C D

Every block is ¼ of the DB. From stage

End Point of Stage 4 1-4, a grand total of 10 blocks are read,
amounting to 10/4 passes over the entire
dataset.
Cell-Based Algorithm

• Divide the dataset into cells with length

– K is the dimensionality, D is the distance

• Define Layer-1 neighbors – all the intermediate neighbor cells. The maximum distance between
a cell and its neighbor cells is D

• Define Layer-2 neighbors – the cells within 3 cell of a certain cell. The minimum distance
between a cell and the cells outside of Layer-2 neighbors is D

• Criteria
– Search a cell internally. If there are M objects inside, all the objects in this cell are not outlier
– Search its layer-1 neighbors. If there are M objects inside a cell and its layer-1 neighbors, all the objects in
this cell are not outlier
– Search its layer-2 neighbors. If there are less than M objects inside a cell, its layer-1 neighbor cells, and its
layer-2 neighbor cells, all the objects in this cell are outlier
– Otherwise, the objects in this cell could be outlier, and then need to calculate the distance between the
objects in this cell and the objects in the cells in the layer-2 neighbor cells to see whether the total points
within D distance is more than M or not.
• An example
Example
Red – A certain cell

Yellow – Layer-1 Neighbor Cells

Blue – Layer-2 Neighbor Cells

Notes:
The maximum distance
between a point in the red cell
and a point In its layer-1
neighbor cells is D

The minimum distance between

A point in the red cell and a
point outside its layer-2
neighbor cells is D
Distance-Based Outlier
Detection (Local Outliers)
• Some outliers can be
defined as global outliers,
some can be defined as
local outliers to a given
cluster
• O2 would not normally be
considered an outlier
with regular
distance-based outlier
detection, since it looks at
the global picture
Distance-Based Outlier
Detection (Local Outliers)
• Each data object is
assigned a local
outlier factor (LOF)
• Objects which are
closer to dense
clusters receive a
higher LOF
• LOF varies according
to the parameter
MinPts
Distance-Based Outlier
Detection (Local Outliers)
Distance-Based Outlier
Detection (Partition-based)

• Partition-based detection
– Use BIRCH clustering to identify clusters/partitions of
non-outliers
– Prune partitions that do not contain outliers
– Use Index/Nested Loop algorithms on the remaining
data points
– Since many data point are removed during pruning,
the efficiency is increased significantly.

Business Intelligence and Analytics Notes
No ratings yet
Business Intelligence and Analytics Notes
260 pages
Data Mining UNIT-2 Notes
No ratings yet
Data Mining UNIT-2 Notes
91 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
A Stabilized Approach of Outlier Detection On Time Series Data
No ratings yet
A Stabilized Approach of Outlier Detection On Time Series Data
10 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Outlier Detection
No ratings yet
Outlier Detection
17 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outlier
No ratings yet
Outlier
2 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
What Are Outliers140
No ratings yet
What Are Outliers140
15 pages
What Are Outliers139
No ratings yet
What Are Outliers139
15 pages
What Are Outliers138
No ratings yet
What Are Outliers138
15 pages
What Are Outliers137
No ratings yet
What Are Outliers137
15 pages
What Are Outliers269
No ratings yet
What Are Outliers269
15 pages
What Are Outliers263
No ratings yet
What Are Outliers263
15 pages
What Are Outliers248
No ratings yet
What Are Outliers248
15 pages
What Are Outliers240
No ratings yet
What Are Outliers240
15 pages
What Are Outliers245
No ratings yet
What Are Outliers245
15 pages
What Are Outliers231
No ratings yet
What Are Outliers231
15 pages
What Are Outliers233
No ratings yet
What Are Outliers233
15 pages
What Are Outliers202
No ratings yet
What Are Outliers202
15 pages
What Are Outliers200
No ratings yet
What Are Outliers200
15 pages
What Are Outliers206
No ratings yet
What Are Outliers206
15 pages
What Are Outliers196
No ratings yet
What Are Outliers196
15 pages
What Are Outliers162
No ratings yet
What Are Outliers162
15 pages
What Are Outliers171
No ratings yet
What Are Outliers171
15 pages
What Are Outliers165
No ratings yet
What Are Outliers165
15 pages
What Are Outliers151
No ratings yet
What Are Outliers151
15 pages
What Are Outliers141
No ratings yet
What Are Outliers141
15 pages
What Are Outliers135
No ratings yet
What Are Outliers135
15 pages
What Are Outliers127
No ratings yet
What Are Outliers127
15 pages
What Are Outliers119
No ratings yet
What Are Outliers119
15 pages
What Are Outliers121
No ratings yet
What Are Outliers121
15 pages
What Are Outliers125
No ratings yet
What Are Outliers125
15 pages
What Are Outliers99
No ratings yet
What Are Outliers99
15 pages
What Are Outliers106
No ratings yet
What Are Outliers106
15 pages
What Are Outliers97
No ratings yet
What Are Outliers97
15 pages
What Are Outliers71
No ratings yet
What Are Outliers71
15 pages
What Are Outliers70
No ratings yet
What Are Outliers70
15 pages
What Are Outliers58
No ratings yet
What Are Outliers58
15 pages
What Are Outliers54
No ratings yet
What Are Outliers54
15 pages
What Are Outliers47
No ratings yet
What Are Outliers47
15 pages
What Are Outliers52
No ratings yet
What Are Outliers52
15 pages
What Are Outliers44
No ratings yet
What Are Outliers44
15 pages
What Are Outliers33
No ratings yet
What Are Outliers33
15 pages
What Are Outliers25
No ratings yet
What Are Outliers25
15 pages
What Are Outliers18
No ratings yet
What Are Outliers18
15 pages
What Are Outliers17
No ratings yet
What Are Outliers17
15 pages
What Are Outliers14
No ratings yet
What Are Outliers14
15 pages
What Are Outliers13
No ratings yet
What Are Outliers13
15 pages
What Are Outliers8
No ratings yet
What Are Outliers8
15 pages
FINAL REVIEWlast
No ratings yet
FINAL REVIEWlast
31 pages
Unit - 3: Big Data Analytics
No ratings yet
Unit - 3: Big Data Analytics
23 pages
Data Mining: Outlier Analysis - Presentation Transcript
No ratings yet
Data Mining: Outlier Analysis - Presentation Transcript
1 page
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
MultiDimensional Data Model
No ratings yet
MultiDimensional Data Model
22 pages
Major Project Stage 2
No ratings yet
Major Project Stage 2
19 pages
1000 AI&ML Ideas Across 21 Industry Domains
No ratings yet
1000 AI&ML Ideas Across 21 Industry Domains
48 pages
(2022) Fraud Detection and Prevention in E-Commerce - A Systematic Literature Review
No ratings yet
(2022) Fraud Detection and Prevention in E-Commerce - A Systematic Literature Review
19 pages
Unit 4
No ratings yet
Unit 4
28 pages
Data Engineering 101
No ratings yet
Data Engineering 101
1 page
09.unsupervised Learning
No ratings yet
09.unsupervised Learning
50 pages
The Role of Artificial Inelligence in Cyber Security and Data Privacy For Economic Growth
No ratings yet
The Role of Artificial Inelligence in Cyber Security and Data Privacy For Economic Growth
9 pages
Anomaly Detection and Predictive Maintenance
No ratings yet
Anomaly Detection and Predictive Maintenance
9 pages
Icpram 2025
No ratings yet
Icpram 2025
15 pages
Capstone Project
No ratings yet
Capstone Project
57 pages
Impact of Artificial Intelligence On The Planning and Operation of Distributed Energy Systems in Smart Grids
No ratings yet
Impact of Artificial Intelligence On The Planning and Operation of Distributed Energy Systems in Smart Grids
22 pages
Splunk Es
No ratings yet
Splunk Es
3 pages
Cis5206: Data Mining For Business Analytics and Cyber Security Sanatkumar Kantibhai Chaudhari (0061141617) Assignment 3 Case Study
No ratings yet
Cis5206: Data Mining For Business Analytics and Cyber Security Sanatkumar Kantibhai Chaudhari (0061141617) Assignment 3 Case Study
12 pages
Threat Detection
No ratings yet
Threat Detection
10 pages
Handbook of Research On Machine and Deep Learning Applications For Cyber Security 1st Edition by Padmavathi Ganapathi 1522596143 9781522596141 Instant Download
No ratings yet
Handbook of Research On Machine and Deep Learning Applications For Cyber Security 1st Edition by Padmavathi Ganapathi 1522596143 9781522596141 Instant Download
63 pages
Analyzing Surveillance Videos in Real-Time Using AI-Powered Deep Learning Techniques
No ratings yet
Analyzing Surveillance Videos in Real-Time Using AI-Powered Deep Learning Techniques
11 pages
Khushiii Project - Payal (Autosaved) 3
No ratings yet
Khushiii Project - Payal (Autosaved) 3
92 pages
Capstone Project Report
No ratings yet
Capstone Project Report
70 pages
Iscsitr-Ijcse 2025 06 01 001
No ratings yet
Iscsitr-Ijcse 2025 06 01 001
13 pages
Mintu Pandya Bca NTCC
No ratings yet
Mintu Pandya Bca NTCC
40 pages
1 s2.0 S0950705122004099 Main
No ratings yet
1 s2.0 S0950705122004099 Main
11 pages
Sri Ujjaini Tech Sem 2
No ratings yet
Sri Ujjaini Tech Sem 2
27 pages
Safe Path
No ratings yet
Safe Path
25 pages
BE Project
No ratings yet
BE Project
48 pages
Machine Learning
No ratings yet
Machine Learning
39 pages
Cs Microproject
No ratings yet
Cs Microproject
3 pages
Anomaly-Detection 112940
No ratings yet
Anomaly-Detection 112940
17 pages
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
From Everand
Técnicas Estadísticas para la Ciencia de Datos a través de R. Aprendizaje Supervisado: Análisis Discriminante, Árboles de Decisión, Redes Neuronales y Modelos Lineales Generalizados
César Pérez López
No ratings yet

Outlier Detection & Analysis 03

Uploaded by

Outlier Detection & Analysis 03

Uploaded by

Outlier Detection & Analysis

“One persons noise could be another person’s signal.”

• Special outliers – Local

• Select subsets of data Ij (j=1,2,…,n) from the dataset I

Ij I- Ij C(I- Ij) D(I- Ij) SF(Ij)

{} {1,4,4,4} 4 1.69 0.00

{4} {1,4,4} 3 2.00 -0.93

{4,4} {1,4} 2 2.25 -1.12

{4,4,4} {1} 1 0.00 1.69

{1} {4,4,4} 3 0.00 5.07

{1,4} {4,4} 2 0.00 3.38

{1,4,4} {4} 1 0.00 1.69

Note, when Ij = {}, D(I) = D(I-Ij) = 1.69, SF(Ij)=0

End Point of Stage 1

End Point of Stage 2

End Point of Stage 3

Every block is ¼ of the DB. From stage

• Divide the dataset into cells with length

Yellow – Layer-1 Neighbor Cells

Blue – Layer-2 Neighbor Cells

The minimum distance between

You might also like