Outlier Detection & Analysis 03
Outlier Detection & Analysis 03
Outlier - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Introduction
• Traditional Data Mining Categories
– Majority of Objects
• Dependency detection
• Class identification
• Class description
– Exceptions
• Exception/outlier detection
Motivation for Outlier Analysis
• Fraud Detection (Credit card, telecommunications, criminal
activity in e-Commerce)
• Customized Marketing (high/low income buying habits)
• Medical Treatments (unusual responses to various drugs)
• Analysis of performance statistics (professional athletes)
• Weather Prediction
• Financial Applications (loan approval, stock tracking)
• Observations inconsistent
with rest of the dataset –
Global Outlier
When Ij={1}, SF(Ij) has the maximum value, so {1} is the outlier set
OLAP Data Cube Technique
• Deviation detection process is overlapped with cube
computation
• Precomputed measures indicating data exceptions
are needed
• A cell value is considered an exception if it is
significantly different from the expected value, based
on a statistical model
• Use visual cues such as background color to reflect
the degree of exception
Outlier Analysis - Outline
• Introduction / Motivation / Definition
• Statistical-based Detection
– Distribution-based, depth-based
• Deviation-based Method
– Sequential exception, OLAP data cube
• Distance-based Detection
– Index-based, nested-loop, cell-based, local-outliers
Distance-Based Outlier Detection
• Distance-based: An object O in a dataset T is a
DB(p,D) outier if at least fraction p of the objects in T
are >= distance D from O
• A point O in a dataset is an outlier with respect to
parameters k and d if no more than k points in the
dataset are at a distance of d or less from O.
• Relative measurement: Let Dk(O) denote the distance
of the kth nearest neighbor of O. It is a measure of
how much of an outlier point O is.
Index-based Algorithm
• Indexing Structures such as R-tree (R+-tree), K-D (K-D-B) tree are built for
the multi-dimensional database
• The index is used to search for neighbors of each object O within radius D
around that object.
• Once K (K = N(1-p)) neighbors of object O are found, O is not an outlier.
• Worst-case computation complexity is O(K*n2), K is the dimensionality and
n is the number of objects in the dataset.
• Pros: scale well with K
• Cons: the index construction process may cost much time
Nested-loop Algorithm
• Divides the buffer space into two halves (first and second
arrays)
• Break data into blocks and then feed two blocks into the
arrays.
• Directly computes the distance between each pair of objects,
inside the array or between arrays
• Decide the outlier.
• Here comes an example:…
• Same computational complexity as the index-based algorithm
• Pros: Avoid index structure construction
• Try to minimize the I/Os
Example – stage 1
Buffer DB
A is the target block on stage 1
A A B
Load A into the first array (1R)
B C D Load B into the second array (1R)
Load C into the second array (1R)
Starting Point of Stage 1
Load D into the second array (1R)
A A B Total: 4 Reads
D C D
C A B Total: 2 Reads
D C D
C A B Total: 2 Reads
B C D
D A B Total: 2 Reads
B C D
• Define Layer-1 neighbors – all the intermediate neighbor cells. The maximum distance between
a cell and its neighbor cells is D
• Define Layer-2 neighbors – the cells within 3 cell of a certain cell. The minimum distance
between a cell and the cells outside of Layer-2 neighbors is D
• Criteria
– Search a cell internally. If there are M objects inside, all the objects in this cell are not outlier
– Search its layer-1 neighbors. If there are M objects inside a cell and its layer-1 neighbors, all the objects in
this cell are not outlier
– Search its layer-2 neighbors. If there are less than M objects inside a cell, its layer-1 neighbor cells, and its
layer-2 neighbor cells, all the objects in this cell are outlier
– Otherwise, the objects in this cell could be outlier, and then need to calculate the distance between the
objects in this cell and the objects in the cells in the layer-2 neighbor cells to see whether the total points
within D distance is more than M or not.
• An example
Example
Red – A certain cell
Notes:
The maximum distance
between a point in the red cell
and a point In its layer-1
neighbor cells is D
• Partition-based detection
– Use BIRCH clustering to identify clusters/partitions of
non-outliers
– Prune partitions that do not contain outliers
– Use Index/Nested Loop algorithms on the remaining
data points
– Since many data point are removed during pruning,
the efficiency is increased significantly.