Outlier Detection
Outlier Detection
1
Outlier Analysis
Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
Causes
Measurement / Execution errors
Inherent data variability
Outliers – maybe valuable patterns
Fraud detection
Customized marketing
Medical Analysis
2
Outlier Mining
Given n data points and k – expected number of
outliers find the top k dissimilar objects
Define inconsistent data
Residuals in Regression
Difficulties – Multi-dimensional data, non-numeric data
Mine the outliers
Visualization based methods
Not applicable to cyclic plots, high dimensional data and categorical data
Approaches
Statistical Approach
Distance-based approach
Density based outlier approach
Deviation based approach
3
Statistical Distribution-based Outlier
detection
Assumes data follows a probability distribution and uses
discordancy test
Discordancy testing
Working hypothesis – H: oi ∈ F i=1,2,..n
Test verifies whether an object oi is significantly different from F
Significance probability SP(vi) = Prob(T>vi)
5
Statistical Distribution-based Outlier
detection
Procedures for detecting Outliers
Block procedures
All are outliers or all are consistent
Consecutive Procedures
Inside-out procedure: Least likely object is tested first
If it is an outlier – more extreme values are also considered as outliers
Distance-based outlier
A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
Object does not have enough neighbours
Avoids excessive computation of Statistical models
If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D
7
Distance based Outlier Detection
Index based Algorithm
Uses multi-dimensional indexing structures such as k-d trees and R-trees
M – maximum number of objects within dmin neighborhood
Once M+1 neighbours are found o is not an outlier
O(n2k) apart from index construction
8
Distance based Outlier Detection
Cell based Algorithm
Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality
Data space is partitioned into cells: dmin / 2√k
Two layers surround each cell
First layer – One cell thick
Second layer - 2√k-1 cells thick
Algorithm processes cells instead of objects
Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
If cell_+_2_layers_count, <= M then all objects in cell – Outliers
If > M some may be outliers
Object by object processing has to be done
9
Density based Outlier detection
10
Density based Outlier detection
Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers
11
Density based Outlier detection
The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)
k-distance neighborhood
contains every object whose distance is not greater than the MinPts (k)-
distance of p
13
Density based Outlier detection
Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
LOF is higher for outliers
14
Deviation based Outlier detection
Identifies outliers by examining the main characteristics
of objects in a group
Objects that “deviate” from this description are
considered outliers
Sequential exception technique
Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects
15
Deviation based Outlier detection
Sequential exception technique
Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity
16
Deviation based Outlier detection
17