0% found this document useful (0 votes)
35 views17 pages

Outlier Detection

data mining

Uploaded by

prathap badam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views17 pages

Outlier Detection

data mining

Uploaded by

prathap badam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Outlier Analysis

1
Outlier Analysis
 Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
 Causes
 Measurement / Execution errors
 Inherent data variability
 Outliers – maybe valuable patterns
 Fraud detection
 Customized marketing
 Medical Analysis

2
Outlier Mining
 Given n data points and k – expected number of
outliers find the top k dissimilar objects
 Define inconsistent data
 Residuals in Regression
 Difficulties – Multi-dimensional data, non-numeric data
 Mine the outliers
 Visualization based methods
 Not applicable to cyclic plots, high dimensional data and categorical data
 Approaches
 Statistical Approach
 Distance-based approach
 Density based outlier approach
 Deviation based approach

3
Statistical Distribution-based Outlier
detection
 Assumes data follows a probability distribution and uses
discordancy test
 Discordancy testing
 Working hypothesis – H: oi ∈ F i=1,2,..n
 Test verifies whether an object oi is significantly different from F
 Significance probability SP(vi) = Prob(T>vi)

 IF SP is small oi is discordant and working hypothesis is rejected

and alternate hypothesis that oi comes from another distribution


model G is adopted
4
Statistical Distribution-based Outlier
detection
 Alternative distributions
 Inherent alternative distribution
 Alternative hypothesis: All objects arise from another distribution G

 Mixture alternative distribution


 Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
λ) F + λG i=1,2,..n

 Slippage alternative distribution


 Some Objects are independent observations from a modified version
of F (different parameters)

5
Statistical Distribution-based Outlier
detection
 Procedures for detecting Outliers
 Block procedures
 All are outliers or all are consistent

 Consecutive Procedures
 Inside-out procedure: Least likely object is tested first
 If it is an outlier – more extreme values are also considered as outliers

 Disadvantages of Statistical Approach


 Tests are for single attributes
 Data distribution may not be known
6
Distance based Outlier Detection

 Distance-based outlier
 A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
 Object does not have enough neighbours
 Avoids excessive computation of Statistical models
 If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D

7
Distance based Outlier Detection
 Index based Algorithm
 Uses multi-dimensional indexing structures such as k-d trees and R-trees
 M – maximum number of objects within dmin neighborhood
 Once M+1 neighbours are found o is not an outlier
 O(n2k) apart from index construction

 Nested loop algorithm


 Avoids index construction
 Tries to minimize I/Os
 Divides memory buffer space into two halves and data set into several logical
blocks

8
Distance based Outlier Detection
 Cell based Algorithm
 Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality
 Data space is partitioned into cells: dmin / 2√k
 Two layers surround each cell
 First layer – One cell thick
 Second layer -  2√k-1  cells thick
 Algorithm processes cells instead of objects
 Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
 An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
 If cell_+_2_layers_count, <= M then all objects in cell – Outliers
 If > M some may be outliers
 Object by object processing has to be done

9
Density based Outlier detection

 Previous methods assume data are uniformly


distributed
 Data may have different density distributions
 Difficulty in choosing dmin

10
Density based Outlier detection
 Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
 O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers

 Considers degree to which an object is an outlier


 Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings

11
Density based Outlier detection
 The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
 there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
 there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)

 k-distance neighborhood
 contains every object whose distance is not greater than the MinPts (k)-
distance of p

 The reachability distance of an object p with respect to object o, is


defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
12
OPTICS

 Complexity : O(n log n)

13
Density based Outlier detection
 Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
 Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
 It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
 LOF is higher for outliers

14
Deviation based Outlier detection
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects

15
Deviation based Outlier detection
 Sequential exception technique
 Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
 Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity

 Dissimilarity function – 1/n ∑i=1 n (xi-x’)2

 Smoothing factor: Assesses how much the dissimilarity can be


reduced by removing the subset from the original set of objects
 Can be repeated to avoid the influence of order

16
Deviation based Outlier detection

 OLAP Data Cube technique


 Uses data cubes to identify regions of anomalies
 A cell value in a cube is an exception if it differs
significantly from an expected value
 Visualization effects guide user
 May drill down

17

You might also like