WHAT ARE OUTLIERS?
A database may contain data objects that do not comply with the general behavior or model of the data. These data objects are outliers. Outliers can be caused by measurement or execution error. the outliers may be of particular interest
Applications:
Fraud detection
Medicine Public health
Sports statistics Detecting measurement errors
OUTLIER DETECTION METHODS
Statistical Distribution-Based Outlier Detection Distance-Based Outlier Detection Density-Based Local Outlier Detection Deviation-Based Outlier Detection
Statistical Distribution-Based Outlier Detection
assumes a distribution for the given data set identifies outliers with respect to the model using a discordancy test requires knowledge of the data set parameters knowledge of distribution parameters expected number of outliers.
How does the discordancy testing work?
This test examines two hypotheses: working hypothesis alternative hypothesis
A working hypothesis, H, is a statement that the entire data set of n objects comes from an initial distribution model, F, that is, H : oi E F, where i = 1, 2, , n. Verifies whether oi is <> in relation to F Assume T is some statistic used as discordancy test Assume value of the statistic for object oi is vi Then distribution T is constructed SP(vi)=Prob(T > vi), is evaluated If SP(vi) is small H is rejected
An alternative hypothesis, H, which states that oi comes from another distribution model, G, is adopted. The result is very much dependent on which model F is chosen because oi may be an outlier under one model and a perfectly valid value under another.
kinds of alternative distributions. Inherent alternative distribution H : oi E G, where i = 1, 2, : : : , n Mixture alternative distribution G. H : oi E (1-mu)F +muG, where i = 1, 2, : : : , n. Slippage alternative distribution
Distance-Based Outlier Detection
An object, o, in a data set, D, is a distancebased (DB) outlier with parameters pct and dmin,11 that is, a DB(pct;dmin)-outlier, if at least a fraction, pct, of the objects in D lie at a distance greater than dmin from o.
algorithms for mining distance-based outliers
Index-based algorithm Nested-loop algorithm Cell-based algorithm
Density-Based Local Outlier Detection
Distance-based outlier detection is based on global distance distribution It encounters difficulties to identify outliers if data is not uniformly distributed.
Deviation-Based Outlier Detection
it identifies outliers by examining the main characteristics of objects in a group
two techniques for deviation-based outlier detection Sequential Exception Technique OLAP Data Cube Technique
Sequential Exception Technique
OLAP Data Cube Technique