Missing and Outlier
Missing and Outlier
2
3
4
5
6
7
8
9
10
What Are Outliers?
Outlier: A data object that deviates significantly from the normal
objects as if it were generated by a different mechanism
Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
Outliers are different from the noise data
Noise is random error or variance in a measured variable
Customer segmentation
Medical analysis
11
Types of Outliers (I)
Three kinds: global, contextual and collective outliers
Global outlier (or point anomaly) Global Outlier
groups of objects
Need to have the background knowledge on the relationship
The border between normal and outlier objects is often a gray area
between normal objects and outliers. It may help hide outliers and
reduce the effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
Model normal objects & report those not matching the model as
outliers, or
Model outliers and treat those not matching the model as normal
Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class and
Problem 2: Costly since first clustering: but far less outliers than
normal objects
Newer methods: tackle outliers directly
16
Outlier Detection III: Semi-Supervised Methods
Situation: In many applications, the number of labeled data is often
small: Labels could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
If some labeled normal objects are available
Use the labeled examples and the proximate unlabeled objects to
train a model for normal objects
Those not fitting the model of normal objects are detected as outliers
If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods
17
Outlier Detection (1): Statistical Methods
Statistical methods (also known as model-based methods) assume that the normal
data follow some statistical model (a stochastic model)
The data not following the model are outliers.
18
Outlier Detection (2): Proximity-Based Methods
An object is an outlier if the nearest neighbors of the object are far away, i.e., the
proximity of the object is significantly deviates from the proximity of most of the other
objects in the same data set
20