741 Outlier Detection
741 Outlier Detection
Jian Pei
Simon Fraser University
’Oumuamua
2
Detection: • Interpretability
• Understand why these are outliers: Justification of
Challenges the detection
• Specify the degree of an outlier: the unlikelihood
of the object being generated by a normal
mechanism
Outlier obtained
• Supervised, semi-supervised, and
unsupervised methods
Detection • Assumptions about normal data and outliers
• Idea: sine the normal data samples often share certain similarities, they
can often be represented in a more succinct way, compared with their
original representation
• Samples which cannot be well reconstructed by such alternative, succinct
representation are regarded outliers
• Two types of reconstruction-based outlier detection methods
• Matrix-factorization based methods for numeric data
• Pattern-based compression methods for categorical data
• Results
• The k-distance of p is the distance between p and its k-th nearest neighbor
• In a set D of points, for any positive integer k, the k-distance of object p,
denoted as k-distance(p), is the distance d(p, o) between p and an object o
such that
• For at least k objects o’ Î D \ {p}, d(p, o’) £ d(p, o)
• For at most (k-1) objects o’ Î D \ {p}, d(p, o’) < d(p, o)
• Find a succinct
representation
• Use the succinct
representation to reconstruct
the original data samples
• Measure the quality (i.e.,
goodness) of reconstruction
• Train a classification model that can distinguish “normal” data from outliers
• A brute-force approach: Consider a training set that contains some samples
labeled as “normal” and others labeled as “outlier”
• A training set in practice is typically heavily biased: the number of “normal”
samples likely far exceeds that of outlier samples
• Cannot detect unseen anomaly
• Interpretability of outliers
• Which subspaces manifest the outliers or an assessment regarding the “outlying-ness” of the
objects
• Data sparsity: data in high-D spaces are often sparse
• The distance between objects becomes heavily dominated by noise as the dimensionality
increases
• Data subspaces
• Local behavior and patterns of data
• Scalability with respect to dimensionality
• The number of subspaces increases exponentially