0% found this document useful (0 votes)
10 views55 pages

741 Outlier Detection

The document discusses outlier detection in data mining, highlighting its significance in various applications such as fraud detection and healthcare. It categorizes outliers into global, contextual, and collective types, and outlines the challenges and methods for detecting them, including supervised, unsupervised, and statistical approaches. The document also covers specific techniques like boxplots, Grubb's test, and kernel density estimation for identifying outliers in datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views55 pages

741 Outlier Detection

The document discusses outlier detection in data mining, highlighting its significance in various applications such as fraud detection and healthcare. It categorizes outliers into global, contextual, and collective types, and outlines the challenges and methods for detecting them, including supervised, unsupervised, and statistical approaches. The document also covers specific techniques like boxplots, Grubb's test, and kernel density estimation for identifying outliers in datasets.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Outlier Detection

Jian Pei
Simon Fraser University
’Oumuamua
2

• “In 2017, an astronomical event occurred that was


Outlier Detection unlike any other: for the first time, we observed an
object that we are certain originated from beyond
our Solar System.”

Jian Pei: Data Mining -- Outlier Detection


• Credit card transaction fraud detection
• Unusual amounts
• Unusual locations/time

Fraud • Fraud detection in stock markets


• Example: unusual chain transactions by a small group
detection of connected users
• Fraud detection in healthcare insurance
• A child and the parents frequently see the same
medical doctor at the same time
• A patient sees a medical doctor for flu every week

Jian Pei: Data Mining -- Outlier Detection 3


Outlier Analysis

• “One person’s noise is another person’s


signal”
• Outliers: the objects considerably
dissimilar from the remainder of the
data

Jian Pei: Data Mining -- Outlier Detection 4


Outliers and Noise

• Different from noise


• Noise is random error or variance in a
measured variable
• Outliers are interesting: an outlier violates
the mechanism that generates the normal
data
• Outlier detection vs. novelty detection
• Early stage may be regarded as
outliers
• But later merged into the model

Jian Pei: Data Mining -- Outlier Detection 5


Types of Outliers

• Three kinds: global, contextual and collective outliers


• A data set may have multiple types of outlier
• One object may belong to more than one type of outlier
• Global outlier (or point anomaly)
• An outlier object significantly deviates from the rest of the data set
• challenge: find an appropriate measurement of deviation

Jian Pei: Data Mining -- Outlier Detection 6


Contextual Outliers
• An outlier object deviates significantly based on a selected context
• Ex. Is 10C in Vancouver an outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation,
e.g., temperature
• A generalization of local outliers—whose density significantly deviates from its
local area
• Challenge: how to define or formulate meaningful context?

Jian Pei: Data Mining -- Outlier Detection 7


Collective Outliers
• A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
• Application example: intrusion detection when a number of
computers keep sending denial-of-service packages to each
other
• Detection of collective outliers
• Consider not only behavior of individual objects, but also that
of groups of objects
• Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure on
objects

Jian Pei: Data Mining -- Outlier Detection 8


• Modeling normal objects and outliers properly
• Hard to enumerate all possible normal behaviors
in an application
• The border between normal and outlier objects is
Outlier often a gray area
• Application-specific outlier detection
Detection: • Choice of distance measure among objects and
the model of relationship among objects are
Challenges often application-dependent
• Example: in clinic data a small deviation could be
an outlier; while in marketing analysis, larger
fluctuations

Jian Pei: Data Mining -- Outlier Detection 9


• Handling noise in outlier detection
• Noise may distort the normal objects and blur the
distinction between normal objects and outliers
• Noise may help hide outliers and reduce the
Outlier effectiveness of outlier detection

Detection: • Interpretability
• Understand why these are outliers: Justification of
Challenges the detection
• Specify the degree of an outlier: the unlikelihood
of the object being generated by a normal
mechanism

Jian Pei: Data Mining -- Outlier Detection 10


• Whether user-labeled examples of outliers can be

Outlier obtained
• Supervised, semi-supervised, and
unsupervised methods
Detection • Assumptions about normal data and outliers

Methods • Statistical, proximity-based, and clustering-


based methods

Jian Pei: Data Mining -- Outlier Detection 11


Supervised Methods

• Modeling outlier detection as a classification problem


• Samples examined by domain experts used for training & testing
• Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as outliers, or
• Model outliers and treat those not matching the model as normal
• Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some
artificial outliers
• Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e.,
not mislabeling normal objects as outliers)
Jian Pei: Data Mining -- Outlier Detection 12
Unsupervised Methods
• Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
• An outlier is expected to be far away from any groups of normal objects
• Weakness: Cannot detect collective outlier effectively
• Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area
• Many clustering methods can be adapted for unsupervised methods
• Find clusters, then outliers: not belonging to any cluster

Jian Pei: Data Mining -- Outlier Detection 13


Unsupervised Methods: Challenges
Semi-Supervised Methods

• In many applications, the number of labeled data is often small


• Labels could be on outliers only, normal objects only, or both
• If some labeled normal objects are available
• Use the labeled examples and the proximate unlabeled objects to train
a model for normal objects
• Those not fitting the model of normal objects are detected as outliers
• If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
• To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods

Jian Pei: Data Mining -- Outlier Detection 15


Statistical/Model-based
Methods
• Make assumptions of data normality: normal data objects
are generated by a statistical (stochastic) model, and that
data not following the model are outliers
• Effectiveness of statistical methods highly depends on
whether the assumption of statistical model holds in the
real data
• There are many kinds of statistical models
• Parametric vs. non-parametric

Jian Pei: Data Mining -- Outlier Detection 16


Proximity-based Methods
• An object is an outlier if the nearest neighbors of the object
are far away, i.e., the proximity of the object significantly
deviates from the proximity of most of the other objects in the
same data set
• The effectiveness of proximity-based methods highly relies on
the proximity measure
• In some applications, proximity or distance measures cannot
be obtained easily
• Often have a difficulty in identifying a group of outliers that
stay close to each other
• Two major types of proximity-based outlier detection methods
• Distance-based vs. density-based

Jian Pei: Data Mining -- Outlier Detection 17


Reconstruction-based Methods

• Idea: sine the normal data samples often share certain similarities, they
can often be represented in a more succinct way, compared with their
original representation
• Samples which cannot be well reconstructed by such alternative, succinct
representation are regarded outliers
• Two types of reconstruction-based outlier detection methods
• Matrix-factorization based methods for numeric data
• Pattern-based compression methods for categorical data

Jian Pei: Data Mining -- Outlier Detection 18


Statistical Approaches
• Learn a generative model fitting the given data set, and then identify those objects in
low-probability regions of the model as outliers
• Two categories
• A parametric method assumes that the normal data objects are generated by a
parametric distribution with a finite number of parameters Θ
• A nonparametric method tries to determine the model from the input data
• A nonparametric method is not completely parameter-free

Jian Pei: Data Mining -- Outlier Detection 19


Detection of Univariate Outliers Based on
Normal Distribution
• Univariate outlier detection using maximum likelihood method
• Example
• A city’s average temperature values in July in the last 10 years are, in value-
ascending order, 24.0◦C, 28.9 ◦C, 28.9 ◦C, 29.0 ◦C, 29.1 ◦C, 29.1 ◦C, 29.2 ◦C, 29.2
◦C, 29.3 ◦C, and 29.4 ◦C

• Model assumption: the average tem- perature follows a normal distribution


𝜇𝜇, 𝜎𝜎 !
• Task: estimate the paramters 𝜇𝜇 and 𝜎𝜎, that is, maximize the log-likelihood
function

Jian Pei: Data Mining -- Outlier Detection 20


Maximum Likelihood Estimates

• Results

Jian Pei: Data Mining -- Outlier Detection 21


Outlier Detection
• The most deviating value, 24.0◦C, is 4.61◦C away from the estimated
mean
• Model assumption: μ ± 3σ region contains 99.7% data under the
assumption of normal distribution
!.#$
• Because = 3.04 > 3, the probability that the value 24.0◦C is
$.%$
generated by the normal distribution is less than 0.15% -- it is an
outlier

Jian Pei: Data Mining -- Outlier Detection 22


Boxplot
• A five-number summary: the smallest
nonoutlier value (Min), the lower quartile
(Q1), the median (Q2), the upper quartile
(Q3), and the largest nonoutlier value (Max)
• The interquantile range (IQR) is defined as
Q3 − Q1
• Any object that is more than 1.5 × IQR
smaller than Q1 or 1.5 × IQR larger than Q3 is
treated as an outlier because the region
between Q1 − 1.5 × IQR and Q3 + 1.5 × IQR
contains 99.3% of the objects

Jian Pei: Data Mining -- Outlier Detection 23


Grubb’s Test (Maximum Normed Residual
Test)
|'()|
• z-score 𝑧𝑧 =
*
• An object is an outlier if

• where 𝑡𝑡#!⁄(!%),%(! is the value taken by a t-distribution at a significance level


of α/(2n), and n is the number of objects in the data set

Jian Pei: Data Mining -- Outlier Detection 24


Detection of • Let 𝑜𝑜̅ be the sample mean vector
Multivariate • For an object o, the squared Mahalanobis
distance from o to 𝑜𝑜̅ is Mdist o, 𝑜𝑜̅ =
Outliers 𝑜𝑜 − 𝑜𝑜̅ + 𝑆𝑆 ($ (𝑜𝑜 − 𝑜𝑜),
̅ where S is the sample
covariance matrix
Using the • Detect outliers using the univariate variable
Mahalanobis Mdist o, 𝑜𝑜̅
Distance

Jian Pei: Data Mining -- Outlier Detection 25


• For an object o, the χ2-statistic is
, / 0! (1! "
𝜒𝜒 = ∑-.$
1!
• 𝑜𝑜) is the value of o on the i-th
Detection of dimension
Multivariate • 𝐸𝐸) is the mean of the i-th
dimension among all objects
Outliers Using • n is the dimensionality
the χ2-statistic • The larger the χ2-statistic, the
more outlying the object

Jian Pei: Data Mining -- Outlier Detection 26


Detection Using a Mixture of Parametric
Distributions
• Model assumption: the normal data objects are generated by
multiple normal distributions

• Using EM (expectation-maximization) algorithm to learn the


parameters

Jian Pei: Data Mining -- Outlier Detection 27


Non-parametric • Not assume an a-priori statistical model,
instead, determine the model from the
input data
Method • Not completely parameter free but
consider the number and nature of the
parameters are flexible and not fixed in
advance
• Examples: histogram and kernel density
estimation

Jian Pei: Data Mining -- Outlier Detection 28


Histogram
• A transaction in the amount of $7,500 is an outlier, since
only 0.2% transactions have an amount higher than
$5,000
• Hard to choose an appropriate bin size for histogram
• Too small bin size → normal objects in empty/rare
bins, false positive
• Too big bin size → outliers in some frequent bins,
false negative

Jian Pei: Data Mining -- Outlier Detection 29


Kernel Density Estimation
• Treat an observed object as an indicator of high probability
density in the surrounding region
• The probability density at a point depends on the distances
from this point to the observed objects
• Use a kernel function to model the influence of a sample
point within its neighborhood
• A kernel K() is a non-negative real-valued integrable function
that satisfies two conditions
"
• ∫!" 𝐾𝐾 𝑥𝑥 𝑑𝑑𝑑𝑑 = 1
• 𝐾𝐾 −𝑥𝑥 = 𝐾𝐾(𝑥𝑥) for any x

Jian Pei: Data Mining -- Outlier Detection 30


Kernel Density Estimation
1 X ⇣ x Xi ⌘
n

• Using a kernel function K, estimate by fˆ(x) = K


nh i=1 h

• A frequently used kernel is a standard Gaussian function with mean 0


and variance 1

Jian Pei: Data Mining -- Outlier Detection 31


Proximity-based Outlier Detection
• Objects far away from the others are outliers
• The proximity of an outlier deviates significantly from that of most of the
others in the data set
• Distance-based outlier detection: An object o is an outlier if its neighborhood
does not have enough other points
• Density-based outlier detection: An object o is an outlier if its density is
relatively much lower than that of its neighbors

Jian Pei: Data Mining -- Outlier Detection 32


• A DB(p, D)-outlier is an object O in a
dataset T such that at least a fraction p of
Distance-based the objects in T lie at a distance greater
than distance D from O
Outliers • The larger D, the more outlying
• The larger p, the more outlying

Jian Pei: Data Mining -- Outlier Detection 33


Drawback of Distance-
based Methods

• Both o1 and o2 are outliers


• Distance-based methods can
detect o1, but not o2

Jian Pei: Data Mining -- Outlier Detection 34


• Outliers comparing to their local
neighborhoods, instead of the global data
distribution
Density-based • The density around an outlier object is
Methods: significantly different from the density
around its neighbors
Intuition • Use the relative density of an object
against its neighbors as the indicator of the
degree of the object being outliers

Jian Pei: Data Mining -- Outlier Detection 35


K-Distance

• The k-distance of p is the distance between p and its k-th nearest neighbor
• In a set D of points, for any positive integer k, the k-distance of object p,
denoted as k-distance(p), is the distance d(p, o) between p and an object o
such that
• For at least k objects o’ Î D \ {p}, d(p, o’) £ d(p, o)
• For at most (k-1) objects o’ Î D \ {p}, d(p, o’) < d(p, o)

Jian Pei: Data Mining -- Outlier Detection 36


K-distance Neighborhood

• Given the k-stance of p, the k-distance neighborhood of p contains every


object whose distance from p is not greater than the k-distance
• Nk-distance(p)(p) = {q Î D\{p} | d(p, q) £ k-distance(p)}
• Nk-distance(p)(p) can be written as Nk(p)

Jian Pei: Data Mining -- Outlier Detection 37


Reachability Distance
• The reachability distance of object p with respect to object o is reach-
distk(p, o) = max{k-distance(o), d(p, o)}

If p and o are close to


each other, reach-dist(p,
o) is the k-distance,
otherwise, it is the real
distance

Jian Pei: Data Mining -- Outlier


38
Detection
Local Reachability Density
| Nk (o) |
lrdk (o) = P 0
0
o 2Nk (o) reachdist k (o o)
Local outlier factor

Jian Pei: Data Mining -- Outlier


39
Detection
Reconstruction-based Approaches

• Find a succinct
representation
• Use the succinct
representation to reconstruct
the original data samples
• Measure the quality (i.e.,
goodness) of reconstruction

Jian Pei: Data Mining -- Outlier Detection 40


Matrix Factorization Based Methods for
Numeric Data

Jian Pei: Data Mining -- Outlier Detection 41


Singular Vector
Decomposition
(SVD)
𝑋𝑋 ≈ 𝑈𝑈Σ𝑉𝑉𝑉

Jian Pei: Data Mining -- Outlier Detection 42


Clustering-based Outlier Detection
• An object is an outlier if
• It does not belong to any cluster;
• There is a large distance between the object and its closest cluster ; or
• It belongs to a small or sparse cluster

Jian Pei: Data Mining -- Outlier Detection 43


Classification-based Outlier Detection

• Train a classification model that can distinguish “normal” data from outliers
• A brute-force approach: Consider a training set that contains some samples
labeled as “normal” and others labeled as “outlier”
• A training set in practice is typically heavily biased: the number of “normal”
samples likely far exceeds that of outlier samples
• Cannot detect unseen anomaly

Jian Pei: Data Mining -- Outlier Detection 44


One-Class Model
• A classifier is built to describe only the normal class
• Learn the decision boundary of the normal class using
classification methods such as SVM
• Any samples that do not belong to the normal class (not
within the decision boundary) are declared as outliers
• Advantage: can detect new outliers that may not appear
close to any outlier objects in the training set
• Extension: Normal objects may belong to multiple classes

Jian Pei: Data Mining -- Outlier Detection 45


Semi-Supervised Learning
Methods
• Combine classification-based and clustering-based methods
• Method
• Use a clustering-based approach to find a large cluster, C,
and a small cluster, C1
• Since some objects in C carry the label “normal”, treat all
objects in C as normal
• Use the one-class model of this cluster to identify normal
objects in outlier detection
• Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
• Any object that does not fall into the model for C (such as
a) is considered an outlier as well

Jian Pei: Data Mining -- Outlier Detection 46


• An outlier object deviates significantly based
on a selected context
• Ex. Is 10C in Vancouver an outlier?
(depending on summer or winter?)
• Attributes of data objects should be divided
into two groups
• Contextual attributes: defines the context,
Contextual e.g., time & location
• Behavioral attributes: characteristics of
Outliers the object, used in outlier evaluation, e.g.,
temperature
• A generalization of local outliers—whose
density significantly deviates from its local area
• Challenge: how to define or formulate
meaningful context?

Jian Pei: Data Mining -- Outlier Detection 47


• If the contexts can be clearly
identified, transform it to
Detection of conventional outlier detection
• Identify the context of the
Contextual object using the contextual
attributes

Outliers • Calculate the outlier score for


the object in the context using
a conventional outlier detection
method

Jian Pei: Data Mining -- Outlier Detection 48


• Detect outlier customers in the context of
customer groups
• Contextual attributes: age group, postal code
• Behavioral attributes: the number of
transactions per year, annual total transaction
amount
Example • Method
• Locate c’s context;
• Compare c with the other customers in the
same group; and
• Use a conventional outlier detection method

Jian Pei: Data Mining -- Outlier Detection 49


• Model the “normal” behavior with respect
to contexts
• Use a training data set to train a model that
predicts the expected behavior attribute values
with respect to the contextual attribute values
Modeling • An object is a contextual outlier if its behavior
attribute values significantly deviate from the
Normal values predicted by the model

Behavior • Use a prediction model to link the contexts


and behavior
• Avoid explicit identification of specific contexts
• Some possible methods: regression, Markov
Models, and Finite State Automaton …

Jian Pei: Data Mining -- Outlier Detection 50


Collective Outliers

• Objects as a group deviate significantly from the


entire data
• Examine the structure of the data set, i.e, the
relationships between multiple data objects
• The structures are often not explicitly
defined and have to be discovered as part of
the outlier detection process.

Jian Pei: Data Mining -- Outlier Detection 51


Detecting High Dimensional Outliers

• Interpretability of outliers
• Which subspaces manifest the outliers or an assessment regarding the “outlying-ness” of the
objects
• Data sparsity: data in high-D spaces are often sparse
• The distance between objects becomes heavily dominated by noise as the dimensionality
increases
• Data subspaces
• Local behavior and patterns of data
• Scalability with respect to dimensionality
• The number of subspaces increases exponentially

Jian Pei: Data Mining -- Outlier Detection 52


HilOut
• Find distance-based outliers, but uses the ranks of distance instead of
the absolute distance in outlier detection
• For each object, o, find the k-nearest neighbors of o, denoted by
nn1(o), . . . , nnk(o), where k is an application-dependent parameter
• The weight of o is

• All objects are ranked in weight-descending order


• The top-l objects in weight are outliers

Jian Pei: Data Mining -- Outlier Detection 53


Angle-based Outliers

Jian Pei: Data Mining -- Outlier Detection 54


• Outlier detection and applications
• Types of outliers
• Statistical methods
Summary • Proximity-based methods
• Clustering- and classification-based methods
• Contextual outlier and collective outliers
• Outlier detection for high-dimensional data

Jian Pei: Data Mining -- Outlier Detection 55

You might also like