0% found this document useful (0 votes)

10 views55 pages

741 Outlier Detection

The document discusses outlier detection in data mining, highlighting its significance in various applications such as fraud detection and healthcare. It categorizes outliers into global, contextual, and collective types, and outlines the challenges and methods for detecting them, including supervised, unsupervised, and statistical approaches. The document also covers specific techniques like boxplots, Grubb's test, and kernel density estimation for identifying outliers in datasets.

Uploaded by

nadun.emailclient

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views55 pages

741 Outlier Detection

Uploaded by

nadun.emailclient

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Outlier Detection

Jian Pei
Simon Fraser University
’Oumuamua
2

• “In 2017, an astronomical event occurred that was

Outlier Detection unlike any other: for the first time, we observed an
object that we are certain originated from beyond
our Solar System.”

Jian Pei: Data Mining -- Outlier Detection

• Credit card transaction fraud detection
• Unusual amounts
• Unusual locations/time

Fraud • Fraud detection in stock markets

• Example: unusual chain transactions by a small group
detection of connected users
• Fraud detection in healthcare insurance
• A child and the parents frequently see the same
medical doctor at the same time
• A patient sees a medical doctor for flu every week

Jian Pei: Data Mining -- Outlier Detection 3

Outlier Analysis

• “One person’s noise is another person’s

signal”
• Outliers: the objects considerably
dissimilar from the remainder of the
data

Jian Pei: Data Mining -- Outlier Detection 4

Outliers and Noise

• Different from noise

• Noise is random error or variance in a
measured variable
• Outliers are interesting: an outlier violates
the mechanism that generates the normal
data
• Outlier detection vs. novelty detection
• Early stage may be regarded as
outliers
• But later merged into the model

Jian Pei: Data Mining -- Outlier Detection 5

Types of Outliers

• Three kinds: global, contextual and collective outliers

• A data set may have multiple types of outlier
• One object may belong to more than one type of outlier
• Global outlier (or point anomaly)
• An outlier object significantly deviates from the rest of the data set
• challenge: find an appropriate measurement of deviation

Jian Pei: Data Mining -- Outlier Detection 6

Contextual Outliers
• An outlier object deviates significantly based on a selected context
• Ex. Is 10C in Vancouver an outlier? (depending on summer or winter?)
• Attributes of data objects should be divided into two groups
• Contextual attributes: defines the context, e.g., time & location
• Behavioral attributes: characteristics of the object, used in outlier evaluation,
e.g., temperature
• A generalization of local outliers—whose density significantly deviates from its
local area
• Challenge: how to define or formulate meaningful context?

Jian Pei: Data Mining -- Outlier Detection 7

Collective Outliers
• A subset of data objects collectively deviate significantly from the
whole data set, even if the individual data objects may not be
outliers
• Application example: intrusion detection when a number of
computers keep sending denial-of-service packages to each
other
• Detection of collective outliers
• Consider not only behavior of individual objects, but also that
of groups of objects
• Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure on
objects

Jian Pei: Data Mining -- Outlier Detection 8

• Modeling normal objects and outliers properly
• Hard to enumerate all possible normal behaviors
in an application
• The border between normal and outlier objects is
Outlier often a gray area
• Application-specific outlier detection
Detection: • Choice of distance measure among objects and
the model of relationship among objects are
Challenges often application-dependent
• Example: in clinic data a small deviation could be
an outlier; while in marketing analysis, larger
fluctuations

Jian Pei: Data Mining -- Outlier Detection 9

• Handling noise in outlier detection
• Noise may distort the normal objects and blur the
distinction between normal objects and outliers
• Noise may help hide outliers and reduce the
Outlier effectiveness of outlier detection

Detection: • Interpretability
• Understand why these are outliers: Justification of
Challenges the detection
• Specify the degree of an outlier: the unlikelihood
of the object being generated by a normal
mechanism

Jian Pei: Data Mining -- Outlier Detection 10

• Whether user-labeled examples of outliers can be

Outlier obtained
• Supervised, semi-supervised, and
unsupervised methods
Detection • Assumptions about normal data and outliers

Methods • Statistical, proximity-based, and clustering-

based methods

Jian Pei: Data Mining -- Outlier Detection 11

Supervised Methods

• Modeling outlier detection as a classification problem

• Samples examined by domain experts used for training & testing
• Methods for Learning a classifier for outlier detection effectively:
• Model normal objects & report those not matching the model as outliers, or
• Model outliers and treat those not matching the model as normal
• Challenges
• Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some
artificial outliers
• Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e.,
not mislabeling normal objects as outliers)
Jian Pei: Data Mining -- Outlier Detection 12
Unsupervised Methods
• Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
• An outlier is expected to be far away from any groups of normal objects
• Weakness: Cannot detect collective outlier effectively
• Normal objects may not share any strong patterns, but the collective outliers may
share high similarity in a small area
• Many clustering methods can be adapted for unsupervised methods
• Find clusters, then outliers: not belonging to any cluster

Jian Pei: Data Mining -- Outlier Detection 13

Unsupervised Methods: Challenges
Semi-Supervised Methods

• In many applications, the number of labeled data is often small

• Labels could be on outliers only, normal objects only, or both
• If some labeled normal objects are available
• Use the labeled examples and the proximate unlabeled objects to train
a model for normal objects
• Those not fitting the model of normal objects are detected as outliers
• If only some labeled outliers are available, a small number of labeled
outliers many not cover the possible outliers well
• To improve the quality of outlier detection, one can get help from
models for normal objects learned from unsupervised methods

Jian Pei: Data Mining -- Outlier Detection 15

Statistical/Model-based
Methods
• Make assumptions of data normality: normal data objects
are generated by a statistical (stochastic) model, and that
data not following the model are outliers
• Effectiveness of statistical methods highly depends on
whether the assumption of statistical model holds in the
real data
• There are many kinds of statistical models
• Parametric vs. non-parametric

Jian Pei: Data Mining -- Outlier Detection 16

Proximity-based Methods
• An object is an outlier if the nearest neighbors of the object
are far away, i.e., the proximity of the object significantly
deviates from the proximity of most of the other objects in the
same data set
• The effectiveness of proximity-based methods highly relies on
the proximity measure
• In some applications, proximity or distance measures cannot
be obtained easily
• Often have a difficulty in identifying a group of outliers that
stay close to each other
• Two major types of proximity-based outlier detection methods
• Distance-based vs. density-based

Jian Pei: Data Mining -- Outlier Detection 17

Reconstruction-based Methods

• Idea: sine the normal data samples often share certain similarities, they
can often be represented in a more succinct way, compared with their
original representation
• Samples which cannot be well reconstructed by such alternative, succinct
representation are regarded outliers
• Two types of reconstruction-based outlier detection methods
• Matrix-factorization based methods for numeric data
• Pattern-based compression methods for categorical data

Jian Pei: Data Mining -- Outlier Detection 18

Statistical Approaches
• Learn a generative model fitting the given data set, and then identify those objects in
low-probability regions of the model as outliers
• Two categories
• A parametric method assumes that the normal data objects are generated by a
parametric distribution with a finite number of parameters Θ
• A nonparametric method tries to determine the model from the input data
• A nonparametric method is not completely parameter-free

Jian Pei: Data Mining -- Outlier Detection 19

Detection of Univariate Outliers Based on
Normal Distribution
• Univariate outlier detection using maximum likelihood method
• Example
• A city’s average temperature values in July in the last 10 years are, in value-
ascending order, 24.0◦C, 28.9 ◦C, 28.9 ◦C, 29.0 ◦C, 29.1 ◦C, 29.1 ◦C, 29.2 ◦C, 29.2
◦C, 29.3 ◦C, and 29.4 ◦C

• Model assumption: the average tem- perature follows a normal distribution

𝜇𝜇, 𝜎𝜎 !
• Task: estimate the paramters 𝜇𝜇 and 𝜎𝜎, that is, maximize the log-likelihood
function

Jian Pei: Data Mining -- Outlier Detection 20

Maximum Likelihood Estimates

• Results

Jian Pei: Data Mining -- Outlier Detection 21

Outlier Detection
• The most deviating value, 24.0◦C, is 4.61◦C away from the estimated
mean
• Model assumption: μ ± 3σ region contains 99.7% data under the
assumption of normal distribution
!.#$
• Because = 3.04 > 3, the probability that the value 24.0◦C is
$.%$
generated by the normal distribution is less than 0.15% -- it is an
outlier

Jian Pei: Data Mining -- Outlier Detection 22

Boxplot
• A five-number summary: the smallest
nonoutlier value (Min), the lower quartile
(Q1), the median (Q2), the upper quartile
(Q3), and the largest nonoutlier value (Max)
• The interquantile range (IQR) is defined as
Q3 − Q1
• Any object that is more than 1.5 × IQR
smaller than Q1 or 1.5 × IQR larger than Q3 is
treated as an outlier because the region
between Q1 − 1.5 × IQR and Q3 + 1.5 × IQR
contains 99.3% of the objects

Jian Pei: Data Mining -- Outlier Detection 23

Grubb’s Test (Maximum Normed Residual
Test)
|'()|
• z-score 𝑧𝑧 =
*
• An object is an outlier if

• where 𝑡𝑡#!⁄(!%),%(! is the value taken by a t-distribution at a significance level

of α/(2n), and n is the number of objects in the data set

Jian Pei: Data Mining -- Outlier Detection 24

Detection of • Let 𝑜𝑜̅ be the sample mean vector
Multivariate • For an object o, the squared Mahalanobis
distance from o to 𝑜𝑜̅ is Mdist o, 𝑜𝑜̅ =
Outliers 𝑜𝑜 − 𝑜𝑜̅ + 𝑆𝑆 ($ (𝑜𝑜 − 𝑜𝑜),
̅ where S is the sample
covariance matrix
Using the • Detect outliers using the univariate variable
Mahalanobis Mdist o, 𝑜𝑜̅
Distance

Jian Pei: Data Mining -- Outlier Detection 25

• For an object o, the χ2-statistic is
, / 0! (1! "
𝜒𝜒 = ∑-.$
1!
• 𝑜𝑜) is the value of o on the i-th
Detection of dimension
Multivariate • 𝐸𝐸) is the mean of the i-th
dimension among all objects
Outliers Using • n is the dimensionality
the χ2-statistic • The larger the χ2-statistic, the
more outlying the object

Jian Pei: Data Mining -- Outlier Detection 26

Detection Using a Mixture of Parametric
Distributions
• Model assumption: the normal data objects are generated by
multiple normal distributions

• Using EM (expectation-maximization) algorithm to learn the

parameters

Jian Pei: Data Mining -- Outlier Detection 27

Non-parametric • Not assume an a-priori statistical model,
instead, determine the model from the
input data
Method • Not completely parameter free but
consider the number and nature of the
parameters are flexible and not fixed in
advance
• Examples: histogram and kernel density
estimation

Jian Pei: Data Mining -- Outlier Detection 28

Histogram
• A transaction in the amount of $7,500 is an outlier, since
only 0.2% transactions have an amount higher than
$5,000
• Hard to choose an appropriate bin size for histogram
• Too small bin size → normal objects in empty/rare
bins, false positive
• Too big bin size → outliers in some frequent bins,
false negative

Jian Pei: Data Mining -- Outlier Detection 29

Kernel Density Estimation
• Treat an observed object as an indicator of high probability
density in the surrounding region
• The probability density at a point depends on the distances
from this point to the observed objects
• Use a kernel function to model the influence of a sample
point within its neighborhood
• A kernel K() is a non-negative real-valued integrable function
that satisfies two conditions
"
• ∫!" 𝐾𝐾 𝑥𝑥 𝑑𝑑𝑑𝑑 = 1
• 𝐾𝐾 −𝑥𝑥 = 𝐾𝐾(𝑥𝑥) for any x

Jian Pei: Data Mining -- Outlier Detection 30

Kernel Density Estimation
1 X ⇣ x Xi ⌘
n

• Using a kernel function K, estimate by fˆ(x) = K

nh i=1 h

• A frequently used kernel is a standard Gaussian function with mean 0

and variance 1

Jian Pei: Data Mining -- Outlier Detection 31

Proximity-based Outlier Detection
• Objects far away from the others are outliers
• The proximity of an outlier deviates significantly from that of most of the
others in the data set
• Distance-based outlier detection: An object o is an outlier if its neighborhood
does not have enough other points
• Density-based outlier detection: An object o is an outlier if its density is
relatively much lower than that of its neighbors

Jian Pei: Data Mining -- Outlier Detection 32

• A DB(p, D)-outlier is an object O in a
dataset T such that at least a fraction p of
Distance-based the objects in T lie at a distance greater
than distance D from O
Outliers • The larger D, the more outlying
• The larger p, the more outlying

Jian Pei: Data Mining -- Outlier Detection 33

Drawback of Distance-
based Methods

• Both o1 and o2 are outliers

• Distance-based methods can
detect o1, but not o2

Jian Pei: Data Mining -- Outlier Detection 34

• Outliers comparing to their local
neighborhoods, instead of the global data
distribution
Density-based • The density around an outlier object is
Methods: significantly different from the density
around its neighbors
Intuition • Use the relative density of an object
against its neighbors as the indicator of the
degree of the object being outliers

Jian Pei: Data Mining -- Outlier Detection 35

K-Distance

• The k-distance of p is the distance between p and its k-th nearest neighbor
• In a set D of points, for any positive integer k, the k-distance of object p,
denoted as k-distance(p), is the distance d(p, o) between p and an object o
such that
• For at least k objects o’ Î D \ {p}, d(p, o’) £ d(p, o)
• For at most (k-1) objects o’ Î D \ {p}, d(p, o’) < d(p, o)

Jian Pei: Data Mining -- Outlier Detection 36

K-distance Neighborhood

• Given the k-stance of p, the k-distance neighborhood of p contains every

object whose distance from p is not greater than the k-distance
• Nk-distance(p)(p) = {q Î D\{p} | d(p, q) £ k-distance(p)}
• Nk-distance(p)(p) can be written as Nk(p)

Jian Pei: Data Mining -- Outlier Detection 37

Reachability Distance
• The reachability distance of object p with respect to object o is reach-
distk(p, o) = max{k-distance(o), d(p, o)}

If p and o are close to

each other, reach-dist(p,
o) is the k-distance,
otherwise, it is the real
distance

Jian Pei: Data Mining -- Outlier

38
Detection
Local Reachability Density
| Nk (o) |
lrdk (o) = P 0
0
o 2Nk (o) reachdist k (o o)
Local outlier factor

Jian Pei: Data Mining -- Outlier

39
Detection
Reconstruction-based Approaches

• Find a succinct
representation
• Use the succinct
representation to reconstruct
the original data samples
• Measure the quality (i.e.,
goodness) of reconstruction

Jian Pei: Data Mining -- Outlier Detection 40

Matrix Factorization Based Methods for
Numeric Data

Jian Pei: Data Mining -- Outlier Detection 41

Singular Vector
Decomposition
(SVD)
𝑋𝑋 ≈ 𝑈𝑈Σ𝑉𝑉𝑉

Jian Pei: Data Mining -- Outlier Detection 42

Clustering-based Outlier Detection
• An object is an outlier if
• It does not belong to any cluster;
• There is a large distance between the object and its closest cluster ; or
• It belongs to a small or sparse cluster

Jian Pei: Data Mining -- Outlier Detection 43

Classification-based Outlier Detection

• Train a classification model that can distinguish “normal” data from outliers
• A brute-force approach: Consider a training set that contains some samples
labeled as “normal” and others labeled as “outlier”
• A training set in practice is typically heavily biased: the number of “normal”
samples likely far exceeds that of outlier samples
• Cannot detect unseen anomaly

Jian Pei: Data Mining -- Outlier Detection 44

One-Class Model
• A classifier is built to describe only the normal class
• Learn the decision boundary of the normal class using
classification methods such as SVM
• Any samples that do not belong to the normal class (not
within the decision boundary) are declared as outliers
• Advantage: can detect new outliers that may not appear
close to any outlier objects in the training set
• Extension: Normal objects may belong to multiple classes

Jian Pei: Data Mining -- Outlier Detection 45

Semi-Supervised Learning
Methods
• Combine classification-based and clustering-based methods
• Method
• Use a clustering-based approach to find a large cluster, C,
and a small cluster, C1
• Since some objects in C carry the label “normal”, treat all
objects in C as normal
• Use the one-class model of this cluster to identify normal
objects in outlier detection
• Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
• Any object that does not fall into the model for C (such as
a) is considered an outlier as well

Jian Pei: Data Mining -- Outlier Detection 46

• An outlier object deviates significantly based
on a selected context
• Ex. Is 10C in Vancouver an outlier?
(depending on summer or winter?)
• Attributes of data objects should be divided
into two groups
• Contextual attributes: defines the context,
Contextual e.g., time & location
• Behavioral attributes: characteristics of
Outliers the object, used in outlier evaluation, e.g.,
temperature
• A generalization of local outliers—whose
density significantly deviates from its local area
• Challenge: how to define or formulate
meaningful context?

Jian Pei: Data Mining -- Outlier Detection 47

• If the contexts can be clearly
identified, transform it to
Detection of conventional outlier detection
• Identify the context of the
Contextual object using the contextual
attributes

Outliers • Calculate the outlier score for

the object in the context using
a conventional outlier detection
method

Jian Pei: Data Mining -- Outlier Detection 48

• Detect outlier customers in the context of
customer groups
• Contextual attributes: age group, postal code
• Behavioral attributes: the number of
transactions per year, annual total transaction
amount
Example • Method
• Locate c’s context;
• Compare c with the other customers in the
same group; and
• Use a conventional outlier detection method

Jian Pei: Data Mining -- Outlier Detection 49

• Model the “normal” behavior with respect
to contexts
• Use a training data set to train a model that
predicts the expected behavior attribute values
with respect to the contextual attribute values
Modeling • An object is a contextual outlier if its behavior
attribute values significantly deviate from the
Normal values predicted by the model

Behavior • Use a prediction model to link the contexts

and behavior
• Avoid explicit identification of specific contexts
• Some possible methods: regression, Markov
Models, and Finite State Automaton …

Jian Pei: Data Mining -- Outlier Detection 50

Collective Outliers

• Objects as a group deviate significantly from the

entire data
• Examine the structure of the data set, i.e, the
relationships between multiple data objects
• The structures are often not explicitly
defined and have to be discovered as part of
the outlier detection process.

Jian Pei: Data Mining -- Outlier Detection 51

Detecting High Dimensional Outliers

• Interpretability of outliers
• Which subspaces manifest the outliers or an assessment regarding the “outlying-ness” of the
objects
• Data sparsity: data in high-D spaces are often sparse
• The distance between objects becomes heavily dominated by noise as the dimensionality
increases
• Data subspaces
• Local behavior and patterns of data
• Scalability with respect to dimensionality
• The number of subspaces increases exponentially

Jian Pei: Data Mining -- Outlier Detection 52

HilOut
• Find distance-based outliers, but uses the ranks of distance instead of
the absolute distance in outlier detection
• For each object, o, find the k-nearest neighbors of o, denoted by
nn1(o), . . . , nnk(o), where k is an application-dependent parameter
• The weight of o is

• All objects are ranked in weight-descending order

• The top-l objects in weight are outliers

Jian Pei: Data Mining -- Outlier Detection 53

Angle-based Outliers

Jian Pei: Data Mining -- Outlier Detection 54

• Outlier detection and applications
• Types of outliers
• Statistical methods
Summary • Proximity-based methods
• Clustering- and classification-based methods
• Contextual outlier and collective outliers
• Outlier detection for high-dimensional data

Jian Pei: Data Mining -- Outlier Detection 55

Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
Unit 5
No ratings yet
Unit 5
47 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Unit 5
No ratings yet
Unit 5
70 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
ADII10 Analisa Outlier
No ratings yet
ADII10 Analisa Outlier
37 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
SPSS Introduction Course at PSB, UUM
100% (1)
SPSS Introduction Course at PSB, UUM
122 pages
Unit 4
No ratings yet
Unit 4
17 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Outliers EXTD
No ratings yet
Outliers EXTD
24 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Outlier Detection
No ratings yet
Outlier Detection
17 pages
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
No ratings yet
Anomaly Detection: Lecture Notes For Chapter 9 Introduction To Data Mining, 2 Edition by Tan, Steinbach, Karpatne, Kumar
33 pages
Outlier Analysis
No ratings yet
Outlier Analysis
18 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
Unit V Outlier 2
No ratings yet
Unit V Outlier 2
13 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Outlier Detection
No ratings yet
Outlier Detection
9 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Lecture 12 Outliers and Guidelines For Exercises
No ratings yet
Lecture 12 Outliers and Guidelines For Exercises
6 pages
Outlier Mining Techniques For Uncertain Data
No ratings yet
Outlier Mining Techniques For Uncertain Data
7 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
Outliers
No ratings yet
Outliers
3 pages
Outlier
No ratings yet
Outlier
2 pages
Chapter 12. Outlier Analysis
No ratings yet
Chapter 12. Outlier Analysis
4 pages
Lecture 1
No ratings yet
Lecture 1
29 pages
Multivariate Time Series Classification of Sensor Data From An in
No ratings yet
Multivariate Time Series Classification of Sensor Data From An in
101 pages
MANOVA
No ratings yet
MANOVA
33 pages
CH 13 F - Hooman
No ratings yet
CH 13 F - Hooman
16 pages
A Novel Anomaly Detection Scheme Based On Principal Component Classifier
No ratings yet
A Novel Anomaly Detection Scheme Based On Principal Component Classifier
10 pages
SMBL Merged
No ratings yet
SMBL Merged
28 pages
Lab5 Counter
No ratings yet
Lab5 Counter
6 pages
UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning
No ratings yet
UE20EC352-Machine Learning & Applications Unit 3 - Non Parametric Supervised Learning
117 pages
Multivariate Analysis in SPSS
No ratings yet
Multivariate Analysis in SPSS
65 pages
Classification L12
No ratings yet
Classification L12
20 pages
Model Komitmen Keterhubungan Jasa Akuntan Publik Dengan Pelanggannya Di Indonesia Oleh: Wiwik Handayani
No ratings yet
Model Komitmen Keterhubungan Jasa Akuntan Publik Dengan Pelanggannya Di Indonesia Oleh: Wiwik Handayani
20 pages
15 April 2020 - Session2 - Digital Image Classification - Poonam S Tiwari
No ratings yet
15 April 2020 - Session2 - Digital Image Classification - Poonam S Tiwari
48 pages
UNIT II: Malware and Vulnerability: Worms
No ratings yet
UNIT II: Malware and Vulnerability: Worms
112 pages
Taskbench: Benchmarking Large Language Models For Task Automation
No ratings yet
Taskbench: Benchmarking Large Language Models For Task Automation
29 pages
DSM 1
No ratings yet
DSM 1
6 pages
Cluster Analysis-Unit 11
No ratings yet
Cluster Analysis-Unit 11
37 pages
Tasklama: Probing The Complex Task Understanding of Language Models
No ratings yet
Tasklama: Probing The Complex Task Understanding of Language Models
11 pages
Advanced ANOVA - MANOVA - Wikiversity
No ratings yet
Advanced ANOVA - MANOVA - Wikiversity
7 pages
Mahalanobis
No ratings yet
Mahalanobis
14 pages
Qiao 2011
No ratings yet
Qiao 2011
6 pages
The Relationship Between Remote Work and Job Satis
No ratings yet
The Relationship Between Remote Work and Job Satis
13 pages
BookSlides 5B Similarity-based-Learning
No ratings yet
BookSlides 5B Similarity-based-Learning
69 pages
Questions STAT 926
No ratings yet
Questions STAT 926
2 pages
Am NQ22010
No ratings yet
Am NQ22010
9 pages
Lab5-IP Addressing
No ratings yet
Lab5-IP Addressing
3 pages
In Class Quiz On Virtual Memory - Attempt Review
No ratings yet
In Class Quiz On Virtual Memory - Attempt Review
2 pages
ENVI Tutorial: Classification Methods
No ratings yet
ENVI Tutorial: Classification Methods
16 pages
DSM 2
No ratings yet
DSM 2
7 pages
GIC ThinkSpace Portfolio Choice With Path Dependent Scenarios
No ratings yet
GIC ThinkSpace Portfolio Choice With Path Dependent Scenarios
22 pages
6 - Into To Data Science Techniques and Clustering
No ratings yet
6 - Into To Data Science Techniques and Clustering
16 pages
Review of Chemometrics Applied To Spectr
No ratings yet
Review of Chemometrics Applied To Spectr
33 pages
FCM - The Fuzzy C-Means Clustering Algorithm
No ratings yet
FCM - The Fuzzy C-Means Clustering Algorithm
13 pages
Lecture Note (14-10-2022)
No ratings yet
Lecture Note (14-10-2022)
12 pages
Introducing The Adaptive Regime Compass: Measuring Equity Market Similarities With ML Algorithms
No ratings yet
Introducing The Adaptive Regime Compass: Measuring Equity Market Similarities With ML Algorithms
18 pages
Data Screening: NCSS Statistical Software
No ratings yet
Data Screening: NCSS Statistical Software
6 pages
Introduction to Robotics
From Everand
Introduction to Robotics
Swarnalata Verma
No ratings yet
Reconnaissance 101: Footprinting & Information Gatherin: Ethical Hackers Bible To Collect Data About Target Systems
From Everand
Reconnaissance 101: Footprinting & Information Gatherin: Ethical Hackers Bible To Collect Data About Target Systems
Rob Botwright
No ratings yet

741 Outlier Detection

Uploaded by

741 Outlier Detection

Uploaded by

Outlier Detection

• “In 2017, an astronomical event occurred that was

Jian Pei: Data Mining -- Outlier Detection

Fraud • Fraud detection in stock markets

Jian Pei: Data Mining -- Outlier Detection 3

• “One person’s noise is another person’s

Jian Pei: Data Mining -- Outlier Detection 4

• Different from noise

Jian Pei: Data Mining -- Outlier Detection 5

• Three kinds: global, contextual and collective outliers

Jian Pei: Data Mining -- Outlier Detection 6

Jian Pei: Data Mining -- Outlier Detection 7

Jian Pei: Data Mining -- Outlier Detection 8

Jian Pei: Data Mining -- Outlier Detection 9

Jian Pei: Data Mining -- Outlier Detection 10

Methods • Statistical, proximity-based, and clustering-

Jian Pei: Data Mining -- Outlier Detection 11

• Modeling outlier detection as a classification problem

Jian Pei: Data Mining -- Outlier Detection 13

• In many applications, the number of labeled data is often small

Jian Pei: Data Mining -- Outlier Detection 15

Jian Pei: Data Mining -- Outlier Detection 16

Jian Pei: Data Mining -- Outlier Detection 17

Jian Pei: Data Mining -- Outlier Detection 18

Jian Pei: Data Mining -- Outlier Detection 19

• Model assumption: the average tem- perature follows a normal distribution

Jian Pei: Data Mining -- Outlier Detection 20

Jian Pei: Data Mining -- Outlier Detection 21

Jian Pei: Data Mining -- Outlier Detection 22

Jian Pei: Data Mining -- Outlier Detection 23

• where 𝑡𝑡#!⁄(!%),%(! is the value taken by a t-distribution at a significance level

Jian Pei: Data Mining -- Outlier Detection 24

Jian Pei: Data Mining -- Outlier Detection 25

Jian Pei: Data Mining -- Outlier Detection 26

• Using EM (expectation-maximization) algorithm to learn the

Jian Pei: Data Mining -- Outlier Detection 27

Jian Pei: Data Mining -- Outlier Detection 28

Jian Pei: Data Mining -- Outlier Detection 29

Jian Pei: Data Mining -- Outlier Detection 30

• Using a kernel function K, estimate by fˆ(x) = K

• A frequently used kernel is a standard Gaussian function with mean 0

Jian Pei: Data Mining -- Outlier Detection 31

Jian Pei: Data Mining -- Outlier Detection 32

Jian Pei: Data Mining -- Outlier Detection 33

• Both o1 and o2 are outliers

Jian Pei: Data Mining -- Outlier Detection 34

Jian Pei: Data Mining -- Outlier Detection 35

Jian Pei: Data Mining -- Outlier Detection 36

• Given the k-stance of p, the k-distance neighborhood of p contains every

Jian Pei: Data Mining -- Outlier Detection 37

If p and o are close to

Jian Pei: Data Mining -- Outlier

Jian Pei: Data Mining -- Outlier

Jian Pei: Data Mining -- Outlier Detection 40

Jian Pei: Data Mining -- Outlier Detection 41

Jian Pei: Data Mining -- Outlier Detection 42

Jian Pei: Data Mining -- Outlier Detection 43

Jian Pei: Data Mining -- Outlier Detection 44

Jian Pei: Data Mining -- Outlier Detection 45

Jian Pei: Data Mining -- Outlier Detection 46

Jian Pei: Data Mining -- Outlier Detection 47

Outliers • Calculate the outlier score for

Jian Pei: Data Mining -- Outlier Detection 48

Jian Pei: Data Mining -- Outlier Detection 49

Behavior • Use a prediction model to link the contexts

Jian Pei: Data Mining -- Outlier Detection 50

• Objects as a group deviate significantly from the

Jian Pei: Data Mining -- Outlier Detection 51

Jian Pei: Data Mining -- Outlier Detection 52

• All objects are ranked in weight-descending order

Jian Pei: Data Mining -- Outlier Detection 53

Jian Pei: Data Mining -- Outlier Detection 54

Jian Pei: Data Mining -- Outlier Detection 55

You might also like