0% found this document useful (0 votes)

35 views17 pages

Outlier Detection

data mining

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views17 pages

Outlier Detection

data mining

Uploaded by

prathap badam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Outlier Analysis

1
Outlier Analysis
 Outlier – data objects that are grossly different from or
inconsistent with the remaining set of data
 Causes
 Measurement / Execution errors
 Inherent data variability
 Outliers – maybe valuable patterns
 Fraud detection
 Customized marketing
 Medical Analysis

2
Outlier Mining
 Given n data points and k – expected number of
outliers find the top k dissimilar objects
 Define inconsistent data
 Residuals in Regression
 Difficulties – Multi-dimensional data, non-numeric data
 Mine the outliers
 Visualization based methods
 Not applicable to cyclic plots, high dimensional data and categorical data
 Approaches
 Statistical Approach
 Distance-based approach
 Density based outlier approach
 Deviation based approach

3
Statistical Distribution-based Outlier
detection
 Assumes data follows a probability distribution and uses
discordancy test
 Discordancy testing
 Working hypothesis – H: oi ∈ F i=1,2,..n
 Test verifies whether an object oi is significantly different from F
 Significance probability SP(vi) = Prob(T>vi)

 IF SP is small oi is discordant and working hypothesis is rejected

and alternate hypothesis that oi comes from another distribution

model G is adopted
4
Statistical Distribution-based Outlier
detection
 Alternative distributions
 Inherent alternative distribution
 Alternative hypothesis: All objects arise from another distribution G

 Mixture alternative distribution

 Discordant values are not outliers but contaminants from G H’: oi ∈ (1-
λ) F + λG i=1,2,..n

 Slippage alternative distribution

 Some Objects are independent observations from a modified version
of F (different parameters)

5
Statistical Distribution-based Outlier
detection
 Procedures for detecting Outliers
 Block procedures
 All are outliers or all are consistent

 Consecutive Procedures
 Inside-out procedure: Least likely object is tested first
 If it is an outlier – more extreme values are also considered as outliers

 Disadvantages of Statistical Approach

 Tests are for single attributes
 Data distribution may not be known
6
Distance based Outlier Detection

 Distance-based outlier
 A DB(p, D)-outlier is an object O in a dataset T such that at least
a fraction p of the objects in T lies at a distance greater than D
from O
 Object does not have enough neighbours
 Avoids excessive computation of Statistical models
 If an object is an outlier according to a discordancy test then o is
DB(p, D) outlier for some p and D

7
Distance based Outlier Detection
 Index based Algorithm
 Uses multi-dimensional indexing structures such as k-d trees and R-trees
 M – maximum number of objects within dmin neighborhood
 Once M+1 neighbours are found o is not an outlier
 O(n2k) apart from index construction

 Nested loop algorithm

 Avoids index construction
 Tries to minimize I/Os
 Divides memory buffer space into two halves and data set into several logical
blocks

8
Distance based Outlier Detection
 Cell based Algorithm
 Complexity : O(ck +n) c- depends on number of cells ; k – dimensionality
 Data space is partitioned into cells: dmin / 2√k
 Two layers surround each cell
 First layer – One cell thick
 Second layer -  2√k-1  cells thick
 Algorithm processes cells instead of objects
 Maintains three counts: cell_count, cell_+_1_layer_count,
cell_+_2_layers_count
 An object in a cell is an outlier if cell_+_1_layer_count <= M, if not, no
objects in the cell are outliers
 If cell_+_2_layers_count, <= M then all objects in cell – Outliers
 If > M some may be outliers
 Object by object processing has to be done

9
Density based Outlier detection

 Previous methods assume data are uniformly

distributed
 Data may have different density distributions
 Difficulty in choosing dmin

10
Density based Outlier detection
 Local Outlier – if its outlying relative to its local
neighbourhood particularly wrt the density of the
neighborhood
 O2 is a local outlier wrt C2; o1 is also an outlier; none of the objects
in C1 are treated as outliers

 Considers degree to which an object is an outlier

 Local Outlier factor – degree depends on how isolated the object is
wrt its surroundings

11
Density based Outlier detection
 The k-distance of an object p is the maximal distance that p gets
from its k-nearest neighbors d(p, o)
 there are at least k objects in D that are as close as or closer to p than o;
for k o’ d(p, o’) <= d(p, o)
 there are at most k-1 objects that are closer to p than o; for k-1 o” d(p,
o”) < d(p, o)

 k-distance neighborhood
 contains every object whose distance is not greater than the MinPts (k)-
distance of p

 The reachability distance of an object p with respect to object o, is

defined as reach_distMinPts(p, o) = max { MinPts-distance(o), d(p, o) }
12
OPTICS

 Complexity : O(n log n)

13
Density based Outlier detection
 Local reachability density of p is the inverse of the
average reachability density based on the MinPts-
nearest neighbors of p.
 Local outlier factor (LOF) of p captures the degree to
which we call p an outlier.
 It is the average of the ratio of the local reachability density of p
and those of p’s MinPts-nearest neighbors.
 LOF is higher for outliers

14
Deviation based Outlier detection
 Identifies outliers by examining the main characteristics
of objects in a group
 Objects that “deviate” from this description are
considered outliers
 Sequential exception technique
 Simulates the way in which humans can distinguish unusual
objects from among a series of supposedly like objects

15
Deviation based Outlier detection
 Sequential exception technique
 Given a data set D a sequence of subsets {D1, D2, ..Dm} is built
such that Dj-1 ⊆ Dj; Dissimilarities are assessed between
subsets in the sequence
 Exception Set – Smallest subset of objects whose removal
results in greatest reduction of dissimilarity

 Dissimilarity function – 1/n ∑i=1 n (xi-x’)2

 Smoothing factor: Assesses how much the dissimilarity can be

reduced by removing the subset from the original set of objects
 Can be repeated to avoid the influence of order

16
Deviation based Outlier detection

 OLAP Data Cube technique

 Uses data cubes to identify regions of anomalies
 A cell value in a cube is an exception if it differs
significantly from an expected value
 Visualization effects guide user
 May drill down

Tecnomatix Plant Simulation Basics, Methods, and Strategies Student Guide - 2012
100% (1)
Tecnomatix Plant Simulation Basics, Methods, and Strategies Student Guide - 2012
764 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Outlier Detection & Analysis 03
No ratings yet
Outlier Detection & Analysis 03
32 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
What Are Outliers139
No ratings yet
What Are Outliers139
15 pages
What Are Outliers137
No ratings yet
What Are Outliers137
15 pages
What Are Outliers138
No ratings yet
What Are Outliers138
15 pages
What Are Outliers140
No ratings yet
What Are Outliers140
15 pages
Outlierfin
No ratings yet
Outlierfin
19 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
What Are Outliers263
No ratings yet
What Are Outliers263
15 pages
What Are Outliers231
No ratings yet
What Are Outliers231
15 pages
Soal Latihan Uts Bahasa Inggris Kelas 4 Semester 1
100% (2)
Soal Latihan Uts Bahasa Inggris Kelas 4 Semester 1
2 pages
Outlier
No ratings yet
Outlier
2 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Mcu - Pic24fv32ka304 - Microchip - Datasheet
No ratings yet
Mcu - Pic24fv32ka304 - Microchip - Datasheet
352 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
Electric Machines and Power Electronics
100% (2)
Electric Machines and Power Electronics
58 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
No ratings yet
Unit 1 1.: Discuss The Challenges of The Distributed Systems With Their Examples?
18 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Unit 5
No ratings yet
Unit 5
70 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Programming Tools
No ratings yet
Programming Tools
2 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
The Invention of The Historic Monument
100% (1)
The Invention of The Historic Monument
27 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
No ratings yet
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
21 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Outlier Detection
No ratings yet
Outlier Detection
10 pages
CG Mini Project Report Kyashawanth
100% (1)
CG Mini Project Report Kyashawanth
33 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Unit 5
No ratings yet
Unit 5
47 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Lecture 12 Outliers and Guidelines For Exercises
No ratings yet
Lecture 12 Outliers and Guidelines For Exercises
6 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Adjectives With Two Syllables
No ratings yet
Adjectives With Two Syllables
5 pages
Outlier Detection
No ratings yet
Outlier Detection
45 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Outliers
No ratings yet
Outliers
3 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
Outlier Detection
No ratings yet
Outlier Detection
22 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Flat Unit 1
No ratings yet
Flat Unit 1
80 pages
Unit 4
No ratings yet
Unit 4
17 pages
DWDM Lecture Notes
No ratings yet
DWDM Lecture Notes
139 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Identifying and Remediating Reading Difficulties
No ratings yet
Identifying and Remediating Reading Difficulties
17 pages
Quanta G31a Dag31amb6d0 Y61x-6l Rev 1a
No ratings yet
Quanta G31a Dag31amb6d0 Y61x-6l Rev 1a
49 pages
参考文献3
No ratings yet
参考文献3
9 pages
Dll-Eapp 12 Week 15
50% (2)
Dll-Eapp 12 Week 15
5 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
No ratings yet
FILE - 20201026 - 135229 - Intro To Translation Studies Revision Questions LOP TRIET - QUYEN
29 pages
Controlling Excersices 2
No ratings yet
Controlling Excersices 2
69 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Name:: Exercise On Center Progression Test Stage 2 Mathematics
No ratings yet
Name:: Exercise On Center Progression Test Stage 2 Mathematics
4 pages
DBMS Module-II
No ratings yet
DBMS Module-II
33 pages
Sample Paper
No ratings yet
Sample Paper
6 pages
1
No ratings yet
1
46 pages
Classification Algorithm
No ratings yet
Classification Algorithm
78 pages
גדל אשר באך - 701-800
No ratings yet
גדל אשר באך - 701-800
100 pages
Pexip Infinity Release Notes V30.1.a
No ratings yet
Pexip Infinity Release Notes V30.1.a
13 pages
GEC 5 LESSON 5 Communication For Academic Purposes
No ratings yet
GEC 5 LESSON 5 Communication For Academic Purposes
17 pages
LTA405P
No ratings yet
LTA405P
56 pages
Flat Unit 5 Qa
No ratings yet
Flat Unit 5 Qa
33 pages
Borang Kriteria Penilaian
No ratings yet
Borang Kriteria Penilaian
15 pages
Hierarchicalclustering
No ratings yet
Hierarchicalclustering
20 pages
Flat Unit 3 Qa
No ratings yet
Flat Unit 3 Qa
35 pages
Flat Unit 1 Qa
No ratings yet
Flat Unit 1 Qa
25 pages
Flat Unit 4 Qa
No ratings yet
Flat Unit 4 Qa
37 pages
Flat Unit 2 Qa
No ratings yet
Flat Unit 2 Qa
30 pages
WT 2 FN
No ratings yet
WT 2 FN
5 pages
The Monkey Paw
No ratings yet
The Monkey Paw
5 pages
How To Modify Curriculum For Students With ASD
No ratings yet
How To Modify Curriculum For Students With ASD
7 pages
Downshifting Essay
No ratings yet
Downshifting Essay
1 page
Simla Deputation PPT Edexcel
No ratings yet
Simla Deputation PPT Edexcel
8 pages
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
No ratings yet
Maharashtra State Board of Technical Education Analysis of Term End Examination Result
1 page
To V and Ving
No ratings yet
To V and Ving
10 pages
Learn Statistics Fast: A Simplified Detailed Version for Students
From Everand
Learn Statistics Fast: A Simplified Detailed Version for Students
Hesbon R.M
No ratings yet
K Nearest Neighbor Algorithm: Fundamentals and Applications
From Everand
K Nearest Neighbor Algorithm: Fundamentals and Applications
Fouad Sabry
No ratings yet