Outlier Detection

Uploaded by

Prathik Narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views22 pages

Outlier Detection

Uploaded by

Prathik Narayan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 22

Outlier detection

• Outlier detection from a collection of patterns

is an active area for research in data set
mining.
• There are several modelling techniques which
are resistant to outliers or may bring down
the impact of them.
• Outlier detection and understanding them can
lead to interesting findings.
• Outliers are generally defined as samples that
are exceptionally far from the mainstream of
data.
• Outlier Detection may be defined as the
process of detecting and subsequently
excluding outliers from a given set of data.
• There are no standardized Outlier
identification methods as these are largely
dependent upon the data set.
• Outlier Detection as a branch of data mining
has many applications in data stream analysis.
• Outlier Detection Techniques
• Why do I want to detect outliers?
– (i) Which and how many features am I considering
for outlier detection? (univariate / multivariate)
– (ii) Can I assume a distribution(s) of values for my
selected features? (parametric / non-parametric)
• 1. Numeric Outlier
– Numeric Outlier is the simplest, nonparametric outlier
detection technique in a one-dimensional feature
space.
– The outliers are calculated by means of the IQR
(InterQuartile Range). For example, the first and the
third quartile (Q1, Q3) are calculated. An outlier is then
a data point xi that lies outside the interquartile range.
– Using the interquartile multiplier value k=1.5, the
range limits are the typical upper and lower whiskers
of a box plot.
• 2. Z-Score
• Z-score technique assumes a Gaussian
distribution of the data. The outliers are the
data points that are in the tails of the
distribution and therefore far from the mean.

• When computing the z-score for each sample

on the data set a threshold must be specified.
A normal distribution is shown below and it is estimated that
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation
99.7% of the data points lie between +/- 3 standard deviation

If the z score of a data point is more than 3, it indicates that the data point is
quite different from the other data points. Such a data point can be an
outlier.
• in a survey, it was asked how many children a
person had. Suppose the data obtained from
people is
• 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
• Clearly, 15 is an outlier in this dataset.
• 3. DBSCAN
• DBSCAN is a nonparametric, density-based outlier detection
method in a one or multi-dimensional feature space. Here, all data
points are defined either as Core Points, Border Points or Noise
Points.
• Core Points are data points that have at least MinPts neighbouring
data points within a distance ε.
• Border Points are neighbours of a Core Point within the distance ε
but with less than MinPts neighbours within the distance ε.
• All other data points are Noise Points, also identified as outliers.
• Outlier detection thus depends on the required number of
neighbours MinPts, the distance ε and the selected distance
measure, like Euclidean or Manhattan.
All the data points with at least 3 points in the circle including itself are
considered as Core points represented by red color.
All the data points with less than 3 but greater than 1 point in the circle including
itself are considered as Border points. They are represented by yellow color.
with no point other than itself present inside the circle are considered as Noise
represented by the purple color.
• 4. Isolation Forest
• This nonparametric method is ideal for large
datasets in a one or multi-dimensional feature
space.
• The isolation number is of paramount
importance in this Outlier Detection
technique.
• The isolation number is the number of splits
needed to isolate a data point.
• It requires fewer splits to isolate an outlier than it
does to isolate a nonoutlier, i.e. an outlier has a lower
isolation number in comparison to a nonoutlier point.
• A data point is therefore defined as an outlier if its
isolation number is lower than the threshold.
• The threshold is defined based on the estimated
percentage of outliers in the data, which is the
starting point of this outlier detection algorithm.
• Outlier Detection Methods
• Models for Outlier Detection Analysis
• 1. Extreme Value Analysis
• Extreme Value Analysis is the most basic form
of outlier detection and great for 1-dimension
data.
• In this Outlier analysis approach, it is assumed
that values which are too large or too small are
outliers. Z-test and Student’s t-test are classic
examples.
• 2. Linear Models
• In this approach, the data is modelled into a
lower-dimensional sub-space with the use of
linear correlations.
• Then the distance of each data point to a
plane that fits the sub-space is being
calculated. This distance is used to find
outliers.
• 3. Probabilistic and Statistical Models
• In this approach, Probabilistic and Statistical Models
assume specific distributions for data.
• They make use of the expectation-maximization
(EM) methods to estimate the parameters of the
model.
• Finally, they calculate the probability of membership
of each data point to calculated distribution.
• The points with a low probability of membership are
marked as outliers.
• 4. Proximity-based Models
• In this method, outliers are modelled as points
isolated from the rest of the observations.
• Cluster analysis, density-based analysis, and
nearest neighborhood are the principal
approaches of this kind.
• 5. Information-Theoretic Models
• In this method, the outliers increase the
minimum code length to describe a data set.
• Outlier Detection Methods In Use
• 1. High Dimensional Outlier Detection
• In many applications, data sets may contain thousands of
features. The traditional outlier detection approaches such as
PCA and LOF will not be effective.
• High Contrast Subspaces for Density-Based Outlier Ranking
(HiCS) method an effective method to find outliers in high
dimensional data sets.
• The contrast in distances to different data points becomes
nonexistent. This basically means using methods such as LOF,
which are based on the nearest neighborhood, for high
dimensional data sets will lead to outlier scores which are
close to each other.
• 2. Proximity Method
• (i) Use clustering methods to identify the natural
clusters in the data (such as the k-means
algorithm).
• (ii) Identify and mark the cluster centroids.
• (iii) Identify data instances that are a fixed
distance or percentage distance from cluster
centroids.
• (iv) Filter out the outliers candidate from training
dataset and assess the model’s performance.
• 3. Projection Method
• Projection methods are relatively simple to apply and
quickly highlight extraneous values.
• (i) Use projection methods to summarize your data to
two dimensions (such as PCA, SOM or Sammon’s
mapping).
• (ii) Visualize the mapping and identify outliers by hand.
• (iii) Use proximity measures from projected values or
codebook vectors to identify outliers.
• (iv) Filter out the outliers candidate from training
dataset and assess the model’s performance.
• Outlier Detection Applications
• social network analysis
• cyber-security
• distributed system
• health care
• bio-informatics.

Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Anomoly Detection - Ensemble - Classifiers
No ratings yet
Anomoly Detection - Ensemble - Classifiers
68 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
A Survey On Outlier Detection Methods
No ratings yet
A Survey On Outlier Detection Methods
4 pages
Lecture 8 Data Prepration Techniques
No ratings yet
Lecture 8 Data Prepration Techniques
4 pages
Unit 4
No ratings yet
Unit 4
17 pages
741 Outlier Detection
No ratings yet
741 Outlier Detection
55 pages
ADII11 Metode Deteksi Outlier
No ratings yet
ADII11 Metode Deteksi Outlier
50 pages
Outliers
No ratings yet
Outliers
3 pages
5 Anomaly Detection Annotated Section 100 300
No ratings yet
5 Anomaly Detection Annotated Section 100 300
48 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Detecting Outliers in High Dimensional Data Sets U
No ratings yet
Detecting Outliers in High Dimensional Data Sets U
6 pages
Anomaly Detection
No ratings yet
Anomaly Detection
10 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
Data Mining-Outlier Analysis
No ratings yet
Data Mining-Outlier Analysis
6 pages
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
No ratings yet
Bhowate, 2014, Outlier Detection Method For Data Set Based On Clustering and EDA Technique
3 pages
Outlier Detection Techniques
100% (2)
Outlier Detection Techniques
56 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Expt 2
No ratings yet
Expt 2
3 pages
Lecture 12
No ratings yet
Lecture 12
54 pages
Outlier Analysis
No ratings yet
Outlier Analysis
28 pages
Outlier Detection
No ratings yet
Outlier Detection
17 pages
Anomaly or Outlier Detection
No ratings yet
Anomaly or Outlier Detection
14 pages
12 Outlier
No ratings yet
12 Outlier
16 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
44 pages
Unit 5
No ratings yet
Unit 5
47 pages
07 Outlier Detection
No ratings yet
07 Outlier Detection
54 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier
No ratings yet
Outlier
2 pages
Chapter 4 Part 2
No ratings yet
Chapter 4 Part 2
12 pages
Anomaly Detection
No ratings yet
Anomaly Detection
7 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
No ratings yet
Unit 5 - Lecture 2 - Statistical - Methods - Mining - Techniques
41 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Unit-5 Outlier Analysis
No ratings yet
Unit-5 Outlier Analysis
32 pages
Data Mining Slide Contents
No ratings yet
Data Mining Slide Contents
22 pages
Lecture-8 Outlier Detection
No ratings yet
Lecture-8 Outlier Detection
72 pages
Lec3. Outlier Analysis
No ratings yet
Lec3. Outlier Analysis
54 pages
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
No ratings yet
Subspace Histograms For Outlier Detection in Linear Time: Saket Sathe Charu C. Aggarwal
25 pages
Anomoly Detection
No ratings yet
Anomoly Detection
2 pages
6anomaly Fraud Detection
No ratings yet
6anomaly Fraud Detection
5 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
12outlier 1
No ratings yet
12outlier 1
45 pages
Data Mining:: Concepts and Techniques
No ratings yet
Data Mining:: Concepts and Techniques
13 pages
Outlier Detection
No ratings yet
Outlier Detection
36 pages
12 Outlier
No ratings yet
12 Outlier
55 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
12 Outlier
No ratings yet
12 Outlier
18 pages
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
No ratings yet
Distance-Based Outlier Detection: Consolidation and Renewed Bearing
12 pages
Outlier Detection
No ratings yet
Outlier Detection
30 pages
Unit 5-2
No ratings yet
Unit 5-2
41 pages
Anomaly Detection
No ratings yet
Anomaly Detection
22 pages
Missing and Outlier
No ratings yet
Missing and Outlier
20 pages
Be A 65 Ads Exp 7
No ratings yet
Be A 65 Ads Exp 7
7 pages
Methods To Detect Different Types of Outliers: March 2016
No ratings yet
Methods To Detect Different Types of Outliers: March 2016
7 pages
Unit 4-2
No ratings yet
Unit 4-2
7 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages

Outlier Detection

Uploaded by

Outlier Detection

Uploaded by

Outlier detection

• Outlier detection from a collection of patterns

• When computing the z-score for each sample

You might also like