0% found this document useful (0 votes)
28 views22 pages

Outlier Detection

Uploaded by

Prathik Narayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views22 pages

Outlier Detection

Uploaded by

Prathik Narayan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 22

Outlier detection

• Outlier detection from a collection of patterns


is an active area for research in data set
mining.
• There are several modelling techniques which
are resistant to outliers or may bring down
the impact of them.
• Outlier detection and understanding them can
lead to interesting findings.
• Outliers are generally defined as samples that
are exceptionally far from the mainstream of
data.
• Outlier Detection may be defined as the
process of detecting and subsequently
excluding outliers from a given set of data.
• There are no standardized Outlier
identification methods as these are largely
dependent upon the data set.
• Outlier Detection as a branch of data mining
has many applications in data stream analysis.
• Outlier Detection Techniques
• Why do I want to detect outliers?
– (i) Which and how many features am I considering
for outlier detection? (univariate / multivariate)
– (ii) Can I assume a distribution(s) of values for my
selected features? (parametric / non-parametric)
• 1. Numeric Outlier
– Numeric Outlier is the simplest, nonparametric outlier
detection technique in a one-dimensional feature
space.
– The outliers are calculated by means of the IQR
(InterQuartile Range). For example, the first and the
third quartile (Q1, Q3) are calculated. An outlier is then
a data point xi that lies outside the interquartile range.
– Using the interquartile multiplier value k=1.5, the
range limits are the typical upper and lower whiskers
of a box plot.
• 2. Z-Score
• Z-score technique assumes a Gaussian
distribution of the data. The outliers are the
data points that are in the tails of the
distribution and therefore far from the mean.

• When computing the z-score for each sample


on the data set a threshold must be specified.
A normal distribution is shown below and it is estimated that
68% of the data points lie between +/- 1 standard deviation.
95% of the data points lie between +/- 2 standard deviation
99.7% of the data points lie between +/- 3 standard deviation

If the z score of a data point is more than 3, it indicates that the data point is
quite different from the other data points. Such a data point can be an
outlier.
• in a survey, it was asked how many children a
person had. Suppose the data obtained from
people is
• 1, 2, 2, 2, 3, 1, 1, 15, 2, 2, 2, 3, 1, 1, 2
• Clearly, 15 is an outlier in this dataset.
• 3. DBSCAN
• DBSCAN is a nonparametric, density-based outlier detection
method in a one or multi-dimensional feature space. Here, all data
points are defined either as Core Points, Border Points or Noise
Points.
• Core Points are data points that have at least MinPts neighbouring
data points within a distance ε.
• Border Points are neighbours of a Core Point within the distance ε
but with less than MinPts neighbours within the distance ε.
• All other data points are Noise Points, also identified as outliers.
• Outlier detection thus depends on the required number of
neighbours MinPts, the distance ε and the selected distance
measure, like Euclidean or Manhattan.
All the data points with at least 3 points in the circle including itself are
considered as Core points represented by red color.
All the data points with less than 3 but greater than 1 point in the circle including
itself are considered as Border points. They are represented by yellow color.
with no point other than itself present inside the circle are considered as Noise
represented by the purple color.
• 4. Isolation Forest
• This nonparametric method is ideal for large
datasets in a one or multi-dimensional feature
space.
• The isolation number is of paramount
importance in this Outlier Detection
technique.
• The isolation number is the number of splits
needed to isolate a data point.
• It requires fewer splits to isolate an outlier than it
does to isolate a nonoutlier, i.e. an outlier has a lower
isolation number in comparison to a nonoutlier point.
• A data point is therefore defined as an outlier if its
isolation number is lower than the threshold.
• The threshold is defined based on the estimated
percentage of outliers in the data, which is the
starting point of this outlier detection algorithm.
• Outlier Detection Methods
• Models for Outlier Detection Analysis
• 1. Extreme Value Analysis
• Extreme Value Analysis is the most basic form
of outlier detection and great for 1-dimension
data.
• In this Outlier analysis approach, it is assumed
that values which are too large or too small are
outliers. Z-test and Student’s t-test are classic
examples.
• 2. Linear Models
• In this approach, the data is modelled into a
lower-dimensional sub-space with the use of
linear correlations.
• Then the distance of each data point to a
plane that fits the sub-space is being
calculated. This distance is used to find
outliers.
• 3. Probabilistic and Statistical Models
• In this approach, Probabilistic and Statistical Models
assume specific distributions for data.
• They make use of the expectation-maximization
(EM) methods to estimate the parameters of the
model.
• Finally, they calculate the probability of membership
of each data point to calculated distribution.
• The points with a low probability of membership are
marked as outliers.
• 4. Proximity-based Models
• In this method, outliers are modelled as points
isolated from the rest of the observations.
• Cluster analysis, density-based analysis, and
nearest neighborhood are the principal
approaches of this kind.
• 5. Information-Theoretic Models
• In this method, the outliers increase the
minimum code length to describe a data set.
• Outlier Detection Methods In Use
• 1. High Dimensional Outlier Detection
• In many applications, data sets may contain thousands of
features. The traditional outlier detection approaches such as
PCA and LOF will not be effective.
• High Contrast Subspaces for Density-Based Outlier Ranking
(HiCS) method an effective method to find outliers in high
dimensional data sets.
• The contrast in distances to different data points becomes
nonexistent. This basically means using methods such as LOF,
which are based on the nearest neighborhood, for high
dimensional data sets will lead to outlier scores which are
close to each other.
• 2. Proximity Method
• (i) Use clustering methods to identify the natural
clusters in the data (such as the k-means
algorithm).
• (ii) Identify and mark the cluster centroids.
• (iii) Identify data instances that are a fixed
distance or percentage distance from cluster
centroids.
• (iv) Filter out the outliers candidate from training
dataset and assess the model’s performance.
• 3. Projection Method
• Projection methods are relatively simple to apply and
quickly highlight extraneous values.
• (i) Use projection methods to summarize your data to
two dimensions (such as PCA, SOM or Sammon’s
mapping).
• (ii) Visualize the mapping and identify outliers by hand.
• (iii) Use proximity measures from projected values or
codebook vectors to identify outliers.
• (iv) Filter out the outliers candidate from training
dataset and assess the model’s performance.
• Outlier Detection Applications
• social network analysis
• cyber-security
• distributed system
• health care
• bio-informatics.

You might also like