0% found this document useful (0 votes)
37 views2 pages

Outlier Detection and Removal

Uploaded by

Niharika Khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views2 pages

Outlier Detection and Removal

Uploaded by

Niharika Khanna
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 2

Outlier

What is an outlier?
An outlier is a data point that significantly deviates from the rest of the data. It can be either
much higher or much lower than the other data points, and its presence can have a significant
impact on the results of machine learning algorithms. They can be caused by measurement or
execution errors. The analysis of outlier data is referred to as outlier analysis or outlier
mining.
Types of Outliers
There are two main types of outliers:
Global outliers: Global outliers are isolated data points that are far away from the main body
of the data. They are often easy to identify and remove.
Contextual outliers: Contextual outliers are data points that are unusual in a specific context
but may not be outliers in a different context. They are often more difficult to identify and
may require additional information or domain knowledge to determine their significance.
Why Should You Detect Outliers?
In the machine learning pipeline, data cleaning and preprocessing is an important step as it
helps you better understand the data. During this step, you deal with missing values, detect
outliers, and more.
As outliers are very different values—abnormally low or abnormally high—their presence
can often skew the results of statistical analyses on the dataset. This could lead to less
effective and less useful models.
But dealing with outliers often requires domain expertise, and none of the outlier detection
techniques should be applied without understanding the data distribution and the use case.
Outliers Detection
How to Detect Outliers Using Standard Deviation
When the data, or certain features in the dataset, follow a normal distribution, you can use the
standard deviation of the data, or the equivalent z-score to detect outliers.
In statistics, standard deviation measures the spread of data around the mean, and in essence,
it captures how far away from the mean the data points are.
For data that is normally distributed, around 68.2% of the data will lie within one standard
deviation from the mean. Close to 95.4% and 99.7% of the data lie within two and three
standard deviations from the mean, respectively.

Let’s denote the standard deviation of the distribution by σ, and the mean by μ.

One approach to outlier detection is to set the lower limit to three standard deviations below
the mean (μ - 3*σ), and the upper limit to three standard deviations above the mean (μ +
3*σ). Any data point that falls outside this range is detected as an outlier.
As 99.7% of the data typically lies within three standard deviations, the number of outliers
will be close to 0.3% of the size of the dataset.
Detecting outliers Using the Interquartile Range (IQR)
In statistics, interquartile range or IQR is a quantity that measures the difference between the
first and the third quartiles in a given dataset.

 The first quartile is also called the one-fourth quartile, or the 25% quartile.

 If q25 is the first quartile, it means 25% of the points in the dataset have values less
than q25.

 The third quartile is also called the three-fourth, or the 75% quartile.

 If q75 is the three-fourth quartile, 75% of the points have values less than q75.
 Using the above notations, IQR = q75 - q25.

the interquartile range works by dropping all points that are outside the range [q25 - 1.5*IQR,
q75 + 1.5*IQR] as outliers.

 If the data, or feature of interest is normally distributed, you may use standard
deviation and z-score to label points that are farther than three standard deviations
away from the mean as outliers.

 If the data is not normally distributed, you can use the interquartile range or
percentage methods to detect outliers.
Removing Outliers
Once outliers have been identified, the next step is to remove them. There are several
methods for removing outliers, including:
•Trimming: This involves removing a certain percentage of the data that falls outside a
specified range.
•Winsorizing: This involves replacing extreme values with the nearest value that falls within
a specified range.
•Imputation: This involves replacing missing or extreme values with a substitute value, such
as the mean or median of the data.

You might also like