Detecting Data Outliers
Detecting Data Outliers
Doctor of Education
Major in Industrial Management
What is an outlier?
it’s a data point that is significantly different from other data points in a
data set. The long story? There isn’t a strong mathematical definition for what
is or isn’t an outlier. In the end, detecting and handling outliers is often a
somewhat subjective exercise.
So how can you dive into a new data set, find the outliers, and clean them?
Keep reading for tips and tricks to help you detect and handle outliers.
Outliers are inevitable, especially for large data sets. On their own, they are
not problematic. However, in the context of the larger data set, it is essential
to identify, verify, and accordingly deal with outliers to ensure that your data
interpretation is as accurate as possible.
The first step in dealing with outliers is finding them. There are two ways to
approach this.
This histogram of our pocket change example shows an outlier on the far right
for Day 4 ($101.2).
Scatter Plot: A scatter plot (also called a scatter diagram or scatter graph)
shows a collection of points on an x-y coordinate axis, where the x-axis
(horizontal axis) represents the independent variable and the y-axis (vertical
axis) represents the dependent variable.
A scatter plot is useful to find outliers in bivariate data (data with two
variables). You can easily spot the outliers because they will be far away from
the majority of points on the scatter plot.
This scatter plot of our pocket change example shows an outlier — far away
from all the other points — for Day 4 ($101.2).
Here we’ll talk about a simple, widely used, and proven technique to identify
outliers.
Sorting the data helps you spot outliers at the very top or bottom of the
column. However, there could be more outliers that might be difficult to catch.
Step 2: Quartiles
In any ordered range of values, there are three quartiles that divide the range
into four equal groups. The second quartile (Q2) is nothing but the median,
since it divides the ordered range into two equal groups. For an odd number
of observations, the median is equal to the middle value of the sorted range.
To calculate the first (Q1) and third quartiles (Q3), you need to simply
calculate the medians of the first half and second half respectively. In this
case, Q1 is 0.565 and Q3 is 3.775.
A data point that falls outside the inner fence is called a minor outlier.
Inner Fence:
Lower bound = Q1 - (1.5 * (Q3-Q1))
The data points for Day 11 and Day 4, that is 9.04 and 101.20 respectively,
qualify as minor outliers.
A data point that falls outside the outer fence is called a major outlier.
Outer Fence:
Lower bound = Q1 - (3 * (Q3-Q1))
The data point for Day 11 (which is $101.20) qualifies as a major outlier.
For example, you have the following data points as peak temperature of Delhi
(in Celsius) over the past two weeks: 30°, 31°, 28°, 30°, 31°, 33°, 32°, 31°,
300°, 30°, 29°, 28°, 30°, 31°. Day 9 had a peak temperature of 300°C, which
is clearly unrealistic.
On the other hand, when you look at the pocket change example, it is not
unrealistic to have $101.20 in your pocket. It is possible that you just withdrew
$100 from the ATM right before you recorded the data point.
To handle such situations, it is a good practice to have protocols in place to
verify each outlier.
If the outlier is incorrect, there are two ways to deal with it:
1. Resurvey the data point: This is the most foolproof way of dealing with
incorrect outliers. Resurveying becomes easier while using mobile-based
data collection applications like Collect.
1. Delete the outlier data point: Resurveying may not be feasible in all
cases due to resource constraints. In such situations, you should delete
the outlier data point.
That $100 bill is an outlier — a data point that is significantly different from
other data points.
Outliers can represent accurate or inaccurate data. For example, if you
reported finding a $200 bill in your pocket, people would rightly ignore your
story. That outlier would be inaccurate, since $200 bills do not exist. This is
likely to be misreporting for a $20 bill.
It is important to find and deal with outliers, since they can skew interpretation
of the data. For example, imagine that you want to know how much money
you keep in your pocket each day. At the end of each day, you empty your
pockets, count the money, and record the total. The results after 12 days are
in the table to the right.
Day 4 is clearly an outlier. If you exclude Day 4 from your calculations, you
would conclude that you keep an average of $2.25 in your pocket. However, if
you don’t exclude Day 4, the average money in your pocket would be $10.49.
These are vastly different results.