0% found this document useful (0 votes)
128 views

Detecting Data Outliers

This document discusses how to detect outliers in a dataset. It provides two main ways to find outliers: 1) Visualizing the data using histograms and scatter plots to spot outliers visually. 2) Using statistical techniques like sorting the data, calculating quartiles, and establishing inner and outer fences based on the interquartile range. Outliers identified need to be verified and appropriately handled by either resurveying the data point or deleting incorrect outliers. Finding and addressing outliers is important to ensure accurate interpretation and analysis of the overall dataset.

Uploaded by

Judy Ann Galleno
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Detecting Data Outliers

This document discusses how to detect outliers in a dataset. It provides two main ways to find outliers: 1) Visualizing the data using histograms and scatter plots to spot outliers visually. 2) Using statistical techniques like sorting the data, calculating quartiles, and establishing inner and outer fences based on the interquartile range. Outliers identified need to be verified and appropriately handled by either resurveying the data point or deleting incorrect outliers. Finding and addressing outliers is important to ensure accurate interpretation and analysis of the overall dataset.

Uploaded by

Judy Ann Galleno
Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 7

Republic of the Philippines

CAPIZ STATE UNIVERSITY


GRADUATE SCHOOL
Roxas City Main Campus  

Doctor of Education
Major in Industrial Management

DETECTING DATA OUTLIERS THROUGH HISTOGRAM AND BOX PLOT


MEASURE OF SKEWNESS AND KURTOSIS

How to Find Outliers in a Data Set


!mXvbbsnA?nBk.fL?.xb like.

Z?B.VLxc cc:?z?v:X n.b?,

What is an outlier?

it’s a data point that is significantly different from other data points in a
data set. The long story? There isn’t a strong mathematical definition for what
is or isn’t an outlier. In the end, detecting and handling outliers is often a
somewhat subjective exercise.

So how can you dive into a new data set, find the outliers, and clean them?
Keep reading for tips and tricks to help you detect and handle outliers.

How to Find Outliers

Outliers are inevitable, especially for large data sets. On their own, they are
not problematic. However, in the context of the larger data set, it is essential
to identify, verify, and accordingly deal with outliers to ensure that your data
interpretation is as accurate as possible.
The first step in dealing with outliers is finding them. There are two ways to
approach this.

Visualize the Data


Depending on your data set, you can use some simple tools to visualize your
data and spot outliers visually.

Histogram: A histogram is the best way to check univariate data — data


containing a single variable — for outliers. A histogram divides the range of
values into various groups (or buckets), and then shows the frequency — how
many times the data falls into each group — through a bar graph. Assuming
that these buckets are arranged in increasing order, you should be able to
spot outliers easily at the far left (very small values) or at the far right (very
large values).

This histogram of our pocket change example shows an outlier on the far right
for Day 4 ($101.2).

Scatter Plot: A scatter plot (also called a scatter diagram or scatter graph)
shows a collection of points on an x-y coordinate axis, where the x-axis
(horizontal axis) represents the independent variable and the y-axis (vertical
axis) represents the dependent variable.
A scatter plot is useful to find outliers in bivariate data (data with two
variables). You can easily spot the outliers because they will be far away from
the majority of points on the scatter plot.
This scatter plot of our pocket change example shows an outlier — far away
from all the other points — for Day 4 ($101.2).

The Statistical Way


Using statistical techniques is a more thorough approach to identifying
outliers. There are several advanced statistical tools and packages that you
could use to identify outliers.

Here we’ll talk about a simple, widely used, and proven technique to identify
outliers.

Step 1: Sort the Data


Sort the data in the column in ascending order (smallest to largest). You can
do this in Excel by selecting the “Sort & Filter” option in the top right in the
home toolbar.

Sorting the data helps you spot outliers at the very top or bottom of the
column. However, there could be more outliers that might be difficult to catch.

Step 2: Quartiles
In any ordered range of values, there are three quartiles that divide the range
into four equal groups. The second quartile (Q2) is nothing but the median,
since it divides the ordered range into two equal groups. For an odd number
of observations, the median is equal to the middle value of the sorted range.

In this example, since we have an even number of observations (12), we need


to calculate the average of the sixth and seventh-position values in the
ordered range — that is, the average of 1.38 and 1.77. The median of the
range works out to be 1.575.

To calculate the first (Q1) and third quartiles (Q3), you need to simply
calculate the medians of the first half and second half respectively. In this
case, Q1 is 0.565 and Q3 is 3.775.

Step 3: Inner and Outer Fences


The inner and outer fences are ranges that you can calculate using the Q1
and Q3. To do this, you need to first calculate the interquartile range — the
difference between Q1 and Q3. In this case, Q3-Q1 = 3.21.

A data point that falls outside the inner fence is called a minor outlier.

Inner Fence:
Lower bound = Q1 - (1.5 * (Q3-Q1))

Upper bound = Q3 + (1.5 * (Q3-Q1))


In our example, the bounds for the inner fence are:

Lower Bound = 0.565 - (1.5*3.21) = -4.25

Upper Bound = 3.775 + (1.5*3.21) = 8.59

The data points for Day 11 and Day 4, that is 9.04 and 101.20 respectively,
qualify as minor outliers.

A data point that falls outside the outer fence is called a major outlier.

Outer Fence:
Lower bound = Q1 - (3 * (Q3-Q1))

Upper bound = Q3 + (3 * (Q3-Q1))

In our example, the bounds for the outer fence are:

Lower Bound = 0.565 - (3*3.21) = -9.07

Upper Bound = 3.775 + (3*3.21) = 13.41

The data point for Day 11 (which is $101.20) qualifies as a major outlier.

How to Deal With Outliers


Now that you have identified all your outliers, you should look at each outlier
in the context of the other data points in the range, as well as the whole data
set. This requires prior knowledge on the nature of the data set, data
validation protocols, and the behavior of the variable you are analyzing.

For example, you have the following data points as peak temperature of Delhi
(in Celsius) over the past two weeks: 30°, 31°, 28°, 30°, 31°, 33°, 32°, 31°,
300°, 30°, 29°, 28°, 30°, 31°. Day 9 had a peak temperature of 300°C, which
is clearly unrealistic.

On the other hand, when you look at the pocket change example, it is not
unrealistic to have $101.20 in your pocket. It is possible that you just withdrew
$100 from the ATM right before you recorded the data point.
To handle such situations, it is a good practice to have protocols in place to
verify each outlier.

If your outlier is verified to be correct, you should leave it untouched. Such an


outlier, in fact, emerges as a useful insight from your data set — and is worth
looking into.

If the outlier is incorrect, there are two ways to deal with it:

1. Resurvey the data point: This is the most foolproof way of dealing with
incorrect outliers. Resurveying becomes easier while using mobile-based
data collection applications like Collect.
1. Delete the outlier data point: Resurveying may not be feasible in all
cases due to resource constraints. In such situations, you should delete
the outlier data point.

What Are Outliers and Why Are They


Important?
Imagine that you generally keep spare change and small bills in your pocket.
If you reach in your pocket and find a $1 bill, a quarter, a dime, and 3 pennies,
you won’t be surprised. If you find a $100 bill, you will certainly be surprised.

That $100 bill is an outlier — a data point that is significantly different from
other data points.
Outliers can represent accurate or inaccurate data. For example, if you
reported finding a $200 bill in your pocket, people would rightly ignore your
story. That outlier would be inaccurate, since $200 bills do not exist. This is
likely to be misreporting for a $20 bill.

However, a report of finding a $100 bill could be an accurate outlier. While


that data point is abnormal, it is possible. Perhaps you had just withdrawn
$100 from an ATM with no small bills.

It is important to find and deal with outliers, since they can skew interpretation
of the data. For example, imagine that you want to know how much money
you keep in your pocket each day. At the end of each day, you empty your
pockets, count the money, and record the total. The results after 12 days are
in the table to the right.

Day 4 is clearly an outlier. If you exclude Day 4 from your calculations, you
would conclude that you keep an average of $2.25 in your pocket. However, if
you don’t exclude Day 4, the average money in your pocket would be $10.49.
These are vastly different results.

You might also like