0% found this document useful (0 votes)

154 views7 pages

Detecting Data Outliers

This document discusses how to detect outliers in a dataset. It provides two main ways to find outliers: 1) Visualizing the data using histograms and scatter plots to spot outliers visually. 2) Using statistical techniques like sorting the data, calculating quartiles, and establishing inner and outer fences based on the interquartile range. Outliers identified need to be verified and appropriately handled by either resurveying the data point or deleting incorrect outliers. Finding and addressing outliers is important to ensure accurate interpretation and analysis of the overall dataset.

Uploaded by

Judy Ann Galleno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

154 views7 pages

Detecting Data Outliers

Uploaded by

Judy Ann Galleno

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

Republic of the Philippines

CAPIZ STATE UNIVERSITY

GRADUATE SCHOOL
Roxas City Main Campus

Doctor of Education
Major in Industrial Management

DETECTING DATA OUTLIERS THROUGH HISTOGRAM AND BOX PLOT

MEASURE OF SKEWNESS AND KURTOSIS

How to Find Outliers in a Data Set

!mXvbbsnA?nBk.fL?.xb like.

Z?B.VLxc cc:?z?v:X n.b?,

What is an outlier?

it’s a data point that is significantly different from other data points in a
data set. The long story? There isn’t a strong mathematical definition for what
is or isn’t an outlier. In the end, detecting and handling outliers is often a
somewhat subjective exercise.

So how can you dive into a new data set, find the outliers, and clean them?
Keep reading for tips and tricks to help you detect and handle outliers.

How to Find Outliers

Outliers are inevitable, especially for large data sets. On their own, they are
not problematic. However, in the context of the larger data set, it is essential
to identify, verify, and accordingly deal with outliers to ensure that your data
interpretation is as accurate as possible.
The first step in dealing with outliers is finding them. There are two ways to
approach this.

Visualize the Data

Depending on your data set, you can use some simple tools to visualize your
data and spot outliers visually.

Histogram: A histogram is the best way to check univariate data — data

containing a single variable — for outliers. A histogram divides the range of
values into various groups (or buckets), and then shows the frequency — how
many times the data falls into each group — through a bar graph. Assuming
that these buckets are arranged in increasing order, you should be able to
spot outliers easily at the far left (very small values) or at the far right (very
large values).

This histogram of our pocket change example shows an outlier on the far right
for Day 4 ($101.2).

Scatter Plot: A scatter plot (also called a scatter diagram or scatter graph)
shows a collection of points on an x-y coordinate axis, where the x-axis
(horizontal axis) represents the independent variable and the y-axis (vertical
axis) represents the dependent variable.
A scatter plot is useful to find outliers in bivariate data (data with two
variables). You can easily spot the outliers because they will be far away from
the majority of points on the scatter plot.
This scatter plot of our pocket change example shows an outlier — far away
from all the other points — for Day 4 ($101.2).

The Statistical Way

Using statistical techniques is a more thorough approach to identifying
outliers. There are several advanced statistical tools and packages that you
could use to identify outliers.

Here we’ll talk about a simple, widely used, and proven technique to identify
outliers.

Step 1: Sort the Data

Sort the data in the column in ascending order (smallest to largest). You can
do this in Excel by selecting the “Sort & Filter” option in the top right in the
home toolbar.

Sorting the data helps you spot outliers at the very top or bottom of the
column. However, there could be more outliers that might be difficult to catch.

Step 2: Quartiles
In any ordered range of values, there are three quartiles that divide the range
into four equal groups. The second quartile (Q2) is nothing but the median,
since it divides the ordered range into two equal groups. For an odd number
of observations, the median is equal to the middle value of the sorted range.

In this example, since we have an even number of observations (12), we need

to calculate the average of the sixth and seventh-position values in the
ordered range — that is, the average of 1.38 and 1.77. The median of the
range works out to be 1.575.

To calculate the first (Q1) and third quartiles (Q3), you need to simply
calculate the medians of the first half and second half respectively. In this
case, Q1 is 0.565 and Q3 is 3.775.

Step 3: Inner and Outer Fences

The inner and outer fences are ranges that you can calculate using the Q1
and Q3. To do this, you need to first calculate the interquartile range — the
difference between Q1 and Q3. In this case, Q3-Q1 = 3.21.

A data point that falls outside the inner fence is called a minor outlier.

Inner Fence:
Lower bound = Q1 - (1.5 * (Q3-Q1))

Upper bound = Q3 + (1.5 * (Q3-Q1))

In our example, the bounds for the inner fence are:

Lower Bound = 0.565 - (1.5*3.21) = -4.25

Upper Bound = 3.775 + (1.5*3.21) = 8.59

The data points for Day 11 and Day 4, that is 9.04 and 101.20 respectively,
qualify as minor outliers.

A data point that falls outside the outer fence is called a major outlier.

Outer Fence:
Lower bound = Q1 - (3 * (Q3-Q1))

Upper bound = Q3 + (3 * (Q3-Q1))

In our example, the bounds for the outer fence are:

Lower Bound = 0.565 - (3*3.21) = -9.07

Upper Bound = 3.775 + (3*3.21) = 13.41

The data point for Day 11 (which is $101.20) qualifies as a major outlier.

How to Deal With Outliers

Now that you have identified all your outliers, you should look at each outlier
in the context of the other data points in the range, as well as the whole data
set. This requires prior knowledge on the nature of the data set, data
validation protocols, and the behavior of the variable you are analyzing.

For example, you have the following data points as peak temperature of Delhi
(in Celsius) over the past two weeks: 30°, 31°, 28°, 30°, 31°, 33°, 32°, 31°,
300°, 30°, 29°, 28°, 30°, 31°. Day 9 had a peak temperature of 300°C, which
is clearly unrealistic.

On the other hand, when you look at the pocket change example, it is not
unrealistic to have $101.20 in your pocket. It is possible that you just withdrew
$100 from the ATM right before you recorded the data point.
To handle such situations, it is a good practice to have protocols in place to
verify each outlier.

If your outlier is verified to be correct, you should leave it untouched. Such an

outlier, in fact, emerges as a useful insight from your data set — and is worth
looking into.

If the outlier is incorrect, there are two ways to deal with it:

1. Resurvey the data point: This is the most foolproof way of dealing with
incorrect outliers. Resurveying becomes easier while using mobile-based
data collection applications like Collect.
1. Delete the outlier data point: Resurveying may not be feasible in all
cases due to resource constraints. In such situations, you should delete
the outlier data point.

What Are Outliers and Why Are They

Important?
Imagine that you generally keep spare change and small bills in your pocket.
If you reach in your pocket and find a $1 bill, a quarter, a dime, and 3 pennies,
you won’t be surprised. If you find a $100 bill, you will certainly be surprised.

That $100 bill is an outlier — a data point that is significantly different from
other data points.
Outliers can represent accurate or inaccurate data. For example, if you
reported finding a $200 bill in your pocket, people would rightly ignore your
story. That outlier would be inaccurate, since $200 bills do not exist. This is
likely to be misreporting for a $20 bill.

However, a report of finding a $100 bill could be an accurate outlier. While

that data point is abnormal, it is possible. Perhaps you had just withdrawn
$100 from an ATM with no small bills.

It is important to find and deal with outliers, since they can skew interpretation
of the data. For example, imagine that you want to know how much money
you keep in your pocket each day. At the end of each day, you empty your
pockets, count the money, and record the total. The results after 12 days are
in the table to the right.

Day 4 is clearly an outlier. If you exclude Day 4 from your calculations, you
would conclude that you keep an average of $2.25 in your pocket. However, if
you don’t exclude Day 4, the average money in your pocket would be $10.49.
These are vastly different results.

Unseen Passage For Class 7 With Questions
100% (1)
Unseen Passage For Class 7 With Questions
8 pages
Masters Orals
100% (5)
Masters Orals
107 pages
Sociology For Business (SOC-201) : Bba 5 Semester
100% (1)
Sociology For Business (SOC-201) : Bba 5 Semester
87 pages
Chapter 1 - 2
No ratings yet
Chapter 1 - 2
46 pages
2003 Makipaa 1
No ratings yet
2003 Makipaa 1
15 pages
Start : Ifp, Then Fail To Reject H Ifp, Then Reject H Ensure The Correct Sample Size Is Taken
No ratings yet
Start : Ifp, Then Fail To Reject H Ifp, Then Reject H Ensure The Correct Sample Size Is Taken
1 page
Martha Argerich and Piano Technique: Science Behind Controlled Pianism
100% (2)
Martha Argerich and Piano Technique: Science Behind Controlled Pianism
40 pages
OUTLIERS
100% (1)
OUTLIERS
5 pages
Errors of Regression Models: Bite-Size Machine Learning, #1
From Everand
Errors of Regression Models: Bite-Size Machine Learning, #1
Lee Baker
No ratings yet
Start Predicting In A World Of Data Science And Predictive Analysis
From Everand
Start Predicting In A World Of Data Science And Predictive Analysis
Matthew Abbitt
No ratings yet
Exploratory Data Analysis - Komorowski PDF
No ratings yet
Exploratory Data Analysis - Komorowski PDF
20 pages
Data Forecast - What If Analysis - Few More Examples
No ratings yet
Data Forecast - What If Analysis - Few More Examples
20 pages
Ignou PGDAST Assignment Booklet Jan-Dec 2020
No ratings yet
Ignou PGDAST Assignment Booklet Jan-Dec 2020
30 pages
Assignment Booklet PGDAST Jan-Dec 2018
No ratings yet
Assignment Booklet PGDAST Jan-Dec 2018
35 pages
Hypothesis Testing - Analysis of Variance (ANOVA)
No ratings yet
Hypothesis Testing - Analysis of Variance (ANOVA)
14 pages
Some Exercises Using Minitab
No ratings yet
Some Exercises Using Minitab
20 pages
Chapter 9 Fundamental of Hypothesis Testing
No ratings yet
Chapter 9 Fundamental of Hypothesis Testing
26 pages
Predictive Analytics Siegel en 27852
No ratings yet
Predictive Analytics Siegel en 27852
7 pages
Basic Stats and Probability
100% (1)
Basic Stats and Probability
703 pages
Statistical Tests
No ratings yet
Statistical Tests
47 pages
Meet Mini Tab 14
No ratings yet
Meet Mini Tab 14
138 pages
6 Sigma 1619893345
No ratings yet
6 Sigma 1619893345
299 pages
PSCV Unit-Iii Digital Notes
No ratings yet
PSCV Unit-Iii Digital Notes
46 pages
Navidi ch6
No ratings yet
Navidi ch6
82 pages
How To Use All 3 Types of ANOVA Built Into Excel To Make Your Internet Marketing More Effective
No ratings yet
How To Use All 3 Types of ANOVA Built Into Excel To Make Your Internet Marketing More Effective
20 pages
SAS Cluster Project Report
100% (1)
SAS Cluster Project Report
24 pages
Sampling Distribution and Simulation in R
No ratings yet
Sampling Distribution and Simulation in R
10 pages
Multivariate Analysis (Minitab)
100% (1)
Multivariate Analysis (Minitab)
43 pages
Improving Bank Call Centre Operations Using Six Sigma: Rahul Gautam
100% (1)
Improving Bank Call Centre Operations Using Six Sigma: Rahul Gautam
8 pages
Minitab v15
No ratings yet
Minitab v15
140 pages
STA4C04 - Statistical Inference and Quality Control
No ratings yet
STA4C04 - Statistical Inference and Quality Control
170 pages
Basic Statistics: Simple Linear Regression
No ratings yet
Basic Statistics: Simple Linear Regression
8 pages
Assignment #3 Hypothesis Testing
No ratings yet
Assignment #3 Hypothesis Testing
10 pages
Business Statistics: Methods For Describing Sets of Data
No ratings yet
Business Statistics: Methods For Describing Sets of Data
103 pages
Ggplot 2
No ratings yet
Ggplot 2
48 pages
Statistical Forcasting - Excel, ARIMA
No ratings yet
Statistical Forcasting - Excel, ARIMA
14 pages
Applications of Statistical Software For Data Analysis
No ratings yet
Applications of Statistical Software For Data Analysis
5 pages
Practical Guide Measuring Shrinkage
No ratings yet
Practical Guide Measuring Shrinkage
4 pages
Assignment 1 Quantitative Management: Bayers' Theorem & Conditional Probability
No ratings yet
Assignment 1 Quantitative Management: Bayers' Theorem & Conditional Probability
2 pages
02-03 ASAP Business Analytics-2 Descriptive Statistics
No ratings yet
02-03 ASAP Business Analytics-2 Descriptive Statistics
109 pages
Input Modeling For Simulation
No ratings yet
Input Modeling For Simulation
48 pages
PDF
No ratings yet
PDF
114 pages
100 Plus Statistics Interview Questions
0% (1)
100 Plus Statistics Interview Questions
44 pages
A Short Course in Multivariate Statistical Methods With R
No ratings yet
A Short Course in Multivariate Statistical Methods With R
11 pages
Time Series Analysis in The Toolbar of Minitab's Help
No ratings yet
Time Series Analysis in The Toolbar of Minitab's Help
30 pages
Statatistical Inferences
No ratings yet
Statatistical Inferences
22 pages
Statistic Interview Questions and Answers by Jeevan Raj
No ratings yet
Statistic Interview Questions and Answers by Jeevan Raj
21 pages
Analysis Analysis: Multivariat E Multivariat E
100% (1)
Analysis Analysis: Multivariat E Multivariat E
12 pages
Minitab Training
No ratings yet
Minitab Training
15 pages
MSC Statistics
No ratings yet
MSC Statistics
36 pages
Data Manipulation
No ratings yet
Data Manipulation
24 pages
Multiple Regression
No ratings yet
Multiple Regression
20 pages
Method Chooser Basic Statistical Tests
100% (1)
Method Chooser Basic Statistical Tests
36 pages
Graded Quiz - Using Probability Distributions - Coursera
No ratings yet
Graded Quiz - Using Probability Distributions - Coursera
10 pages
PSSC Maths Statistics Project Handbook Eff08 PDF
No ratings yet
PSSC Maths Statistics Project Handbook Eff08 PDF
19 pages
Interpret The Key Results For Attribute Agreement Analysis
100% (1)
Interpret The Key Results For Attribute Agreement Analysis
28 pages
Six Sigma Applications
No ratings yet
Six Sigma Applications
300 pages
Class 7
No ratings yet
Class 7
42 pages
Chapter 1 Data Analysis
No ratings yet
Chapter 1 Data Analysis
18 pages
Basic - Statistics 30 Sep 2013 PDF
100% (1)
Basic - Statistics 30 Sep 2013 PDF
20 pages
Regression
0% (1)
Regression
38 pages
H-311 Linear Regression Analysis With R
100% (1)
H-311 Linear Regression Analysis With R
71 pages
Examples and Problems in Mathematical Statistics
From Everand
Examples and Problems in Mathematical Statistics
Shelemyahu Zacks
5/5 (2)
SAS Data Analytic Development: Dimensions of Software Quality
From Everand
SAS Data Analytic Development: Dimensions of Software Quality
Troy Martin Hughes
No ratings yet
Main Door Design
No ratings yet
Main Door Design
12 pages
First Quarter Tos Science
No ratings yet
First Quarter Tos Science
3 pages
Yarn Breakage
No ratings yet
Yarn Breakage
24 pages
Mock Test 4 JR Science Teacher Test
No ratings yet
Mock Test 4 JR Science Teacher Test
4 pages
ESD-Final Exam - SCS3140-013
No ratings yet
ESD-Final Exam - SCS3140-013
4 pages
Bureau 13 d20 Auras
No ratings yet
Bureau 13 d20 Auras
6 pages
Q1 LE Science 7 Lesson 8 Week 8
No ratings yet
Q1 LE Science 7 Lesson 8 Week 8
23 pages
I ST Year II Sem. Scheme Syllabus 22 23 New
No ratings yet
I ST Year II Sem. Scheme Syllabus 22 23 New
38 pages
ESSAY WRITING FOR EXAMS - E-Atsakymai - 2015.02.20 - Su Copyraitu
No ratings yet
ESSAY WRITING FOR EXAMS - E-Atsakymai - 2015.02.20 - Su Copyraitu
55 pages
Gaslight Meaning - Google Search
No ratings yet
Gaslight Meaning - Google Search
1 page
MIRROR
100% (1)
MIRROR
62 pages
07 - An Exploration Into Some Dominant Features of Filipino Social Behavior
No ratings yet
07 - An Exploration Into Some Dominant Features of Filipino Social Behavior
8 pages
Click The Link Below To Download
100% (1)
Click The Link Below To Download
58 pages
Where Can Buy Psychology in Modules Twelfth Edition David Myers Ebook With Cheap Price
No ratings yet
Where Can Buy Psychology in Modules Twelfth Edition David Myers Ebook With Cheap Price
55 pages
Argumentative Text Quiz! - Quizizz
No ratings yet
Argumentative Text Quiz! - Quizizz
3 pages
Domestic Violence A Biased Concept in Term of Men
No ratings yet
Domestic Violence A Biased Concept in Term of Men
2 pages
Unit 12 - Day 3 - Presentation
No ratings yet
Unit 12 - Day 3 - Presentation
21 pages
WME01 01 MSC 20190307 PDF
No ratings yet
WME01 01 MSC 20190307 PDF
15 pages
Module
No ratings yet
Module
348 pages
Essay Plan Practice
No ratings yet
Essay Plan Practice
2 pages
Practica Post CyO 2
No ratings yet
Practica Post CyO 2
11 pages
Module 8 The Good Life
No ratings yet
Module 8 The Good Life
4 pages
Crack Width Calculation
No ratings yet
Crack Width Calculation
3 pages
Untitled - EH-AD-2.EHTW-VAC-2
No ratings yet
Untitled - EH-AD-2.EHTW-VAC-2
1 page
m201sp18PS22 hw7b
No ratings yet
m201sp18PS22 hw7b
1 page
FORMAT-LP-MG-SCHEME-E English 3-4
No ratings yet
FORMAT-LP-MG-SCHEME-E English 3-4
4 pages