0% found this document useful (0 votes)

52 views19 pages

Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English

The document discusses identifying and handling outliers in Pandas. It explains what outliers are and explores visual identification using boxplots and histograms. It also covers calculating z-scores and using the interquartile range method to identify outliers numerically. Finally, it demonstrates how to remove outliers from a Pandas DataFrame.

Uploaded by

Pedro Henrique Geanini Dicati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views19 pages

Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English

Uploaded by

Pedro Henrique Geanini Dicati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

Open in app

Member-only story

Identifying and Handling Outliers in Pandas: A

Step-By-Step Guide
Learn How To Handle Outliers Like a Pro

Arvid Eichner · Follow

Published in Python in Plain English
8 min read · Oct 19, 2023

Listen Share More

Outliers are exceptional data points within your dataset, caused by chance,
anomalies, or even measurement errors. Effectively identifying and handling
outliers is a complex yet critical process because ignoring them can lead to biased
results. Pandas, the versatile data manipulation library in Python, provides a set of
tools for efficiently handling outliers. In this step-by-step guide, we will explore
what outliers are, how to detect them, what actions to take when handling them,
and how to leverage pandas along the way. We will look at boxplots, z_scores and the
interquartile range (IQR) method.

Before We Start
You should NEVER remove outliers without questioning their origin. Not all outliers
are created equal and for some types of outliers there are strong arguments against
removing them. If you are not certain that your outliers can be removed, or for a
more introductory guide on outliers using real-world examples, check out this story
first:

Don‘t Just Remove Outliers From Your Data – Think Twice

Rethinking Outliers: Strategies for Informed Data Decisions
arvideichner.medium.com

Okay, with that out of the way, lets see how we can identify and get rid of outliers in
your data using the Python library pandas.

Identifying Outliers Using Pandas

The first step in handling outliers is to identify their presence within your dataset.
Outliers can be identified visually or by using metrics such as z-scores or the
interquartile range (IQR) method. First, let’s quickly create some sample data.

'''
I use the numpy library to generate normally distributed data,
add 5 outliers, then combine everything in a new DataFrame:
'''

# We need pandas and numpy so let's import both libraries

import pandas as pd
import numpy as np

# Next, we set a random seed for reproducibility

np.random.seed(42)

# Generate random data following a normal distribution

# [Mean = 50, Standard Deviation = 10]
data = np.random.normal(loc=50, scale=10, size=1000)

# Introduce 5 outliers
outliers = np.random.normal(loc=200, scale=50, size=5)
data = np.concatenate([data, outliers])

# Shuffle the data to randomize the order

np.random.shuffle(data)

# Create final DataFrame containing outliers

outlier_df = pd.DataFrame({'values': data})

Since np.random.normal creates random data, I first set a seed using np.random.seed(42) .

By setting the same seed (42) in your code, you can recreate the same exact data points,
even though they are random.

In this code snippet, we create a DataFrame ’outlier_df’ with one variable called
‘values’ which contains 1000 normally distributed data points with mean (M) = 50
and standard deviation (SD) = 10. 'values' also contains 5 additional data points
that follow a normal distribution with M = 200 and SD = 50; i.e., potential outliers.
Let’s try to find them:

1) Identifying Outliers: Visual Inspection

The easiest method to spot outliers is to visually inspect the data (or plots thereof) to
spot values that seem odd or don’t fit in with the rest of the data.

# create boxplot
outlier_df.boxplot()
We can see that, as expected, most values lie between 0 and 100 and the median is
around 50. This makes sense given that we created data that is normally distributed
around a mean of 50 with a standard deviation of 10.

We can also see some values — 5 to be exact — that are above 150. They appear far
from all other values, like outcasts struggling to fit in. We just identified the
outliers!

In a boxplot, the data is divided into quartiles, with the central box representing the
interquartile range (IQR) that spans the 25th to 75th percentiles. The line inside the box
marks the median (50th percentile), and the “whiskers” (T-shaped lines) represent a limit
of 1.5 * IQR.

We can also use different plots, such as histograms. Histograms are great to get a
general feel for the distribution of a variable. It is often a good idea to start your data
exploration by looking at a histogram. As you will see, they are not as powerful for
spotting outliers though, especially compared to boxplots.

# create histogram
outlier_df.hist()
Again, we can see that our data seems to be normally distributed around 50
(although it appears a little skewed). Looking at this histogram there don’t seem to
be any outliers though, what’s up with that?

In this histogram, we can’t spot the 5 outliers since they get lost among the 1000 “normal”
data points.

That’s the tricky thing about outliers. They don’t show up in large numbers (or else
they wouldn’t be outliers). But even in low numbers they can disproportionately
affect your analysis due to their exceptionally high or low values.

We can combat this by increasing the number of ‘bins’ for the histogram. The
hist() function groups values into bins and then plots the bins. We can change the
number of bins to use by changing the value for the bins parameter when calling
the function:

# creating another histogram with more bins

outlier_df.plot.hist(bins=100)
Increasing the number of bins made the histogram more granular. Now it looks
more like a normal distribution and, more importantly, we can see the outliers. You
might have to squint a bit but at the bottom right corner of the plot we can see our 5
outliers as tiny blue bars.

Using a higher bin count, we can visually identify outliers using a histogram. But as you
notice, boxplots are far superior for visual outlier detection.

2) Identifying Outliers: z-Score

Apart from visual identification, we can also calculate metrics to determine what
represents “normal” values in our data. We can then look for data points that lie
outside of this normal range of data to find our outliers.

For instance, we can calculate the so-called z-score for each data point. It measures
the distance of a data point from the mean in standard deviations. High absolute z-
scores can be considered outliers (usually z-scores greater than 2 or 3).

# Calculate the z-scores

z_scores = np.abs((outlier_df - outlier_df.mean()) / outlier_df.std())
# add z_scores as a new column to our DataFrame
outlier_df['z_scores'] = z_scores

We obtain the z_scores by calculating the distance from the mean outlier_df —

outlier_df.mean() and dividing it by the standard deviation outlier_df.std() for

each data point. We use np.abs to get absolute values. This is necessary, since we
want to identify potential outliers both higher and lower than the rest of the data.
Finally, I added the z_scores to the DataFrame.

Each value in our DataFrame now has its own z_score, representing how far off it is
from the rest of the data.

For example, we can see that the z_score 0.246940 of value 54.979983 in row 3 is higher
than the z_score 0.127266 of value 49.039401 in row 0. That’s because the mean of
'values' is 51.0598 and 54 is further away from 51 than 49 is.

# quickly checking the mean of 'values'

outlier_df['values'].mean()

>> 51.05976702436777

3) Identifying Outliers: IQR Method

The IQR is a statistical measure that provides insight into the spread of data within
the interquartile range, which covers the middle 50% of the data. Calculate the IQR:
Start by computing the IQR, which is the difference between the 75th percentile (Q3)
and the 25th percentile (Q1).

# Calculate 25% percentile and 75% percentile

Q1 = outlier_df['values'].quantile(0.25)
Q3 = outlier_df['values'].quantile(0.75)

# Calculate Interquartile Range (IQR)

IQR = Q3 - Q1

Next, we define upper and lower bounds. These are the bounds that define the
range of “acceptable” or “normal” values.

Usually, the lower bound is defined as Q1 - 1.5 * IQR and the upper bound as Q3 + 1.5 *
IQR. You can change out the 1.5 if you want your outlier detection to be more sensitive or
less sensitive.

# Define lower and upper bounds

lower_bound = Q1 – 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

Now that we have the upper and lower bounds, we can identify our outliers: Data
points falling below the lower bound or above the upper bound are considered
potential outliers.

Removing Outliers Using Pandas

Now, the only thing left to do is to remove our outliers. Remember that you should
always think twice about removing outliers. Never remove remove data points
simply because they are outliers.

If you are certain, that your outliers can be removed, for instance because they are
created by measurement errors, you can go ahead and remove them from your data
set.

1) Removing Outliers Identified by Boxplot

First, we used a boxplot to find our outliers. The plot revealed that all but 5 data
points are within a range from 0 to 100. All five outliers had values well above 150.
Therefore, we can easily remove the outliers like this:

# Remove outliers (values larger than 150)

cleaned_df = outlier_df[outlier_df['values'] <= 150]

To remove the outliers, we create a new DataFrame df_cleaned which only contains
values from outlier_df that are smaller or equal to 150. We can confirm, whether
the outlier removal was sucessfull by plotting another boxplot:

# select values column from cleaned_df

values = cleaned_df[['values']]

# create new boxplot for values

values.boxplot()

No more outliers, success!

2) Removing Outliers Identified by z-Score and IQR
Removing outliers identified by z-score or IQR is equally straight-forward. To
remove z-score outliers, remove all values that feature z-scores above a predefined
threshold (usually 2 or 3).

# Define threshold (here: remove values with z-score > 2)

threshold = 2

# Remove outliers based on z-score and threshold

cleaned_df = outlier_df[outlier_df['z_scores'] <= threshold]

Similarly for IQR:

'''
We already defined lower and upper bounds above
So now we only need to remove values that are:

1) lower that the lower bound

2) higher than the upper bound

'''

# Remove outliers based on lower & upper bounds

cleaned_df = outlier_df[(outlier_df['Values'] >= lower_bound) &
(outlier_df['Values'] <= upper_bound)]

Handling outliers is a crucial step in data analysis and is essential for producing
accurate and meaningful results. With the power of Pandas and statistical methods,
you can effectively identify and manage outliers, ensuring that your analyses are
based on reliable data. By incorporating these techniques into your data analysis
workflow, you’ll be better equipped to derive valuable insights from your datasets.

Mastering Data Aggregation with Pandas GroupBy: An Easy-to-

Understand Guide
In data analysis, extracting meaningful insights requries data
aggregation. Pandas offers the powerful groupby method…
python.plainenglish.io

Why You Should Start Using Pandas for Data Preparation Today
arvideichner.medium.com

In Plain English
Thank you for being a part of our community! Before you go:

Be sure to clap and follow the writer! 👏

You can find even more content at PlainEnglish.io 🚀
Sign up for our free weekly newsletter. 🗞️
Follow us on Twitter(X), LinkedIn, YouTube, and Discord.

Data Science Pandas Python Programming Outliers

Written by Arvid Eichner

122 Followers · Writer for Python in Plain English

Ph.D. candidate in Information Systems / Data Science, passionate about Python, R, data, and statistics

More from Arvid Eichner and Python in Plain English

Arvid Eichner

A Step-by-Step Guide to Installing MySQL and MySQL Workbench on

Windows
· 6 min read · Oct 23, 2023

Júlio Almeida in Python in Plain English

Claude 3: The king of data extraction

Unleashing the Future of Data Extraction: Inside Claude 3’s Revolutionary Leap
· 7 min read · Mar 27, 2024

1.3K 5

Builescu Daniel in Python in Plain English

My Boss Laughed at Python…Then I Showed Him This

And Streamlined My Data Analysis Workflow

· 5 min read · Mar 28, 2024

2.2K 23
Arvid Eichner

Don‘t Just Remove Outliers From Your Data – Think Twice

Rethinking Outliers: Strategies for Informed Data Decisions

· 5 min read · Oct 14, 2023

96 1

See all from Arvid Eichner

See all from Python in Plain English

Recommended from Medium

Sze Zhong LIM in Data And Beyond

Mastering Exploratory Data Analysis (EDA): Everything You Need To

Know
A systematic approach to EDA your data and prep it for machine learning.

18 min read · Apr 6, 2024

343 3

Dylan Cooper in Stackademic

Mojo, 90,000 Times Faster Than Python, Finally Open Sourced!

On March 29, 2024, Modular Inc. announced the open sourcing of the core components of
Mojo.

· 10 min read · Apr 8, 2024

3K 21

Lists

Coding & Development

11 stories · 568 saves

Predictive Modeling w/ Python

20 stories · 1111 saves

General Coding Knowledge

20 stories · 1124 saves

Practical Guides to Machine Learning

10 stories · 1324 saves

Rany ElHousieny in Level Up Coding

Understanding Principal Component Analysis (PCA) Through Everyday

Examples
Principal Component Analysis (PCA) is a statistical technique used to simplify the complexity
in high-dimensional data while retaining…
14 min read · Dec 21, 2023

149

Richard Warepam in ILLUMINATION

How To Perform Statistical Analysis Using Python: The Ultimate Guide

5 Proven Methods that Every Data Science Professional Uses

8 min read · Mar 5, 2024

609 2
Tim Sumner in Towards Data Science

A New Coefficient of Correlation

What if you were told there exists a new way to measure the relationship between two variables
just like correlation except possibly…

10 min read · Mar 31, 2024

2.6K 35

Subha

Handling missing values in dataset — 7 methods that you need to know

While working with data it is a common scenario for the data scientists to deal with missing
values. Handling these missing values could…

9 min read · Feb 13, 2024

117

See more recommendations

CSEC Office Administration June 2015 P2
No ratings yet
CSEC Office Administration June 2015 P2
20 pages
AACVPR Guidelines For AACVPR Guidelines For Pulmonary Rehabilitation Programs (4 Edition)
No ratings yet
AACVPR Guidelines For AACVPR Guidelines For Pulmonary Rehabilitation Programs (4 Edition)
37 pages
Story Timelevel 2
No ratings yet
Story Timelevel 2
33 pages
Guide On Outlier Detection Methods
No ratings yet
Guide On Outlier Detection Methods
11 pages
Handling Ouliers
No ratings yet
Handling Ouliers
5 pages
Explanatory Data Analysis
100% (1)
Explanatory Data Analysis
28 pages
11 Different Ways For Outlier Detection in Python
No ratings yet
11 Different Ways For Outlier Detection in Python
11 pages
ML Ex2
No ratings yet
ML Ex2
7 pages
Feature Engineering
No ratings yet
Feature Engineering
66 pages
Feature Engineering
No ratings yet
Feature Engineering
63 pages
Outlier Detection
No ratings yet
Outlier Detection
41 pages
Outlier Detection and Removal
No ratings yet
Outlier Detection and Removal
2 pages
Outlier Treatment
No ratings yet
Outlier Treatment
16 pages
Dsi237 Group 2
No ratings yet
Dsi237 Group 2
27 pages
Week-6 DS Practical
No ratings yet
Week-6 DS Practical
12 pages
Outlier Detection and Capping
No ratings yet
Outlier Detection and Capping
7 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
Outliers ML
No ratings yet
Outliers ML
14 pages
Detecting Data Outliers
No ratings yet
Detecting Data Outliers
7 pages
6735367a5d6e24a5f185bf9c 99512104437
No ratings yet
6735367a5d6e24a5f185bf9c 99512104437
2 pages
Handling Outliers
No ratings yet
Handling Outliers
6 pages
Dealing With Outliers
No ratings yet
Dealing With Outliers
19 pages
Ads Exp 7
No ratings yet
Ads Exp 7
10 pages
Outliers
No ratings yet
Outliers
3 pages
ISAT 600 Progress Report 3
No ratings yet
ISAT 600 Progress Report 3
4 pages
4 - Outliers - +transformaations ML
No ratings yet
4 - Outliers - +transformaations ML
28 pages
1 Program
No ratings yet
1 Program
20 pages
5 Ways To Find Outliers in Your Data - Statistics by Jim
No ratings yet
5 Ways To Find Outliers in Your Data - Statistics by Jim
35 pages
Outliers in Machine Learning
No ratings yet
Outliers in Machine Learning
13 pages
What Is Outlier
No ratings yet
What Is Outlier
3 pages
Unit 4
No ratings yet
Unit 4
17 pages
17 dm2 Anomaly Detection 2022 23
No ratings yet
17 dm2 Anomaly Detection 2022 23
113 pages
How To Handle Outliers
No ratings yet
How To Handle Outliers
6 pages
Concepts of EDA, Outliers-Detection and Treatment
No ratings yet
Concepts of EDA, Outliers-Detection and Treatment
99 pages
How To Calculate Outliers
No ratings yet
How To Calculate Outliers
7 pages
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
No ratings yet
WINSEM2024-25 CBS3006 ETH VL2024250505168 2025-01-09 Reference-Material-III
4 pages
DSBDA Lab Assignment No 2
No ratings yet
DSBDA Lab Assignment No 2
7 pages
Krishnendu PCB-IT602B
No ratings yet
Krishnendu PCB-IT602B
11 pages
Nikita Prasad - Outliers Basics
No ratings yet
Nikita Prasad - Outliers Basics
13 pages
ML Lab Manual Bcsl602
No ratings yet
ML Lab Manual Bcsl602
108 pages
Module 11 (C)
No ratings yet
Module 11 (C)
4 pages
Data Minning Unit 4-1
No ratings yet
Data Minning Unit 4-1
10 pages
Outlier Analysis in Data Mining
No ratings yet
Outlier Analysis in Data Mining
5 pages
3-Introduction To Data Cleaning Outlires
No ratings yet
3-Introduction To Data Cleaning Outlires
5 pages
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
No ratings yet
Lecture 05: Feature Engineering: Ms. Mehroz Sadiq
69 pages
Univariate Outlier Detection
No ratings yet
Univariate Outlier Detection
9 pages
Anomaly Detection and Outlier Analysis
No ratings yet
Anomaly Detection and Outlier Analysis
25 pages
Ads 7
No ratings yet
Ads 7
6 pages
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
No ratings yet
Outlier Detection in Non-Gaussian Distributions Uitschieter Detectie in Niet-Gauss Verdelingen
45 pages
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
No ratings yet
Lec 7 Data Visualization Basic Statistics Updated 21102024 122008pm
39 pages
Expt 2
No ratings yet
Expt 2
3 pages
Prog 1
No ratings yet
Prog 1
3 pages
RousseeuwHubert AnomalyDetection WIRES 2018
No ratings yet
RousseeuwHubert AnomalyDetection WIRES 2018
14 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
ML Lab Manual
No ratings yet
ML Lab Manual
110 pages
Data Cleaning
No ratings yet
Data Cleaning
4 pages
Hduud
No ratings yet
Hduud
55 pages
Unit 5 - Lecture 1 - Outlier Detection
No ratings yet
Unit 5 - Lecture 1 - Outlier Detection
30 pages
Advanced Data Analysis Techniques 3
No ratings yet
Advanced Data Analysis Techniques 3
31 pages
Notes PDF ML Day 17
No ratings yet
Notes PDF ML Day 17
9 pages
Cefr Letters b2 and c1
No ratings yet
Cefr Letters b2 and c1
32 pages
Problem Set 1: Big-And Divide & Conquer: CS 3510: Design & Analysis of Algorithms
No ratings yet
Problem Set 1: Big-And Divide & Conquer: CS 3510: Design & Analysis of Algorithms
5 pages
Order of The Mass-2
100% (2)
Order of The Mass-2
2 pages
Case Study 21. Human Resource Planning R PDF
0% (1)
Case Study 21. Human Resource Planning R PDF
3 pages
AoR-2020 (MGNREGA), Vol-I
No ratings yet
AoR-2020 (MGNREGA), Vol-I
135 pages
Unit 2 - Approaches To Tourism Entrepreneurship
100% (1)
Unit 2 - Approaches To Tourism Entrepreneurship
16 pages
2021 - IEEE Style Manual
0% (1)
2021 - IEEE Style Manual
70 pages
Writing Research Report
No ratings yet
Writing Research Report
33 pages
Abeya Merga Research
No ratings yet
Abeya Merga Research
45 pages
76 Command Set
No ratings yet
76 Command Set
27 pages
Creative Writing: A Short Guide To Teaching Imaginative Thinking
100% (1)
Creative Writing: A Short Guide To Teaching Imaginative Thinking
4 pages
Media and Information Literacy: Sablayan National Comprehensive High School
No ratings yet
Media and Information Literacy: Sablayan National Comprehensive High School
3 pages
Slope Stability PDF
No ratings yet
Slope Stability PDF
6 pages
The Alchemist Test Study Guide
No ratings yet
The Alchemist Test Study Guide
2 pages
Qadaqadar PDF
No ratings yet
Qadaqadar PDF
4 pages
NA DeiselShip Latest
No ratings yet
NA DeiselShip Latest
105 pages
Customers To Be Linkedfinal
No ratings yet
Customers To Be Linkedfinal
8 pages
DIVIDENDS
No ratings yet
DIVIDENDS
2 pages
PA 6.0 Amplifier Datasheet
No ratings yet
PA 6.0 Amplifier Datasheet
6 pages
Further Studies Maths P1 Memo 2024
No ratings yet
Further Studies Maths P1 Memo 2024
19 pages
Memory Addressing and Instruction Formats
No ratings yet
Memory Addressing and Instruction Formats
9 pages
Latest Battery Datasheet
No ratings yet
Latest Battery Datasheet
4 pages
Theo 5 - Module 4
No ratings yet
Theo 5 - Module 4
26 pages
Processing of Leather by Microbial Enzyme
100% (1)
Processing of Leather by Microbial Enzyme
13 pages
Info Age
No ratings yet
Info Age
31 pages
Cosmetic & Homecare Industry
No ratings yet
Cosmetic & Homecare Industry
2 pages
Math g1 m2 Full Module
No ratings yet
Math g1 m2 Full Module
379 pages