0% found this document useful (0 votes)
52 views19 pages

Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English

The document discusses identifying and handling outliers in Pandas. It explains what outliers are and explores visual identification using boxplots and histograms. It also covers calculating z-scores and using the interquartile range method to identify outliers numerically. Finally, it demonstrates how to remove outliers from a Pandas DataFrame.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views19 pages

Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English

The document discusses identifying and handling outliers in Pandas. It explains what outliers are and explores visual identification using boxplots and histograms. It also covers calculating z-scores and using the interquartile range method to identify outliers numerically. Finally, it demonstrates how to remove outliers from a Pandas DataFrame.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

Open in app

Search

Member-only story

Identifying and Handling Outliers in Pandas: A


Step-By-Step Guide
Learn How To Handle Outliers Like a Pro

Arvid Eichner · Follow


Published in Python in Plain English
8 min read · Oct 19, 2023

Listen Share More

Outliers are exceptional data points within your dataset, caused by chance,
anomalies, or even measurement errors. Effectively identifying and handling
outliers is a complex yet critical process because ignoring them can lead to biased
results. Pandas, the versatile data manipulation library in Python, provides a set of
tools for efficiently handling outliers. In this step-by-step guide, we will explore
what outliers are, how to detect them, what actions to take when handling them,
and how to leverage pandas along the way. We will look at boxplots, z_scores and the
interquartile range (IQR) method.

Before We Start
You should NEVER remove outliers without questioning their origin. Not all outliers
are created equal and for some types of outliers there are strong arguments against
removing them. If you are not certain that your outliers can be removed, or for a
more introductory guide on outliers using real-world examples, check out this story
first:

Don‘t Just Remove Outliers From Your Data – Think Twice


Rethinking Outliers: Strategies for Informed Data Decisions
arvideichner.medium.com

Okay, with that out of the way, lets see how we can identify and get rid of outliers in
your data using the Python library pandas.

Identifying Outliers Using Pandas


The first step in handling outliers is to identify their presence within your dataset.
Outliers can be identified visually or by using metrics such as z-scores or the
interquartile range (IQR) method. First, let’s quickly create some sample data.

'''
I use the numpy library to generate normally distributed data,
add 5 outliers, then combine everything in a new DataFrame:
'''

# We need pandas and numpy so let's import both libraries


import pandas as pd
import numpy as np

# Next, we set a random seed for reproducibility


np.random.seed(42)

# Generate random data following a normal distribution


# [Mean = 50, Standard Deviation = 10]
data = np.random.normal(loc=50, scale=10, size=1000)

# Introduce 5 outliers
outliers = np.random.normal(loc=200, scale=50, size=5)
data = np.concatenate([data, outliers])

# Shuffle the data to randomize the order


np.random.shuffle(data)

# Create final DataFrame containing outliers


outlier_df = pd.DataFrame({'values': data})

Since np.random.normal creates random data, I first set a seed using np.random.seed(42) .

By setting the same seed (42) in your code, you can recreate the same exact data points,
even though they are random.

In this code snippet, we create a DataFrame ’outlier_df’ with one variable called
‘values’ which contains 1000 normally distributed data points with mean (M) = 50
and standard deviation (SD) = 10. 'values' also contains 5 additional data points
that follow a normal distribution with M = 200 and SD = 50; i.e., potential outliers.
Let’s try to find them:

1) Identifying Outliers: Visual Inspection


The easiest method to spot outliers is to visually inspect the data (or plots thereof) to
spot values that seem odd or don’t fit in with the rest of the data.

# create boxplot
outlier_df.boxplot()
We can see that, as expected, most values lie between 0 and 100 and the median is
around 50. This makes sense given that we created data that is normally distributed
around a mean of 50 with a standard deviation of 10.

We can also see some values — 5 to be exact — that are above 150. They appear far
from all other values, like outcasts struggling to fit in. We just identified the
outliers!

In a boxplot, the data is divided into quartiles, with the central box representing the
interquartile range (IQR) that spans the 25th to 75th percentiles. The line inside the box
marks the median (50th percentile), and the “whiskers” (T-shaped lines) represent a limit
of 1.5 * IQR.

We can also use different plots, such as histograms. Histograms are great to get a
general feel for the distribution of a variable. It is often a good idea to start your data
exploration by looking at a histogram. As you will see, they are not as powerful for
spotting outliers though, especially compared to boxplots.

# create histogram
outlier_df.hist()
Again, we can see that our data seems to be normally distributed around 50
(although it appears a little skewed). Looking at this histogram there don’t seem to
be any outliers though, what’s up with that?

In this histogram, we can’t spot the 5 outliers since they get lost among the 1000 “normal”
data points.

That’s the tricky thing about outliers. They don’t show up in large numbers (or else
they wouldn’t be outliers). But even in low numbers they can disproportionately
affect your analysis due to their exceptionally high or low values.

We can combat this by increasing the number of ‘bins’ for the histogram. The
hist() function groups values into bins and then plots the bins. We can change the
number of bins to use by changing the value for the bins parameter when calling
the function:

# creating another histogram with more bins


outlier_df.plot.hist(bins=100)
Increasing the number of bins made the histogram more granular. Now it looks
more like a normal distribution and, more importantly, we can see the outliers. You
might have to squint a bit but at the bottom right corner of the plot we can see our 5
outliers as tiny blue bars.

Using a higher bin count, we can visually identify outliers using a histogram. But as you
notice, boxplots are far superior for visual outlier detection.

2) Identifying Outliers: z-Score


Apart from visual identification, we can also calculate metrics to determine what
represents “normal” values in our data. We can then look for data points that lie
outside of this normal range of data to find our outliers.

For instance, we can calculate the so-called z-score for each data point. It measures
the distance of a data point from the mean in standard deviations. High absolute z-
scores can be considered outliers (usually z-scores greater than 2 or 3).

# Calculate the z-scores


z_scores = np.abs((outlier_df - outlier_df.mean()) / outlier_df.std())
# add z_scores as a new column to our DataFrame
outlier_df['z_scores'] = z_scores

We obtain the z_scores by calculating the distance from the mean outlier_df —

outlier_df.mean() and dividing it by the standard deviation outlier_df.std() for


each data point. We use np.abs to get absolute values. This is necessary, since we
want to identify potential outliers both higher and lower than the rest of the data.
Finally, I added the z_scores to the DataFrame.

Each value in our DataFrame now has its own z_score, representing how far off it is
from the rest of the data.

For example, we can see that the z_score 0.246940 of value 54.979983 in row 3 is higher
than the z_score 0.127266 of value 49.039401 in row 0. That’s because the mean of
'values' is 51.0598 and 54 is further away from 51 than 49 is.

# quickly checking the mean of 'values'


outlier_df['values'].mean()

>> 51.05976702436777

3) Identifying Outliers: IQR Method


The IQR is a statistical measure that provides insight into the spread of data within
the interquartile range, which covers the middle 50% of the data. Calculate the IQR:
Start by computing the IQR, which is the difference between the 75th percentile (Q3)
and the 25th percentile (Q1).

# Calculate 25% percentile and 75% percentile


Q1 = outlier_df['values'].quantile(0.25)
Q3 = outlier_df['values'].quantile(0.75)

# Calculate Interquartile Range (IQR)


IQR = Q3 - Q1

Next, we define upper and lower bounds. These are the bounds that define the
range of “acceptable” or “normal” values.

Usually, the lower bound is defined as Q1 - 1.5 * IQR and the upper bound as Q3 + 1.5 *
IQR. You can change out the 1.5 if you want your outlier detection to be more sensitive or
less sensitive.

# Define lower and upper bounds


lower_bound = Q1 – 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

Now that we have the upper and lower bounds, we can identify our outliers: Data
points falling below the lower bound or above the upper bound are considered
potential outliers.

Removing Outliers Using Pandas


Now, the only thing left to do is to remove our outliers. Remember that you should
always think twice about removing outliers. Never remove remove data points
simply because they are outliers.

If you are certain, that your outliers can be removed, for instance because they are
created by measurement errors, you can go ahead and remove them from your data
set.

1) Removing Outliers Identified by Boxplot


First, we used a boxplot to find our outliers. The plot revealed that all but 5 data
points are within a range from 0 to 100. All five outliers had values well above 150.
Therefore, we can easily remove the outliers like this:

# Remove outliers (values larger than 150)


cleaned_df = outlier_df[outlier_df['values'] <= 150]

To remove the outliers, we create a new DataFrame df_cleaned which only contains
values from outlier_df that are smaller or equal to 150. We can confirm, whether
the outlier removal was sucessfull by plotting another boxplot:

# select values column from cleaned_df


values = cleaned_df[['values']]

# create new boxplot for values


values.boxplot()

No more outliers, success!


2) Removing Outliers Identified by z-Score and IQR
Removing outliers identified by z-score or IQR is equally straight-forward. To
remove z-score outliers, remove all values that feature z-scores above a predefined
threshold (usually 2 or 3).

# Define threshold (here: remove values with z-score > 2)


threshold = 2

# Remove outliers based on z-score and threshold


cleaned_df = outlier_df[outlier_df['z_scores'] <= threshold]

Similarly for IQR:

'''
We already defined lower and upper bounds above
So now we only need to remove values that are:

1) lower that the lower bound


2) higher than the upper bound

'''

# Remove outliers based on lower & upper bounds


cleaned_df = outlier_df[(outlier_df['Values'] >= lower_bound) &
(outlier_df['Values'] <= upper_bound)]

Handling outliers is a crucial step in data analysis and is essential for producing
accurate and meaningful results. With the power of Pandas and statistical methods,
you can effectively identify and manage outliers, ensuring that your analyses are
based on reliable data. By incorporating these techniques into your data analysis
workflow, you’ll be better equipped to derive valuable insights from your datasets.

Related Stories

Mastering Data Aggregation with Pandas GroupBy: An Easy-to-


Understand Guide
In data analysis, extracting meaningful insights requries data
aggregation. Pandas offers the powerful groupby method…
python.plainenglish.io

Why You Should Start Using Pandas for Data Preparation Today
arvideichner.medium.com

In Plain English
Thank you for being a part of our community! Before you go:

Be sure to clap and follow the writer! 👏


You can find even more content at PlainEnglish.io 🚀
Sign up for our free weekly newsletter. 🗞️
Follow us on Twitter(X), LinkedIn, YouTube, and Discord.

Data Science Pandas Python Programming Outliers

Follow

Written by Arvid Eichner


122 Followers · Writer for Python in Plain English

Ph.D. candidate in Information Systems / Data Science, passionate about Python, R, data, and statistics

More from Arvid Eichner and Python in Plain English


Arvid Eichner

A Step-by-Step Guide to Installing MySQL and MySQL Workbench on


Windows
· 6 min read · Oct 23, 2023

36

Júlio Almeida in Python in Plain English

Claude 3: The king of data extraction


Unleashing the Future of Data Extraction: Inside Claude 3’s Revolutionary Leap
· 7 min read · Mar 27, 2024

1.3K 5

Builescu Daniel in Python in Plain English

My Boss Laughed at Python…Then I Showed Him This


And Streamlined My Data Analysis Workflow

· 5 min read · Mar 28, 2024

2.2K 23
Arvid Eichner

Don‘t Just Remove Outliers From Your Data – Think Twice


Rethinking Outliers: Strategies for Informed Data Decisions

· 5 min read · Oct 14, 2023

96 1

See all from Arvid Eichner

See all from Python in Plain English

Recommended from Medium


Sze Zhong LIM in Data And Beyond

Mastering Exploratory Data Analysis (EDA): Everything You Need To


Know
A systematic approach to EDA your data and prep it for machine learning.

18 min read · Apr 6, 2024

343 3

Dylan Cooper in Stackademic

Mojo, 90,000 Times Faster Than Python, Finally Open Sourced!


On March 29, 2024, Modular Inc. announced the open sourcing of the core components of
Mojo.

· 10 min read · Apr 8, 2024

3K 21

Lists

Coding & Development


11 stories · 568 saves

Predictive Modeling w/ Python


20 stories · 1111 saves

General Coding Knowledge


20 stories · 1124 saves

Practical Guides to Machine Learning


10 stories · 1324 saves

Rany ElHousieny in Level Up Coding

Understanding Principal Component Analysis (PCA) Through Everyday


Examples
Principal Component Analysis (PCA) is a statistical technique used to simplify the complexity
in high-dimensional data while retaining…
14 min read · Dec 21, 2023

149

Richard Warepam in ILLUMINATION

How To Perform Statistical Analysis Using Python: The Ultimate Guide


5 Proven Methods that Every Data Science Professional Uses

8 min read · Mar 5, 2024

609 2
Tim Sumner in Towards Data Science

A New Coefficient of Correlation


What if you were told there exists a new way to measure the relationship between two variables
just like correlation except possibly…

10 min read · Mar 31, 2024

2.6K 35

Subha

Handling missing values in dataset — 7 methods that you need to know


While working with data it is a common scenario for the data scientists to deal with missing
values. Handling these missing values could…

9 min read · Feb 13, 2024

117

See more recommendations

You might also like