Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
Identifying and Handling Outliers in Pandas - A Step-By-Step Guide - by Arvid Eichner - Python in Plain English
Search
Member-only story
Outliers are exceptional data points within your dataset, caused by chance,
anomalies, or even measurement errors. Effectively identifying and handling
outliers is a complex yet critical process because ignoring them can lead to biased
results. Pandas, the versatile data manipulation library in Python, provides a set of
tools for efficiently handling outliers. In this step-by-step guide, we will explore
what outliers are, how to detect them, what actions to take when handling them,
and how to leverage pandas along the way. We will look at boxplots, z_scores and the
interquartile range (IQR) method.
Before We Start
You should NEVER remove outliers without questioning their origin. Not all outliers
are created equal and for some types of outliers there are strong arguments against
removing them. If you are not certain that your outliers can be removed, or for a
more introductory guide on outliers using real-world examples, check out this story
first:
Okay, with that out of the way, lets see how we can identify and get rid of outliers in
your data using the Python library pandas.
'''
I use the numpy library to generate normally distributed data,
add 5 outliers, then combine everything in a new DataFrame:
'''
# Introduce 5 outliers
outliers = np.random.normal(loc=200, scale=50, size=5)
data = np.concatenate([data, outliers])
Since np.random.normal creates random data, I first set a seed using np.random.seed(42) .
By setting the same seed (42) in your code, you can recreate the same exact data points,
even though they are random.
In this code snippet, we create a DataFrame ’outlier_df’ with one variable called
‘values’ which contains 1000 normally distributed data points with mean (M) = 50
and standard deviation (SD) = 10. 'values' also contains 5 additional data points
that follow a normal distribution with M = 200 and SD = 50; i.e., potential outliers.
Let’s try to find them:
# create boxplot
outlier_df.boxplot()
We can see that, as expected, most values lie between 0 and 100 and the median is
around 50. This makes sense given that we created data that is normally distributed
around a mean of 50 with a standard deviation of 10.
We can also see some values — 5 to be exact — that are above 150. They appear far
from all other values, like outcasts struggling to fit in. We just identified the
outliers!
In a boxplot, the data is divided into quartiles, with the central box representing the
interquartile range (IQR) that spans the 25th to 75th percentiles. The line inside the box
marks the median (50th percentile), and the “whiskers” (T-shaped lines) represent a limit
of 1.5 * IQR.
We can also use different plots, such as histograms. Histograms are great to get a
general feel for the distribution of a variable. It is often a good idea to start your data
exploration by looking at a histogram. As you will see, they are not as powerful for
spotting outliers though, especially compared to boxplots.
# create histogram
outlier_df.hist()
Again, we can see that our data seems to be normally distributed around 50
(although it appears a little skewed). Looking at this histogram there don’t seem to
be any outliers though, what’s up with that?
In this histogram, we can’t spot the 5 outliers since they get lost among the 1000 “normal”
data points.
That’s the tricky thing about outliers. They don’t show up in large numbers (or else
they wouldn’t be outliers). But even in low numbers they can disproportionately
affect your analysis due to their exceptionally high or low values.
We can combat this by increasing the number of ‘bins’ for the histogram. The
hist() function groups values into bins and then plots the bins. We can change the
number of bins to use by changing the value for the bins parameter when calling
the function:
Using a higher bin count, we can visually identify outliers using a histogram. But as you
notice, boxplots are far superior for visual outlier detection.
For instance, we can calculate the so-called z-score for each data point. It measures
the distance of a data point from the mean in standard deviations. High absolute z-
scores can be considered outliers (usually z-scores greater than 2 or 3).
We obtain the z_scores by calculating the distance from the mean outlier_df —
Each value in our DataFrame now has its own z_score, representing how far off it is
from the rest of the data.
For example, we can see that the z_score 0.246940 of value 54.979983 in row 3 is higher
than the z_score 0.127266 of value 49.039401 in row 0. That’s because the mean of
'values' is 51.0598 and 54 is further away from 51 than 49 is.
>> 51.05976702436777
Next, we define upper and lower bounds. These are the bounds that define the
range of “acceptable” or “normal” values.
Usually, the lower bound is defined as Q1 - 1.5 * IQR and the upper bound as Q3 + 1.5 *
IQR. You can change out the 1.5 if you want your outlier detection to be more sensitive or
less sensitive.
Now that we have the upper and lower bounds, we can identify our outliers: Data
points falling below the lower bound or above the upper bound are considered
potential outliers.
If you are certain, that your outliers can be removed, for instance because they are
created by measurement errors, you can go ahead and remove them from your data
set.
To remove the outliers, we create a new DataFrame df_cleaned which only contains
values from outlier_df that are smaller or equal to 150. We can confirm, whether
the outlier removal was sucessfull by plotting another boxplot:
'''
We already defined lower and upper bounds above
So now we only need to remove values that are:
'''
Handling outliers is a crucial step in data analysis and is essential for producing
accurate and meaningful results. With the power of Pandas and statistical methods,
you can effectively identify and manage outliers, ensuring that your analyses are
based on reliable data. By incorporating these techniques into your data analysis
workflow, you’ll be better equipped to derive valuable insights from your datasets.
Related Stories
Why You Should Start Using Pandas for Data Preparation Today
arvideichner.medium.com
In Plain English
Thank you for being a part of our community! Before you go:
Follow
Ph.D. candidate in Information Systems / Data Science, passionate about Python, R, data, and statistics
36
1.3K 5
2.2K 23
Arvid Eichner
96 1
343 3
3K 21
Lists
149
609 2
Tim Sumner in Towards Data Science
2.6K 35
Subha
117