5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Cleaning Dirty Data
Home
with
Solutions
Pandas
Courses
& Python
Why Us Contact
Previous
(877) 629-
Next
5631
Get My
Custom
Training
Consultation
First Name*
Last Name*
Company Name*
Pandas is a popular Python library used for data science and
analysis. Used in conjunction with other data science toolsets
like SciPy, NumPy, and Matplotlib, a modeler can create end- Email*
to-end analytic work ows to solve business problems.
While you can do a lot of really powerful things with Python Phone Number*
and data analysis, your analysis is only ever as good as your
dataset. And many datasets have missing, malformed, or How many people
erroneous data. It’s often unavoidable–anything from need training?*
incomplete reporting to technical glitches can cause “dirty” -- Please Select --
data.
Submit
Thankfully, Pandas provides a robust library of functions to
help you clean up, sort through, and make sense of your
datasets, no matter what state they’re in. For our example,
we’re going to use a dataset of 5,000 movies scraped from
IMDB. It contains information on the actors, directors, budget, DevelopIntelligence
has been in the
and gross, as well as the IMDB rating and release year. In
technical/software
practice, you’ll be using much larger datasets consisting of
development
potentially millions of rows, but this is a good sample dataset
learning and
to start with. training industry for
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 1/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Unfortunately, some of the elds in this dataset aren’t lled in nearly 20 years.
and some of them have default values such as 0 or NaN (Not a We’ve provided
Number). learning solutions to
more than 48,000
Home Solutions Courses Why Us
engineers, Contact
across
220 organizations
worldwide.
No good. Let’s go through some Pandas hacks you can use to
clean up your dirty data.
Getting started
To get started with Pandas, rst you will need to have it
installed. You can do so by running:
$ pip install pandas
Then we need to load the data we downloaded into Pandas.
You can do this with a few Python commands:
import pandas as pd
data = pd.read_csv(‘movie_metadata.csv’)
Make sure you have your movie dataset in the same folder as
you’re running the Python script. If you have it stored
elsewhere, you’ll need to change the read_csv parameter to
point to the le’s location.
Look at your data
To check out the basic structure of the data we just read in,
you can use the head() command to print out the rst ve
rows. That should give you a general idea of the structure of
the dataset.
data.head()
When we look at the dataset either in Pandas or in a more
traditional program like Excel, we can start to note down the
problems, and then we’ll come up with solutions to x those
problems.
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 2/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Pandas has some selection methods which you can use to
slice and dice the dataset based on your queries. Let’s go
through some quick examples before moving on:
Home
Look at the some basic stats for theSolutions
‘imdb_score’Courses Why Us Contact
column: data.imdb_score.describe()
Select a column: data[‘movie_title’]
Select the rst 10 rows of a column: data[‘duration’]
[:10]
Select multiple columns: data[[‘budget’,’gross’]]
Select all movies over two hours long:
data[data[‘duration’] > 120]
Deal with missing data
One of the most common problems is missing data. This could
be because it was never lled out properly, the data wasn’t
available, or there was a computing error. Whatever the
reason, if we leave the blank values in there, it will cause
errors in analysis later on. There are a couple of ways to deal
with missing data:
Add in a default value for the missing data
Get rid of (delete) the rows that have missing data
Get rid of (delete) the columns that have a high incidence
of missing data
We’ll go through each of those in turn.
Add default values
First of all, we should probably get rid of all those nasty NaN
values. But what to put in its place? Well, this is where you’re
going to have to eyeball the data a little bit. For our example,
let’s look at the ‘country’ column. It’s straightforward enough,
but some of the movies don’t have a country provided so the
data shows up as NaN. In this case, we probably don’t want to
assume the country, so we can replace it with an empty string
or some other default value.
data.country = data.country.fillna(‘’)
This replaces the NaN entries in the ‘country’ column with the
empty string, but we could just as easily tell it to replace with a
default name such as “None Given”. You can nd more
information on llna() in the Pandas documentation.
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 3/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
With numerical data like the duration of the movie, a
calculation like taking the mean duration can help us even the
dataset out. It’s not a great measure, but it’s an estimate of
what the duration could be based on the other data. That way
Home Solutions Courses Why Us Contact
we don’t have crazy numbers like 0 or NaN throwing o our
analysis.
data.duration =
data.duration.fillna(data.duration.mean())
Remove incomplete rows
Let’s say we want to get rid of any rows that have a missing
value. It’s a pretty aggressive technique, but there may be a
use case where that’s exactly what you want to do.
Dropping all rows with any NA values is easy:
data.dropna()
Of course, we can also drop rows that have all NA values:
data.dropna(how=’all’)
We can also put a limitation on how many non-null values
need to be in a row in order to keep it (in this example, the
data needs to have at least 5 non-null values):
data.dropna(thresh=5)
Let’s say for instance that we don’t want to include any movie
that doesn’t have information on when the movie came out:
data.dropna(subset=[‘title_year’])
The subset parameter allows you to choose which columns
you want to look at. You can also pass it a list of column
names here.
Deal with error-prone columns
We can apply the same kind of criteria to our columns. We just
need to use the parameter axis=1 in our code. That means to
operate on columns, not rows. (We could have used axis=0 in
our row examples, but it is 0 by default if you don’t enter
anything.)
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 4/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Drop the columns with that are all NA values:
data.dropna(axis=1, how=’all’)
Home
Drop all columns with any NA values: Solutions Courses Why Us Contact
data.dropna(axis=1, how=’any’)
The same threshold and subset parameters from above apply
as well. For more information and examples, visit the Pandas
documentation.
Normalize data types
Sometimes, especially when you’re reading in a CSV with a
bunch of numbers, some of the numbers will read in as strings
instead of numeric values, or vice versa. Here’s a way you can
x that and normalize your data types:
data = pd.read_csv(‘movie_metadata.csv’, dtype=
{‘duration’: int})
This tells Pandas that the column ‘duration’ needs to be an
integer value. Similarly, if we want the release year to be a
string and not a number, we can do the same kind of thing:
data = pd.read_csv(‘movie_metadata.csv’, dtype=
{title_year: str})
Keep in mind that this data reads the CSV from disk again, so
make sure you either normalize your data types rst or dump
your intermediary results to a le before doing so.
Change casing
Columns with user-provided data are ripe for corruption.
People make typos, leave their caps lock on (or o ), and add
extra spaces where they shouldn’t.
To change all our movie titles to uppercase:
data[‘movie_title’].str.upper()
Similarly, to get rid of trailing whitespace:
data[‘movie_title’].str.strip()
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 5/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
We won’t be able to cover correcting spelling mistakes in this
tutorial, but you can read up on fuzzy matching for more
information.
Home Solutions Courses Why Us Contact
Rename columns
Finally, if your data was generated by a computer program, it
probably has some computer-generated column names, too.
Those can be hard to read and understand while working, so if
you want to rename a column to something more user-
friendly, you can do it like this:
data.rename(columns = {‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})
Here we’ve renamed ‘title_year’ to ‘release_date’ and
‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is
not an in-place operation, you’ll need to save the DataFrame
by assigning it to a variable.
data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})
Save your results
When you’re done cleaning your data, you may want to export
it back into CSV format for further processing in another
program. This is easy to do in Pandas:
data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)
More resources
Of course, this is only the tip of the iceberg. With variations in
user environments, languages, and user input, there are many
ways that a potential dataset may be dirty or corrupted. At this
point you should have learned some of the most common
ways to clean your dataset with Pandas and Python.
For more resources on Pandas and data cleaning, see these
additional resources:
Pandas documentation
Messy Data Tutorial
Kaggle Datasets
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 6/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Python for Data Analysis (“The Pandas Book”)
Bio Latest Posts
Home Solutions Courses Why Us Contact
Al Nelson
Al is a geek about all things tech. He's a
professional technical writer and software
developer who loves writing for tech
businesses and cultivating happy users. You
can nd him on the web at
http://www.alnelsonwrites.com or on Twitter
as @musegarden.
Share This
Article
August 10th, 2017 | Python, Uncategorized
ABOUT US TRAINING OPTIONS COURSES BY LET’S DISCUSS
CATEGORY
Testimonials Hands-On DevelopIntelligence
Training Courses All Courses (IT has been in the
Be Smart Blog
Training) technical/software
Bootcamp
Partners development
Courses Agile Training
learning and
Giving Back
Learn to Code Apache Training training industry
Join The Team Workshops for nearly 20 years.
Google Training
We’ve provided
In the News Expert-Led
Java EE Training learning solutions
Prototyping
2017 Developer to more than
JavaScript
Report Courses by Skill 48,000 engineers,
Training
Level across 220
Tutorials
JBoss Training organizations
Courses by Role
MySQL Training worldwide.
Ruby Training
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 7/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog
Scala Training Let's discuss
your training
SpringSource
Training goals.
Home Solutions Courses Why Us Contact
Zend Training
Copyright; 2017, Develop Intelligence | All Rights Reserved.
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 8/8