0% found this document useful (0 votes)

250 views8 pages

Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF

Uploaded by

aravindcj3600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

250 views8 pages

Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF

Uploaded by

aravindcj3600

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Cleaning Dirty Data

Home
with
Solutions
Pandas
Courses
& Python
Why Us Contact

       Previous
(877) 629-
Next
5631

Get My
Custom
Training
Consultation
First Name*

Last Name*

Company Name*
Pandas is a popular Python library used for data science and
analysis. Used in conjunction with other data science toolsets
like SciPy, NumPy, and Matplotlib, a modeler can create end- Email*
to-end analytic work ows to solve business problems.

While you can do a lot of really powerful things with Python Phone Number*
and data analysis, your analysis is only ever as good as your
dataset. And many datasets have missing, malformed, or How many people
erroneous data. It’s often unavoidable–anything from need training?*
incomplete reporting to technical glitches can cause “dirty” -- Please Select --
data.
Submit

Thankfully, Pandas provides a robust library of functions to

help you clean up, sort through, and make sense of your
datasets, no matter what state they’re in. For our example,
we’re going to use a dataset of 5,000 movies scraped from
IMDB. It contains information on the actors, directors, budget, DevelopIntelligence
has been in the
and gross, as well as the IMDB rating and release year. In
technical/software
practice, you’ll be using much larger datasets consisting of
development
potentially millions of rows, but this is a good sample dataset
learning and
to start with. training industry for

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 1/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Unfortunately, some of the elds in this dataset aren’t lled in nearly 20 years.
and some of them have default values such as 0 or NaN (Not a We’ve provided
Number). learning solutions to
more than 48,000
Home Solutions Courses Why Us
engineers, Contact
across
220 organizations
worldwide.

No good. Let’s go through some Pandas hacks you can use to

clean up your dirty data.

Getting started
To get started with Pandas, rst you will need to have it
installed. You can do so by running:

$ pip install pandas

Then we need to load the data we downloaded into Pandas.

You can do this with a few Python commands:

import pandas as pd

data = pd.read_csv(‘movie_metadata.csv’)

Make sure you have your movie dataset in the same folder as
you’re running the Python script. If you have it stored
elsewhere, you’ll need to change the read_csv parameter to
point to the le’s location.

Look at your data

To check out the basic structure of the data we just read in,
you can use the head() command to print out the rst ve
rows. That should give you a general idea of the structure of
the dataset.

data.head()

When we look at the dataset either in Pandas or in a more

traditional program like Excel, we can start to note down the
problems, and then we’ll come up with solutions to x those
problems.

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 2/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Pandas has some selection methods which you can use to

slice and dice the dataset based on your queries. Let’s go
through some quick examples before moving on:

Home
Look at the some basic stats for theSolutions
‘imdb_score’Courses Why Us Contact

column: data.imdb_score.describe()
Select a column: data[‘movie_title’]
Select the rst 10 rows of a column: data[‘duration’]
[:10]
Select multiple columns: data[[‘budget’,’gross’]]
Select all movies over two hours long:
data[data[‘duration’] > 120]

Deal with missing data

One of the most common problems is missing data. This could
be because it was never lled out properly, the data wasn’t
available, or there was a computing error. Whatever the
reason, if we leave the blank values in there, it will cause
errors in analysis later on. There are a couple of ways to deal
with missing data:

Add in a default value for the missing data

Get rid of (delete) the rows that have missing data
Get rid of (delete) the columns that have a high incidence
of missing data

We’ll go through each of those in turn.

Add default values

First of all, we should probably get rid of all those nasty NaN
values. But what to put in its place? Well, this is where you’re
going to have to eyeball the data a little bit. For our example,
let’s look at the ‘country’ column. It’s straightforward enough,
but some of the movies don’t have a country provided so the
data shows up as NaN. In this case, we probably don’t want to
assume the country, so we can replace it with an empty string
or some other default value.

data.country = data.country.fillna(‘’)

This replaces the NaN entries in the ‘country’ column with the
empty string, but we could just as easily tell it to replace with a
default name such as “None Given”. You can nd more
information on llna() in the Pandas documentation.
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 3/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

With numerical data like the duration of the movie, a

calculation like taking the mean duration can help us even the
dataset out. It’s not a great measure, but it’s an estimate of
what the duration could be based on the other data. That way
Home Solutions Courses Why Us Contact
we don’t have crazy numbers like 0 or NaN throwing o our
analysis.

data.duration =
data.duration.fillna(data.duration.mean())

Remove incomplete rows

Let’s say we want to get rid of any rows that have a missing
value. It’s a pretty aggressive technique, but there may be a
use case where that’s exactly what you want to do.

Dropping all rows with any NA values is easy:

data.dropna()

Of course, we can also drop rows that have all NA values:

data.dropna(how=’all’)

We can also put a limitation on how many non-null values

need to be in a row in order to keep it (in this example, the
data needs to have at least 5 non-null values):

data.dropna(thresh=5)

Let’s say for instance that we don’t want to include any movie
that doesn’t have information on when the movie came out:

data.dropna(subset=[‘title_year’])

The subset parameter allows you to choose which columns

you want to look at. You can also pass it a list of column
names here.

Deal with error-prone columns

We can apply the same kind of criteria to our columns. We just
need to use the parameter axis=1 in our code. That means to
operate on columns, not rows. (We could have used axis=0 in
our row examples, but it is 0 by default if you don’t enter
anything.)

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 4/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Drop the columns with that are all NA values:

data.dropna(axis=1, how=’all’)

Home
Drop all columns with any NA values: Solutions Courses Why Us Contact

data.dropna(axis=1, how=’any’)

The same threshold and subset parameters from above apply

as well. For more information and examples, visit the Pandas
documentation.

Normalize data types

Sometimes, especially when you’re reading in a CSV with a
bunch of numbers, some of the numbers will read in as strings
instead of numeric values, or vice versa. Here’s a way you can
x that and normalize your data types:

data = pd.read_csv(‘movie_metadata.csv’, dtype=

{‘duration’: int})

This tells Pandas that the column ‘duration’ needs to be an

integer value. Similarly, if we want the release year to be a
string and not a number, we can do the same kind of thing:

data = pd.read_csv(‘movie_metadata.csv’, dtype=

{title_year: str})

Keep in mind that this data reads the CSV from disk again, so
make sure you either normalize your data types rst or dump
your intermediary results to a le before doing so.

Change casing
Columns with user-provided data are ripe for corruption.
People make typos, leave their caps lock on (or o ), and add
extra spaces where they shouldn’t.

To change all our movie titles to uppercase:

data[‘movie_title’].str.upper()

Similarly, to get rid of trailing whitespace:

data[‘movie_title’].str.strip()

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 5/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

We won’t be able to cover correcting spelling mistakes in this

tutorial, but you can read up on fuzzy matching for more
information.

Home Solutions Courses Why Us Contact

Rename columns
Finally, if your data was generated by a computer program, it
probably has some computer-generated column names, too.
Those can be hard to read and understand while working, so if
you want to rename a column to something more user-
friendly, you can do it like this:

data.rename(columns = {‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})

Here we’ve renamed ‘title_year’ to ‘release_date’ and

‘movie_facebook_likes’ to simply ‘facebook_likes’. Since this is
not an in-place operation, you’ll need to save the DataFrame
by assigning it to a variable.

data = data.rename(columns =
{‘title_year’:’release_date’,
‘movie_facebook_likes’:’facebook_likes’})

Save your results

When you’re done cleaning your data, you may want to export
it back into CSV format for further processing in another
program. This is easy to do in Pandas:

data.to_csv(‘cleanfile.csv’ encoding=’utf-8’)

More resources
Of course, this is only the tip of the iceberg. With variations in
user environments, languages, and user input, there are many
ways that a potential dataset may be dirty or corrupted. At this
point you should have learned some of the most common
ways to clean your dataset with Pandas and Python.

For more resources on Pandas and data cleaning, see these

additional resources:

Pandas documentation
Messy Data Tutorial
Kaggle Datasets
www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 6/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Python for Data Analysis (“The Pandas Book”)

Bio Latest Posts

Home Solutions Courses Why Us Contact
Al Nelson
Al is a geek about all things tech. He's a
professional technical writer and software
developer who loves writing for tech
businesses and cultivating happy users. You
can nd him on the web at
http://www.alnelsonwrites.com or on Twitter
as @musegarden.

     
Share This
Article 

August 10th, 2017 | Python, Uncategorized

ABOUT US TRAINING OPTIONS COURSES BY LET’S DISCUSS

CATEGORY
 Testimonials  Hands-On DevelopIntelligence
Training Courses  All Courses (IT has been in the
 Be Smart Blog
Training) technical/software
 Bootcamp
 Partners development
Courses  Agile Training
learning and
 Giving Back
 Learn to Code  Apache Training training industry
 Join The Team Workshops for nearly 20 years.
 Google Training
We’ve provided
 In the News  Expert-Led
 Java EE Training learning solutions
Prototyping
 2017 Developer to more than
 JavaScript
Report  Courses by Skill 48,000 engineers,
Training
Level across 220
 Tutorials
 JBoss Training organizations
 Courses by Role
 MySQL Training worldwide.

 Ruby Training

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 7/8
5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

 Scala Training Let's discuss

your training
 SpringSource
Training goals.
Home Solutions Courses Why Us Contact
 Zend Training

www.developintelligence.com/blog/2017/08/data-cleaning-pandas-python/ 8/8

Wireless Communication by Rappaport Problem Solution Manual
No ratings yet
Wireless Communication by Rappaport Problem Solution Manual
7 pages
Business Continuity Plan Sample
100% (21)
Business Continuity Plan Sample
32 pages
Jolt by Justin Jackson
100% (1)
Jolt by Justin Jackson
168 pages
Data Analyst Cheatsheet - For - Kuhtfe
No ratings yet
Data Analyst Cheatsheet - For - Kuhtfe
6 pages
Illegal
No ratings yet
Illegal
3 pages
SQL Interview Questions For A Data Engineer
No ratings yet
SQL Interview Questions For A Data Engineer
11 pages
SQL Interview
No ratings yet
SQL Interview
73 pages
Yauhan Recruitment Agency Contacts - 2024-2
No ratings yet
Yauhan Recruitment Agency Contacts - 2024-2
10 pages
PYTHON Notes by Devaraj
100% (1)
PYTHON Notes by Devaraj
40 pages
Interview Compendium PDF
No ratings yet
Interview Compendium PDF
22 pages
Brainalyst's SQL Interview Guide
No ratings yet
Brainalyst's SQL Interview Guide
112 pages
Introduksi Radiologi Konvensional
No ratings yet
Introduksi Radiologi Konvensional
47 pages
Thickness Report R16693
No ratings yet
Thickness Report R16693
339 pages
SQL Functions
100% (1)
SQL Functions
16 pages
Passive Components: Resistors Capacitors Inductors Diodes Interface Components
No ratings yet
Passive Components: Resistors Capacitors Inductors Diodes Interface Components
186 pages
Creating Data Visualizations Using Tableau Desktop (Beginner) - Map and Data Library
No ratings yet
Creating Data Visualizations Using Tableau Desktop (Beginner) - Map and Data Library
48 pages
"Milestone Deliverables" A Guide To Managing IT Implementation Projects
0% (1)
"Milestone Deliverables" A Guide To Managing IT Implementation Projects
8 pages
Data Science & ML - A Complete Interview Guide - Dimensionless PDF
100% (1)
Data Science & ML - A Complete Interview Guide - Dimensionless PDF
18 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Python Pandas Cheatsheety
No ratings yet
Python Pandas Cheatsheety
7 pages
SLA Aware Cost Based Service Ranking in Cloud Computing
No ratings yet
SLA Aware Cost Based Service Ranking in Cloud Computing
12 pages
147-Reddish HTB Official Writeup Tamarisk
No ratings yet
147-Reddish HTB Official Writeup Tamarisk
18 pages
DAX Cheat Sheet
No ratings yet
DAX Cheat Sheet
10 pages
The 365 DS Booklet PDF
100% (1)
The 365 DS Booklet PDF
67 pages
CS610 MIDTERMSOLVEDMCQSWITHREFRENCESby MOAAZ1
No ratings yet
CS610 MIDTERMSOLVEDMCQSWITHREFRENCESby MOAAZ1
15 pages
External Tables
No ratings yet
External Tables
105 pages
Vyzex Floor POD Plus Pilot's Guide PDF
No ratings yet
Vyzex Floor POD Plus Pilot's Guide PDF
20 pages
Data Science in Finance (Article) - DataCamp PDF
No ratings yet
Data Science in Finance (Article) - DataCamp PDF
23 pages
PySpark SQL Cheat Sheet Python
No ratings yet
PySpark SQL Cheat Sheet Python
1 page
PDF Baterias
No ratings yet
PDF Baterias
6 pages
Natural Language Processing: Aman Shakya
No ratings yet
Natural Language Processing: Aman Shakya
17 pages
Mastering SQL Window Functions - 01
No ratings yet
Mastering SQL Window Functions - 01
39 pages
Process Data From Dirty To Clean
No ratings yet
Process Data From Dirty To Clean
30 pages
Data Visualisation With Tableau
No ratings yet
Data Visualisation With Tableau
26 pages
SQL Statement Tunning
No ratings yet
SQL Statement Tunning
19 pages
Power BI Data Storytelling
No ratings yet
Power BI Data Storytelling
10 pages
The Most Impressive Answer I Ever Received To 'Tell Me About Yourself' PDF
No ratings yet
The Most Impressive Answer I Ever Received To 'Tell Me About Yourself' PDF
7 pages
DataCleaning 1717312956
No ratings yet
DataCleaning 1717312956
22 pages
CVE311
No ratings yet
CVE311
40 pages
Python Data Exploratory Commands
No ratings yet
Python Data Exploratory Commands
9 pages
The Pollution Flashover On High Voltage Insulators
No ratings yet
The Pollution Flashover On High Voltage Insulators
8 pages
SCD Type 2. Pyspark
No ratings yet
SCD Type 2. Pyspark
7 pages
ThoughtSpot User Guide 4.5
No ratings yet
ThoughtSpot User Guide 4.5
409 pages
DBMS Presentation
No ratings yet
DBMS Presentation
21 pages
Chapter 1
No ratings yet
Chapter 1
5 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Data Warehousing Interview Questions - by Shobha Bhagwat - Medium
No ratings yet
Data Warehousing Interview Questions - by Shobha Bhagwat - Medium
9 pages
Databricks Questions
No ratings yet
Databricks Questions
23 pages
Installation Lithium Ion Battery Rack-Modified
No ratings yet
Installation Lithium Ion Battery Rack-Modified
36 pages
Looker
No ratings yet
Looker
57 pages
Panda Cheatsheet
No ratings yet
Panda Cheatsheet
17 pages
Hard Computing Vs Soft Computing
No ratings yet
Hard Computing Vs Soft Computing
15 pages
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
No ratings yet
T-GCPBDML-B - M2 - Data Engineering For Streaming Data - ILT Slides
71 pages
Document 216109.1
No ratings yet
Document 216109.1
9 pages
SQL Queries and PL/SQL
No ratings yet
SQL Queries and PL/SQL
92 pages
Prasad Reddy19 - Power BI 4.2yr
No ratings yet
Prasad Reddy19 - Power BI 4.2yr
4 pages
Data-Analyst - ERT
No ratings yet
Data-Analyst - ERT
21 pages
2.6 Python in Power BI: 2.6.1. Configuration
No ratings yet
2.6 Python in Power BI: 2.6.1. Configuration
25 pages
Power BI Deployment Pipelines CheatSheet 1731972155
No ratings yet
Power BI Deployment Pipelines CheatSheet 1731972155
10 pages
Sanga MSTR
0% (1)
Sanga MSTR
443 pages
AVL Trees in Java
No ratings yet
AVL Trees in Java
7 pages
Day65 - Day70 Power BI Interview
No ratings yet
Day65 - Day70 Power BI Interview
31 pages
SQL Interview Questions and Answers G
No ratings yet
SQL Interview Questions and Answers G
67 pages
Data Science With Python, Power BI and Tableau
100% (1)
Data Science With Python, Power BI and Tableau
3 pages
KSR DATA VISION Fullstack - Powerbi - With - Fabric - Tools
No ratings yet
KSR DATA VISION Fullstack - Powerbi - With - Fabric - Tools
21 pages
EDA With Pandas
No ratings yet
EDA With Pandas
8 pages
4 - Power BI - Query Editor - Text Transformation
100% (1)
4 - Power BI - Query Editor - Text Transformation
88 pages
Data Structure
No ratings yet
Data Structure
12 pages
DSL Pandas
No ratings yet
DSL Pandas
87 pages
Dashboard in A Day Slides
No ratings yet
Dashboard in A Day Slides
40 pages
Microsoft Power BI
No ratings yet
Microsoft Power BI
19 pages
Getac V110-SpecSheet
No ratings yet
Getac V110-SpecSheet
2 pages
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
No ratings yet
Ch-2 Panda: #Import The Pandas Library and Aliasing As PD
5 pages
Python Interview Questions
No ratings yet
Python Interview Questions
8 pages
Power BI Interview Questions-2
No ratings yet
Power BI Interview Questions-2
39 pages
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
No ratings yet
Pandas - Basics - Practice: Consider The Following Python Dictionary Data and Python List Labels
6 pages
SST - Round 1 Interview Prep Kit
No ratings yet
SST - Round 1 Interview Prep Kit
11 pages
Querying Microsoft SQL Server
No ratings yet
Querying Microsoft SQL Server
3 pages
Hcylw230-1 Cyber Law Fa1
No ratings yet
Hcylw230-1 Cyber Law Fa1
7 pages
Data Science Theory: Analysis and Analytics
No ratings yet
Data Science Theory: Analysis and Analytics
14 pages
MRBR No Invoice Displayed: Symptom
No ratings yet
MRBR No Invoice Displayed: Symptom
2 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Power BI Cheat Sheet
No ratings yet
Power BI Cheat Sheet
10 pages
Numerical Methods Course Outline
No ratings yet
Numerical Methods Course Outline
4 pages
Python Variables Cheatsheet
No ratings yet
Python Variables Cheatsheet
2 pages
SQL Manual
No ratings yet
SQL Manual
29 pages
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
No ratings yet
Introduction To MS Power BI Desktop - Exercise 02 - Deeper Understanding Power BI ETL - V03
6 pages
Excel 2013/2016: Get Your Hands Dirty
From Everand
Excel 2013/2016: Get Your Hands Dirty
Sam Akrasi
No ratings yet
Sssis Interview Questins
No ratings yet
Sssis Interview Questins
7 pages
Beginning Microsoft SQL Server 2012 Programming
From Everand
Beginning Microsoft SQL Server 2012 Programming
Paul Atkinson
1/5 (1)
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
From Everand
Crack the Data Analyst Interview: Real-Time Questions & Expert Answers
Yash d.
No ratings yet
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
From Everand
The Power of Prediction in Health Care: A Step-by-step Guide to Data Science in Health Care
Rafiq Muhammad
No ratings yet
Databricks Essentials: A Guide to Unified Data Analytics
From Everand
Databricks Essentials: A Guide to Unified Data Analytics
Robert Johnson
No ratings yet

Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF

Uploaded by

Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF

Uploaded by

5/18/2019 Cleaning Dirty Data with Pandas & Python | DevelopIntelligence Blog

Cleaning Dirty Data

Thankfully, Pandas provides a robust library of functions to

No good. Let’s go through some Pandas hacks you can use to

$ pip install pandas

Then we need to load the data we downloaded into Pandas.

Look at your data

When we look at the dataset either in Pandas or in a more

Pandas has some selection methods which you can use to

Deal with missing data

Add in a default value for the missing data

We’ll go through each of those in turn.

Add default values

With numerical data like the duration of the movie, a

Remove incomplete rows

Dropping all rows with any NA values is easy:

Of course, we can also drop rows that have all NA values:

We can also put a limitation on how many non-null values

The subset parameter allows you to choose which columns

Deal with error-prone columns

Drop the columns with that are all NA values:

The same threshold and subset parameters from above apply

Normalize data types

data = pd.read_csv(‘movie_metadata.csv’, dtype=

This tells Pandas that the column ‘duration’ needs to be an

data = pd.read_csv(‘movie_metadata.csv’, dtype=

To change all our movie titles to uppercase:

Similarly, to get rid of trailing whitespace:

We won’t be able to cover correcting spelling mistakes in this

Home Solutions Courses Why Us Contact

Here we’ve renamed ‘title_year’ to ‘release_date’ and

Save your results

For more resources on Pandas and data cleaning, see these

Python for Data Analysis (“The Pandas Book”)

Bio Latest Posts

August 10th, 2017 | Python, Uncategorized

ABOUT US TRAINING OPTIONS COURSES BY LET’S DISCUSS

 Scala Training Let's discuss

Copyright; 2017, Develop Intelligence | All Rights Reserved.

You might also like