Data Cleaning
Data Cleaning
Objectives
Develop data cleaning strategies:
START WITH A
QUESTION
Data Analysis Workflow
What does the data analysis
workflow look like?
START WITH A
QUESTION
COLLECT &
CLEAN DATA
Data Analysis Workflow
What does the data analysis
workflow look like?
START WITH A
QUESTION
EXPLORATORY
DATA ANALYSIS
(EDA)
COLLECT &
CLEAN DATA
Data Analysis Workflow
What does the data analysis
workflow look like?
EXPLORATORY
DATA ANALYSIS
(EDA)
EXPLORATORY
DATA ANALYSIS
(EDA)
EXPLORATORY
DATA ANALYSIS
(EDA)
1. Duplicate or
unnecessary data
3. Missing data
4. Outliers
… and more!
How can data be messy?
Most Visited US Websites (as of 2020)
36 netflix.com 37,000,000
How can data be messy?
Most Visited US Websites (as of 2020)
36 netflix.com 37,000,000
How can data be messy?
Most Visited US Websites (as of 2020)
36 netflix.com 37,000,000
How can data be messy?
● Look for duplicates and dig into why
there are multiple values
1. Duplicate or
unnecessary data ● Filter data down as appropriate
3. Missing data
4. Outliers
… and more!
How can data be messy?
Most Visited US Websites (as of 2020)
6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000
… and more!
How can data be messy?
Most Visited US Websites (as of 2020)
6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000
… and more!
How can data be messy?
Check summary statistics for each
column of data.
1. Duplicate or
unnecessary data
● Minimum and maximum of
2. Inconsistent text and numerical values
typos
● Unique values of categoricals
3. Missing data
4. Outliers
… and more!
How can data be messy?
Most Visited US Websites (as of 2020)
6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000
… and more!
How can data be messy?
Most Visited US Websites (as of 2020)
6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000
… and more!
How can data be messy?
Outliers
1. Duplicate or
● Are distant from other observations
unnecessary data
4. Outliers
… and more!
How can data be messy?
How to find outliers:
1. Duplicate or ● Plots
unnecessary data
● Statistics
2. Inconsistent text and
typos
3. Missing data
4. Outliers
… and more!
How can data be messy?
How to find outliers:
1. Duplicate or ● Plots
unnecessary data
● Statistics
2. Inconsistent text and
typos
How to deal with outliers:
3. Missing data
● Remove them
4. Outliers
● Assign mean or median value
… and more!
● Predict value with model
How can data be messy?
1. Duplicate or
unnecessary data
3. Missing data
4. Outliers
… and more!
Data Analysis Workflow
What does the data analysis
workflow look like?
EXPLORATORY
DATA ANALYSIS
(EDA)
import pandas as pd
df = pd.read_csv("coffee.csv")
df
Detecting Missing Values
Dataset contains several missings
import pandas as pd
df = pd.read_csv("coffee.csv")
df
Detecting Missing Values
Quickly check missings with .info()
df.info()
Detecting Missing Values
Use .isna() for elementwise
True/False values
df.isna()
Detecting Missing Values
Use .isna() for elementwise
True/False values
df.isna()
How to Detect Missing Values
Use .isna() result as data mask
~df.shipping.isna()
How to Detect Missing Values
Use .isna() result as data mask
~df.shipping.isna() df[~df.shipping.isna()]
Methods to Handle Missings
df.dropna()
Dropping Missing Values
Use pandas to drop with .dropna()
● Drops all rows with any missing by
default
● Use subset to drop only some missings
df.dropna(subset=["price_lb"])
Filling Missings with a Value
Use pandas .fillna() to fill missings
df.shipping.fillna(0)
Filling Missings with a Value
Use pandas .fillna() to fill missings
● Only fill with reasonable values!
df.shipping.fillna(0)
Imputing Missings
Use pandas .fillna() to fill missings
with mean or median values
price_avg = df.price_lb.mean()
price_avg
14.105
Imputing Missings
Use pandas .fillna() to fill missings
with mean or median values
price_avg = df.price_lb.mean()
df.price_lb.fillna(price_avg)
Detecting Missing Values
.info()
● Count of non-null values for each
column
.isna()
● Boolean True/False for each
element
● Can be used as data mask
Methods to Handle Missings
.dropna()
1. Drop rows with missing values
.fillna()
2. Fill missing values with a
standard value such as zero
3. Impute missings with mean or
median
import pandas as pd
df = pd.read_csv("coffee_data.csv")
df
Rename Data Columns
None of these column names are
valid Python variables
● Rename to make analysis easier
df.columns
df.rename(columns={
'Price per Pound': 'price_lb',
'Shipping Price': 'shipping',
'Favorite?': 'favorite'
}, inplace=True)
Rename Data Columns
Pass a dictionary to the columns
argument of pandas .rename()
df.rename(columns={ df
'Price per Pound': 'price_lb',
'Shipping Price': 'shipping',
'Favorite?': 'favorite'
}, inplace=True)
What is the average shipping?
Why does using .mean() on
shipping column causes error?
df.shipping.mean()
df.shipping.mean()
df.head(3)
What is the average shipping?
Checking .dtypes shows the
shipping column contains strings
df.dtypes
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
df.dtypes
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
df.dtypes
df.shipping.mean()
2.4257142857142857
Dropping Columns
The favorite column contains the
same value for every row
df.favorite.value_counts()
Dropping Columns
Drop unnecessary columns with
pandas .drop()
● axis=0 refers to the row dimension
● axis=1 refers to column dimension
df.drop('favorite', axis=1)
Dropping Columns
Drop unnecessary columns with
pandas .drop()
● axis=0 refers to the row dimension
● axis=1 refers to column dimension
df.drop('favorite', axis=1)
Managing Columns of Data
● Inconsistent text
● Typos
● Extra whitespace
import pandas as pd
df = pd.read_csv("cities.csv")
df
Convert Column to Upper- or
Lowercase
● Inconsistencies in state column
df
Convert Column to Upper- or
Lowercase
● Inconsistencies in state column
● Convert column to uppercase
● Reference string methods, .str
df.state = df.state.str.upper()
df.state
Remove Specific Characters
df
Remove Specific Characters
df.city.unique()
city = df.city.str.strip()
Removing Whitespace
Strip whitespace from front or end
of string data
● Whitespace includes spaces, tabs,
new line characters, etc.
city = df.city.str.strip()
city.unique()
df.city.str.contains('Los')
Checking for Substrings
Which cities contain “Los”?
● Check elementwise with
.contains()
● Use result as data mask
df.city.str.contains('Los') df[df.city.str.contains('Los')]
Analyzing Text Data
Text data is notoriously messy.
Inconsistent text or typos
● .str.upper(), .str.lower()
● .str.replace()
Extra whitespace
● .str.strip()
import pandas as pd
df = pd.read_csv("wxyz.csv")
df.head()
Stock Prices Case Study
What kind of data do we have?
df.shape
(308, 3)
df.info()
Stock Prices Case Study
What kind of data do we have?
df.shape
(308, 3)
df.info()
All
columns
contain
strings
Convert Strings to Numerics
Convert the price column to
numerical values
df.head()
Convert Strings to Numerics
Convert the price column to
numerical values
● Remove dollar signs
df.price = df.price.astype('float')
df.dtypes
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
df.sample(5)
Create Datetime Column
● Combine day and time columns
df['day_of_week'] = df.date_time.dt.weekday
df.sample(5)
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
● Select Mondays
mondays = df[df.day_of_week == 0]
mondays.sample(5)
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
● Select Mondays
● Sort to find the maximum price
(mondays[['date_time', 'price']]
.sort_values('price', ascending=False)
.head(3))
What should we do when
exploring new data?
.replace(), .to_datetime()
● What steps are necessary to make
data ready for analysis?
● Use exceptions
Circus Performers Case Study
Which type of performer is the
most experienced on average?
import pandas as pd
df = pd.read_csv("circus.csv")
df.head()
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.shape
(50, 2)
df.info()
Manipulating Email Data
Create new columns for domain
str.split("[email protected]", "@")
['jamie50', 'liontamer.org']
str.split("[email protected]", "@")[1]
'liontamer.org’
Manipulating Email Data
Create new columns for domain
df.loc[:2].email.map(
lambda x: str.split(x, "@")[1]
)
Manipulating Email Data
Create new columns for domain
df.email.map(
lambda x: str.split(x, "@")[1]
)
def check_for_at_symbol(email):
if '@' not in email:
print(email)
return ('@' in email)
at_symbol_test = df.email.map(check_for_at_symbol)
jonathan104ringmaster.net
Creating Custom Function
Create a custom Python function
to create domain column
def get_domain(email):
if '@' not in email:
return None
return str.split(email, '@')[1]
df['domain'] = df.email.map(get_domain)
Creating Custom Function
Create a custom Python function
to create domain column
def get_domain(email):
if '@' not in email:
return None
return str.split(email, '@')[1]
df['domain'] = df.email.map(get_domain)
df.loc[13:16]
Using Python Exceptions
Create domain column by catching
errors with exception
def get_domain_exception(email):
try:
return str.split(email, '@')[1]
except IndexError:
print(email)
return None
df['domain'] = df.email.map(get_domain_exception)
jonathan104ringmaster.net
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.head(3)
df.shape
(50, 2)
df.email.nunique()
40
Remove Duplicate Rows
Use pandas to remove duplicate
rows with .drop_duplicates()
● Entire row must be exact match
df.drop_duplicates().shape()
(40, 3)
Remove Duplicate Rows
Use pandas to remove duplicate
rows with .drop_duplicates()
● Entire row must be exact match
● Use subset argument with column
name or list to match specific columns
df.drop_duplicates().shape()
(40, 3)
df.drop_duplicates(subset='email').shape()
(40, 3)
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.drop_duplicates(inplace=True)
(df.groupby('domain')
.performances
.mean()
.sort_values(ascending=False))
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.drop_duplicates(inplace=True)
(df.groupby('domain')
.performances
.mean()
.sort_values(ascending=False))
Case Study #3:
Comparing Against
Group Statistics
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.
df.shape
(344, 7)
df.info()
Handling Missing Values
What kind of missings do we have?
df[df.bill_length_mm.isna()]
df.dropna(subset=["bill_length_mm"], inplace=True)
df.shape
(342, 7)
Handling Missing Values
What kind of missings do we have?
df[df.sex.isna()]
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.
Pandas Transform
Use transform to produce group
aggregates for each row
df.groupby("species").body_mass_g.mean()
Pandas Transform
Use transform to produce group
aggregates for each row
df["mass_species_mean"] = (df.groupby("species").body_mass_g
.transform(lambda x: x.mean()))
Pandas Transform
Use transform to produce group
aggregates for each row
df["mass_species_mean"] = (df.groupby("species").body_mass_g
.transform(lambda x: x.mean()))
df[["species", "body_mass_g", "mass_species_mean"]].sample(5)
Standard Penguin Mass
Standardize the penguin masses
by species.
df["mass_standard"] = (df.groupby("species").body_mass_g
.transform(lambda x: (x - x.mean())/x.std()))
Standard Penguin Mass
Standardize the penguin masses
by species.
df["mass_standard"] = (df.groupby("species").body_mass_g
.transform(lambda x: (x - x.mean())/x.std()))
df[["species", "body_mass_g", "mass_species_mean", "mass_standard"]].sample(5)
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.
df.sort_values("mass_standard").head()