0% found this document useful (0 votes)
13 views119 pages

Data Cleaning

Uploaded by

taghreed albaiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views119 pages

Data Cleaning

Uploaded by

taghreed albaiz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 119

Data Cleaning

Objectives
Develop data cleaning strategies:

● Handling missing values

● Tidying string data

● Cleaning datasets through


case studies
What is
Data Cleaning?
Data Analysis Workflow
What does the data analysis
workflow look like?
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A
QUESTION
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A
QUESTION

COLLECT &
CLEAN DATA
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A
QUESTION

EXPLORATORY
DATA ANALYSIS
(EDA)

COLLECT &
CLEAN DATA
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A MODELS &


QUESTION ALGORITHMS

EXPLORATORY
DATA ANALYSIS
(EDA)

COLLECT & COMMUNICATE


CLEAN DATA RESULTS
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A MODELS &


QUESTION ALGORITHMS

EXPLORATORY
DATA ANALYSIS
(EDA)

COLLECT & COMMUNICATE


CLEAN DATA RESULTS
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A MODELS &


QUESTION ALGORITHMS

EXPLORATORY
DATA ANALYSIS
(EDA)

COLLECT & COMMUNICATE


CLEAN DATA RESULTS
How can data be messy?
How can data be messy?

1. Duplicate or
unnecessary data

2. Inconsistent text and


typos

3. Missing data

4. Outliers

… and more!
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 2 en.wikipedia.org 1,032,000,000


typos 3 twitter.com 536,000,000

3. Missing data 4 Facebook 512,000,000

5 amazon.com 492 million


4. Outliers
6 yelp.com ---

… and more! 7 reddit.com 184,000,000

36 netflix.com 37,000,000
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 2 en.wikipedia.org 1,032,000,000


typos 3 twitter.com 536,000,000

3. Missing data 4 Facebook 512,000,000

5 amazon.com 492 million


4. Outliers
6 yelp.com ---

… and more! 7 reddit.com 184,000,000

36 netflix.com 37,000,000
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 2 en.wikipedia.org 1,032,000,000


typos 3 twitter.com 536,000,000

3. Missing data 4 Facebook 512,000,000

5 amazon.com 492 million


4. Outliers
6 yelp.com ---

… and more! 7 reddit.com 184,000,000

36 netflix.com 37,000,000
How can data be messy?
● Look for duplicates and dig into why
there are multiple values
1. Duplicate or
unnecessary data ● Filter data down as appropriate

2. Inconsistent text and


typos

3. Missing data

4. Outliers

… and more!
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 3 twitter.com 536,000,000


typos 4 Facebook 512,000,000

3. Missing data 5 amazon.com 492 million

6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000

… and more!
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 3 twitter.com 536,000,000


typos 4 Facebook 512,000,000

3. Missing data 5 amazon.com 492 million

6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000

… and more!
How can data be messy?
Check summary statistics for each
column of data.
1. Duplicate or
unnecessary data
● Minimum and maximum of
2. Inconsistent text and numerical values
typos
● Unique values of categoricals
3. Missing data

4. Outliers

… and more!
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 3 twitter.com 536,000,000


typos 4 facebook.com 512,000,000

3. Missing data 5 amazon.com 492,000,000

6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000

… and more!
How can data be messy?
Most Visited US Websites (as of 2020)

rank website monthly_traffic


1. Duplicate or
1 youtube.com 1,626,000,000
unnecessary data
2 en.wikipedia.org 1,032,000,000

2. Inconsistent text and 3 twitter.com 536,000,000


typos 4 facebook.com 512,000,000

3. Missing data 5 amazon.com 492,000,000

6 yelp.com ---
4. Outliers
7 reddit.com 184,000,000

… and more!
How can data be messy?
Outliers
1. Duplicate or
● Are distant from other observations
unnecessary data

2. Inconsistent text and ● Do not accurately represent real world


typos
● Can significantly impact analysis
3. Missing data

4. Outliers

… and more!
How can data be messy?
How to find outliers:
1. Duplicate or ● Plots
unnecessary data
● Statistics
2. Inconsistent text and
typos

3. Missing data

4. Outliers

… and more!
How can data be messy?
How to find outliers:
1. Duplicate or ● Plots
unnecessary data
● Statistics
2. Inconsistent text and
typos
How to deal with outliers:
3. Missing data
● Remove them
4. Outliers
● Assign mean or median value
… and more!
● Predict value with model
How can data be messy?

1. Duplicate or
unnecessary data

2. Inconsistent text and


typos

3. Missing data

4. Outliers

… and more!
Data Analysis Workflow
What does the data analysis
workflow look like?

START WITH A MODELS &


QUESTION ALGORITHMS

EXPLORATORY
DATA ANALYSIS
(EDA)

COLLECT & COMMUNICATE


CLEAN DATA RESULTS
Handling
Missing Values
Missing Values

● Unfortunately very common

● Occur for many reasons

● Detect with pandas

● Several ways to handle missings


Detecting Missing Values
Load in data about coffee

import pandas as pd
df = pd.read_csv("coffee.csv")
df
Detecting Missing Values
Dataset contains several missings

import pandas as pd
df = pd.read_csv("coffee.csv")
df
Detecting Missing Values
Quickly check missings with .info()

df.info()
Detecting Missing Values
Use .isna() for elementwise
True/False values

df.isna()
Detecting Missing Values
Use .isna() for elementwise
True/False values

df.isna()
How to Detect Missing Values
Use .isna() result as data mask

~df.shipping.isna()
How to Detect Missing Values
Use .isna() result as data mask

~df.shipping.isna() df[~df.shipping.isna()]
Methods to Handle Missings

1. Drop rows with missing values

2. Fill missing values with a


standard value such as zero

3. Impute missings with mean or


median
Dropping Missing Values
Use pandas to drop with .dropna()
● Drops all rows with any missing by
default

df.dropna()
Dropping Missing Values
Use pandas to drop with .dropna()
● Drops all rows with any missing by
default
● Use subset to drop only some missings

df.dropna(subset=["price_lb"])
Filling Missings with a Value
Use pandas .fillna() to fill missings

df.shipping.fillna(0)
Filling Missings with a Value
Use pandas .fillna() to fill missings
● Only fill with reasonable values!

df.shipping.fillna(0)
Imputing Missings
Use pandas .fillna() to fill missings
with mean or median values

price_avg = df.price_lb.mean()
price_avg

14.105
Imputing Missings
Use pandas .fillna() to fill missings
with mean or median values

price_avg = df.price_lb.mean()
df.price_lb.fillna(price_avg)
Detecting Missing Values

.info()
● Count of non-null values for each
column

.isna()
● Boolean True/False for each
element
● Can be used as data mask
Methods to Handle Missings

.dropna()
1. Drop rows with missing values

.fillna()
2. Fill missing values with a
standard value such as zero
3. Impute missings with mean or
median

4. Use a model to predict missing


(advanced)
Managing
Columns of Data
Rename Data Columns
Load in data about coffee

import pandas as pd
df = pd.read_csv("coffee_data.csv")
df
Rename Data Columns
None of these column names are
valid Python variables
● Rename to make analysis easier

df.columns

Index(['Price per Pound', 'Shipping


Price', 'Favorite?'], dtype='object')
Rename Data Columns
Pass a dictionary to the columns
argument of pandas .rename()

df.rename(columns={
'Price per Pound': 'price_lb',
'Shipping Price': 'shipping',
'Favorite?': 'favorite'
}, inplace=True)
Rename Data Columns
Pass a dictionary to the columns
argument of pandas .rename()

df.rename(columns={ df
'Price per Pound': 'price_lb',
'Shipping Price': 'shipping',
'Favorite?': 'favorite'
}, inplace=True)
What is the average shipping?
Why does using .mean() on
shipping column causes error?

df.shipping.mean()

TypeError: Could not convert


3.000.001.995.490.004.002.50 to numeric
What is the average shipping?
Why does using .mean() on
shipping column causes error?

df.shipping.mean()

TypeError: Could not convert


3.000.001.995.490.004.002.50 to numeric

df.head(3)
What is the average shipping?
Checking .dtypes shows the
shipping column contains strings

df.dtypes
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
df.dtypes
Updating a Column Datatype
Convert a column’s datatype with
the .astype() method
df['shipping'] = df.shipping.astype('float')
df.dtypes

df.shipping.mean()

2.4257142857142857
Dropping Columns
The favorite column contains the
same value for every row
df.favorite.value_counts()
Dropping Columns
Drop unnecessary columns with
pandas .drop()
● axis=0 refers to the row dimension
● axis=1 refers to column dimension
df.drop('favorite', axis=1)
Dropping Columns
Drop unnecessary columns with
pandas .drop()
● axis=0 refers to the row dimension
● axis=1 refers to column dimension
df.drop('favorite', axis=1)
Managing Columns of Data

● Rename columns by passing an


update dictionary into .rename()

● Convert column’s datatype with


.astype()

● Use .drop() and axis=1 to drop a


column from the dataframe
Cleaning
String Data
Analyzing Text Data

Text data is notoriously messy.

● Inconsistent text

● Typos

● Extra whitespace

● Extra characters in numerical


values (e.g. commas, dollar signs)
Analyzing Text Data
Load in data about US cities

import pandas as pd
df = pd.read_csv("cities.csv")
df
Convert Column to Upper- or
Lowercase
● Inconsistencies in state column

df
Convert Column to Upper- or
Lowercase
● Inconsistencies in state column
● Convert column to uppercase
● Reference string methods, .str

df.state = df.state.str.upper()
df.state
Remove Specific Characters

● Commas in population column

df
Remove Specific Characters

● Commas in population column


● Remove by replacing commas
with the empty string
pop = df.population.str.replace(',', '')
pop
Remove Specific Characters

● Commas in population column


● Remove by replacing commas
with the empty string
pop = df.population.str.replace(',', '')
pop pop.astype('int')
Analyzing Text Data
df
Analyzing Text Data
df

df.city.unique()

array(['Chicago ', 'Los Angeles ',


'Omaha ', 'Dallas ', 'Philadelphia ',
'Los Alamos '], dtype=object)
Removing Whitespace
Strip whitespace from front or end
of string data
● Whitespace includes spaces, tabs,
new line characters, etc.

city = df.city.str.strip()
Removing Whitespace
Strip whitespace from front or end
of string data
● Whitespace includes spaces, tabs,
new line characters, etc.

city = df.city.str.strip()
city.unique()

array(['Chicago', 'Los Angeles',


'Omaha', 'Dallas', 'Philadelphia',
'Los Alamos'], dtype=object)
Checking for Substrings
Which cities contain “Los”?
● Check elementwise with
.contains()

df.city.str.contains('Los')
Checking for Substrings
Which cities contain “Los”?
● Check elementwise with
.contains()
● Use result as data mask

df.city.str.contains('Los') df[df.city.str.contains('Los')]
Analyzing Text Data
Text data is notoriously messy.
Inconsistent text or typos
● .str.upper(), .str.lower()
● .str.replace()

Extra whitespace
● .str.strip()

Characters in numerical values


● .str.replace()

Searching for substrings


● .str.contains()
Case Study #1:
Exploring New Data
What should we do when
exploring new data?
Ask many questions and be skeptical.

● How do these data help answer the


project question?

● What kind of data is given in each


column?

● Do the data contain missing values?

● What steps are necessary to make


these data ready for analysis?
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
Stock Prices Case Study
What kind of data do we have?

import pandas as pd
df = pd.read_csv("wxyz.csv")
df.head()
Stock Prices Case Study
What kind of data do we have?

df.shape
(308, 3)

df.info()
Stock Prices Case Study
What kind of data do we have?

df.shape
(308, 3)

df.info()

All
columns
contain
strings
Convert Strings to Numerics
Convert the price column to
numerical values

df.head()
Convert Strings to Numerics
Convert the price column to
numerical values
● Remove dollar signs

df.price = df.price.str.replace('$', '')


df.head()
Convert Strings to Numerics
Convert the price column to
numerical values
● Remove dollar signs
● Convert from strings to floats

df.price = df.price.astype('float')
df.dtypes
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?

df.sample(5)
Create Datetime Column
● Combine day and time columns

df['date_time'] = df.day + ' ' + df.time


df.head()
Create Datetime Column
● Combine day and time columns
● Convert to datetime

df['date_time'] = df.day + ' ' + df.time df['date_time'] = pd.to_datetime(df.date_time)


df.head() df.dtypes
Create Day of Week Column
Use weekday property of datetime

df['day_of_week'] = df.date_time.dt.weekday
df.sample(5)
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
● Select Mondays

mondays = df[df.day_of_week == 0]
mondays.sample(5)
Stock Prices Case Study
Which Monday time saw the
highest stock price in September?
● Select Mondays
● Sort to find the maximum price

(mondays[['date_time', 'price']]
.sort_values('price', ascending=False)
.head(3))
What should we do when
exploring new data?

.head(), .info(), .dtypes, .shape


● What kind of data is given in each
column?
● Do the data contain missing values?

.replace(), .to_datetime()
● What steps are necessary to make
data ready for analysis?

● How do these data help answer the


project question?
Case Study #2:
Diagnosing Errors
How can we diagnose errors?

Data inconsistencies may cause


errors when operating on columns.

Build custom functions to:

● Include print statements

● Add conditional statements

● Use exceptions
Circus Performers Case Study
Which type of performer is the
most experienced on average?

import pandas as pd
df = pd.read_csv("circus.csv")
df.head()
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.shape

(50, 2)

df.info()
Manipulating Email Data
Create new columns for domain

str.split("[email protected]", "@")

['jamie50', 'liontamer.org']

str.split("[email protected]", "@")[1]

'liontamer.org’
Manipulating Email Data
Create new columns for domain

df.loc[:2].email.map(
lambda x: str.split(x, "@")[1]
)
Manipulating Email Data
Create new columns for domain

df.email.map(
lambda x: str.split(x, "@")[1]
)

IndexError: list index out of range


Creating Custom Function
Create a custom Python function
to diagnose error

def check_for_at_symbol(email):
if '@' not in email:
print(email)
return ('@' in email)

at_symbol_test = df.email.map(check_for_at_symbol)

jonathan104ringmaster.net
Creating Custom Function
Create a custom Python function
to create domain column
def get_domain(email):
if '@' not in email:
return None
return str.split(email, '@')[1]

df['domain'] = df.email.map(get_domain)
Creating Custom Function
Create a custom Python function
to create domain column
def get_domain(email):
if '@' not in email:
return None
return str.split(email, '@')[1]

df['domain'] = df.email.map(get_domain)
df.loc[13:16]
Using Python Exceptions
Create domain column by catching
errors with exception
def get_domain_exception(email):
try:
return str.split(email, '@')[1]
except IndexError:
print(email)
return None

df['domain'] = df.email.map(get_domain_exception)

jonathan104ringmaster.net
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.head(3)

df.shape

(50, 2)

df.email.nunique()

40
Remove Duplicate Rows
Use pandas to remove duplicate
rows with .drop_duplicates()
● Entire row must be exact match

df.drop_duplicates().shape()

(40, 3)
Remove Duplicate Rows
Use pandas to remove duplicate
rows with .drop_duplicates()
● Entire row must be exact match
● Use subset argument with column
name or list to match specific columns

df.drop_duplicates().shape()

(40, 3)

df.drop_duplicates(subset='email').shape()

(40, 3)
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.drop_duplicates(inplace=True)
(df.groupby('domain')
.performances
.mean()
.sort_values(ascending=False))
Circus Performers Case Study
Which type of performer is the
most experienced on average?
df.drop_duplicates(inplace=True)
(df.groupby('domain')
.performances
.mean()
.sort_values(ascending=False))
Case Study #3:
Comparing Against
Group Statistics
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.

import seaborn as sns


df = sns.load_dataset("penguins")
df.head()
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.

import seaborn as sns


df = sns.load_dataset("penguins")
df.head()
Penguins Case Study
What kind of data do we have?

df.shape
(344, 7)

df.info()
Handling Missing Values
What kind of missings do we have?

df[df.bill_length_mm.isna()]

df.dropna(subset=["bill_length_mm"], inplace=True)

df.shape
(342, 7)
Handling Missing Values
What kind of missings do we have?

df[df.sex.isna()]
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.
Pandas Transform
Use transform to produce group
aggregates for each row

df.groupby("species").body_mass_g.mean()
Pandas Transform
Use transform to produce group
aggregates for each row
df["mass_species_mean"] = (df.groupby("species").body_mass_g
.transform(lambda x: x.mean()))
Pandas Transform
Use transform to produce group
aggregates for each row
df["mass_species_mean"] = (df.groupby("species").body_mass_g
.transform(lambda x: x.mean()))
df[["species", "body_mass_g", "mass_species_mean"]].sample(5)
Standard Penguin Mass
Standardize the penguin masses
by species.
df["mass_standard"] = (df.groupby("species").body_mass_g
.transform(lambda x: (x - x.mean())/x.std()))
Standard Penguin Mass
Standardize the penguin masses
by species.
df["mass_standard"] = (df.groupby("species").body_mass_g
.transform(lambda x: (x - x.mean())/x.std()))
df[["species", "body_mass_g", "mass_species_mean", "mass_standard"]].sample(5)
Penguins Case Study
Standardize the penguin masses
by species and sort by standard
mass ascending.

df.sort_values("mass_standard").head()

You might also like