0% found this document useful (0 votes)
12 views

Cleaning Data in Python

The document outlines a course on cleaning data in Python, detailing common data problems, data type constraints, and methods for handling issues like duplicates and out-of-range values. It includes practical examples using Python code to demonstrate how to clean data effectively, including converting data types and managing categorical data. The course emphasizes the importance of data cleaning to ensure accurate analysis and decision-making.

Uploaded by

danndanium
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Cleaning Data in Python

The document outlines a course on cleaning data in Python, detailing common data problems, data type constraints, and methods for handling issues like duplicates and out-of-range values. It includes practical examples using Python code to demonstrate how to clean data effectively, including converting data types and managing categorical data. The course emphasizes the importance of data cleaning to ensure accurate analysis and decision-making.

Uploaded by

danndanium
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Course outline

Data type
constraints
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Course outline Course outline

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Course outline Why do we need to clean data?

Chapter 1 - Common data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Why do we need to clean data? Why do we need to clean data?

Garbage in Garbage out

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Data type constraints Strings to integers
Datatype Example Python data type # Import CSV file and output header
sales = pd.read_csv('sales.csv')

Text data First name, last name, address str sales.head(2)


...
int
Integers # Subscribers, # products sold SalesOrderID Revenue Quantity
... 0 43659 23153$ 12
float
1 43660 1457$ 2
Decimals Temperature, $ exchange rates
... bool
# Get data types of columns

Binary Is married, new customer, datetime sales.dtypes


yes/no, ...
category
Dates Order dates, ship dates ... SalesOrderID int64
Revenue object
Categories Marriage status, gender ... Quantity int64
dtype: object

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

String to integers String to integers


# Get DataFrame information # Print sum of all Revenue column
sales.info() sales['Revenue'].sum()

<class 'pandas.core.frame.DataFrame'> '23153$1457$36865$32474$472$27510$16158$5694$6876$40487$807$6893$9153$6895$4216..

RangeIndex: 31465 entries, 0 to 31464


Data columns (total 3 columns): # Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
SalesOrderID 31465 non-null int64
sales['Revenue'] = sales['Revenue'].astype('int')
Revenue 31465 non-null object
Quantity 31465 non-null int64
# Verify that Revenue is now an integer
dtypes: int64(2), object(1)
assert sales['Revenue'].dtype == 'int'
memory usage: 737.5+ KB

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


The assert statement Numeric or categorical?
... marriage_status ...
# This will pass
... 3 ...
assert 1+1 == 2 ... 1 ...
... 2 ...

# This will not pass


0 = Never married 1 = Married 2 = Separated 3 = Divorced
assert 1+1 == 3

df['marriage_status'].describe()
AssertionError Traceback (most recent call last)
assert 1+1 == 3 marriage_status
...
AssertionError:
mean 1.4
std 0.20
min 0.00
50% 1.8 ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Numeric or categorical?
# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()

Let's practice!
marriage_status
count 241
unique 4 C L E A N I N G D ATA I N P Y T H O N
top 1
freq 120

CLEANING DATA IN PYTHON


Motivation
movies.head()

Data range 0
movie_name
The Godfather
avg_rating
5

constraints 1
2
Frozen 2
Shrek
3
4
C L E A N I N G D ATA I N P Y T H O N
...

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Motivation Motivation
import matplotlib.pyplot as plt Can future sign-ups exist?
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)') # Import date time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]

subscription_date user_name ... Country


0 01/05/2021 Marah ... Nauru
1 09/08/2020 Joshua ... Austria
2 04/01/2020 Heidi ... Guinea
3 11/10/2020 Rina ... Turkmenistan
4 11/07/2020 Christine ... Marshall Islands
5 07/07/2020 Ayanna ... Gabon

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to deal with out of range data? Movie example
Dropping data import pandas as pd
# Output Movies with rating > 5
Setting custom minimums and maximums movies[movies['avg_rating'] > 5]

Treat as missing and impute


movie_name avg_rating
Setting custom value depending on business assumptions 23 A Beautiful Mind 6
65 La Vita e Bella 6
77 Amelie 6

# Drop values using filtering


movies = movies[movies['avg_rating'] <= 5]
# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)
# Assert results
assert movies['avg_rating'].max() <= 5

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Movie example Date range example


import datetime as dt
# Convert avg_rating > 5 to 5
import pandas as pd
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5 # Output data types
user_signups.dtypes

# Assert statement
subscription_date object
assert movies['avg_rating'].max() <= 5 user_name object
Country object
dtype: object
Remember, no output means it passed

# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Date range example
today_date = dt.date.today()

Drop the data

# Drop values using filtering


user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True) Let's practice!
C L E A N I N G D ATA I N P Y T H O N
Hardcode dates with upper limit

# Drop values using filtering


user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups.subscription_date.max().date() <= today_date

CLEANING DATA IN PYTHON

What are duplicate values?


All columns have the same values

first_name last_name address height weight


Uniqueness Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg

constraints Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON


What are duplicate values? Why do they happen?
Most columns have the same values

first_name last_name address height weight


Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 194 cm 87 kg

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Why do they happen? Why do they happen?

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to find duplicate values? How to find duplicate values?
# Print the header # Get duplicates across all columns
height_weight.head() duplicates = height_weight.duplicated()
print(duplicates)

first_name last_name address height weight


0 Lane Reese 534-1559 Nam St. 181 64 1 False
1 Ivor Pierce 102-3364 Non Road 168 66 ... ....
2 Roary Gibson P.O. Box 344, 7785 Nisi Ave 191 99 22 True
3 Shannon Little 691-2550 Consectetuer Street 185 65 23 False
4 Abdul Fry 4565 Risus St. 169 65 ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate values? How to find duplicate rows?


# Get duplicate rows
The .duplicated() method
duplicates = height_weight.duplicated()
subset : List of column names to check for duplication.
height_weight[duplicates]
keep : Whether to keep first ( 'first' ), last ( 'last' ) or all ( False ) duplicate values.

first_name last_name address height weight


100 Mary Colon 4674 Ut Rd. 179 75 # Column names to check for duplication
101 Ivor Pierce 102-3364 Non Road 168 88 column_names = ['first_name','last_name','address']

102 Cole Palmer 8366 At, Street 178 91 duplicates = height_weight.duplicated(subset = column_names, keep = False)

103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to find duplicate rows? How to find duplicate rows?
# Output duplicate values # Output duplicate values
height_weight[duplicates] height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight first_name last_name address height weight
1 Ivor Pierce 102-3364 Non Road 168 66 22 Cole Palmer 8366 At, Street 178 91
22 Cole Palmer 8366 At, Street 178 91 102 Cole Palmer 8366 At, Street 178 91
28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83 28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83
37 Mary Colon 4674 Ut Rd. 179 75 103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
100 Mary Colon 4674 Ut Rd. 179 75 1 Ivor Pierce 102-3364 Non Road 168 66
101 Ivor Pierce 102-3364 Non Road 168 88 101 Ivor Pierce 102-3364 Non Road 168 88
102 Cole Palmer 8366 At, Street 178 91 37 Mary Colon 4674 Ut Rd. 179 75
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83 100 Mary Colon 4674 Ut Rd. 179 75

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate rows? How to find duplicate rows?


# Output duplicate values # Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name') height_weight[duplicates].sort_values(by = 'first_name')

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to treat duplicate values? How to treat duplicate values?
# Output duplicate values The .drop_duplicates() method
height_weight[duplicates].sort_values(by = 'first_name')
subset : List of column names to check for duplication.

keep : Whether to keep first ( 'first' ), last ( 'last' ) or all ( False ) duplicate values.

inplace : Drop duplicated rows directly inside DataFrame without creating new object ( True
).

# Drop duplicates
height_weight.drop_duplicates(inplace = True)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to treat duplicate values? How to treat duplicate values?


# Output duplicate values # Output duplicate values
column_names = ['first_name','last_name','address'] column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False) duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name') height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight


28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
1 Ivor Pierce 102-3364 Non Road 168 66
101 Ivor Pierce 102-3364 Non Road 168 88

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to treat duplicate values?
The .groupby() and .agg() methods

# Group by column names and produce statistical summaries


column_names = ['first_name','last_name','address']
summaries = {'height': 'max', 'weight': 'mean'}

Let's practice!
height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()
# Make sure aggregation is done
duplicates = height_weight.duplicated(subset = column_names, keep = False)
C L E A N I N G D ATA I N P Y T H O N
height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight

CLEANING DATA IN PYTHON

Membership
constraints
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme Chapter 2 - Text and categorical data problems


Content Developer @DataCamp

CLEANING DATA IN PYTHON


Categories and membership constraints Why could we have these problems?
Predefined finite set of categories

Type of data Example values Numeric representation


Marriage Status unmarried , married 0 ,1

Household Income Category 0-20K , 20-40K , ... 0 , 1 , ..

Loan Status default , payed , no_loan 0 ,1 ,2

Marriage status can only be unmarried _or_ married

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How do we treat these problems? An example


# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


An example A note on joins
# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ <-- 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

A left anti join on blood types An inner join on blood types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Finding inconsistent categories Dropping inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type']) inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories) inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

{'Z+'} # Drop inconsistent categories and get consistent data only


consistent_data = study_data[~inconsistent_rows]

# Get and print rows with inconsistent categories


inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories) name birthday blood_type
study_data[inconsistent_rows] 1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
name birthday blood_type
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Categorical
Let's practice! variables
C L E A N I N G D ATA I N P Y T H O N
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @DataCamp
What type of errors could we have? Value consistency
I) Value inconsistency Capitalization: 'married' , 'Married' , 'UNMARRIED' , 'unmarried' ..

Inconsistent fields: 'married' , 'Maried' , 'UNMARRIED' , 'not married' .. # Get marriage status column
_Trailing white spaces: _ 'married ' , ' married ' .. marriage_status = demographics['marriage_status']
marriage_status.value_counts()
II) Collapsing too many categories to few

Creating new groups: 0-20K , 20-40K categories ... from continuous household income data unmarried 352
Mapping groups to new ones: Mapping household income categories to 2 'rich' , 'poor' married 268
MARRIED 204
III) Making sure data is of type category (seen in Chapter 1)
UNMARRIED 176
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Value consistency Value consistency


# Get value counts on DataFrame # Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status.groupby('marriage_status').count()
marriage_status['marriage_status'].value_counts()

household_income gender UNMARRIED 528


marriage_status MARRIED 472

MARRIED 204 204


# Lowercase
UNMARRIED 176 176
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
married 268 268
marriage_status['marriage_status'].value_counts()
unmarried 352 352

unmarried 528
married 472

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Value consistency Value consistency
Trailing spaces: 'married ' , 'married' , 'unmarried' , ' unmarried' .. # Strip all spaces
demographics = demographics['marriage_status'].str.strip()
# Get marriage status column demographics['marriage_status'].value_counts()
marriage_status = demographics['marriage_status']
marriage_status.value_counts() unmarried 528
married 472

unmarried 352
unmarried 268
married 204
married 176
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing data into categories Collapsing data into categories


Create categories out of data: income_group column from income column. Create categories out of data: income_group column from income column.

# Using qcut() # Using cut() - create category ranges and names


import pandas as pd ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+'] group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3, # Create income group column
labels = group_names) demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges,
# Print income_group column
labels=group_names)
demographics[['income_group', 'household_income']]
demographics[['income_group', 'household_income']]

category household_income
category Income
0 200K-500K 189243
0 0-200K 189243
1 500K+ 778533
1 500K+ 778533
..

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Collapsing data into categories
Map categories to fewer ones: reducing categories in categorical column.

operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'

operating_system column should become: 'DesktopOS', 'MobileOS'

# Create mapping dictionary and replace


mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS',
Let's practice!
'IOS':'MobileOS', 'Android':'MobileOS'} C L E A N I N G D ATA I N P Y T H O N
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()

array(['DesktopOS', 'MobileOS'], dtype=object)

CLEANING DATA IN PYTHON

What is text data?


Type of data Example values Common text data problems

Names Alex , Sara ... 1) Data inconsistency:


Phone numbers +96171679912 ...
Cleaning text data Emails `adel@datacamp.com`..
+96171679912 or 0096171679912 or ..?

C L E A N I N G D ATA I N P Y T H O N 2) Fixed length violations:


Passwords ...
Passwords needs to be at least 8 characters

3) Typos:

+961.71.679912
Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON


Example Example
phones = pd.read_csv('phones.csv') phones = pd.read_csv('phones.csv')
print(phones) print(phones)

Full name Phone number Full name Phone number


0 Noelani A. Gray 001-702-397-5143 0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540 1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338 2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 3 Prescott D. Hardin +1-297-996-4904 <-- Inconsistent data format
4 Benedict G. Valdez 001-969-820-3536 4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 5 Reece M. Andrews 4138 <-- Length violation
6 Hayfa E. Keith 001-536-175-8444 6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823 7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265 8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210 9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Example Fixing the phone number column


phones = pd.read_csv('phones.csv') # Replace "+" with "00"
print(phones) phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones

Full name Phone number


0 Noelani A. Gray 0017023975143 Full name Phone number

1 Myles Z. Gomez 0013294850540 0 Noelani A. Gray 001-702-397-5143


1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 0011954922338
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 0012979964904
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 0019698203536
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews NaN
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 0016815521823
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 0019103235265
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 0011431199210 9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Fixing the phone number column Fixing the phone number column
# Replace "-" with nothing # Replace phone numbers with lower than 10 digits to NaN
phones["Phone number"] = phones["Phone number"].str.replace("-", "") digits = phones['Phone number'].str.len()

phones phones.loc[digits < 10, "Phone number"] = np.nan


phones

Full name Phone number


Full name Phone number
0 Noelani A. Gray 0017023975143
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444 6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823 7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265 8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210 9 Lionel M. Davis 0011431199210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Fixing the phone number column But what about more complicated examples?
# Find length of each row in Phone number column phones.head()
sanity_check = phone['Phone number'].str.len()

Full name Phone number


# Assert minmum phone number length is 10 0 Olga Robinson +(01706)-25891
assert sanity_check.min() >= 10 1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111

# Assert all numbers do not have "+" or "-" 3 Miranda Solis +07058-879063

assert phone['Phone number'].str.contains("+|-").any() == False 4 Caldwell Gilliam +(016977)-8424

Remember, assert returns nothing if the condition passes

Supercharged control + F

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Regular expressions in action
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()

Let's practice!
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437 C L E A N I N G D ATA I N P Y T H O N
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424

CLEANING DATA IN PYTHON

In this chapter

Uniformity
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme Chapter 3 - Advanced data problems


Content Developer @ DataCamp

CLEANING DATA IN PYTHON


Data range constraints Uniformity
Column Unit
Temperature 32°C is also 89.6°F

Weight 70 Kg is also 11 st.

Date 26-11-2019 is also 26, November, 2019

Money 100$ is also 10763.90¥

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

An example An example
temperatures = pd.read_csv('temperature.csv') temperatures = pd.read_csv('temperature.csv')
temperatures.head() temperatures.head()

Date Temperature Date Temperature


0 03.03.19 14.0 0 03.03.19 14.0
1 04.03.19 15.0 1 04.03.19 15.0
2 05.03.19 18.0 2 05.03.19 18.0
3 06.03.19 16.0 3 06.03.19 16.0
4 07.03.19 62.6 4 07.03.19 62.6 <--

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


An example
# Import matplotlib
import matplotlib.pyplot as plt
# Create scatter plot
plt.scatter(x = 'Date', y = 'Temperature', data = temperatures)
# Create title, xlabel and ylabel
plt.title('Temperature in Celsius March 2019 - NYC')
plt.xlabel('Dates')
plt.ylabel('Temperature in Celsius')
# Show plot
plt.show()

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating temperature data


5
C = (F − 32) ×
9

temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']


temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels

# Assert conversion is correct


assert temperatures['Temperature'].max() < 40

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Treating date data Treating date data
birthdays.head() birthdays.head()

Birthday First name Last name


0 27/27/19 Rowan Nunez
1 03-29-19 Brynn Yang
2 March 3rd, 2019 Sophia Reilly
3 24-03-19 Deacon Prince
4 06-03-19 Griffith Neal

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Datetime formatting Treating date data


datetime is useful for representing dates pandas.to_datetime()
# Converts to datetime - but won't work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
Date datetime format Can recognize most formats automatically

25-12-2019 %d-%m-%Y Sometimes fails with erroneous or ValueError: month must be in 1..12
unrecognizable formats
December 25th 2019 %c
# Will work!
12-25-2019 %m-%d-%Y birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
... ... # Attempt to infer format of each date
infer_datetime_format=True,
# Return NA for rows where conversion failed
errors = 'coerce')

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Treating date data Treating date data
birthdays.head() birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
birthdays.head()

Birthday First name Last name


0 NaT Rowan Nunez Birthday First name Last name
1 2019-03-29 Brynn Yang 0 NaT Rowan Nunez
2 2019-03-03 Sophia Reilly 1 29-03-2019 Brynn Yang
3 2019-03-24 Deacon Prince 2 03-03-2019 Sophia Reilly
4 2019-06-03 Griffith Neal 3 24-03-2019 Deacon Prince
4 03-06-2019 Griffith Neal

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating ambiguous date data

Is 2019-03-08 in August or March?

Convert to NA and treat accordingly


Let's practice!
Infer format by understanding data source C L E A N I N G D ATA I N P Y T H O N
Infer format by understanding previous and subsequent data in DataFrame

CLEANING DATA IN PYTHON


Motivation
import pandas as pd

flights = pd.read_csv('flights.csv')

Cross field validation flights.head()

C L E A N I N G D ATA I N P Y T H O N
flight_number economy_class business_class first_class total_passengers
0 DL140 100 60 40 200
1 BA248 130 100 70 300
2 MEA124 100 50 50 200
3 AFR939 140 70 90 300
Adel Nehme 4 TKA101 130 100 20 250
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Cross field validation Cross field validation


The use of multiple fields in a dataset to sanity check data integrity users.head()

flight_number economy_class business_class first_class total_passengers


0 DL140 100 + 60 + 40 = 200 user_id Age Birthday
1 BA248 130 + 100 + 70 = 300 0 32985 22 1998-03-02
2 MEA124 100 + 50 + 50 = 200
1 94387 27 1993-12-04
3 AFR939 140 + 70 + 90 = 300
4 TKA101 130 + 100 + 20 = 250
2 34236 42 1978-11-24
3 12551 31 1989-01-03
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1) 4 55212 18 2002-07-02
passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistent_pass = flights[passenger_equ]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Cross field validation What to do when we catch inconsistencies?
import pandas as pd
import datetime as dt

# Convert to datetime and get today's date


users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()
# For each row in the Birthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match
age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Completeness
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp
What is missing data? Airquality example
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality)

Date Temperature CO2


Can be represented as NA , nan , 0 , . ...
987 20/04/2004 16.8 0.0
2119 07/06/2004 18.7 0.8
2451 20/06/2004 -40.0 NaN
Technical error 1984 01/06/2004 19.6 1.8
8299 19/02/2005 11.2 1.2
Human error
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example Airquality example


import pandas as pd # Return missing values

airquality = pd.read_csv('airquality.csv') airquality.isna()

print(airquality)
Date Temperature CO2
987 False False False
Date Temperature CO2
2119 False False False
987 20/04/2004 16.8 0.0
2451 False False True
2119 07/06/2004 18.7 0.8
1984 False False False
2451 20/06/2004 -40.0 NaN <-- 8299 False False False
1984 01/06/2004 19.6 1.8
8299 19/02/2005 11.2 1.2
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Airquality example Missingno
# Get summary of missingness
Useful package for visualizing and understanding missing data
airquality.isna().sum()
import missingno as msno
import matplotlib.pyplot as plt
Date 0
# Visualize missingness
Temperature 0
msno.matrix(airquality)
CO2 366
plt.show()
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example
# Isolate missing and complete values aside
missing = airquality[airquality['CO2'].isna()]
complete = airquality[~airquality['CO2'].isna()]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Airquality example Airquality example
# Describe complete DataFramee # Describe missing DataFramee # Describe complete DataFramee # Describe missing DataFramee
complete.describe() missing.describe() complete.describe() missing.describe()

Temperature CO2 Temperature CO2 Temperature CO2 Temperature CO2


count 8991.000000 8991.000000 count 366.000000 0.0 count 8991.000000 8991.000000 count 366.000000 0.0
mean 18.317829 1.739584 mean -39.655738 NaN mean 18.317829 1.739584 mean -39.655738 NaN <--
std 8.832116 1.537580 std 5.988716 NaN std 8.832116 1.537580 std 5.988716 NaN
min -1.900000 0.000000 min -49.000000 NaN min -1.900000 0.000000 min -49.000000 NaN <--
... ... ... ... ... ... ... ... ... ... ... ...
max 44.600000 11.900000 max -30.000000 NaN max 44.600000 11.900000 max -30.000000 NaN <--

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

sorted_airquality = airquality.sort_values(by = 'Temperature') sorted_airquality = airquality.sort_values(by = 'Temperature')


msno.matrix(sorted_airquality) msno.matrix(sorted_airquality)
plt.show() plt.show()

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Missingness types Missingness types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Missingness types Missingness types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


How to deal with missing data? Dealing with missing data
Simple approaches: airquality.head()

1. Drop missing data


Date Temperature CO2
2. Impute with statistical measures (mean, median, mode..)
0 05/03/2005 8.5 2.5
More complex approaches: 1 23/08/2004 21.8 0.0
2 18/02/2005 6.3 1.0
1. Imputing using an algorithmic approach 3 08/02/2005 -31.0 NaN
2. Impute with machine learning models 4 13/03/2005 19.9 0.1

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Dropping missing values Replacing with statistical measures


# Drop missing values co2_mean = airquality['CO2'].mean()
airquality_dropped = airquality.dropna(subset = ['CO2']) airquality_imputed = airquality.fillna({'CO2': co2_mean})
airquality_dropped.head() airquality_imputed.head()

Date Temperature CO2 Date Temperature CO2


0 05/03/2005 8.5 2.5 0 05/03/2005 8.5 2.500000
1 23/08/2004 21.8 0.0 1 23/08/2004 21.8 0.000000
2 18/02/2005 6.3 1.0 2 18/02/2005 6.3 1.000000
4 13/03/2005 19.9 0.1 3 08/02/2005 -31.0 1.739584
5 02/04/2005 17.0 0.8 4 13/03/2005 19.9 0.100000

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Comparing strings
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

In this chapter Minimum edit distance

Least possible amount of steps needed to transition from one string to another
Chapter 4 - Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Minimum edit distance Minimum edit distance

Least possible amount of steps needed to transition from one string to another

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance Minimum edit distance

Minimum edit distance so far: 2 Minimum edit distance: 5

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Minimum edit distance Minimum edit distance algorithms
Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

Possible packages: nltk , thefuzz , textdistance ..

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance algorithms Simple string comparison


Algorithm Operations # Lets us compare between two strings
Damerau-Levenshtein insertion, substitution, deletion, transposition from thefuzz import fuzz

Levenshtein insertion, substitution, deletion


# Compare reeding vs reading
Hamming substitution only
fuzz.WRatio('Reeding', 'Reading')
Jaro distance transposition only
... ...
86

Possible packages: thefuzz

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Partial strings and different orderings Comparison with arrays
# Partial string comparison # Import process
fuzz.WRatio('Houston Rockets', 'Rockets') from thefuzz import process

90 # Define string and array of possible matches


string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets',
# Partial string comparison with different order
'Houson vs Los Angeles', 'Heat vs Bulls'])
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

process.extract(string, choices, limit = 2)


86

[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing categories with string similarity Collapsing categories with string matching
Chapter 2 print(survey['state'].unique()) categories

Use .replace() to collapse "eur" into "Europe"


id state state
0 California 0 California
1 Cali 1 New York
What if there are too many variations? 2 Calefornia
3 Calefornie
"EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ... 4 Californie
5 Calfornia
6 Calefernia
7 New York
String similarity! 8 New York City
...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Collapsing all of the state Record linkage
# For each correct category
for state in categories['state']:
# Find potential matches in states with typoes
matches = process.extract(state, survey['state'], limit = survey.shape[0])
# For each potential match match
for potential_match in matches:
# If high similarity score
if potential_match[1] >= 80:
# Replace typo with correct category
survey.loc[survey['state'] == potential_match[0], 'state'] = state

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Generating pairs
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp
Motivation When joins won't work

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Record linkage Our DataFrames


census_A

given_name surname date_of_birth suburb state address_1


rec_id
rec-1070-org michaela neumann 19151111 winston hills cal stanley street
rec-1016-org courtney painter 19161214 richlands txs pinkerton circuit
...

census_B

given_name surname date_of_birth suburb state address_1


rec_id
rec-561-dup-0 elton NaN 19651013 windermere ny light setreet
rec-2642-dup-0 mitchell maxon 19390212 north ryde cal edkins street
The recordlinkage package ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Generating pairs Generating pairs

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Blocking Generating pairs


# Import recordlinkage
import recordlinkage

# Create indexing object


indexer = recordlinkage.Index()

# Generate pairs blocked on state


indexer.block('state')
pairs = indexer.index(census_A, census_B)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Generating pairs Comparing the DataFrames
print(pairs) # Generate the pairs
pairs = indexer.index(census_A, census_B)
# Create a Compare object
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org',
compare_cl = recordlinkage.Compare()
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org',
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', # Find exact matches for pairs of date_of_birth and state
'rec-1280-org',........... compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, compare_cl.exact('state', 'state', label='state')
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12, # Find similar matches for pairs of surname and address_1 using string similarity

95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2']) compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Finding matching pairs Finding the only pairs we want


print(potential_matches) potential_matches[potential_matches.sum(axis = 1) => 2]

date_of_birth state surname address_1 date_of_birth state surname address_1


rec_id_1 rec_id_2 rec_id_1 rec_id_2
rec-1070-org rec-561-dup-0 0 1 0.0 0.0 rec-4878-org rec-4878-dup-0 1 1 1.0 0.0
rec-2642-dup-0 0 1 0.0 0.0 rec-417-org rec-2867-dup-0 0 1 0.0 1.0
rec-608-dup-0 0 1 0.0 0.0 rec-3964-org rec-394-dup-0 0 1 1.0 0.0
... rec-1373-org rec-4051-dup-0 0 1 1.0 0.0
rec-1631-org rec-4070-dup-0 0 1 0.0 0.0 rec-802-dup-0 0 1 1.0 0.0
rec-4862-dup-0 0 1 0.0 0.0 rec-3540-org rec-470-dup-0 0 1 1.0 0.0
rec-629-dup-0 0 1 0.0 0.0
...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Linking DataFrames
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

Record linkage Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Our DataFrames What we've already done
census_A # Import recordlinkage and generate full pairs
import recordlinkage
given_name surname date_of_birth suburb state address_1 indexer = recordlinkage.Index()
rec_id indexer.block('state')
rec-1070-org michaela neumann 19151111 winston hills nsw stanley street full_pairs = indexer.index(census_A, census_B)

rec-1016-org courtney painter 19161214 richlands vic pinkerton circuit


... # Comparison step
compare_cl = recordlinkage.Compare()
census_B compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')
given_name surname date_of_birth suburb state address_1 compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
rec_id compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
rec-561-dup-0 elton NaN 19651013 windermere vic light setreet
rec-2642-dup-0 mitchell maxon 19390212 north ryde nsw edkins street potential_matches = compare_cl.compute(full_pairs, census_A, census_B)

...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

What we're doing now Our potential matches


potential_matches

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Our potential matches Our potential matches
potential_matches potential_matches

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Our potential matches Probable matches


potential_matches matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
print(matches)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Probable matches Get the indices
matches = potential_matches[potential_matches.sum(axis = 1) >= 3] matches.index
print(matches)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org',
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', ...

# Get indices from census_B only


duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)

Index(['rec-2404-dup-0', 'rec-4178-dup-0', 'rec-1054-dup-0', 'rec-4663-dup-0',


'rec-485-dup-0', 'rec-2950-dup-0', 'rec-1234-dup-0', ... , 'rec-299-dup-0'])

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Linking DataFrames # Import recordlinkage and generate pairs and compare across columns
...

# Finding duplicates in census_B # Generate potential matches


potential_matches = compare_cl.compute(full_pairs, census_A, census_B)
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# Isolate matches with matching values for 3 or more columns


# Finding new rows in census_B matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]
# Get index for matching census_B rows only
duplicate_rows = matches.index.get_level_values(1)
# Link the DataFrames!
full_census = census_A.append(census_B_new) # Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrames!


full_census = census_A.append(census_B_new)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


Congratulations!
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

What we've learned What we've learned

Chapter 1 - Common data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


What we've learned What we've learned

Chapter 2 - Text and categorical data problems Chapter 3 - Advanced data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

What we've learned More to learn on DataCamp!


Working with Dates and Times in Python
Regular Expressions in Python
Dealing with Missing Data in Python
And more!

Chapter 4 - Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON


More to learn! More to learn!

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Thank you!
C L E A N I N G D ATA I N P Y T H O N

You might also like