Cleaning Data in Python
Cleaning Data in Python
Data type
constraints
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
df['marriage_status'].describe()
AssertionError Traceback (most recent call last)
assert 1+1 == 3 marriage_status
...
AssertionError:
mean 1.4
std 0.20
min 0.00
50% 1.8 ...
Numeric or categorical?
# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()
Let's practice!
marriage_status
count 241
unique 4 C L E A N I N G D ATA I N P Y T H O N
top 1
freq 120
Data range 0
movie_name
The Godfather
avg_rating
5
constraints 1
2
Frozen 2
Shrek
3
4
C L E A N I N G D ATA I N P Y T H O N
...
Adel Nehme
Content Developer @ DataCamp
Motivation Motivation
import matplotlib.pyplot as plt Can future sign-ups exist?
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)') # Import date time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]
# Assert statement
subscription_date object
assert movies['avg_rating'].max() <= 5 user_name object
Country object
dtype: object
Remember, no output means it passed
# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
102 Cole Palmer 8366 At, Street 178 91 duplicates = height_weight.duplicated(subset = column_names, keep = False)
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
first_name last_name address height weight first_name last_name address height weight
1 Ivor Pierce 102-3364 Non Road 168 66 22 Cole Palmer 8366 At, Street 178 91
22 Cole Palmer 8366 At, Street 178 91 102 Cole Palmer 8366 At, Street 178 91
28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83 28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83
37 Mary Colon 4674 Ut Rd. 179 75 103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
100 Mary Colon 4674 Ut Rd. 179 75 1 Ivor Pierce 102-3364 Non Road 168 66
101 Ivor Pierce 102-3364 Non Road 168 88 101 Ivor Pierce 102-3364 Non Road 168 88
102 Cole Palmer 8366 At, Street 178 91 37 Mary Colon 4674 Ut Rd. 179 75
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83 100 Mary Colon 4674 Ut Rd. 179 75
keep : Whether to keep first ( 'first' ), last ( 'last' ) or all ( False ) duplicate values.
inplace : Drop duplicated rows directly inside DataFrame without creating new object ( True
).
# Drop duplicates
height_weight.drop_duplicates(inplace = True)
Let's practice!
height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()
# Make sure aggregation is done
duplicates = height_weight.duplicated(subset = column_names, keep = False)
C L E A N I N G D ATA I N P Y T H O N
height_weight[duplicates].sort_values(by = 'first_name')
Membership
constraints
C L E A N I N G D ATA I N P Y T H O N
blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-
blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ <-- 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-
Categorical
Let's practice! variables
C L E A N I N G D ATA I N P Y T H O N
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @DataCamp
What type of errors could we have? Value consistency
I) Value inconsistency Capitalization: 'married' , 'Married' , 'UNMARRIED' , 'unmarried' ..
Inconsistent fields: 'married' , 'Maried' , 'UNMARRIED' , 'not married' .. # Get marriage status column
_Trailing white spaces: _ 'married ' , ' married ' .. marriage_status = demographics['marriage_status']
marriage_status.value_counts()
II) Collapsing too many categories to few
Creating new groups: 0-20K , 20-40K categories ... from continuous household income data unmarried 352
Mapping groups to new ones: Mapping household income categories to 2 'rich' , 'poor' married 268
MARRIED 204
III) Making sure data is of type category (seen in Chapter 1)
UNMARRIED 176
dtype: int64
unmarried 528
married 472
unmarried 352
unmarried 268
married 204
married 176
dtype: int64
category household_income
category Income
0 200K-500K 189243
0 0-200K 189243
1 500K+ 778533
1 500K+ 778533
..
3) Typos:
+961.71.679912
Adel Nehme
Content Developer @ DataCamp
Fixing the phone number column But what about more complicated examples?
# Find length of each row in Phone number column phones.head()
sanity_check = phone['Phone number'].str.len()
# Assert all numbers do not have "+" or "-" 3 Miranda Solis +07058-879063
Supercharged control + F
Let's practice!
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437 C L E A N I N G D ATA I N P Y T H O N
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424
In this chapter
Uniformity
C L E A N I N G D ATA I N P Y T H O N
An example An example
temperatures = pd.read_csv('temperature.csv') temperatures = pd.read_csv('temperature.csv')
temperatures.head() temperatures.head()
25-12-2019 %d-%m-%Y Sometimes fails with erroneous or ValueError: month must be in 1..12
unrecognizable formats
December 25th 2019 %c
# Will work!
12-25-2019 %m-%d-%Y birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
... ... # Attempt to infer format of each date
infer_datetime_format=True,
# Return NA for rows where conversion failed
errors = 'coerce')
flights = pd.read_csv('flights.csv')
C L E A N I N G D ATA I N P Y T H O N
flight_number economy_class business_class first_class total_passengers
0 DL140 100 60 40 200
1 BA248 130 100 70 300
2 MEA124 100 50 50 200
3 AFR939 140 70 90 300
Adel Nehme 4 TKA101 130 100 20 250
Content Developer @ DataCamp
Completeness
Let's practice! C L E A N I N G D ATA I N P Y T H O N
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
What is missing data? Airquality example
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality)
print(airquality)
Date Temperature CO2
987 False False False
Date Temperature CO2
2119 False False False
987 20/04/2004 16.8 0.0
2451 False False True
2119 07/06/2004 18.7 0.8
1984 False False False
2451 20/06/2004 -40.0 NaN <-- 8299 False False False
1984 01/06/2004 19.6 1.8
8299 19/02/2005 11.2 1.2
... ... ... ...
Airquality example
# Isolate missing and complete values aside
missing = airquality[airquality['CO2'].isna()]
complete = airquality[~airquality['CO2'].isna()]
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
Least possible amount of steps needed to transition from one string to another
Chapter 4 - Record linkage
Least possible amount of steps needed to transition from one string to another
Collapsing categories with string similarity Collapsing categories with string matching
Chapter 2 print(survey['state'].unique()) categories
Generating pairs
Let's practice! C L E A N I N G D ATA I N P Y T H O N
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
Motivation When joins won't work
census_B
95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2']) compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
...
Linking DataFrames # Import recordlinkage and generate pairs and compare across columns
...
C L E A N I N G D ATA I N P Y T H O N
Adel Nehme
Content Developer @ DataCamp
Chapter 2 - Text and categorical data problems Chapter 3 - Advanced data problems
Thank you!
C L E A N I N G D ATA I N P Y T H O N