0% found this document useful (0 votes)

35 views47 pages

Cleaning Data in Python

The document outlines a course on cleaning data in Python, detailing common data problems, data type constraints, and methods for handling issues like duplicates and out-of-range values. It includes practical examples using Python code to demonstrate how to clean data effectively, including converting data types and managing categorical data. The course emphasizes the importance of data cleaning to ensure accurate analysis and decision-making.

Uploaded by

danndanium

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views47 pages

Cleaning Data in Python

Uploaded by

danndanium

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Course outline

Data type
constraints
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Course outline Course outline

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Course outline Why do we need to clean data?

Chapter 1 - Common data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Why do we need to clean data? Why do we need to clean data?

Garbage in Garbage out

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Data type constraints Strings to integers
Datatype Example Python data type # Import CSV file and output header
sales = pd.read_csv('sales.csv')

Text data First name, last name, address str sales.head(2)

...
int
Integers # Subscribers, # products sold SalesOrderID Revenue Quantity
... 0 43659 23153$ 12
float
1 43660 1457$ 2
Decimals Temperature, $ exchange rates
... bool
# Get data types of columns

Binary Is married, new customer, datetime sales.dtypes

yes/no, ...
category
Dates Order dates, ship dates ... SalesOrderID int64
Revenue object
Categories Marriage status, gender ... Quantity int64
dtype: object

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

String to integers String to integers

# Get DataFrame information # Print sum of all Revenue column
sales.info() sales['Revenue'].sum()

<class 'pandas.core.frame.DataFrame'> '23153$1457$36865$32474$472$27510$16158$5694$6876$40487$807$6893$9153$6895$4216..

RangeIndex: 31465 entries, 0 to 31464

Data columns (total 3 columns): # Remove $ from Revenue column
sales['Revenue'] = sales['Revenue'].str.strip('$')
SalesOrderID 31465 non-null int64
sales['Revenue'] = sales['Revenue'].astype('int')
Revenue 31465 non-null object
Quantity 31465 non-null int64
# Verify that Revenue is now an integer
dtypes: int64(2), object(1)
assert sales['Revenue'].dtype == 'int'
memory usage: 737.5+ KB

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

The assert statement Numeric or categorical?
... marriage_status ...
# This will pass
... 3 ...
assert 1+1 == 2 ... 1 ...
... 2 ...

# This will not pass

0 = Never married 1 = Married 2 = Separated 3 = Divorced
assert 1+1 == 3

df['marriage_status'].describe()
AssertionError Traceback (most recent call last)
assert 1+1 == 3 marriage_status
...
AssertionError:
mean 1.4
std 0.20
min 0.00
50% 1.8 ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Numeric or categorical?
# Convert to categorical
df["marriage_status"] = df["marriage_status"].astype('category')
df.describe()

Let's practice!
marriage_status
count 241
unique 4 C L E A N I N G D ATA I N P Y T H O N
top 1
freq 120

CLEANING DATA IN PYTHON

Motivation
movies.head()

Data range 0
movie_name
The Godfather
avg_rating
5

constraints 1
2
Frozen 2
Shrek
3
4
C L E A N I N G D ATA I N P Y T H O N
...

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Motivation Motivation
import matplotlib.pyplot as plt Can future sign-ups exist?
plt.hist(movies['avg_rating'])
plt.title('Average rating of movies (1-5)') # Import date time
import datetime as dt
today_date = dt.date.today()
user_signups[user_signups['subscription_date'] > dt.date.today()]

subscription_date user_name ... Country

0 01/05/2021 Marah ... Nauru
1 09/08/2020 Joshua ... Austria
2 04/01/2020 Heidi ... Guinea
3 11/10/2020 Rina ... Turkmenistan
4 11/07/2020 Christine ... Marshall Islands
5 07/07/2020 Ayanna ... Gabon

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to deal with out of range data? Movie example
Dropping data import pandas as pd
# Output Movies with rating > 5
Setting custom minimums and maximums movies[movies['avg_rating'] > 5]

Treat as missing and impute

movie_name avg_rating
Setting custom value depending on business assumptions 23 A Beautiful Mind 6
65 La Vita e Bella 6
77 Amelie 6

# Drop values using filtering

movies = movies[movies['avg_rating'] <= 5]
# Drop values using .drop()
movies.drop(movies[movies['avg_rating'] > 5].index, inplace = True)
# Assert results
assert movies['avg_rating'].max() <= 5

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Movie example Date range example

import datetime as dt
# Convert avg_rating > 5 to 5
import pandas as pd
movies.loc[movies['avg_rating'] > 5, 'avg_rating'] = 5 # Output data types
user_signups.dtypes

# Assert statement
subscription_date object
assert movies['avg_rating'].max() <= 5 user_name object
Country object
dtype: object
Remember, no output means it passed

# Convert to date
user_signups['subscription_date'] = pd.to_datetime(user_signups['subscription_date']).dt.date

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Date range example
today_date = dt.date.today()

Drop the data

# Drop values using filtering

user_signups = user_signups[user_signups['subscription_date'] < today_date]
# Drop values using .drop()
user_signups.drop(user_signups[user_signups['subscription_date'] > today_date].index, inplace = True) Let's practice!
C L E A N I N G D ATA I N P Y T H O N
Hardcode dates with upper limit

# Drop values using filtering

user_signups.loc[user_signups['subscription_date'] > today_date, 'subscription_date'] = today_date
# Assert is true
assert user_signups.subscription_date.max().date() <= today_date

CLEANING DATA IN PYTHON

What are duplicate values?

All columns have the same values

first_name last_name address height weight

Uniqueness Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg

constraints Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

What are duplicate values? Why do they happen?
Most columns have the same values

first_name last_name address height weight

Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 193 cm 87 kg
Justin Saddlemyer Boulevard du Jardin Botanique 3, Bruxelles 194 cm 87 kg

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Why do they happen? Why do they happen?

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate values? How to find duplicate values?
# Print the header # Get duplicates across all columns
height_weight.head() duplicates = height_weight.duplicated()
print(duplicates)

first_name last_name address height weight

0 Lane Reese 534-1559 Nam St. 181 64 1 False
1 Ivor Pierce 102-3364 Non Road 168 66 ... ....
2 Roary Gibson P.O. Box 344, 7785 Nisi Ave 191 99 22 True
3 Shannon Little 691-2550 Consectetuer Street 185 65 23 False
4 Abdul Fry 4565 Risus St. 169 65 ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate values? How to find duplicate rows?

# Get duplicate rows
The .duplicated() method
duplicates = height_weight.duplicated()
subset : List of column names to check for duplication.
height_weight[duplicates]
keep : Whether to keep first ( 'first' ), last ( 'last' ) or all ( False ) duplicate values.

first_name last_name address height weight

100 Mary Colon 4674 Ut Rd. 179 75 # Column names to check for duplication
101 Ivor Pierce 102-3364 Non Road 168 88 column_names = ['first_name','last_name','address']

102 Cole Palmer 8366 At, Street 178 91 duplicates = height_weight.duplicated(subset = column_names, keep = False)

103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate rows? How to find duplicate rows?
# Output duplicate values # Output duplicate values
height_weight[duplicates] height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight first_name last_name address height weight
1 Ivor Pierce 102-3364 Non Road 168 66 22 Cole Palmer 8366 At, Street 178 91
22 Cole Palmer 8366 At, Street 178 91 102 Cole Palmer 8366 At, Street 178 91
28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83 28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83
37 Mary Colon 4674 Ut Rd. 179 75 103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
100 Mary Colon 4674 Ut Rd. 179 75 1 Ivor Pierce 102-3364 Non Road 168 66
101 Ivor Pierce 102-3364 Non Road 168 88 101 Ivor Pierce 102-3364 Non Road 168 88
102 Cole Palmer 8366 At, Street 178 91 37 Mary Colon 4674 Ut Rd. 179 75
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83 100 Mary Colon 4674 Ut Rd. 179 75

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to find duplicate rows? How to find duplicate rows?

# Output duplicate values # Output duplicate values
height_weight[duplicates].sort_values(by = 'first_name') height_weight[duplicates].sort_values(by = 'first_name')

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to treat duplicate values? How to treat duplicate values?
# Output duplicate values The .drop_duplicates() method
height_weight[duplicates].sort_values(by = 'first_name')
subset : List of column names to check for duplication.

keep : Whether to keep first ( 'first' ), last ( 'last' ) or all ( False ) duplicate values.

inplace : Drop duplicated rows directly inside DataFrame without creating new object ( True
).

# Drop duplicates
height_weight.drop_duplicates(inplace = True)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to treat duplicate values? How to treat duplicate values?

# Output duplicate values # Output duplicate values
column_names = ['first_name','last_name','address'] column_names = ['first_name','last_name','address']
duplicates = height_weight.duplicated(subset = column_names, keep = False) duplicates = height_weight.duplicated(subset = column_names, keep = False)
height_weight[duplicates].sort_values(by = 'first_name') height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight

28 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 195 83
103 Desirae Shannon P.O. Box 643, 5251 Consectetuer, Rd. 196 83
1 Ivor Pierce 102-3364 Non Road 168 66
101 Ivor Pierce 102-3364 Non Road 168 88

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to treat duplicate values?
The .groupby() and .agg() methods

# Group by column names and produce statistical summaries

column_names = ['first_name','last_name','address']
summaries = {'height': 'max', 'weight': 'mean'}

Let's practice!
height_weight = height_weight.groupby(by = column_names).agg(summaries).reset_index()
# Make sure aggregation is done
duplicates = height_weight.duplicated(subset = column_names, keep = False)
C L E A N I N G D ATA I N P Y T H O N
height_weight[duplicates].sort_values(by = 'first_name')

first_name last_name address height weight

CLEANING DATA IN PYTHON

Membership
constraints
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme Chapter 2 - Text and categorical data problems

Content Developer @DataCamp

CLEANING DATA IN PYTHON

Categories and membership constraints Why could we have these problems?
Predefined finite set of categories

Type of data Example values Numeric representation

Marriage Status unmarried , married 0 ,1

Household Income Category 0-20K , 20-40K , ... 0 , 1 , ..

Loan Status default , payed , no_loan 0 ,1 ,2

Marriage status can only be unmarried _or_ married

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How do we treat these problems? An example

# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

An example A note on joins
# Read study data and print it # Correct possible blood types
study_data = pd.read_csv('study.csv') categories
study_data

blood_type
name birthday blood_type 1 O-
1 Beth 2019-10-20 B- 2 O+
2 Ignatius 2020-07-08 A- 3 A-
3 Paul 2019-08-12 O+ 4 A+
4 Helen 2019-03-17 O- 5 B+
5 Jennifer 2019-12-17 Z+ <-- 6 B-
6 Kennedy 2020-04-27 A+ 7 AB+
7 Keith 2019-04-19 AB+ 8 AB-

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

A left anti join on blood types An inner join on blood types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Finding inconsistent categories Dropping inconsistent categories
inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type']) inconsistent_categories = set(study_data['blood_type']).difference(categories['blood_type'])
print(inconsistent_categories) inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories)
inconsistent_data = study_data[inconsistent_rows]

{'Z+'} # Drop inconsistent categories and get consistent data only

consistent_data = study_data[~inconsistent_rows]

# Get and print rows with inconsistent categories

inconsistent_rows = study_data['blood_type'].isin(inconsistent_categories) name birthday blood_type
study_data[inconsistent_rows] 1 Beth 2019-10-20 B-
2 Ignatius 2020-07-08 A-
3 Paul 2019-08-12 O+
name birthday blood_type
4 Helen 2019-03-17 O-
5 Jennifer 2019-12-17 Z+
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Categorical
Let's practice! variables
C L E A N I N G D ATA I N P Y T H O N
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @DataCamp
What type of errors could we have? Value consistency
I) Value inconsistency Capitalization: 'married' , 'Married' , 'UNMARRIED' , 'unmarried' ..

Inconsistent fields: 'married' , 'Maried' , 'UNMARRIED' , 'not married' .. # Get marriage status column
_Trailing white spaces: _ 'married ' , ' married ' .. marriage_status = demographics['marriage_status']
marriage_status.value_counts()
II) Collapsing too many categories to few

Creating new groups: 0-20K , 20-40K categories ... from continuous household income data unmarried 352
Mapping groups to new ones: Mapping household income categories to 2 'rich' , 'poor' married 268
MARRIED 204
III) Making sure data is of type category (seen in Chapter 1)
UNMARRIED 176
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Value consistency Value consistency

# Get value counts on DataFrame # Capitalize
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.upper()
marriage_status.groupby('marriage_status').count()
marriage_status['marriage_status'].value_counts()

household_income gender UNMARRIED 528

marriage_status MARRIED 472

MARRIED 204 204

# Lowercase
UNMARRIED 176 176
marriage_status['marriage_status'] = marriage_status['marriage_status'].str.lower()
married 268 268
marriage_status['marriage_status'].value_counts()
unmarried 352 352

unmarried 528
married 472

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Value consistency Value consistency
Trailing spaces: 'married ' , 'married' , 'unmarried' , ' unmarried' .. # Strip all spaces
demographics = demographics['marriage_status'].str.strip()
# Get marriage status column demographics['marriage_status'].value_counts()
marriage_status = demographics['marriage_status']
marriage_status.value_counts() unmarried 528
married 472

unmarried 352
unmarried 268
married 204
married 176
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing data into categories Collapsing data into categories

Create categories out of data: income_group column from income column. Create categories out of data: income_group column from income column.

# Using qcut() # Using cut() - create category ranges and names

import pandas as pd ranges = [0,200000,500000,np.inf]
group_names = ['0-200K', '200K-500K', '500K+'] group_names = ['0-200K', '200K-500K', '500K+']
demographics['income_group'] = pd.qcut(demographics['household_income'], q = 3, # Create income group column
labels = group_names) demographics['income_group'] = pd.cut(demographics['household_income'], bins=ranges,
# Print income_group column
labels=group_names)
demographics[['income_group', 'household_income']]
demographics[['income_group', 'household_income']]

category household_income
category Income
0 200K-500K 189243
0 0-200K 189243
1 500K+ 778533
1 500K+ 778533
..

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing data into categories
Map categories to fewer ones: reducing categories in categorical column.

operating_system column is: 'Microsoft', 'MacOS', 'IOS', 'Android', 'Linux'

operating_system column should become: 'DesktopOS', 'MobileOS'

# Create mapping dictionary and replace

mapping = {'Microsoft':'DesktopOS', 'MacOS':'DesktopOS', 'Linux':'DesktopOS',
Let's practice!
'IOS':'MobileOS', 'Android':'MobileOS'} C L E A N I N G D ATA I N P Y T H O N
devices['operating_system'] = devices['operating_system'].replace(mapping)
devices['operating_system'].unique()

array(['DesktopOS', 'MobileOS'], dtype=object)

CLEANING DATA IN PYTHON

What is text data?

Type of data Example values Common text data problems

Names Alex , Sara ... 1) Data inconsistency:

Phone numbers +96171679912 ...
Cleaning text data Emails `[email protected]`..
+96171679912 or 0096171679912 or ..?

C L E A N I N G D ATA I N P Y T H O N 2) Fixed length violations:

Passwords ...
Passwords needs to be at least 8 characters

3) Typos:

+961.71.679912
Adel Nehme
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Example Example
phones = pd.read_csv('phones.csv') phones = pd.read_csv('phones.csv')
print(phones) print(phones)

Full name Phone number Full name Phone number

0 Noelani A. Gray 001-702-397-5143 0 Noelani A. Gray 001-702-397-5143
1 Myles Z. Gomez 001-329-485-0540 1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 001-195-492-2338 2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin +1-297-996-4904 3 Prescott D. Hardin +1-297-996-4904 <-- Inconsistent data format
4 Benedict G. Valdez 001-969-820-3536 4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews 4138 5 Reece M. Andrews 4138 <-- Length violation
6 Hayfa E. Keith 001-536-175-8444 6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 001-681-552-1823 7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 001-910-323-5265 8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 001-143-119-9210 9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Example Fixing the phone number column

phones = pd.read_csv('phones.csv') # Replace "+" with "00"
print(phones) phones["Phone number"] = phones["Phone number"].str.replace("+", "00")
phones

Full name Phone number

0 Noelani A. Gray 0017023975143 Full name Phone number

1 Myles Z. Gomez 0013294850540 0 Noelani A. Gray 001-702-397-5143

1 Myles Z. Gomez 001-329-485-0540
2 Gil B. Silva 0011954922338
2 Gil B. Silva 001-195-492-2338
3 Prescott D. Hardin 0012979964904
3 Prescott D. Hardin 001-297-996-4904
4 Benedict G. Valdez 0019698203536
4 Benedict G. Valdez 001-969-820-3536
5 Reece M. Andrews NaN
5 Reece M. Andrews 4138
6 Hayfa E. Keith 0015361758444
6 Hayfa E. Keith 001-536-175-8444
7 Hedley I. Logan 0016815521823
7 Hedley I. Logan 001-681-552-1823
8 Jack W. Carrillo 0019103235265
8 Jack W. Carrillo 001-910-323-5265
9 Lionel M. Davis 0011431199210 9 Lionel M. Davis 001-143-119-9210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Fixing the phone number column Fixing the phone number column
# Replace "-" with nothing # Replace phone numbers with lower than 10 digits to NaN
phones["Phone number"] = phones["Phone number"].str.replace("-", "") digits = phones['Phone number'].str.len()

phones phones.loc[digits < 10, "Phone number"] = np.nan

phones

Full name Phone number

Full name Phone number
0 Noelani A. Gray 0017023975143
0 Noelani A. Gray 0017023975143
1 Myles Z. Gomez 0013294850540
1 Myles Z. Gomez 0013294850540
2 Gil B. Silva 0011954922338
2 Gil B. Silva 0011954922338
3 Prescott D. Hardin 0012979964904
3 Prescott D. Hardin 0012979964904
4 Benedict G. Valdez 0019698203536
4 Benedict G. Valdez 0019698203536
5 Reece M. Andrews 4138
5 Reece M. Andrews NaN
6 Hayfa E. Keith 0015361758444 6 Hayfa E. Keith 0015361758444
7 Hedley I. Logan 0016815521823 7 Hedley I. Logan 0016815521823
8 Jack W. Carrillo 0019103235265 8 Jack W. Carrillo 0019103235265
9 Lionel M. Davis 0011431199210 9 Lionel M. Davis 0011431199210

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Fixing the phone number column But what about more complicated examples?
# Find length of each row in Phone number column phones.head()
sanity_check = phone['Phone number'].str.len()

Full name Phone number

# Assert minmum phone number length is 10 0 Olga Robinson +(01706)-25891
assert sanity_check.min() >= 10 1 Justina Kim +0500-571437
2 Tamekah Henson +0800-1111

# Assert all numbers do not have "+" or "-" 3 Miranda Solis +07058-879063

assert phone['Phone number'].str.contains("+|-").any() == False 4 Caldwell Gilliam +(016977)-8424

Remember, assert returns nothing if the condition passes

Supercharged control + F

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Regular expressions in action
# Replace letters with nothing
phones['Phone number'] = phones['Phone number'].str.replace(r'\D+', '')
phones.head()

Let's practice!
Full name Phone number
0 Olga Robinson 0170625891
1 Justina Kim 0500571437 C L E A N I N G D ATA I N P Y T H O N
2 Tamekah Henson 08001111
3 Miranda Solis 07058879063
4 Caldwell Gilliam 0169778424

CLEANING DATA IN PYTHON

In this chapter

Uniformity
C L E A N I N G D ATA I N P Y T H O N

Adel Nehme Chapter 3 - Advanced data problems

Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Data range constraints Uniformity
Column Unit
Temperature 32°C is also 89.6°F

Weight 70 Kg is also 11 st.

Date 26-11-2019 is also 26, November, 2019

Money 100$ is also 10763.90¥

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

An example An example
temperatures = pd.read_csv('temperature.csv') temperatures = pd.read_csv('temperature.csv')
temperatures.head() temperatures.head()

Date Temperature Date Temperature

0 03.03.19 14.0 0 03.03.19 14.0
1 04.03.19 15.0 1 04.03.19 15.0
2 05.03.19 18.0 2 05.03.19 18.0
3 06.03.19 16.0 3 06.03.19 16.0
4 07.03.19 62.6 4 07.03.19 62.6 <--

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

An example
# Import matplotlib
import matplotlib.pyplot as plt
# Create scatter plot
plt.scatter(x = 'Date', y = 'Temperature', data = temperatures)
# Create title, xlabel and ylabel
plt.title('Temperature in Celsius March 2019 - NYC')
plt.xlabel('Dates')
plt.ylabel('Temperature in Celsius')
# Show plot
plt.show()

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating temperature data

5
C = (F − 32) ×
9

temp_fah = temperatures.loc[temperatures['Temperature'] > 40, 'Temperature']

temp_cels = (temp_fah - 32) * (5/9)
temperatures.loc[temperatures['Temperature'] > 40, 'Temperature'] = temp_cels

# Assert conversion is correct

assert temperatures['Temperature'].max() < 40

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating date data Treating date data
birthdays.head() birthdays.head()

Birthday First name Last name

0 27/27/19 Rowan Nunez
1 03-29-19 Brynn Yang
2 March 3rd, 2019 Sophia Reilly
3 24-03-19 Deacon Prince
4 06-03-19 Griffith Neal

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Datetime formatting Treating date data

datetime is useful for representing dates pandas.to_datetime()
# Converts to datetime - but won't work!
birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'])
Date datetime format Can recognize most formats automatically

25-12-2019 %d-%m-%Y Sometimes fails with erroneous or ValueError: month must be in 1..12
unrecognizable formats
December 25th 2019 %c
# Will work!
12-25-2019 %m-%d-%Y birthdays['Birthday'] = pd.to_datetime(birthdays['Birthday'],
... ... # Attempt to infer format of each date
infer_datetime_format=True,
# Return NA for rows where conversion failed
errors = 'coerce')

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating date data Treating date data
birthdays.head() birthdays['Birthday'] = birthdays['Birthday'].dt.strftime("%d-%m-%Y")
birthdays.head()

Birthday First name Last name

0 NaT Rowan Nunez Birthday First name Last name
1 2019-03-29 Brynn Yang 0 NaT Rowan Nunez
2 2019-03-03 Sophia Reilly 1 29-03-2019 Brynn Yang
3 2019-03-24 Deacon Prince 2 03-03-2019 Sophia Reilly
4 2019-06-03 Griffith Neal 3 24-03-2019 Deacon Prince
4 03-06-2019 Griffith Neal

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Treating ambiguous date data

Is 2019-03-08 in August or March?

Convert to NA and treat accordingly

Let's practice!
Infer format by understanding data source C L E A N I N G D ATA I N P Y T H O N
Infer format by understanding previous and subsequent data in DataFrame

CLEANING DATA IN PYTHON

Motivation
import pandas as pd

flights = pd.read_csv('flights.csv')

Cross field validation flights.head()

C L E A N I N G D ATA I N P Y T H O N
flight_number economy_class business_class first_class total_passengers
0 DL140 100 60 40 200
1 BA248 130 100 70 300
2 MEA124 100 50 50 200
3 AFR939 140 70 90 300
Adel Nehme 4 TKA101 130 100 20 250
Content Developer @ DataCamp

CLEANING DATA IN PYTHON

Cross field validation Cross field validation

The use of multiple fields in a dataset to sanity check data integrity users.head()

flight_number economy_class business_class first_class total_passengers

0 DL140 100 + 60 + 40 = 200 user_id Age Birthday
1 BA248 130 + 100 + 70 = 300 0 32985 22 1998-03-02
2 MEA124 100 + 50 + 50 = 200
1 94387 27 1993-12-04
3 AFR939 140 + 70 + 90 = 300
4 TKA101 130 + 100 + 20 = 250
2 34236 42 1978-11-24
3 12551 31 1989-01-03
sum_classes = flights[['economy_class', 'business_class', 'first_class']].sum(axis = 1) 4 55212 18 2002-07-02
passenger_equ = sum_classes == flights['total_passengers']
# Find and filter out rows with inconsistent passenger totals
inconsistent_pass = flights[~passenger_equ]
consistent_pass = flights[passenger_equ]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Cross field validation What to do when we catch inconsistencies?
import pandas as pd
import datetime as dt

# Convert to datetime and get today's date

users['Birthday'] = pd.to_datetime(users['Birthday'])
today = dt.date.today()
# For each row in the Birthday column, calculate year difference
age_manual = today.year - users['Birthday'].dt.year
# Find instances where ages match
age_equ = age_manual == users['Age']
# Find and filter out rows with inconsistent age
inconsistent_age = users[~age_equ]
consistent_age = users[age_equ]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Completeness
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp
What is missing data? Airquality example
import pandas as pd
airquality = pd.read_csv('airquality.csv')
print(airquality)

Date Temperature CO2

Can be represented as NA , nan , 0 , . ...
987 20/04/2004 16.8 0.0
2119 07/06/2004 18.7 0.8
2451 20/06/2004 -40.0 NaN
Technical error 1984 01/06/2004 19.6 1.8
8299 19/02/2005 11.2 1.2
Human error
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example Airquality example

import pandas as pd # Return missing values

airquality = pd.read_csv('airquality.csv') airquality.isna()

print(airquality)
Date Temperature CO2
987 False False False
Date Temperature CO2
2119 False False False
987 20/04/2004 16.8 0.0
2451 False False True
2119 07/06/2004 18.7 0.8
1984 False False False
2451 20/06/2004 -40.0 NaN <-- 8299 False False False
1984 01/06/2004 19.6 1.8
8299 19/02/2005 11.2 1.2
... ... ... ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example Missingno
# Get summary of missingness
Useful package for visualizing and understanding missing data
airquality.isna().sum()
import missingno as msno
import matplotlib.pyplot as plt
Date 0
# Visualize missingness
Temperature 0
msno.matrix(airquality)
CO2 366
plt.show()
dtype: int64

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example
# Isolate missing and complete values aside
missing = airquality[airquality['CO2'].isna()]
complete = airquality[~airquality['CO2'].isna()]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Airquality example Airquality example
# Describe complete DataFramee # Describe missing DataFramee # Describe complete DataFramee # Describe missing DataFramee
complete.describe() missing.describe() complete.describe() missing.describe()

Temperature CO2 Temperature CO2 Temperature CO2 Temperature CO2

count 8991.000000 8991.000000 count 366.000000 0.0 count 8991.000000 8991.000000 count 366.000000 0.0
mean 18.317829 1.739584 mean -39.655738 NaN mean 18.317829 1.739584 mean -39.655738 NaN <--
std 8.832116 1.537580 std 5.988716 NaN std 8.832116 1.537580 std 5.988716 NaN
min -1.900000 0.000000 min -49.000000 NaN min -1.900000 0.000000 min -49.000000 NaN <--
... ... ... ... ... ... ... ... ... ... ... ...
max 44.600000 11.900000 max -30.000000 NaN max 44.600000 11.900000 max -30.000000 NaN <--

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

sorted_airquality = airquality.sort_values(by = 'Temperature') sorted_airquality = airquality.sort_values(by = 'Temperature')

msno.matrix(sorted_airquality) msno.matrix(sorted_airquality)
plt.show() plt.show()

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Missingness types Missingness types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Missingness types Missingness types

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

How to deal with missing data? Dealing with missing data
Simple approaches: airquality.head()

1. Drop missing data

Date Temperature CO2
2. Impute with statistical measures (mean, median, mode..)
0 05/03/2005 8.5 2.5
More complex approaches: 1 23/08/2004 21.8 0.0
2 18/02/2005 6.3 1.0
1. Imputing using an algorithmic approach 3 08/02/2005 -31.0 NaN
2. Impute with machine learning models 4 13/03/2005 19.9 0.1

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Dropping missing values Replacing with statistical measures

# Drop missing values co2_mean = airquality['CO2'].mean()
airquality_dropped = airquality.dropna(subset = ['CO2']) airquality_imputed = airquality.fillna({'CO2': co2_mean})
airquality_dropped.head() airquality_imputed.head()

Date Temperature CO2 Date Temperature CO2

0 05/03/2005 8.5 2.5 0 05/03/2005 8.5 2.500000
1 23/08/2004 21.8 0.0 1 23/08/2004 21.8 0.000000
2 18/02/2005 6.3 1.0 2 18/02/2005 6.3 1.000000
4 13/03/2005 19.9 0.1 3 08/02/2005 -31.0 1.739584
5 02/04/2005 17.0 0.8 4 13/03/2005 19.9 0.100000

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Comparing strings
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

In this chapter Minimum edit distance

Least possible amount of steps needed to transition from one string to another
Chapter 4 - Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance Minimum edit distance

Least possible amount of steps needed to transition from one string to another

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance Minimum edit distance

Minimum edit distance so far: 2 Minimum edit distance: 5

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance Minimum edit distance algorithms
Algorithm Operations
Damerau-Levenshtein insertion, substitution, deletion, transposition
Levenshtein insertion, substitution, deletion
Hamming substitution only
Jaro distance transposition only
... ...

Possible packages: nltk , thefuzz , textdistance ..

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Minimum edit distance algorithms Simple string comparison

Algorithm Operations # Lets us compare between two strings
Damerau-Levenshtein insertion, substitution, deletion, transposition from thefuzz import fuzz

Levenshtein insertion, substitution, deletion

# Compare reeding vs reading
Hamming substitution only
fuzz.WRatio('Reeding', 'Reading')
Jaro distance transposition only
... ...
86

Possible packages: thefuzz

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Partial strings and different orderings Comparison with arrays
# Partial string comparison # Import process
fuzz.WRatio('Houston Rockets', 'Rockets') from thefuzz import process

90 # Define string and array of possible matches

string = "Houston Rockets vs Los Angeles Lakers"
choices = pd.Series(['Rockets vs Lakers', 'Lakers vs Rockets',
# Partial string comparison with different order
'Houson vs Los Angeles', 'Heat vs Bulls'])
fuzz.WRatio('Houston Rockets vs Los Angeles Lakers', 'Lakers vs Rockets')

process.extract(string, choices, limit = 2)

[('Rockets vs Lakers', 86, 0), ('Lakers vs Rockets', 86, 1)]

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing categories with string similarity Collapsing categories with string matching
Chapter 2 print(survey['state'].unique()) categories

Use .replace() to collapse "eur" into "Europe"

id state state
0 California 0 California
1 Cali 1 New York
What if there are too many variations? 2 Calefornia
3 Calefornie
"EU" , "eur" , "Europ" , "Europa" , "Erope" , "Evropa" ... 4 Californie
5 Calfornia
6 Calefernia
7 New York
String similarity! 8 New York City
...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Collapsing all of the state Record linkage
# For each correct category
for state in categories['state']:
# Find potential matches in states with typoes
matches = process.extract(state, survey['state'], limit = survey.shape[0])
# For each potential match match
for potential_match in matches:
# If high similarity score
if potential_match[1] >= 80:
# Replace typo with correct category
survey.loc[survey['state'] == potential_match[0], 'state'] = state

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Generating pairs
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp
Motivation When joins won't work

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Record linkage Our DataFrames

census_A

given_name surname date_of_birth suburb state address_1

rec_id
rec-1070-org michaela neumann 19151111 winston hills cal stanley street
rec-1016-org courtney painter 19161214 richlands txs pinkerton circuit
...

census_B

given_name surname date_of_birth suburb state address_1

rec_id
rec-561-dup-0 elton NaN 19651013 windermere ny light setreet
rec-2642-dup-0 mitchell maxon 19390212 north ryde cal edkins street
The recordlinkage package ...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Generating pairs Generating pairs

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Blocking Generating pairs

# Import recordlinkage
import recordlinkage

# Create indexing object

indexer = recordlinkage.Index()

# Generate pairs blocked on state

indexer.block('state')
pairs = indexer.index(census_A, census_B)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Generating pairs Comparing the DataFrames
print(pairs) # Generate the pairs
pairs = indexer.index(census_A, census_B)
# Create a Compare object
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org',
compare_cl = recordlinkage.Compare()
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', 'rec-1146-org',
'rec-1157-org', 'rec-1165-org', 'rec-1185-org', 'rec-1234-org', 'rec-1271-org', # Find exact matches for pairs of date_of_birth and state
'rec-1280-org',........... compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
66, 14, 13, 18, 34, 39, 0, 16, 80, 50, 20, 69, 28, 25, 49, 77, 51, 85, 52, 63, 74, 61, compare_cl.exact('state', 'state', label='state')
83, 91, 22, 26, 55, 84, 11, 81, 97, 56, 27, 48, 2, 64, 5, 17, 29, 60, 72, 47, 92, 12, # Find similar matches for pairs of surname and address_1 using string similarity

95, 15, 19, 57, 37, 70, 94]], names=['rec_id_1', 'rec_id_2']) compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')

# Find matches
potential_matches = compare_cl.compute(pairs, census_A, census_B)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Finding matching pairs Finding the only pairs we want

print(potential_matches) potential_matches[potential_matches.sum(axis = 1) => 2]

date_of_birth state surname address_1 date_of_birth state surname address_1

rec_id_1 rec_id_2 rec_id_1 rec_id_2
rec-1070-org rec-561-dup-0 0 1 0.0 0.0 rec-4878-org rec-4878-dup-0 1 1 1.0 0.0
rec-2642-dup-0 0 1 0.0 0.0 rec-417-org rec-2867-dup-0 0 1 0.0 1.0
rec-608-dup-0 0 1 0.0 0.0 rec-3964-org rec-394-dup-0 0 1 1.0 0.0
... rec-1373-org rec-4051-dup-0 0 1 1.0 0.0
rec-1631-org rec-4070-dup-0 0 1 0.0 0.0 rec-802-dup-0 0 1 1.0 0.0
rec-4862-dup-0 0 1 0.0 0.0 rec-3540-org rec-470-dup-0 0 1 1.0 0.0
rec-629-dup-0 0 1 0.0 0.0
...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Linking DataFrames
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

Record linkage Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Our DataFrames What we've already done
census_A # Import recordlinkage and generate full pairs
import recordlinkage
given_name surname date_of_birth suburb state address_1 indexer = recordlinkage.Index()
rec_id indexer.block('state')
rec-1070-org michaela neumann 19151111 winston hills nsw stanley street full_pairs = indexer.index(census_A, census_B)

rec-1016-org courtney painter 19161214 richlands vic pinkerton circuit

... # Comparison step
compare_cl = recordlinkage.Compare()
census_B compare_cl.exact('date_of_birth', 'date_of_birth', label='date_of_birth')
compare_cl.exact('state', 'state', label='state')
given_name surname date_of_birth suburb state address_1 compare_cl.string('surname', 'surname', threshold=0.85, label='surname')
rec_id compare_cl.string('address_1', 'address_1', threshold=0.85, label='address_1')
rec-561-dup-0 elton NaN 19651013 windermere vic light setreet
rec-2642-dup-0 mitchell maxon 19390212 north ryde nsw edkins street potential_matches = compare_cl.compute(full_pairs, census_A, census_B)

...

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

What we're doing now Our potential matches

potential_matches

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Our potential matches Our potential matches
potential_matches potential_matches

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Our potential matches Probable matches

potential_matches matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
print(matches)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Probable matches Get the indices
matches = potential_matches[potential_matches.sum(axis = 1) >= 3] matches.index
print(matches)
MultiIndex(levels=[['rec-1007-org', 'rec-1016-org', 'rec-1054-org', 'rec-1066-org',
'rec-1070-org', 'rec-1075-org', 'rec-1080-org', 'rec-110-org', ...

# Get indices from census_B only

duplicate_rows = matches.index.get_level_values(1)
print(census_B_index)

Index(['rec-2404-dup-0', 'rec-4178-dup-0', 'rec-1054-dup-0', 'rec-4663-dup-0',

'rec-485-dup-0', 'rec-2950-dup-0', 'rec-1234-dup-0', ... , 'rec-299-dup-0'])

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Linking DataFrames # Import recordlinkage and generate pairs and compare across columns
...

# Finding duplicates in census_B # Generate potential matches

potential_matches = compare_cl.compute(full_pairs, census_A, census_B)
census_B_duplicates = census_B[census_B.index.isin(duplicate_rows)]

# Isolate matches with matching values for 3 or more columns

# Finding new rows in census_B matches = potential_matches[potential_matches.sum(axis = 1) >= 3]
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]
# Get index for matching census_B rows only
duplicate_rows = matches.index.get_level_values(1)
# Link the DataFrames!
full_census = census_A.append(census_B_new) # Finding new rows in census_B
census_B_new = census_B[~census_B.index.isin(duplicate_rows)]

# Link the DataFrames!

full_census = census_A.append(census_B_new)

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Congratulations!
Let's practice! C L E A N I N G D ATA I N P Y T H O N

C L E A N I N G D ATA I N P Y T H O N

Adel Nehme
Content Developer @ DataCamp

What we've learned What we've learned

Chapter 1 - Common data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

What we've learned What we've learned

Chapter 2 - Text and categorical data problems Chapter 3 - Advanced data problems

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

What we've learned More to learn on DataCamp!

Working with Dates and Times in Python
Regular Expressions in Python
Dealing with Missing Data in Python
And more!

Chapter 4 - Record linkage

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

More to learn! More to learn!

CLEANING DATA IN PYTHON CLEANING DATA IN PYTHON

Thank you!
C L E A N I N G D ATA I N P Y T H O N

Class 12th Mathematics PYQs With Solution CHAPTER-1 Relations and Functions
100% (2)
Class 12th Mathematics PYQs With Solution CHAPTER-1 Relations and Functions
27 pages
Math7 - Q1 - Mod1 - Introduction To Sets
93% (27)
Math7 - Q1 - Mod1 - Introduction To Sets
37 pages
CS103 Cheat Sheet
No ratings yet
CS103 Cheat Sheet
2 pages
Functions Revision Sheet
No ratings yet
Functions Revision Sheet
1 page
Chapter1 PDF
No ratings yet
Chapter1 PDF
46 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
24 pages
Data Cleaningin ML
No ratings yet
Data Cleaningin ML
15 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Data Preprocessing
No ratings yet
Data Preprocessing
84 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
Chapter 4
No ratings yet
Chapter 4
58 pages
Comparing Strings: Adel Nehme
No ratings yet
Comparing Strings: Adel Nehme
58 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
IntroToPython Unit 5
No ratings yet
IntroToPython Unit 5
42 pages
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
No ratings yet
String (Pandas) - Removing $ After Int Sales ( Revenue') Sales ( Revenue') .STR - Strip ( $') #Convert String To Int
12 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Python (Unit - 2)
No ratings yet
Python (Unit - 2)
22 pages
DS Lec 6
No ratings yet
DS Lec 6
27 pages
All CLR
No ratings yet
All CLR
8 pages
Techniques
No ratings yet
Techniques
31 pages
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
26 pages
Lab-4, Data Wrangling With Python
No ratings yet
Lab-4, Data Wrangling With Python
11 pages
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
No ratings yet
1.2.1. Retrieving Data - 1.2.2. Cleaning Data
35 pages
Python Basics Refresher
No ratings yet
Python Basics Refresher
19 pages
PDS Exp 7 To 9
No ratings yet
PDS Exp 7 To 9
10 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Data Analytics Curriculum
No ratings yet
Data Analytics Curriculum
8 pages
Pandas Module (Part-I)
No ratings yet
Pandas Module (Part-I)
36 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Dataset Cleaning Cheat Sheet
No ratings yet
Dataset Cleaning Cheat Sheet
1 page
E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Cleaning Data in R
No ratings yet
Cleaning Data in R
158 pages
Data Cleaning in Python
No ratings yet
Data Cleaning in Python
14 pages
Asfasdas
No ratings yet
Asfasdas
36 pages
What Is The Concept of Data Cleaning
No ratings yet
What Is The Concept of Data Cleaning
20 pages
Data Analytics With Python Curriculum (LOCTECH) PDF
No ratings yet
Data Analytics With Python Curriculum (LOCTECH) PDF
6 pages
7 Cleaning Data w3s.............................................
No ratings yet
7 Cleaning Data w3s.............................................
2 pages
Text 3
No ratings yet
Text 3
3 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
No ratings yet
Lesson 3. Data Preparation and Structuring 1 Data Cleaning
36 pages
Lecture Week5
No ratings yet
Lecture Week5
72 pages
Lab 3 DWM
No ratings yet
Lab 3 DWM
5 pages
Python and PowerBI Syllabus
No ratings yet
Python and PowerBI Syllabus
3 pages
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
No ratings yet
Cleaning Dirty Data With Pandas & Python - DevelopIntelligence Blog PDF
8 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Advanced Python Programming Data Science: The University of Sheffield
No ratings yet
Advanced Python Programming Data Science: The University of Sheffield
55 pages
Cleaning
No ratings yet
Cleaning
4 pages
Unit V
No ratings yet
Unit V
47 pages
S08 Slides
No ratings yet
S08 Slides
14 pages
Illuminating Data: A hands on guide to data visualization in R
From Everand
Illuminating Data: A hands on guide to data visualization in R
Eman Ahmad
No ratings yet
Introduction to Python Programming: Do your first steps into programming with python
From Everand
Introduction to Python Programming: Do your first steps into programming with python
Greytower Corp
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Advanced C++ Interview Questions You'll Most Likely Be Asked
From Everand
Advanced C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Notes For CFG
No ratings yet
Notes For CFG
5 pages
MME-1 Past Year Solution
No ratings yet
MME-1 Past Year Solution
353 pages
Vedantu JEE Maths Tatva Module - Vedantu - 2023 - Vedantu - Anna's Archive
No ratings yet
Vedantu JEE Maths Tatva Module - Vedantu - 2023 - Vedantu - Anna's Archive
1,041 pages
VIKRANT
No ratings yet
VIKRANT
69 pages
Artificial Intelligence Chapter 3: Solving Problems by Searching
No ratings yet
Artificial Intelligence Chapter 3: Solving Problems by Searching
32 pages
Set Theory Symbols (Sets Symbols and Examples)
No ratings yet
Set Theory Symbols (Sets Symbols and Examples)
13 pages
Probability Methods in Civil Engineering Prof. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur
No ratings yet
Probability Methods in Civil Engineering Prof. Rajib Maity Department of Civil Engineering Indian Institute of Technology, Kharagpur
26 pages
Digital Logic Design
No ratings yet
Digital Logic Design
15 pages
Turing Machine-Theoretical Question
No ratings yet
Turing Machine-Theoretical Question
34 pages
Basic Operators and Their Precedence
No ratings yet
Basic Operators and Their Precedence
5 pages
10.4 Mathematical Induction
No ratings yet
10.4 Mathematical Induction
10 pages
Algorithm Chapter Four
No ratings yet
Algorithm Chapter Four
86 pages
DMGT Module4 Boolean Algebra
No ratings yet
DMGT Module4 Boolean Algebra
40 pages
Backtracking: Definition Constraints
No ratings yet
Backtracking: Definition Constraints
14 pages
Divide and Conquer Algorithm
No ratings yet
Divide and Conquer Algorithm
58 pages
Lebesgue Outer Measure and Lebesgue Measure
No ratings yet
Lebesgue Outer Measure and Lebesgue Measure
2 pages
Chapter 3
No ratings yet
Chapter 3
18 pages
3527 Homework 4 Solutions
No ratings yet
3527 Homework 4 Solutions
4 pages
CS8451-Design and Analysis of Algorithms PDF
No ratings yet
CS8451-Design and Analysis of Algorithms PDF
18 pages
3 Pointers
No ratings yet
3 Pointers
26 pages
Analysis of Algorithms II: The Recursive Case
No ratings yet
Analysis of Algorithms II: The Recursive Case
28 pages
Honors Precalculus Name: WS 1.5 - 6: X, G (X) X
No ratings yet
Honors Precalculus Name: WS 1.5 - 6: X, G (X) X
3 pages
4.divide and Conquer
No ratings yet
4.divide and Conquer
22 pages
Lesson 26 - Chomsky Normal Form - Leftmost Derivation
No ratings yet
Lesson 26 - Chomsky Normal Form - Leftmost Derivation
22 pages
Chapter 1 Sets
No ratings yet
Chapter 1 Sets
4 pages
Wuct121 Discrete Mathematics Final Exam Autumn 2008 Marking Guide
No ratings yet
Wuct121 Discrete Mathematics Final Exam Autumn 2008 Marking Guide
13 pages