Open navigation menu

Scribd

0% found this document useful (0 votes)

13 views52 pages

Data Cleaning

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views52 pages

Data Cleaning

Uploaded by

Copyright

© © All Rights Reserved

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

Data

Engineering 101
Data Cleaning
using PySpark

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

dropDuplicates()
Removes duplicate rows from the
DataFrame based on specified columns

df = df.dropDuplicates(['id', 'name'])

Removes duplicate rows from the DataFrame,

keeping only unique combinations of 'id' and
'name'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

dropna()
Removes rows with null values in
specified columns

df = df.dropna(subset=['important_column'])

Removes any row where the 'important_column'

contains a null value

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

fillna()
Replaces null values with specified
values

df = df.fillna({'age': 0, 'salary': 50000})

Replaces null values in the 'age' column with 0

and in the 'salary' column with 50000

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

replace()
Replaces specified values with new
values

df = df.replace({'status': {'old': 'legacy', 'new':

'current'}})
Replaces 'old' with 'legacy' and 'new' with
'current' in the 'status' column

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

cast()
Changes the data type of a column

df = df.withColumn('salary',
df['salary'].cast('double'))
Converts the 'salary' column to double data type

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

withColumn()
Adds a new column or replaces an
existing one

df = df.withColumn('full_name',
F.concat(df['first_name'], F.lit(' '), df['last_name']))
Creates a new 'full_name' column by
concatenating 'first_name', a space, and
'last_name'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

drop()
Removes specified columns from the
DataFrame

df = df.drop('unnecessary_column')

Removes the column named

'unnecessary_column' from the DataFrame

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

rename()
Renames columns in the DataFrame

df = df.withColumnRenamed('old_name', 'new_name')

Renames the column 'old_name' to 'new_name'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

trim()
Removes leading and trailing
whitespace from string columns

df = df.withColumn('name', F.trim(df['name']))

Removes leading and trailing spaces from the

'name' column

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

regexp_replace()
Replaces substrings matching a regular
expression

df = df.withColumn('phone',
F.regexp_replace(df['phone'], r'[^\d]', ''))
Removes all non-digit characters from the
'phone' column

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

filter()
Filters rows based on a condition

df = df.filter(df['age'] > 18)

Keeps only the rows where the 'age' is greater

than 18

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

where()
Another method to filter rows based on
a condition

df = df.where(df['status'] == 'active')

Keeps only the rows where the 'status' is 'active'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

distinct()
Removes duplicate rows from the
DataFrame

df = df.select('category').distinct()

Returns a DataFrame with unique 'category'

values

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

unionByName()
Combines two DataFrames, aligning
them by column names

df_combined = df1.unionByName(df2,
allowMissingColumns=True)
Combines df1 and df2, matching columns by
name and allowing missing columns

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

join()
Combines two DataFrames based on a
key

df_joined = df1.join(df2, on='id', how='left')

Performs a left join of df1 and df2 on the 'id'

column

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

groupBy()
Groups the DataFrame by specified
columns

df_grouped =
df.groupBy('department').agg(F.avg('salary') \
.alias('avg_salary'))
Groups by 'department' and calculates the
average salary for each group

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

agg()
Performs aggregations on grouped data

df_summary = df.agg(F.min('age'), F.max('age'),

F.avg('salary'))
Calculates the minimum and maximum age, and
average salary across the entire DataFrame

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

pivot()
Pivots a DataFrame from long to wide
format

df_pivoted =
df.groupBy('date').pivot('category').sum('amount')
Creates a pivot table with 'date' as rows,
'category' as columns, and sum of 'amount' as
values

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

explode()
Splits array columns into multiple rows

df = df.withColumn('item', F.explode(df['items_list']))

Creates a new row for each element in the

'items_list' array

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

coalesce()
Returns the first non-null value in a list
of columns

df = df.withColumn('best_contact',
F.coalesce(df['email'], df['phone'], df['address']))
Creates a 'best_contact' column with the first
non-null value from email, phone, or address

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

when()
Conditional operations in DataFrame

df = df.withColumn('age_group', F.when(df['age'] <

18, 'minor').otherwise('adult'))
Creates an 'age_group' column, assigning 'minor'
if age < 18, otherwise 'adult'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

split()
Splits a string column into an array

df = df.withColumn('first_name',
F.split(df['full_name'], ' ').getItem(0))
Splits 'full_name' by space and takes the first
item as 'first_name'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

substring()
Extracts a substring from a string
column

df = df.withColumn('year', F.substring(df['date'], 1, 4))

Extracts the first 4 characters from the 'date'

column as 'year'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

to_date()
Converts a string to a date type

df = df.withColumn('date', F.to_date(df['string_date'],
'yyyy-MM-dd'))
Converts 'string_date' to a date type using the
specified format

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

isNull()
Checks if a column value is null

df = df.filter(df['important_column'].isNull())

Keeps only the rows where 'important_column'

is null

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

isNotNull()
Checks if a column value is not null

df = df.filter(df['important_column'].isNotNull())

Keeps only the rows where 'important_column'

is not null

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

isin()
Checks if a column value is in a list of
values

df = df.filter(df['category'].isin(['A', 'B', 'C']))

Keeps only the rows where 'category' is either 'A',

'B', or 'C'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

between()
Filters for values in a specified range

df = df.filter(df['age'].between(18, 65))

Keeps only the rows where 'age' is between 18

and 65 (inclusive)

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

rdd.map()
Applies a function to each row of the
DataFrame

df = df.rdd.map(lambda x: (x['id'],
x['name'].upper())).toDF(['id', 'name'])
Converts 'name' to uppercase for each row and
creates a new DataFrame

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

withColumnRenamed()
Renames a column

df = df.withColumnRenamed('old_column_name',
'new_column_name')
Renames the column 'old_column_name' to
'new_column_name'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

na.fill()
Fills null values with specified values

df = df.na.fill({'age': 0, 'name': 'Unknown'})

Replaces null values in 'age' with 0 and in 'name'

with 'Unknown'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

na.replace()
Replaces specified values with null

df = df.na.replace(['N/A', 'NA'], None)

Replaces 'N/A' and 'NA' values with null across

all columns

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

approx_count_distinct()
Estimates the number of distinct items
in a column

df =
df.agg(F.approx_count_distinct('user_id').alias('distin
ct_users'))
Estimates the count of distinct 'user_id' values

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

collect_list()
Collects values from a column into a list

df =
df.groupBy('category').agg(F.collect_list('item').alias('
items'))
Groups by 'category' and collects all 'item' values
into a list

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

array_contains()
Checks if an array column contains a
value

df = df.filter(F.array_contains(df['tags'], 'important'))

Keeps only rows where the 'tags' array contains

'important'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

to_timestamp()
Converts a string to a timestamp

df = df.withColumn('timestamp',
F.to_timestamp(df['string_timestamp'], 'yyyy-MM-dd
HH:mm:ss'))
Converts 'string_timestamp' to a timestamp
using the specified format

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

unix_timestamp()
Converts a date/timestamp to Unix
timestamp

df = df.withColumn('unix_time',
F.unix_timestamp(df['date']))
Converts 'date' column to Unix timestamp
(seconds since 1970-01-01)

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

from_unixtime()
Converts Unix timestamp to a readable
date string

df = df.withColumn('readable_time',
F.from_unixtime(df['unix_time']))
Converts 'unix_time' to a readable date string

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

datediff()
Calculates the difference between two
dates

df = df.withColumn('days_diff',
F.datediff(df['end_date'], df['start_date']))
Calculates the number of days between
'start_date' and 'end_date'

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

months_between()
Calculates the number of months
between two dates

df = df.withColumn('months_employed',
F.months_between(F.current_date(), df['hire_date']))
Calculates the number of months between
'hire_date' and the current date

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

round()
Rounds a numeric column to specified
decimal places

df = df.withColumn('rounded_salary',
F.round(df['salary'], 2))
Rounds the 'salary' column to 2 decimal places

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

upper()
Converts a string column to uppercase

df = df.withColumn('uppercase_name',
F.upper(df['name']))
Converts the 'name' column to uppercase

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

lower()
Converts a string column to lowercase

df = df.withColumn('lowercase_email',
F.lower(df['email']))
Converts the 'email' column to lowercase

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

concat_ws()
Concatenates multiple columns with a
separator

df = df.withColumn('full_address', F.concat_ws(', ',

df['street'], df['city'], df['country']))
Concatenates 'street', 'city', and 'country' with
comma separator

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

create_map()
Creates a map column from key-value
pairs

df = df.withColumn('name_map',
F.create_map(F.lit('first'), df['first_name'], F.lit('last'),
df['last_name']))
Creates a map column with 'first' and 'last' as
keys, and corresponding name columns as values

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

size()
Returns the size of an array or map
column

df = df.withColumn('list_size', F.size(df['item_list']))

Adds a column with the size of the 'item_list'

array

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

first()
Returns the first value in a group

df =
df.groupBy('category').agg(F.first('item').alias('first_it
em'))
Groups by 'category' and gets the first 'item' in
each group

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

last()
Returns the last value in a group

df =
df.groupBy('category').agg(F.last('item').alias('last_item'))

Groups by 'category' and gets the last 'item' in

each group

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

regexp_extract()
Extracts a pattern from a string column

df = df.withColumn('zip_code',
F.regexp_extract(df['address'], r'\d{5}', 0))
Extracts a 5-digit zip code from the 'address'
column

Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark

window()
Defines a window for window functions

window_spec =
Window.partitionBy('department').orderBy(F.desc('s
alary'))<br>df = df.withColumn('salary_rank',
F.rank().over(window_spec))
Creates a window spec partitioned by
'department' and ordered by descending 'salary',
then ranks salaries within each department

Shwetank Singh
GritSetGrow - GSGLearn.com

You might also like

E-Book Data Cleaning Techniques in Python
100% (2)
E-Book Data Cleaning Techniques in Python
50 pages
Splunk 8.2 Cloud Administration
No ratings yet
Splunk 8.2 Cloud Administration
386 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
English Translation of A Birth Certificate From Honduras PDF
78% (9)
English Translation of A Birth Certificate From Honduras PDF
1 page
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Stages of Development of HRIS
50% (2)
Stages of Development of HRIS
15 pages
Arithmetic Progression Worksheet
100% (2)
Arithmetic Progression Worksheet
5 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Data Manipulation in Python Using Pandas
No ratings yet
Data Manipulation in Python Using Pandas
12 pages
D-MSS-DS-23 Exam Updated Practice Questions 2025
No ratings yet
D-MSS-DS-23 Exam Updated Practice Questions 2025
5 pages
Jagpat Project Dhapni
No ratings yet
Jagpat Project Dhapni
46 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Sop Cognitive Science Iitk
0% (1)
Sop Cognitive Science Iitk
2 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
No ratings yet
Pyspark Scenario-Based Interview Questions & Answers: Nitya Cloudtech PVT LTD
12 pages
Windows 7 Developer Guide v1.5
No ratings yet
Windows 7 Developer Guide v1.5
46 pages
EC612 User Manual
No ratings yet
EC612 User Manual
131 pages
Unit-I Additive Manufacturing Old
No ratings yet
Unit-I Additive Manufacturing Old
62 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
A55M-HVS multiQIG PDF
No ratings yet
A55M-HVS multiQIG PDF
162 pages
TUV Certificate - HC900 Safety
No ratings yet
TUV Certificate - HC900 Safety
1 page
Mod 1 Lesson 1 Ict and Its Current State
No ratings yet
Mod 1 Lesson 1 Ict and Its Current State
71 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
SAP BASIS CUA (New) CENTRAL USER ADMIN
No ratings yet
SAP BASIS CUA (New) CENTRAL USER ADMIN
13 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
Drive Spares Old PDF
No ratings yet
Drive Spares Old PDF
3 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Pyspark Practice Template
No ratings yet
Pyspark Practice Template
2 pages
Digital Instrumentation
No ratings yet
Digital Instrumentation
1 page
Cleaning Data in Python
No ratings yet
Cleaning Data in Python
47 pages
6.data Cleaning
No ratings yet
6.data Cleaning
20 pages
Everything You Need To Know About Chatgpt Expeed Software 240314091646 b2188bc5
No ratings yet
Everything You Need To Know About Chatgpt Expeed Software 240314091646 b2188bc5
19 pages
BDP and CapDev Format Sample
No ratings yet
BDP and CapDev Format Sample
17 pages
Create The Plan For A Real Project: Lab 02 Tasks
No ratings yet
Create The Plan For A Real Project: Lab 02 Tasks
4 pages
Formation of Bus Admittance and Impedance Matrices and Solution of Networks Date: Expt No: Aim
No ratings yet
Formation of Bus Admittance and Impedance Matrices and Solution of Networks Date: Expt No: Aim
6 pages
8 DataStorageIndexingStructures Updated
No ratings yet
8 DataStorageIndexingStructures Updated
57 pages
Data Engineering 101 SQL and PySpark 1727161935
No ratings yet
Data Engineering 101 SQL and PySpark 1727161935
58 pages
Unlock HDD That Are Locked After Secure Erase
No ratings yet
Unlock HDD That Are Locked After Secure Erase
9 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
DAP Writeups - Merged
No ratings yet
DAP Writeups - Merged
33 pages
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Automobile Gannt Chart
No ratings yet
Automobile Gannt Chart
6 pages
Chapter 3
No ratings yet
Chapter 3
47 pages
Cost Constraint/Isocost Line
No ratings yet
Cost Constraint/Isocost Line
38 pages
Advanced Data Cleaning Techniques With PySpark
No ratings yet
Advanced Data Cleaning Techniques With PySpark
25 pages
1-Introduction To Data Cleaning
No ratings yet
1-Introduction To Data Cleaning
22 pages
Chapter1 PDF
No ratings yet
Chapter1 PDF
46 pages
Techniques
No ratings yet
Techniques
31 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Data Engineering 101 PySpark Vs Pandas 1721887961
No ratings yet
Data Engineering 101 PySpark Vs Pandas 1721887961
36 pages
ch4 Slides PDF
No ratings yet
ch4 Slides PDF
44 pages
DWM - Co2-10
No ratings yet
DWM - Co2-10
27 pages
Chapter 2
No ratings yet
Chapter 2
36 pages
Membership Constraints: Adel Nehme
No ratings yet
Membership Constraints: Adel Nehme
36 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Data Cleaning
No ratings yet
Data Cleaning
20 pages
Deloitte Data Engineer Interview Experience (0-3 Yoe)
No ratings yet
Deloitte Data Engineer Interview Experience (0-3 Yoe)
22 pages
Sheets FrontEnd
No ratings yet
Sheets FrontEnd
14 pages
W29C040 × 8 Cmos Flash Memory: General Description
No ratings yet
W29C040 × 8 Cmos Flash Memory: General Description
24 pages
Reading 5 - Data Preparation
No ratings yet
Reading 5 - Data Preparation
23 pages
Statistical Transform Data Cleaning
No ratings yet
Statistical Transform Data Cleaning
30 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Research Paper - Juan de Julio Escura
No ratings yet
Research Paper - Juan de Julio Escura
11 pages
Fule Research Papaer
No ratings yet
Fule Research Papaer
10 pages
Ass 3 - Best
No ratings yet
Ass 3 - Best
10 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Extract, Transform and Load (ETL)
No ratings yet
Extract, Transform and Load (ETL)
31 pages
Master Data Cleaning With Python
No ratings yet
Master Data Cleaning With Python
11 pages
Pandas Data Cleaning Presentation
No ratings yet
Pandas Data Cleaning Presentation
11 pages
Ass 3 - Average
No ratings yet
Ass 3 - Average
10 pages
Midterm - Lecture 1 - WEBAPPS
No ratings yet
Midterm - Lecture 1 - WEBAPPS
18 pages
Data Cleanups
No ratings yet
Data Cleanups
16 pages
Repeater N Datalist Cont
No ratings yet
Repeater N Datalist Cont
6 pages
Data Cleaning With Python by Raju Gajelli
No ratings yet
Data Cleaning With Python by Raju Gajelli
8 pages
Data Cleaning Cheat Sheet
No ratings yet
Data Cleaning Cheat Sheet
2 pages
Xiq Whitepaper vr2
No ratings yet
Xiq Whitepaper vr2
9 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
Performing Operations On Multiple Columns in A PySpark DataFrame
No ratings yet
Performing Operations On Multiple Columns in A PySpark DataFrame
5 pages
Pyspark Distinct and Filter
No ratings yet
Pyspark Distinct and Filter
3 pages
Ds Exp1 Manju
No ratings yet
Ds Exp1 Manju
5 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages