Data Cleaning
Data Cleaning
Engineering 101
Data Cleaning
using PySpark
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
dropDuplicates()
Removes duplicate rows from the
DataFrame based on specified columns
df = df.dropDuplicates(['id', 'name'])
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
dropna()
Removes rows with null values in
specified columns
df = df.dropna(subset=['important_column'])
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
fillna()
Replaces null values with specified
values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
replace()
Replaces specified values with new
values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
cast()
Changes the data type of a column
df = df.withColumn('salary',
df['salary'].cast('double'))
Converts the 'salary' column to double data type
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
withColumn()
Adds a new column or replaces an
existing one
df = df.withColumn('full_name',
F.concat(df['first_name'], F.lit(' '), df['last_name']))
Creates a new 'full_name' column by
concatenating 'first_name', a space, and
'last_name'
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
drop()
Removes specified columns from the
DataFrame
df = df.drop('unnecessary_column')
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
rename()
Renames columns in the DataFrame
df = df.withColumnRenamed('old_name', 'new_name')
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
trim()
Removes leading and trailing
whitespace from string columns
df = df.withColumn('name', F.trim(df['name']))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
regexp_replace()
Replaces substrings matching a regular
expression
df = df.withColumn('phone',
F.regexp_replace(df['phone'], r'[^\d]', ''))
Removes all non-digit characters from the
'phone' column
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
filter()
Filters rows based on a condition
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
where()
Another method to filter rows based on
a condition
df = df.where(df['status'] == 'active')
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
distinct()
Removes duplicate rows from the
DataFrame
df = df.select('category').distinct()
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
unionByName()
Combines two DataFrames, aligning
them by column names
df_combined = df1.unionByName(df2,
allowMissingColumns=True)
Combines df1 and df2, matching columns by
name and allowing missing columns
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
join()
Combines two DataFrames based on a
key
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
groupBy()
Groups the DataFrame by specified
columns
df_grouped =
df.groupBy('department').agg(F.avg('salary') \
.alias('avg_salary'))
Groups by 'department' and calculates the
average salary for each group
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
agg()
Performs aggregations on grouped data
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
pivot()
Pivots a DataFrame from long to wide
format
df_pivoted =
df.groupBy('date').pivot('category').sum('amount')
Creates a pivot table with 'date' as rows,
'category' as columns, and sum of 'amount' as
values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
explode()
Splits array columns into multiple rows
df = df.withColumn('item', F.explode(df['items_list']))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
coalesce()
Returns the first non-null value in a list
of columns
df = df.withColumn('best_contact',
F.coalesce(df['email'], df['phone'], df['address']))
Creates a 'best_contact' column with the first
non-null value from email, phone, or address
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
when()
Conditional operations in DataFrame
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
split()
Splits a string column into an array
df = df.withColumn('first_name',
F.split(df['full_name'], ' ').getItem(0))
Splits 'full_name' by space and takes the first
item as 'first_name'
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
substring()
Extracts a substring from a string
column
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
to_date()
Converts a string to a date type
df = df.withColumn('date', F.to_date(df['string_date'],
'yyyy-MM-dd'))
Converts 'string_date' to a date type using the
specified format
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
isNull()
Checks if a column value is null
df = df.filter(df['important_column'].isNull())
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
isNotNull()
Checks if a column value is not null
df = df.filter(df['important_column'].isNotNull())
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
isin()
Checks if a column value is in a list of
values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
between()
Filters for values in a specified range
df = df.filter(df['age'].between(18, 65))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
rdd.map()
Applies a function to each row of the
DataFrame
df = df.rdd.map(lambda x: (x['id'],
x['name'].upper())).toDF(['id', 'name'])
Converts 'name' to uppercase for each row and
creates a new DataFrame
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
withColumnRenamed()
Renames a column
df = df.withColumnRenamed('old_column_name',
'new_column_name')
Renames the column 'old_column_name' to
'new_column_name'
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
na.fill()
Fills null values with specified values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
na.replace()
Replaces specified values with null
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
approx_count_distinct()
Estimates the number of distinct items
in a column
df =
df.agg(F.approx_count_distinct('user_id').alias('distin
ct_users'))
Estimates the count of distinct 'user_id' values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
collect_list()
Collects values from a column into a list
df =
df.groupBy('category').agg(F.collect_list('item').alias('
items'))
Groups by 'category' and collects all 'item' values
into a list
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
array_contains()
Checks if an array column contains a
value
df = df.filter(F.array_contains(df['tags'], 'important'))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
to_timestamp()
Converts a string to a timestamp
df = df.withColumn('timestamp',
F.to_timestamp(df['string_timestamp'], 'yyyy-MM-dd
HH:mm:ss'))
Converts 'string_timestamp' to a timestamp
using the specified format
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
unix_timestamp()
Converts a date/timestamp to Unix
timestamp
df = df.withColumn('unix_time',
F.unix_timestamp(df['date']))
Converts 'date' column to Unix timestamp
(seconds since 1970-01-01)
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
from_unixtime()
Converts Unix timestamp to a readable
date string
df = df.withColumn('readable_time',
F.from_unixtime(df['unix_time']))
Converts 'unix_time' to a readable date string
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
datediff()
Calculates the difference between two
dates
df = df.withColumn('days_diff',
F.datediff(df['end_date'], df['start_date']))
Calculates the number of days between
'start_date' and 'end_date'
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
months_between()
Calculates the number of months
between two dates
df = df.withColumn('months_employed',
F.months_between(F.current_date(), df['hire_date']))
Calculates the number of months between
'hire_date' and the current date
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
round()
Rounds a numeric column to specified
decimal places
df = df.withColumn('rounded_salary',
F.round(df['salary'], 2))
Rounds the 'salary' column to 2 decimal places
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
upper()
Converts a string column to uppercase
df = df.withColumn('uppercase_name',
F.upper(df['name']))
Converts the 'name' column to uppercase
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
lower()
Converts a string column to lowercase
df = df.withColumn('lowercase_email',
F.lower(df['email']))
Converts the 'email' column to lowercase
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
concat_ws()
Concatenates multiple columns with a
separator
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
create_map()
Creates a map column from key-value
pairs
df = df.withColumn('name_map',
F.create_map(F.lit('first'), df['first_name'], F.lit('last'),
df['last_name']))
Creates a map column with 'first' and 'last' as
keys, and corresponding name columns as values
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
size()
Returns the size of an array or map
column
df = df.withColumn('list_size', F.size(df['item_list']))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
first()
Returns the first value in a group
df =
df.groupBy('category').agg(F.first('item').alias('first_it
em'))
Groups by 'category' and gets the first 'item' in
each group
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
last()
Returns the last value in a group
df =
df.groupBy('category').agg(F.last('item').alias('last_item'))
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
regexp_extract()
Extracts a pattern from a string column
df = df.withColumn('zip_code',
F.regexp_extract(df['address'], r'\d{5}', 0))
Extracts a 5-digit zip code from the 'address'
column
Shwetank Singh
GritSetGrow - GSGLearn.com
Data Engineering 101: Data Cleaning using PySpark
window()
Defines a window for window functions
window_spec =
Window.partitionBy('department').orderBy(F.desc('s
alary'))<br>df = df.withColumn('salary_rank',
F.rank().over(window_spec))
Creates a window spec partitioned by
'department' and ordered by descending 'salary',
then ranks salaries within each department
Shwetank Singh
GritSetGrow - GSGLearn.com