Cleaning Data With PySpark Chapter2
Cleaning Data With PySpark Chapter2
operations
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
DataFrame refresher
DataFrames:
Immutable
Select
voter_df.select(voter_df.name)
withColumn
voter_df.withColumn('year', voter_df.date.year)
drop
voter_df.drop('unused_column')
Negate with ~
voter_df.filter(voter_df['name'].isNotNull())
voter_df.filter(voter_df.date.year > 1800)
voter_df.where(voter_df['_c0'].contains('VOTE'))
voter_df.where(~ voter_df._c1.isNull())
import pyspark.sql.functions as F
Mike Metzger
Data Engineering Consultant
Conditional clauses
Conditional Clauses are:
.when()
.otherwise()
Name Age
Alice 14
Bob 18 Adult
Candice 38 Adult
df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.when(df.Age < 18, "Minor"))
Name Age
Alice 14 Minor
Bob 18 Adult
Candice 38 Adult
df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.otherwise("Minor"))
Name Age
Alice 14 Minor
Bob 18 Adult
Candice 38 Adult
Mike Metzger
Data Engineering Consultant
De ned...
User de ned functions or UDFs
Python method
Stored as a variable
def reverseString(mystr):
return mystr[::-1]
user_df = user_df.withColumn('ReverseName',
udfReverseString(user_df.Name))
Alice 14 H
Bob 18 S
Candice 63 G
Mike Metzger
Data Engineering Consultant
Partitioning
DataFrames are broken up into partitions
.select(...)
.write(...)
0 Smith John TX
1 Wilson A. IL
2 Adams Wendy OR
Completely parallel
0 Smith John TX
134520871 Wilson A. IL