Cleaning Data With PySpark Chapter4
Cleaning Data With PySpark Chapter4
Pipelines
C L E A N I N G D ATA W I T H P Y S PA R K
Mike Metzger
Data Engineering Consultant
What is a data pipeline?
A set of steps to process data from source(s) to nal output
Transformations
withColumn() , .filter() , .drop()
Output(s)
CSV, Parquet, database
Validation
Analysis
schema = StructType([
StructField('name', StringType(), False),
StructField('age', StringType(), False)
])
df = spark.read.format('csv').load('datafile').schema(schema)
df = df.withColumn('id', monotonically_increasing_id())
...
df.write.parquet('outdata.parquet')
df.write.json('outdata.json')
Mike Metzger
Data Engineering Consultant
What are we trying to parse?
Incorrect data width, height, image
Empty rows
Headers
200 300 affenpinscher;0
Nested structures
Multiple delimiters
600 450 Collie;307 Collie;101
Non-regular data 600 449 Japanese_spaniel;23
Differing numbers of columns per row
Example rows:
Defaults to using ,
Mike Metzger
Data Engineering Consultant
De nition
Validation is:
Data types
Comparatively fast
parsed_df = spark.read.parquet('parsed_data.parquet')
company_df = spark.read.parquet('companies.parquet')
verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company)
This automatically removes any rows with a company not in the valid_df !
Calculations
Mike Metzger
Data Engineering Consultant
Analysis calculations (UDF)
Calculations using UDF
def getAvgSale(saleslist):
totalsales = 0
count = 0
for sale in saleslist:
totalsales += sale[2] + sale[3]
count += 2
return totalsales / count
df = df.read.csv('datafile')
Mike Metzger
Data Engineering Consultant
Next Steps
Review Spark documentation