0% found this document useful (0 votes)

182 views

Cleaning Data With PySpark Chapter4

This document introduces data pipelines and focuses on cleaning data with PySpark. It discusses that a data pipeline consists of steps to process data from sources to outputs and can span many systems. It also describes common pipeline components like inputs, transformations, outputs, and validation. Finally, it discusses techniques for cleaning data like removing blanks/comments, automatic column creation, validation with joins or rules, and final analysis with UDFs or inline calculations before delivery.

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

182 views

Cleaning Data With PySpark Chapter4

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Introduction to Data

Pipelines
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What is a data pipeline?
A set of steps to process data from source(s) to nal output

Can consist of any number of steps or components

Can span many systems

We will focus on data pipelines within Spark

CLEANING DATA WITH PYSPARK

What does a data pipeline look like?
Input(s)
CSV, JSON, web services, databases

Transformations
withColumn() , .filter() , .drop()

Output(s)
CSV, Parquet, database

Validation

Analysis

CLEANING DATA WITH PYSPARK

Pipeline details
Not formally de ned in Spark

Typically all normal Spark code required for task

schema = StructType([
StructField('name', StringType(), False),
StructField('age', StringType(), False)
])
df = spark.read.format('csv').load('datafile').schema(schema)
df = df.withColumn('id', monotonically_increasing_id())
...
df.write.parquet('outdata.parquet')
df.write.json('outdata.json')

CLEANING DATA WITH PYSPARK

Let's Practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Data handling
techniques
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
What are we trying to parse?
Incorrect data width, height, image
Empty rows

Commented lines # This is a comment

Headers
200 300 affenpinscher;0
Nested structures
Multiple delimiters
600 450 Collie;307 Collie;101
Non-regular data 600 449 Japanese_spaniel;23
Differing numbers of columns per row

Focused on CSV data

CLEANING DATA WITH PYSPARK

Stanford ImageNet annotations
Identi es dog breeds in images

Provides list of all identi ed dogs in image

Other metadata (base folder, image size, etc.)

Example rows:

02111277 n02111277_3206 500 375 Newfoundland,110,73,416,298

02108422 n02108422_4375 500 375 bull_mastiff,101,90,214,356 \
bull_mastiff,282,74,416,370

CLEANING DATA WITH PYSPARK

Removing blank lines, headers, and comments
Spark's CSV parser:

Automatically removes blank lines

Can remove comments using an optional argument

df1 = spark.read.csv('datafile.csv.gz', comment='#')

Handles header elds

De ned via argument

Ignored if a schema is de ned

df1 = spark.read.csv('datafile.csv.gz', header='True')

CLEANING DATA WITH PYSPARK

Automatic column creation
Spark will:

Automatically create columns in a DataFrame based on sep argument

df1 = spark.read.csv('datafile.csv.gz', sep=',')

Defaults to using ,

Can still successfully parse if sep is not in string

df1 = spark.read.csv('datafile.csv.gz', sep='*')

Stores data in column defaulting to _c0

Allows you to properly handle nested separators

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Data validation
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
De nition
Validation is:

Verifying that a dataset complies with the expected format

Number of rows / columns

Data types

Complex validation rules

CLEANING DATA WITH PYSPARK

Validating via joins
Compares data against known values

Easy to nd data in a given set

Comparatively fast

parsed_df = spark.read.parquet('parsed_data.parquet')
company_df = spark.read.parquet('companies.parquet')
verified_df = parsed_df.join(company_df, parsed_df.company == company_df.company)

This automatically removes any rows with a company not in the valid_df !

CLEANING DATA WITH PYSPARK

Complex rule validation
Using Spark components to validate logic:

Calculations

Verifying against external source

Likely uses a UDF to modify / verify the DataFrame

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Final analysis and
delivery
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Analysis calculations (UDF)
Calculations using UDF

def getAvgSale(saleslist):
totalsales = 0
count = 0
for sale in saleslist:
totalsales += sale[2] + sale[3]
count += 2
return totalsales / count

udfGetAvgSale = udf(getAvgSale, DoubleType())

df = df.withColumn('avg_sale', udfGetAvgSale(df.sales_list))

CLEANING DATA WITH PYSPARK

Analysis calculations (inline)
Inline calculations

df = df.read.csv('datafile')

df = df.withColumn('avg', (df.total_sales / df.sales_count))

df = df.withColumn('sq_ft', df.width * df.length)

df = df.withColumn('total_avg_size', udfComputeTotal(df.entries) / df.numEntries)

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Congratulations and
next steps
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Next Steps
Review Spark documentation

Try working with data on actual clusters

Work with various datasets

CLEANING DATA WITH PYSPARK

Thank you!
C L E A N I N G D ATA W I T H P Y S PA R K

Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Data Engineering With Databricks Da
100% (2)
Data Engineering With Databricks Da
232 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Denodo Data Virtualization Basics
100% (1)
Denodo Data Virtualization Basics
57 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Cloudera Spark Developer Training
No ratings yet
Cloudera Spark Developer Training
491 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Databricks Apache Spark Certified Developer Master Cheat Sheet
100% (1)
Databricks Apache Spark Certified Developer Master Cheat Sheet
29 pages
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Apache Spark Interview Questions
No ratings yet
Apache Spark Interview Questions
12 pages
AWS Glue
No ratings yet
AWS Glue
10 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
What Are DBT Sources
No ratings yet
What Are DBT Sources
109 pages
Modern Data Pipelines With Apache Airflow
No ratings yet
Modern Data Pipelines With Apache Airflow
36 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Azure Synapse With Power BI Dataflows
100% (1)
Azure Synapse With Power BI Dataflows
19 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Databricks Guide
No ratings yet
Databricks Guide
27 pages
Databuildtoolpdf 220704 142715
No ratings yet
Databuildtoolpdf 220704 142715
39 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Talend Tutorial
50% (2)
Talend Tutorial
19 pages
Databricks
No ratings yet
Databricks
56 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Snowflake Vs Data Bricks
No ratings yet
Snowflake Vs Data Bricks
10 pages
Erwin Data Modeling PPT
100% (1)
Erwin Data Modeling PPT
20 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
Talend Architecture White Paper - Branded - Final 11302020
No ratings yet
Talend Architecture White Paper - Branded - Final 11302020
18 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Databricks Spark Knowledge Base
100% (1)
Databricks Spark Knowledge Base
22 pages
Databricks Dbutils
100% (1)
Databricks Dbutils
34 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
Databricks Practice Questions
No ratings yet
Databricks Practice Questions
83 pages
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
No ratings yet
Learn More About SQL Interview Questions-Ii: The Expert'S Voice in SQL Server
12 pages
Pyspark Tutorial
100% (2)
Pyspark Tutorial
27 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
150+ Python Interview Questions
No ratings yet
150+ Python Interview Questions
76 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
Microprocessor Lab Assignment
No ratings yet
Microprocessor Lab Assignment
2 pages
How To Define A New Section in STAAD
0% (1)
How To Define A New Section in STAAD
7 pages
PLT To DXF
No ratings yet
PLT To DXF
5 pages
Abap Conversion 02
No ratings yet
Abap Conversion 02
21 pages
Detail Scope of Work For Network Redesign Security For Internal and External Access Work
No ratings yet
Detail Scope of Work For Network Redesign Security For Internal and External Access Work
1 page
K1 & K2 Bit Description
No ratings yet
K1 & K2 Bit Description
5 pages
The Art of Disassembly (Alpha)
No ratings yet
The Art of Disassembly (Alpha)
119 pages
Communication System PDF Sanjay Sharma
25% (12)
Communication System PDF Sanjay Sharma
2 pages
Assignment 1 Solution
40% (5)
Assignment 1 Solution
2 pages
Warning: File and Directory Names in Linux Are Case Sensitive. This Means That A File
No ratings yet
Warning: File and Directory Names in Linux Are Case Sensitive. This Means That A File
18 pages
Starwind San and Nas
No ratings yet
Starwind San and Nas
167 pages
Accord Info Matrix: Programming Questions
No ratings yet
Accord Info Matrix: Programming Questions
19 pages
Snowflake Schema: The Snowflake Schema Is An Extension of Star Schema. in A Snowflake
No ratings yet
Snowflake Schema: The Snowflake Schema Is An Extension of Star Schema. in A Snowflake
4 pages
Sap Ddic
No ratings yet
Sap Ddic
15 pages
Data Usage Report
No ratings yet
Data Usage Report
4 pages
Module 4 - 5G Protocol Architecture: Nex-G Innovations - NESPL
No ratings yet
Module 4 - 5G Protocol Architecture: Nex-G Innovations - NESPL
30 pages
CBSE Class XI Informatics Practices Database Questions
0% (1)
CBSE Class XI Informatics Practices Database Questions
1 page
Vmware Vsphere Metro Storage Cluster Recommended Practices
No ratings yet
Vmware Vsphere Metro Storage Cluster Recommended Practices
31 pages
Evolution of Storage Media
No ratings yet
Evolution of Storage Media
13 pages
PART-B: Simulation Experiments Using MATLAB: Experiment-1
No ratings yet
PART-B: Simulation Experiments Using MATLAB: Experiment-1
11 pages
Final Exam (5 Pages)
No ratings yet
Final Exam (5 Pages)
5 pages
Ericsson MGW
100% (3)
Ericsson MGW
8 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
ABAP HANA Via Secondary Database Connection
No ratings yet
ABAP HANA Via Secondary Database Connection
34 pages
Asa Remote Access VPN Technologies: SSLVPN Webvpn Ipsecvpn: Security Consulting Se Ccie, Cissp
No ratings yet
Asa Remote Access VPN Technologies: SSLVPN Webvpn Ipsecvpn: Security Consulting Se Ccie, Cissp
43 pages
Memory Flash Yield Report
No ratings yet
Memory Flash Yield Report
18 pages
Control Flow Cheatsheet
No ratings yet
Control Flow Cheatsheet
4 pages
Using PDF Files in CONTENTdm
No ratings yet
Using PDF Files in CONTENTdm
23 pages
Week10 CM MDL CC225
No ratings yet
Week10 CM MDL CC225
34 pages
Itri 613 Database Systems Assignment 1 29435927
No ratings yet
Itri 613 Database Systems Assignment 1 29435927
9 pages