100% found this document useful (1 vote)

428 views

Cleaning Data With PySpark Chapter2

The document discusses techniques for cleaning DataFrames in PySpark, including filtering rows, selecting columns, adding new columns, and handling null values. It covers conditional expressions for transforming columns based on certain conditions. User defined functions (UDFs) in Python can be wrapped and used like native Spark functions. DataFrames are partitioned and transformations are lazy, only occurring during actions to allow optimizations. Unique identifiers can be generated using monotonically_increasing_id() for parallel processing.

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (1 vote)

428 views

Cleaning Data With PySpark Chapter2

Uploaded by

Fgpeqw

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

DataFrame column

operations
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
DataFrame refresher
DataFrames:

Made up of rows & columns

Immutable

Use various transformation operations to modify data

# Return rows where name starts with "M"

voter_df.filter(voter_df.name.like('M%'))

# Return name and position only

voters = voter_df.select('name', 'position')

CLEANING DATA WITH PYSPARK

Common DataFrame transformations
Filter / Where

voter_df.filter(voter_df.date > '1/1/2019') # or voter_df.where(...)

Select

voter_df.select(voter_df.name)

withColumn

voter_df.withColumn('year', voter_df.date.year)

drop

voter_df.drop('unused_column')

CLEANING DATA WITH PYSPARK

Filtering data
Remove nulls

Remove odd entries

Split data from combined sources

Negate with ~

voter_df.filter(voter_df['name'].isNotNull())
voter_df.filter(voter_df.date.year > 1800)
voter_df.where(voter_df['_c0'].contains('VOTE'))
voter_df.where(~ voter_df._c1.isNull())

CLEANING DATA WITH PYSPARK

Column string transformations
Contained in pyspark.sql.functions

import pyspark.sql.functions as F

Applied per column as transformation

voter_df.withColumn('upper', F.upper('name'))

Can create intermediary columns

voter_df.withColumn('splits', F.split('name', ' '))

Can cast to other types

voter_df.withColumn('year', voter_df['_c4'].cast(IntegerType()))

CLEANING DATA WITH PYSPARK

ArrayType() column functions
Various utility functions / transformations to interact with ArrayType()

.size(<column>) - returns length of arrayType() column

.getItem(<index>) - used to retrieve a speci c item at index of list column.

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Conditional
DataFrame column
operations
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Conditional clauses
Conditional Clauses are:

Inline version of if / then / else

.when()

.otherwise()

CLEANING DATA WITH PYSPARK

Conditional example
.when(<if condition>, <then x>)

df.select(df.Name, df.Age, F.when(df.Age >= 18, "Adult"))

Name Age

Alice 14

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Another example
Multiple .when()

df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.when(df.Age < 18, "Minor"))

Name Age

Alice 14 Minor

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Otherwise
.otherwise() is like else

df.select(df.Name, df.Age,
.when(df.Age >= 18, "Adult")
.otherwise("Minor"))

Name Age

Alice 14 Minor

Bob 18 Adult

Candice 38 Adult

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
User de ned
functions
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
De ned...
User de ned functions or UDFs

Python method

Wrapped via the pyspark.sql.functions.udf method

Stored as a variable

Called like a normal Spark function

CLEANING DATA WITH PYSPARK

Reverse string UDF
De ne a Python method

def reverseString(mystr):
return mystr[::-1]

Wrap the function and store as a variable

udfReverseString = udf(reverseString, StringType())

Use with Spark

user_df = user_df.withColumn('ReverseName',
udfReverseString(user_df.Name))

CLEANING DATA WITH PYSPARK

Argument-less example
def sortingCap():
return random.choice(['G', 'H', 'R', 'S'])
udfSortingCap = udf(sortingCap, StringType())
user_df = user_df.withColumn('Class', udfSortingCap())

Name Age Class

Alice 14 H

Bob 18 S

Candice 63 G

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K
Partitioning and lazy
processing
C L E A N I N G D ATA W I T H P Y S PA R K

Mike Metzger
Data Engineering Consultant
Partitioning
DataFrames are broken up into partitions

Partition size can vary

Each partition is handled independently

CLEANING DATA WITH PYSPARK

Lazy processing
Transformations are lazy
.withColumn(...)

.select(...)

Nothing is actually done until an action is performed

.count()

.write(...)

Transformations can be re-ordered for best performance

Sometimes causes unexpected behavior

CLEANING DATA WITH PYSPARK

Adding IDs
Normal ID elds:

Common in relational databases

Most usually an integer increasing, sequential and unique

Not very parallel

id last name rst name state

0 Smith John TX

1 Wilson A. IL

2 Adams Wendy OR

CLEANING DATA WITH PYSPARK

Monotonically increasing IDs
pyspark.sql.functions.monotonically_increasing_id()

Integer (64-bit), increases in value, unique

Not necessarily sequential (gaps exist)

Completely parallel

id last name rst name state

0 Smith John TX

134520871 Wilson A. IL

675824594 Adams Wendy OR

CLEANING DATA WITH PYSPARK

Notes
Remember, Spark is lazy!

Occasionally out of order

If performing a join, ID may be assigned after the join

Test your transformations

CLEANING DATA WITH PYSPARK

Let's practice!
C L E A N I N G D ATA W I T H P Y S PA R K

PYSPARK Interview Questions
100% (2)
PYSPARK Interview Questions
126 pages
Pyspark Questions & Scenario Based
No ratings yet
Pyspark Questions & Scenario Based
25 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
PySpark Data Frame Questions PDF
100% (1)
PySpark Data Frame Questions PDF
57 pages
My Pyspark Practice Notes
100% (1)
My Pyspark Practice Notes
63 pages
Credit Risk Modeling in Python Chapter3
No ratings yet
Credit Risk Modeling in Python Chapter3
35 pages
ECommerce Empire Builders Review Peter Pru Course Worthxmimp
No ratings yet
ECommerce Empire Builders Review Peter Pru Course Worthxmimp
6 pages
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
From Everand
Practice Questions for Snowflake Snowpro Core Certification Concept Based - Latest Edition 2023
Exam OG
5/5 (1)
Pyspark Interview Questions: Click Here
0% (1)
Pyspark Interview Questions: Click Here
35 pages
Sqoop Commands
No ratings yet
Sqoop Commands
4 pages
Mastering Spark SQL PDF
100% (1)
Mastering Spark SQL PDF
1,776 pages
Designing Machine Learning Workflows in Python Chapter2
No ratings yet
Designing Machine Learning Workflows in Python Chapter2
39 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
2018 Class Resume Book
No ratings yet
2018 Class Resume Book
46 pages
Airbnb Digital Transformation
No ratings yet
Airbnb Digital Transformation
5 pages
Cleaning Data With PySpark Chapter1
0% (1)
Cleaning Data With PySpark Chapter1
20 pages
Cleaning Data With PySpark Chapter4
No ratings yet
Cleaning Data With PySpark Chapter4
23 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark - SQL Module
No ratings yet
Pyspark - SQL Module
132 pages
4 - Action and RDD Transformations
No ratings yet
4 - Action and RDD Transformations
25 pages
Sparksql PDF
100% (2)
Sparksql PDF
119 pages
PySpark SQL Cheat Sheet Python PDF
No ratings yet
PySpark SQL Cheat Sheet Python PDF
1 page
PySpark Transformations Tutorial
100% (1)
PySpark Transformations Tutorial
58 pages
Pyspark PDF
100% (1)
Pyspark PDF
406 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
PySpark Cheat Sheet Spark in Python PDF
No ratings yet
PySpark Cheat Sheet Spark in Python PDF
1 page
Spark SQL and DataFrames - Spark 2.2.0 Documentation
No ratings yet
Spark SQL and DataFrames - Spark 2.2.0 Documentation
35 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
PySpark RDD Basics PDF
No ratings yet
PySpark RDD Basics PDF
1 page
(Big Data Analytics With PySpark) (CheatSheet)
No ratings yet
(Big Data Analytics With PySpark) (CheatSheet)
7 pages
Pyspark Commands
No ratings yet
Pyspark Commands
12 pages
Core Python
No ratings yet
Core Python
102 pages
Advanced Spark Training
0% (1)
Advanced Spark Training
49 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
35 pages
Pyspark Notes
No ratings yet
Pyspark Notes
93 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
PySpark Reference Guide
No ratings yet
PySpark Reference Guide
2 pages
Pyspark Interview Code
100% (2)
Pyspark Interview Code
197 pages
SQL Narayana Reddy
100% (1)
SQL Narayana Reddy
124 pages
Pyspark PDF
0% (1)
Pyspark PDF
239 pages
AWS Glue
No ratings yet
AWS Glue
10 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Dimensional Modeling
No ratings yet
Dimensional Modeling
188 pages
Notes of Azure Data Bricks
No ratings yet
Notes of Azure Data Bricks
16 pages
8888888888888888888
100% (1)
8888888888888888888
131 pages
PySpark Notes
No ratings yet
PySpark Notes
29 pages
Spark Walmart Data Analysis Project
No ratings yet
Spark Walmart Data Analysis Project
17 pages
Databricks
No ratings yet
Databricks
56 pages
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
100% (1)
Azure Data Factory Data Flows: Luke Newport Technical Specialist - Data & AI
30 pages
Final Print Py Spark
No ratings yet
Final Print Py Spark
133 pages
150+ Python Interview Questions
No ratings yet
150+ Python Interview Questions
76 pages
Snowflake UNIT II
No ratings yet
Snowflake UNIT II
44 pages
Azure DataEngineer Course Outline
No ratings yet
Azure DataEngineer Course Outline
4 pages
AZURE DATA FACTORY Content
No ratings yet
AZURE DATA FACTORY Content
5 pages
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
No ratings yet
Data Engineering & GCP Basic Services 2. Data Storage in GCP 3. Database Offering by GCP 4. Data Processing in GCP 5. ML/AI Offering in GCP
3 pages
Learning Apache Spark With Python
No ratings yet
Learning Apache Spark With Python
10 pages
20 PySpark Problems
No ratings yet
20 PySpark Problems
22 pages
4.1 The Spark UI - Databricks
No ratings yet
4.1 The Spark UI - Databricks
7 pages
Apache Spark 2.x Cookbook
From Everand
Apache Spark 2.x Cookbook
Rishi Yadav
No ratings yet
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
From Everand
Querying Databricks with Spark SQL: Leverage SQL to query and analyze Big Data for insights (English Edition)
Adam Aspin
No ratings yet
Chapter 2
No ratings yet
Chapter 2
25 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
Introduction to Pandas
No ratings yet
Introduction to Pandas
14 pages
Spoken Language Processing in Python Chapter3
No ratings yet
Spoken Language Processing in Python Chapter3
26 pages
Introduction To Data Visualization With Matplotlib: Ariel Rokem
No ratings yet
Introduction To Data Visualization With Matplotlib: Ariel Rokem
30 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
Spoken Language Processing in Python Chapter1
No ratings yet
Spoken Language Processing in Python Chapter1
17 pages
Introduction To Data Visualization With Seaborn Chapter2
No ratings yet
Introduction To Data Visualization With Seaborn Chapter2
38 pages
Spoken Language Processing in Python Chapter2
No ratings yet
Spoken Language Processing in Python Chapter2
23 pages
Preparing Your Gures To Share With Others: Ariel Rokem
No ratings yet
Preparing Your Gures To Share With Others: Ariel Rokem
35 pages
Chapter3 PDF
No ratings yet
Chapter3 PDF
36 pages
Introduction To Data Visualization With Matplotlib Chapter2
No ratings yet
Introduction To Data Visualization With Matplotlib Chapter2
27 pages
Changing Plot Style and Color: Erin Case
No ratings yet
Changing Plot Style and Color: Erin Case
54 pages
Introduction To Data Visualization With Seaborn Chapter3
100% (1)
Introduction To Data Visualization With Seaborn Chapter3
32 pages
Designing Machine Learning Workflows in Python Chapter4
No ratings yet
Designing Machine Learning Workflows in Python Chapter4
38 pages
Introduction To Data Visualization With Seaborn Chapter1
No ratings yet
Introduction To Data Visualization With Seaborn Chapter1
26 pages
Designing Machine Learning Workflows in Python Chapter3
No ratings yet
Designing Machine Learning Workflows in Python Chapter3
42 pages
Customer Segmentation in Python Chapter3
No ratings yet
Customer Segmentation in Python Chapter3
25 pages
Designing Machine Learning Workflows in Python Chapter1
No ratings yet
Designing Machine Learning Workflows in Python Chapter1
32 pages
Credit Risk Modeling in Python Chapter4
100% (1)
Credit Risk Modeling in Python Chapter4
35 pages
Building Chatbots in Python Chapter2 PDF
No ratings yet
Building Chatbots in Python Chapter2 PDF
41 pages
Customer Segmentation in Python Chapter4
No ratings yet
Customer Segmentation in Python Chapter4
37 pages
Analyzing IoT Data in Python Chapter2
No ratings yet
Analyzing IoT Data in Python Chapter2
35 pages
Building Chatbots in Python Chapter4
No ratings yet
Building Chatbots in Python Chapter4
20 pages
Analyzing IoT Data in Python Chapter1
100% (1)
Analyzing IoT Data in Python Chapter1
27 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Advanced NLP With Spacy Chapter4
No ratings yet
Advanced NLP With Spacy Chapter4
26 pages
JavaScript Conditional Statements - PART - 13
No ratings yet
JavaScript Conditional Statements - PART - 13
5 pages
Ipv6 Fundamentals Analyst: Exam Guide
No ratings yet
Ipv6 Fundamentals Analyst: Exam Guide
16 pages
Bilingual Machine Translation
No ratings yet
Bilingual Machine Translation
8 pages
Chap5 - Conditional Statements in MATLAB
No ratings yet
Chap5 - Conditional Statements in MATLAB
12 pages
CVS User Guide CVS User Guide: Page 1 of 27
No ratings yet
CVS User Guide CVS User Guide: Page 1 of 27
27 pages
Developing With Angular
100% (2)
Developing With Angular
402 pages
Hard Skills DevOps
No ratings yet
Hard Skills DevOps
3 pages
Rotor-Gene Q Quick-Start Guide: Hardware Installation
No ratings yet
Rotor-Gene Q Quick-Start Guide: Hardware Installation
6 pages
Config Guide Trim Op Tim Ization Apo
No ratings yet
Config Guide Trim Op Tim Ization Apo
13 pages
DH-IPC-HFW5831E-Z5E - Datasheet - 20200927 (1) MMMM
No ratings yet
DH-IPC-HFW5831E-Z5E - Datasheet - 20200927 (1) MMMM
3 pages
Crystal Clear Advanced
No ratings yet
Crystal Clear Advanced
114 pages
Batch22 Code
No ratings yet
Batch22 Code
7 pages
5G&EMF Explained - AMTA - 23aug - 2019 - 20
No ratings yet
5G&EMF Explained - AMTA - 23aug - 2019 - 20
12 pages
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
No ratings yet
Quick Setup Guide: Radar Sensor For Continuous Level Measurement of Water and Wastewater
28 pages
Matthew Hancock Curriculum Vitae
No ratings yet
Matthew Hancock Curriculum Vitae
2 pages
ESXTOP
No ratings yet
ESXTOP
33 pages
Hadoop Course Content PDF
No ratings yet
Hadoop Course Content PDF
9 pages
EC8691-MPMC Syllabus
No ratings yet
EC8691-MPMC Syllabus
2 pages
HiperPlus Broch REVC
No ratings yet
HiperPlus Broch REVC
2 pages
Computer Application
No ratings yet
Computer Application
56 pages
Copy of Activity 1.2.4 Securing Your Browser
No ratings yet
Copy of Activity 1.2.4 Securing Your Browser
2 pages
Amit Jain Resume
No ratings yet
Amit Jain Resume
4 pages
Webdynpro Abap
No ratings yet
Webdynpro Abap
54 pages
BGPHijacking Project Description
No ratings yet
BGPHijacking Project Description
13 pages
Advance Ethical Hacking Course Contents Updated
No ratings yet
Advance Ethical Hacking Course Contents Updated
11 pages
D C Chapter 03 Topic 51 (Network Performance)
No ratings yet
D C Chapter 03 Topic 51 (Network Performance)
20 pages
Unicast Routing Protocols: Rip, Ospf, and BGP
No ratings yet
Unicast Routing Protocols: Rip, Ospf, and BGP
82 pages