0% found this document useful (0 votes)

24 views5 pages

Performing Operations On Multiple Columns in A PySpark DataFrame

The document discusses different ways to apply operations to multiple columns in a PySpark DataFrame using reduce, for loops, list comprehensions, and custom functions. It shows examples of lowercasing all columns, removing characters from a subset of columns, and defining a reusable function to apply the same operation to multiple columns.

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views5 pages

Performing Operations On Multiple Columns in A PySpark DataFrame

Uploaded by

sparkaredla

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Performing operations on multiple

columns in a PySpark DataFrame

You can use reduce, for loops, or list comprehensions to apply
PySpark functions to multiple columns in a DataFrame.

Using iterators to apply the same operation on multiple columns is

vital for maintaining a DRY codebase.

Let’s explore different ways to lowercase all of the columns in a

DataFrame to illustrate this concept.

If you’re using the Scala API, see this blog post on performing
operations on multiple columns in a Spark DataFrame with
foldLeft.

Lowercase all columns with reduce

Let’s import the reduce function from functools and use it to

lowercase all the columns in a DataFrame.
source_df = spark.createDataFrame(
[
("Jose", "BLUE"),
("lI", "BrOwN")
],
["name", "eye_color"]
)

actual_df = (reduce(
lambda memo_df, col_name: memo_df.withColumn(col_name,
lower(col(col_name))),
source_df.columns,
source_df
))

The physical plan that’s generated by this code looks efficient.

print(actual_df.explain())== Physical Plan ==
*Project [lower(name#0) AS name#5, lower(eye_color#1) AS
eye_color#9]
+- Scan ExistingRDD[name#0,eye_color#1]

It is no secret that reduce is not among the favored functions of

the Pythonistas. — dawg

Let’s see how we can achieve the same result with a for loop.

Lowercase all columns with a for loop

Let’s use the same source_df as earlier and build up the actual_df with
a for loop.
actual_df = source_df

for col_name in actual_df.columns:

actual_df = actual_df.withColumn(col_name,
lower(col(col_name)))

This code is a bit ugly, but Spark is smart and generates the same
physical plan.
print(actual_df.explain())== Physical Plan ==
*Project [lower(name#18) AS name#23, lower(eye_color#19) AS
eye_color#27]
+- Scan ExistingRDD[name#18,eye_color#19]

Let’s see how we can also use a list comprehension to write this
code.
Lowercase all columns with a list comprehension
Let’s use the same source_df as earlier and lowercase all the columns
with list comprehensions that are beloved by Pythonistas far and
wide.
actual_df = source_df.select(
*[lower(col(col_name)).name(col_name) for col_name in
source_df.columns]
)

Spark is still smart and generates the same physical plan.

print(actual_df.explain())== Physical Plan ==
*Project [lower(name#36) AS name#41, lower(eye_color#37) AS
eye_color#42]
+- Scan ExistingRDD[name#36,eye_color#37]

Let’s mix it up and see how these solutions work when they’re run
on some, but not all, of the columns in a DataFrame.

Performing operations on a subset of the

DataFrame columns
Let’s define a remove_some_chars function that removes all exclamation
points and question marks from a column.
def remove_some_chars(col_name):
removed_chars = ("!", "?")
regexp = "|".join('\{0}'.format(i) for i in removed_chars)
return regexp_replace(col_name, regexp, "")

Let’s use reduce to apply the remove_some_chars function to two colums

in a new DataFrame.
source_df = spark.createDataFrame(
[
("h!o!c!k!e!y", "rangers", "new york"),
("soccer", "??nacional!!", "medellin")
],
["sport", "team", "city"]
)
print(source_df.show())+-----------+------------+--------+
| sport| team| city|
+-----------+------------+--------+
|h!o!c!k!e!y| rangers|new york|
| soccer|??nacional!!|medellin|
+-----------+------------+--------+actual_df = (reduce(
lambda memo_df, col_name: memo_df.withColumn(col_name,
remove_some_chars(col_name)),
["sport", "team"],
source_df
))

Let’s try building up the actual_df with a for loop.

actual_df = source_df

for col_name in ["sport", "team"]:

actual_df = actual_df.withColumn(col_name,
remove_some_chars(col_name))

The for loop looks pretty clean. Now let’s try it with a list
comprehension.
source_df.select(
*[remove_some_chars(col_name).name(col_name) if col_name in
["sport", "team"] else col_name for col_name in source_df.columns]
)

Wow, the list comprehension is really ugly for a subset of the

columns 😿

, , and list comprehensions are all outputting the same

reduce for
physical plan as in the previous example, so each option is equally
performant when executed.
== Physical Plan ==
*Project [regexp_replace(sport#109, \!|\?, ) AS sport#116,
regexp_replace(team#110, \!|\?, ) AS team#117, city#111]
+- Scan ExistingRDD[sport#109,team#110,city#111]

What approach should you use?

for loops seem to yield the most readable code. List
comprehensions can be used for operations that are performed on
all columns of a DataFrame, but should be avoided for operations
performed on a subset of the columns. The reduce code is pretty
clean too, so that’s also a viable alternative.

It’s best to write functions that operate on a single column and

wrap the iterator in a separate DataFrame transformation so the
code can easily be applied to multiple columns.

Let’s define a multi_remove_some_chars DataFrame transformation that

takes an array of col_names as an argument and
applies remove_some_chars to each col_name.
def multi_remove_some_chars(col_names):
def inner(df):
for col_name in col_names:
df = df.withColumn(
col_name,
remove_some_chars(col_name)
)
return df
return inner

We can invoke multi_remove_some_chars as follows:

multi_remove_some_chars(["sport", "team"])(source_df)

This separation of concerns creates a codebase that’s easy to test

and reuse.

PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Soft's Botworld Tier List
No ratings yet
Soft's Botworld Tier List
51 pages
Pandas Cheat Sheet PDF
67% (3)
Pandas Cheat Sheet PDF
1 page
Proposal Development Funding A Computer Lab and Computer Facilities
67% (3)
Proposal Development Funding A Computer Lab and Computer Facilities
5 pages
177bug Diagnostics User's Manual: V177DIAA/UM1
No ratings yet
177bug Diagnostics User's Manual: V177DIAA/UM1
279 pages
Practical File Class Xii
No ratings yet
Practical File Class Xii
25 pages
Informatics Practices Practical File
No ratings yet
Informatics Practices Practical File
8 pages
Unit 4 - Working With Graphs - Python
No ratings yet
Unit 4 - Working With Graphs - Python
49 pages
Python Cheat Sheet Code Academy
100% (1)
Python Cheat Sheet Code Academy
1 page
Data Science Cheat Sheet: KEY Imports
100% (1)
Data Science Cheat Sheet: KEY Imports
1 page
Practical File IP
No ratings yet
Practical File IP
27 pages
Manipulating Dataframes - Beginner
No ratings yet
Manipulating Dataframes - Beginner
2 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
DS Practical
No ratings yet
DS Practical
30 pages
Data Frames Pandas, Handout 1
No ratings yet
Data Frames Pandas, Handout 1
16 pages
Ainotes Dataframe
No ratings yet
Ainotes Dataframe
5 pages
Questions Practical File
No ratings yet
Questions Practical File
13 pages
IP12 Gargi
No ratings yet
IP12 Gargi
32 pages
Must Know Pyspark Coding Before Databricks Interview
No ratings yet
Must Know Pyspark Coding Before Databricks Interview
7 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
PDF&Rendition 1
No ratings yet
PDF&Rendition 1
47 pages
Pyspark Syntax Using Simple Examples
No ratings yet
Pyspark Syntax Using Simple Examples
28 pages
IP Record Python 23-24 Aryan
No ratings yet
IP Record Python 23-24 Aryan
42 pages
IP - PRACTICAL EXAM - Revision
No ratings yet
IP - PRACTICAL EXAM - Revision
24 pages
Column Manipulation Part2
No ratings yet
Column Manipulation Part2
2 pages
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
No ratings yet
Accelerated Data Science Getting Started Cheat Sheet Cudf 2003937 r4
2 pages
SQL Cheat Sheet Python
100% (1)
SQL Cheat Sheet Python
1 page
Journal
No ratings yet
Journal
47 pages
Dataframe in Pandas
No ratings yet
Dataframe in Pandas
23 pages
Pyspark Cheatsheet
No ratings yet
Pyspark Cheatsheet
10 pages
Jashan ML
No ratings yet
Jashan ML
20 pages
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Pandas Dataframe All Operations 1735471870
No ratings yet
Pandas Dataframe All Operations 1735471870
4 pages
Ip Practical (2) (Autosaved)
No ratings yet
Ip Practical (2) (Autosaved)
21 pages
WS2 DataHandlingUsingPandas1
No ratings yet
WS2 DataHandlingUsingPandas1
3 pages
Cleaning Data in Python: Pu!ing It All Together
No ratings yet
Cleaning Data in Python: Pu!ing It All Together
14 pages
2.1 Combining Data Frames
No ratings yet
2.1 Combining Data Frames
38 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Data Cleaning
No ratings yet
Data Cleaning
52 pages
DA Cheat Codes
No ratings yet
DA Cheat Codes
2 pages
Programs For Practical
No ratings yet
Programs For Practical
3 pages
Pandas Data Manipulation Extended CheatSheet 1731972219
No ratings yet
Pandas Data Manipulation Extended CheatSheet 1731972219
9 pages
Pandas DataFrame Notes
No ratings yet
Pandas DataFrame Notes
13 pages
Week 2 Laboratory Activity
No ratings yet
Week 2 Laboratory Activity
7 pages
Pyspark SQL Basics Cheat Sheet: Python For Data Science
No ratings yet
Pyspark SQL Basics Cheat Sheet: Python For Data Science
1 page
Pandas - I (PPT 6)
No ratings yet
Pandas - I (PPT 6)
14 pages
4220 5 (Python)
No ratings yet
4220 5 (Python)
12 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
IP Practical 2023-24 (1 To 34)
100% (1)
IP Practical 2023-24 (1 To 34)
32 pages
Core of ML - Part 1 Handling Data
No ratings yet
Core of ML - Part 1 Handling Data
3 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
Pyspark Practice Template
No ratings yet
Pyspark Practice Template
2 pages
Practical of R
No ratings yet
Practical of R
38 pages
Wa0024.
No ratings yet
Wa0024.
38 pages
Python ClassXII AI
No ratings yet
Python ClassXII AI
4 pages
Ainotes
No ratings yet
Ainotes
5 pages
DataFrame 1
No ratings yet
DataFrame 1
3 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
From Everand
UNIX Shell Programming Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Intro To C - Module 4
No ratings yet
Intro To C - Module 4
15 pages
Introduction To Threat Hunting
No ratings yet
Introduction To Threat Hunting
75 pages
Concept Paper (Revised)
100% (1)
Concept Paper (Revised)
11 pages
Co2 Absortion
No ratings yet
Co2 Absortion
131 pages
IEEE-A Novel Predictive Maintenance Method Based On Deep Adversarial Learning in The Intelligent Manufacturing System
No ratings yet
IEEE-A Novel Predictive Maintenance Method Based On Deep Adversarial Learning in The Intelligent Manufacturing System
19 pages
BT-770 Cat.
No ratings yet
BT-770 Cat.
2 pages
Syllabus - Highway and Railroad Engineering
No ratings yet
Syllabus - Highway and Railroad Engineering
12 pages
Microsoft® Small Basic: Graphics Window
100% (1)
Microsoft® Small Basic: Graphics Window
12 pages
2021 06 20 Education Sector Heatmap
No ratings yet
2021 06 20 Education Sector Heatmap
17 pages
QC PM485 Modbus Registers - Rev2
No ratings yet
QC PM485 Modbus Registers - Rev2
1 page
Design and Fabrication of A Solar Operated Lawnmower
No ratings yet
Design and Fabrication of A Solar Operated Lawnmower
37 pages
Adobe Photoshop MiNi Level 01
No ratings yet
Adobe Photoshop MiNi Level 01
54 pages
Assignment DDWD2663
No ratings yet
Assignment DDWD2663
4 pages
04NL1590 1030116212230096 Feb21040711
No ratings yet
04NL1590 1030116212230096 Feb21040711
1 page
Important Notice Postcard
No ratings yet
Important Notice Postcard
2 pages
KNIME Software Overview
No ratings yet
KNIME Software Overview
2 pages
Which of The Following Is A Common Laboratory Indicator For Bases - Myschool
No ratings yet
Which of The Following Is A Common Laboratory Indicator For Bases - Myschool
6 pages
171 16cccca15 2020051905170696
No ratings yet
171 16cccca15 2020051905170696
23 pages
Class 11 CS Project - Arpita Soni
No ratings yet
Class 11 CS Project - Arpita Soni
16 pages
Service Description For Payroll Landscape Application Hosting (EC-Payroll Offering)
No ratings yet
Service Description For Payroll Landscape Application Hosting (EC-Payroll Offering)
15 pages
Q1 Module3 G11 Programming Java Nciii San Jacinto NHS
No ratings yet
Q1 Module3 G11 Programming Java Nciii San Jacinto NHS
8 pages
Rubber Insulating Matting: Standard Specification For
No ratings yet
Rubber Insulating Matting: Standard Specification For
9 pages
A Review Paper On CMOS, SOI and FinFET Technology
No ratings yet
A Review Paper On CMOS, SOI and FinFET Technology
12 pages
Vishnu Mini Project
No ratings yet
Vishnu Mini Project
46 pages
Arpit Singh Negi Resume
No ratings yet
Arpit Singh Negi Resume
1 page
Arif Fyp
No ratings yet
Arif Fyp
5 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
5 pages

Performing Operations On Multiple Columns in A PySpark DataFrame

Uploaded by

Performing Operations On Multiple Columns in A PySpark DataFrame

Uploaded by

Performing operations on multiple

columns in a PySpark DataFrame

Using iterators to apply the same operation on multiple columns is

Let’s explore different ways to lowercase all of the columns in a

Lowercase all columns with reduce

Let’s import the reduce function from functools and use it to

The physical plan that’s generated by this code looks efficient.

It is no secret that reduce is not among the favored functions of

Lowercase all columns with a for loop

for col_name in actual_df.columns:

Spark is still smart and generates the same physical plan.

Performing operations on a subset of the

Let’s use reduce to apply the remove_some_chars function to two colums

Let’s try building up the actual_df with a for loop.

for col_name in ["sport", "team"]:

Wow, the list comprehension is really ugly for a subset of the

, , and list comprehensions are all outputting the same

What approach should you use?

It’s best to write functions that operate on a single column and

Let’s define a multi_remove_some_chars DataFrame transformation that

We can invoke multi_remove_some_chars as follows:

This separation of concerns creates a codebase that’s easy to test

You might also like