0% found this document useful (0 votes)

21 views7 pages

UDF in Pyspark

The document provides an overview of PySpark User Defined Functions (UDFs), explaining their purpose, creation, and usage within PySpark DataFrames and SQL. It emphasizes the importance of careful design to avoid performance issues and runtime errors, especially when handling null values. Additionally, it includes examples of creating and using UDFs, as well as best practices for ensuring robust functionality.

Uploaded by

erabhishek.bairwal1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views7 pages

UDF in Pyspark

Uploaded by

erabhishek.bairwal1108

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

PySpark UDF (a.k.

a User Defined Function) is the most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build in
capabilities.Note: UDF’s are the most expensive operations hence use them only you have no choice and when essential.

1. PySpark UDF Introduction

1.1 What is UDF?

UDF’s a.k.a User Defined Functions, If you are coming from SQL background, UDF’s are nothing new to you as most of the traditional RDBMS
databases support User Defined Functions, these functions need to register in the database library and use them on SQL as regular
functions.PySpark UDF’s are similar to UDF on traditional databases. In PySpark, you create a function in a Python syntax and wrap it with
PySpark SQL udf() or register it as udf and use it on DataFrame and SQL respectively.

1.2 Why do we need a UDF?

UDF’s are used to extend the functions of the framework and re-use these functions on multiple DataFrame’s. For example, you wanted to
convert every first letter of a word in a name string to a capital case; PySpark build-in features don’t have this function hence you can create it a
UDF and reuse this as needed on many Data Frames. UDF’s are once created they can be re-used on several DataFrame’s and SQL
expressions.Before you create any UDF, do your research to check if the similar function you wanted is already available in Spark SQL
Functions. PySpark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is
best to check before you reinventing the wheel.When you creating UDF’s you need to design them very carefully otherwise you will come across
optimization & performance issues.

2. Create PySpark UDF

2.1 Create a DataFrame

Before we jump in creating a UDF, first let’s create a PySpark DataFrame.

import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["Seqno", "Name"]

data = [("1", "John Jones"),

("2", "Tracey Smith"),
("3", "Amy Sanders")]

df = spark.createDataFrame(data = data, schema = columns)

df.show(truncate = False)

+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |John Jones |
|2 |Tracey Smith|
|3 |Amy Sanders |
+-----+------------+

2.2 Create a Python Function

The first step in creating a UDF is creating a Python function. Below snippet creates a function convertCase() which takes a string parameter
and converts the first letter of every word to capital letter. UDF’s take parameters of your choice and returns a value.

def convertCase(str):
resStr = ""
arr = str.split(" ")
for x in arr:
resStr = resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr

Note that there might be a better way to write this function. But for the sake of this article, I am not worried much about the performance and
better ways.

2.3 Convert a Python function to PySpark UDF

Now convert this function convertCase() to UDF by passing the function to PySpark SQL udf(), this function is available at
org.apache.spark.sql.functions.udf package. Make sure you import this package before using it.PySpark SQL udf() function returns
org.apache.spark.sql.expressions.UserDefinedFunction class object.
from pyspark.sql.functions import col,udf
from pyspark.sql.types import StringType

# Converting function to UDF

convertUDF = udf(lambda z: convertCase(z), StringType())

Note: The default type of the udf() is StringType hence, you can also write the above statement without return type.

# Converting function to UDF

# StringType() is by default hence not required
convertUDF = udf(lambda z: convertCase(z))

3. Using UDF with DataFrame

3.1 Using UDF with PySpark DataFrame select()

Now you can use convertUDF() on a DataFrame column as a regular build-in function.

df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate = False)

+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+

3.2 Using UDF with PySpark DataFrame withColumn()

You could also use udf on DataFrame withColumn() function, to explain this I will create another upperCase() function which converts the input
string to upper case.

def upperCase(str):
return str.upper()

Let’s convert upperCase() python function to UDF and then use it with DataFrame withColumn(). Below example converts the values of “Name”
column to upper case and creates a new column “Curated Name”

upperCaseUDF = udf(lambda z:upperCase(z), StringType())

df.withColumn("Curated Name", upperCaseUDF(col("Name"))) \

.show(truncate = False)

3.3 Registering PySpark UDF & use it on SQL

In order to use convertCase() function on PySpark SQL, you need to register the function with PySpark by using spark.udf.register().

""" Using UDF on SQL """

spark.udf.register("convertUDF", convertCase, StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("SELECT seqno, convertUDF(Name) AS name FROM NAME_TABLE") \
.show(truncate = False)

+-----+-------------+
|seqno|name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+

This yields the same output as 3.1 example.

4. Creating UDF using annotation

In the previous sections, you have learned creating a UDF is a 2 step process, first, you need to create a Python function, second convert
function to UDF using SQL udf() function, however, you can avoid these two steps and create it with just a single step by using annotations.

@udf(returnType=StringType())
def upperCase(str):
return str.upper()

df.withColumn("Curated Name", upperCase(col("Name"))) \

.show(truncate = False)

This results same output as section 3.2

5. Special Handling

5.1 Execution order

One thing to aware is in PySpark/Spark does not guarantee the order of evaluation of subexpressions meaning expressions are not guarantee to
evaluated left-to-right or in any other fixed order. PySpark reorders the execution for query optimization and planning hence, AND, OR, WHERE
and HAVING expression will have side effects.So when you are designing and using UDF, you have to be very careful especially with null
handling as these results runtime exceptions.

"""
No guarantee Name is not null will execute first
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
"""
spark.sql("SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLE" + \
"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate = False)
---------------------------------------------------------------------------
ParseException Traceback (most recent call last)
Cell In[11], line 6
1 """
2 No guarantee Name is not null will execute first
3 If convertUDF(Name) like '%John%' execute first then
4 you will get runtime error
5 """
----> 6 spark.sql("SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLE" + \
7 "where Name is not null and convertUDF(Name) like '%John%'") \
8 .show(truncate = False)

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\session.py:1631, in SparkSession.sql(self, sqlQ

uery, args, **kwargs)
1627 assert self._jvm is not None
1628 litArgs = self._jvm.PythonUtils.toArray(
1629 [_to_java_column(lit(v)) for v in (args or [])]
1630 )
-> 1631 return DataFrame(self._jsparkSession.sql(sqlQuery, litArgs), self)
1632 finally:
1633 if len(kwargs) > 0:

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in Jav

aMember.__call__(self, *args)
1316 command = proto.CALL_COMMAND_NAME +\
1317 self.command_header +\
1318 args_command +\
1319 proto.END_COMMAND_PART
1321 answer = self.gateway_client.send_command(command)
-> 1322 return_value = get_return_value(
1323 answer, self.gateway_client, self.target_id, self.name)
1325 for temp_arg in temp_args:
1326 if hasattr(temp_arg, "_detach"):

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:185, in capture_sql_e

xception.<locals>.deco(*a, **kw)
181 converted = convert_exception(e.java_exception)
182 if not isinstance(converted, UnknownException):
183 # Hide where the exception came from that shows a non-Pythonic
184 # JVM exception message.
--> 185 raise converted from None
186 else:
187 raise

ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'is'.(line 1, pos 65)

== SQL ==
SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLEwhere Name is not null and convertUDF(Name) like '%John%'
-----------------------------------------------------------------^^^

5.2 Handling null check

UDF’s are error-prone when not designed carefully. for example, when you have a column that contains the value null on some records

""" null check """

columns = ["Seqno", "Name"]

data = [("1", "John Jones"),
("2", "Tracey Smith"),
("3", "Amy Sanders"),
("4", None)
]

df2 = spark.createDataFrame(data = data, schema = columns)

df2.show(truncate = False)
df2.createOrReplaceTempView("NAME_TABLE2")

spark.sql("SELECT convertUDF(Name) FROM NAME_TABLE2") \

.show(truncate = False)

+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |John Jones |
|2 |Tracey Smith|
|3 |Amy Sanders |
|4 |NULL |
+-----+------------+
---------------------------------------------------------------------------
PythonException Traceback (most recent call last)

Cell In[12], line 15

11 df2.show(truncate = False)
12 df2.createOrReplaceTempView("NAME_TABLE2")
14 spark.sql("SELECT convertUDF(Name) FROM NAME_TABLE2") \
---> 15 .show(truncate = False)

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\dataframe.py:945, in DataFrame.show(self, n, tr

uncate, vertical)
885 def show(self, n: int = 20, truncate: Union[bool, int] = True, vertical: bool = False) -> None:
886 """Prints the first ``n`` rows to the console.
887
888 .. versionadded:: 1.3.0
(...)
943 name | Bob
944 """
--> 945 print(self._show_string(n, truncate, vertical))

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\dataframe.py:976, in DataFrame._show_string(sel

f, n, truncate, vertical)
967 except ValueError:
968 raise PySparkTypeError(
969 error_class="NOT_BOOL",
970 message_parameters={
(...)
973 },
974 )
--> 976 return self._jdf.showString(n, int_truncate, vertical)

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in Jav

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:185, in capture_sql_e

PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:\Users\Dell\AppData\Local\Temp\ipykernel_18024\595000715.py", line 3, in convertCase
AttributeError: 'NoneType' object has no attribute 'split'
Note that from the above snippet, record with “Seqno 4” has value “None” for “name” column. Since we are not handling null with UDF
function, using this on DataFrame returns below error. Note that in Python None is considered null.Below points to remember Its always best
practice to check for null inside a UDF function rather than checking for null outside. In any case, if you can’t do a null check in UDF at lease
use IF or CASE WHEN to check for null and call UDF conditionally.

spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "", StringType())

spark.sql("SELECT _nullsafeUDF(Name) FROM NAME_TABLE2") \

.show(truncate = False)

spark.sql("SELECT Seqno, _nullsafeUDF(Name) AS Name FROM NAME_TABLE2 " + \

+-----+----+
|Seqno|Name|
+-----+----+
+-----+----+

This executes successfully without errors as we are checking for null/none while registering UDF.

5.3 Performance concern using UDF

UDFs are a black box to PySpark hence it can’t apply optimization and you will lose all the optimization PySpark does on Dataframe/Dataset.
When possible you should use Spark SQL built-in functions as these functions provide optimization. Consider creating UDF only when the
existing built-in SQL function doesn’t have it.

6. Complete PySpark UDF Example

Below is a complete UDF function example in Python

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType

spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()

columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]

df = spark.createDataFrame(data=data,schema=columns)

df.show(truncate=False)

def convertCase(str):
resStr=""
arr = str.split(" ")
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr

""" Converting function to UDF """

convertUDF = udf(lambda z: convertCase(z))

df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)

def upperCase(str):
return str.upper()

upperCaseUDF = udf(lambda z:upperCase(z),StringType())

df.withColumn("Cureated Name", upperCaseUDF(col("Name"))) \

.show(truncate=False)

""" Using UDF on SQL """

spark.udf.register("convertUDF", convertCase,StringType())
df.createOrReplaceTempView("NAME_TABLE")
spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE") \
.show(truncate=False)

spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \

"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate=False)

""" null check """

columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders"),
('4',None)]
df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)
df2.createOrReplaceTempView("NAME_TABLE2")

spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "" , StringType())

spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2") \

.show(truncate=False)

spark.sql("select Seqno, _nullsafeUDF(Name) as Name from NAME_TABLE2 " + \

" where Name is not null and _nullsafeUDF(Name) like '%John%'") \
.show(truncate=False)

+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |john jones |
|2 |tracey smith|
|3 |amy sanders |
+-----+------------+

+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+

+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+

+-----+-----------+
|Seqno|Name |
+-----+-----------+
|1 |John Jones |
+-----+-----------+

+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |john jones |
|2 |tracey smith|
|3 |amy sanders |
|4 |NULL |
+-----+------------+

+------------------+
|_nullsafeUDF(Name)|
+------------------+
|John Jones |
|Tracey Smith |
|Amy Sanders |
| |
+------------------+

+-----+-----------+
|Seqno|Name |
+-----+-----------+
|1 |John Jones |
+-----+-----------+

Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js

Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
No ratings yet
Master Pyspark Zero To Big Data Hero: Day 1 Day 2 Day 3 Day 4 Day 5 Day 6 Day 7 Day 8 Day 9 Day 10
106 pages
Cassandra Certification Study Guide DataStax
13% (8)
Cassandra Certification Study Guide DataStax
20 pages
IDS Unit 1 Notes
No ratings yet
IDS Unit 1 Notes
24 pages
Master PySpark 1-18
No ratings yet
Master PySpark 1-18
59 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Pyspark IQ FREE Guide
100% (1)
Pyspark IQ FREE Guide
57 pages
Cleaning Data With PySpark Chapter2
100% (1)
Cleaning Data With PySpark Chapter2
25 pages
Pyspark Practice
No ratings yet
Pyspark Practice
42 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Py Spark
No ratings yet
Py Spark
10 pages
Slide 10 PySpark - SQL
No ratings yet
Slide 10 PySpark - SQL
131 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Spark Architecture
No ratings yet
Spark Architecture
7 pages
PMLE Dumps
No ratings yet
PMLE Dumps
4 pages
Step by Step Guide For Data Engineering
No ratings yet
Step by Step Guide For Data Engineering
7 pages
Learning Spark - Chapter 4
No ratings yet
Learning Spark - Chapter 4
30 pages
Azure Data Engineer Interview Questions
No ratings yet
Azure Data Engineer Interview Questions
15 pages
PhonePe Case Study - Detailed
No ratings yet
PhonePe Case Study - Detailed
5 pages
Learn Apache Spark
100% (1)
Learn Apache Spark
31 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Big Data Analytics-Digital Notes
No ratings yet
Big Data Analytics-Digital Notes
86 pages
Top 100 Pyspark Functions For Data Engineers 1738131847
No ratings yet
Top 100 Pyspark Functions For Data Engineers 1738131847
30 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Pyspark Coding Questions From StrataScratch Platform
No ratings yet
Pyspark Coding Questions From StrataScratch Platform
23 pages
07 Spark Dataframes
100% (1)
07 Spark Dataframes
45 pages
Applications of Big Data Analytics in E-Commerce
No ratings yet
Applications of Big Data Analytics in E-Commerce
11 pages
MasterCard Data Engineering
No ratings yet
MasterCard Data Engineering
17 pages
Cheat Sheet: From Spark Data Sources SQL Queries
No ratings yet
Cheat Sheet: From Spark Data Sources SQL Queries
1 page
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Day11 Notes
No ratings yet
Day11 Notes
2 pages
Pyspark - Lambda Functions
No ratings yet
Pyspark - Lambda Functions
4 pages
SQL To Pyspark Conversion
No ratings yet
SQL To Pyspark Conversion
9 pages
Muhammad Khaleel Afzal Full Stack Software Engineer Solution Architect
No ratings yet
Muhammad Khaleel Afzal Full Stack Software Engineer Solution Architect
4 pages
Pyspark 500
No ratings yet
Pyspark 500
103 pages
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
No ratings yet
Compare Hadoop vs. Spark vs. Kafka For Your Big Data Strategy
10 pages
Learning Spark - Chapter 5
No ratings yet
Learning Spark - Chapter 5
44 pages
PySpark Notes
No ratings yet
PySpark Notes
64 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
Scenarios Where Bad Records Occur
No ratings yet
Scenarios Where Bad Records Occur
38 pages
Journal
No ratings yet
Journal
47 pages
Advanced Certificate Program in Data Science and AI Curriculum v1.0
No ratings yet
Advanced Certificate Program in Data Science and AI Curriculum v1.0
55 pages
1 - Introduction ToPySpark
No ratings yet
1 - Introduction ToPySpark
26 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
Date Timestamp Function in Pyspark
No ratings yet
Date Timestamp Function in Pyspark
15 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Pyspark
No ratings yet
Pyspark
10 pages
Udfs 4
No ratings yet
Udfs 4
12 pages
Databricks Webinar v7 Final - 189829
No ratings yet
Databricks Webinar v7 Final - 189829
37 pages
Basic DataFrame Operation
No ratings yet
Basic DataFrame Operation
11 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Data Frames
No ratings yet
Data Frames
12 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Udfs 3
No ratings yet
Udfs 3
7 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
4 PySpark Exercises
No ratings yet
4 PySpark Exercises
7 pages
22 2019-03-07 UdemyforBusinessCourseList PDF
No ratings yet
22 2019-03-07 UdemyforBusinessCourseList PDF
72 pages
Expr SelectExpr in Pyspark
No ratings yet
Expr SelectExpr in Pyspark
9 pages
Chapter 3
No ratings yet
Chapter 3
33 pages
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts For Data Engineers - by Mayurkumar Surani - May, 2025 - Medium
14 pages
Chapter 6 Spark and Flink Questions Answers
No ratings yet
Chapter 6 Spark and Flink Questions Answers
5 pages
Py Spark
No ratings yet
Py Spark
7 pages
SQL Vs Pyspark-1
No ratings yet
SQL Vs Pyspark-1
9 pages
DataBricks Overview
No ratings yet
DataBricks Overview
13 pages
SQL PySpark Cheat Sheet 1731729790
No ratings yet
SQL PySpark Cheat Sheet 1731729790
9 pages
De Mod 2 Transform Data With Spark
No ratings yet
De Mod 2 Transform Data With Spark
32 pages
Big Data Analytics Udf
No ratings yet
Big Data Analytics Udf
8 pages
Udfs 1
No ratings yet
Udfs 1
5 pages
AWS Plus Common Big Data Notes
No ratings yet
AWS Plus Common Big Data Notes
3 pages
Apache Spark Essentials
No ratings yet
Apache Spark Essentials
12 pages
CS 2018 042
No ratings yet
CS 2018 042
8 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Silabus PII Short Course Intro To Big Data in 2 Hours
No ratings yet
Silabus PII Short Course Intro To Big Data in 2 Hours
3 pages
Spark Best Practices
No ratings yet
Spark Best Practices
10 pages
Anurag-Sah (Data Engineer) - 2
No ratings yet
Anurag-Sah (Data Engineer) - 2
2 pages
Rakesh Kumar - 21554244 - Big Data - Assessment 2
No ratings yet
Rakesh Kumar - 21554244 - Big Data - Assessment 2
23 pages
Ateeja Mohammed
No ratings yet
Ateeja Mohammed
2 pages
BDAwAS Lab Evaluation 2 Assignments
No ratings yet
BDAwAS Lab Evaluation 2 Assignments
1 page
Udfs 2
No ratings yet
Udfs 2
2 pages
Pue Big Data
No ratings yet
Pue Big Data
2 pages
Shlok's Resume
No ratings yet
Shlok's Resume
1 page
Creating User-Defined Functions (UDFs) For DataFrames in Python - Snowflake Documentation
No ratings yet
Creating User-Defined Functions (UDFs) For DataFrames in Python - Snowflake Documentation
1 page
Vamshi Java Full Stack
No ratings yet
Vamshi Java Full Stack
12 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
SQL & pySPARK
No ratings yet
SQL & pySPARK
9 pages
BigData OSFY Nov
No ratings yet
BigData OSFY Nov
6 pages
SQL Vs PySpark 1678871778
No ratings yet
SQL Vs PySpark 1678871778
8 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

UDF in Pyspark

Uploaded by

UDF in Pyspark

Uploaded by

PySpark UDF (a.k.

1. PySpark UDF Introduction

1.1 What is UDF?

1.2 Why do we need a UDF?

2. Create PySpark UDF

2.1 Create a DataFrame

from pyspark.sql import SparkSession

columns = ["Seqno", "Name"]

data = [("1", "John Jones"),

df = spark.createDataFrame(data = data, schema = columns)

2.2 Create a Python Function

2.3 Convert a Python function to PySpark UDF

# Converting function to UDF

# Converting function to UDF

3. Using UDF with DataFrame

3.1 Using UDF with PySpark DataFrame select()

3.2 Using UDF with PySpark DataFrame withColumn()

upperCaseUDF = udf(lambda z:upperCase(z), StringType())

df.withColumn("Curated Name", upperCaseUDF(col("Name"))) \

3.3 Registering PySpark UDF & use it on SQL

""" Using UDF on SQL """

This yields the same output as 3.1 example.

4. Creating UDF using annotation

df.withColumn("Curated Name", upperCase(col("Name"))) \

This results same output as section 3.2

5.1 Execution order

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\session.py:1631, in SparkSession.sql(self, sqlQ

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in Jav

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:185, in capture_sql_e

5.2 Handling null check

""" null check """

columns = ["Seqno", "Name"]

df2 = spark.createDataFrame(data = data, schema = columns)

spark.sql("SELECT convertUDF(Name) FROM NAME_TABLE2") \

Cell In[12], line 15

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\dataframe.py:945, in DataFrame.show(self, n, tr

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\sql\dataframe.py:976, in DataFrame._show_string(sel

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\lib\py4j-0.10.9.7-src.zip\py4j\java_gateway.py:1322, in Jav

File C:\Program Files\spark-3.5.1-bin-hadoop3\python\pyspark\errors\exceptions\captured.py:185, in capture_sql_e

spark.sql("SELECT _nullsafeUDF(Name) FROM NAME_TABLE2") \

spark.sql("SELECT Seqno, _nullsafeUDF(Name) AS Name FROM NAME_TABLE2 " + \

5.3 Performance concern using UDF

6. Complete PySpark UDF Example

""" Converting function to UDF """

upperCaseUDF = udf(lambda z:upperCase(z),StringType())

df.withColumn("Cureated Name", upperCaseUDF(col("Name"))) \

""" Using UDF on SQL """

spark.sql("select Seqno, convertUDF(Name) as Name from NAME_TABLE " + \

""" null check """

spark.sql("select _nullsafeUDF(Name) from NAME_TABLE2") \

spark.sql("select Seqno, _nullsafeUDF(Name) as Name from NAME_TABLE2 " + \

You might also like