UDF in Pyspark
UDF in Pyspark
a User Defined Function) is the most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build in
capabilities.Note: UDF’s are the most expensive operations hence use them only you have no choice and when essential.
import findspark
findspark.init()
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
df.show(truncate = False)
+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |John Jones |
|2 |Tracey Smith|
|3 |Amy Sanders |
+-----+------------+
def convertCase(str):
resStr = ""
arr = str.split(" ")
for x in arr:
resStr = resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr
Note that there might be a better way to write this function. But for the sake of this article, I am not worried much about the performance and
better ways.
Note: The default type of the udf() is StringType hence, you can also write the above statement without return type.
df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate = False)
+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+
def upperCase(str):
return str.upper()
Let’s convert upperCase() python function to UDF and then use it with DataFrame withColumn(). Below example converts the values of “Name”
column to upper case and creates a new column “Curated Name”
+-----+------------+------------+
|Seqno|Name |Curated Name|
+-----+------------+------------+
|1 |John Jones |JOHN JONES |
|2 |Tracey Smith|TRACEY SMITH|
|3 |Amy Sanders |AMY SANDERS |
+-----+------------+------------+
+-----+-------------+
|seqno|name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+
@udf(returnType=StringType())
def upperCase(str):
return str.upper()
+-----+------------+------------+
|Seqno|Name |Curated Name|
+-----+------------+------------+
|1 |John Jones |JOHN JONES |
|2 |Tracey Smith|TRACEY SMITH|
|3 |Amy Sanders |AMY SANDERS |
+-----+------------+------------+
5. Special Handling
"""
No guarantee Name is not null will execute first
If convertUDF(Name) like '%John%' execute first then
you will get runtime error
"""
spark.sql("SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLE" + \
"where Name is not null and convertUDF(Name) like '%John%'") \
.show(truncate = False)
---------------------------------------------------------------------------
ParseException Traceback (most recent call last)
Cell In[11], line 6
1 """
2 No guarantee Name is not null will execute first
3 If convertUDF(Name) like '%John%' execute first then
4 you will get runtime error
5 """
----> 6 spark.sql("SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLE" + \
7 "where Name is not null and convertUDF(Name) like '%John%'") \
8 .show(truncate = False)
ParseException:
[PARSE_SYNTAX_ERROR] Syntax error at or near 'is'.(line 1, pos 65)
== SQL ==
SELECT Seqno, convertUDF(Name) AS Name FROM NAME_TABLEwhere Name is not null and convertUDF(Name) like '%John%'
-----------------------------------------------------------------^^^
+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |John Jones |
|2 |Tracey Smith|
|3 |Amy Sanders |
|4 |NULL |
+-----+------------+
---------------------------------------------------------------------------
PythonException Traceback (most recent call last)
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:\Users\Dell\AppData\Local\Temp\ipykernel_18024\595000715.py", line 3, in convertCase
AttributeError: 'NoneType' object has no attribute 'split'
Note that from the above snippet, record with “Seqno 4” has value “None” for “name” column. Since we are not handling null with UDF
function, using this on DataFrame returns below error. Note that in Python None is considered null.Below points to remember Its always best
practice to check for null inside a UDF function rather than checking for null outside. In any case, if you can’t do a null check in UDF at lease
use IF or CASE WHEN to check for null and call UDF conditionally.
spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "", StringType())
+-----+----+
|Seqno|Name|
+-----+----+
+-----+----+
This executes successfully without errors as we are checking for null/none while registering UDF.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, udf
from pyspark.sql.types import StringType
spark = SparkSession.builder.appName('SparkByExamples.com').getOrCreate()
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders")]
df = spark.createDataFrame(data=data,schema=columns)
df.show(truncate=False)
def convertCase(str):
resStr=""
arr = str.split(" ")
for x in arr:
resStr= resStr + x[0:1].upper() + x[1:len(x)] + " "
return resStr
df.select(col("Seqno"), \
convertUDF(col("Name")).alias("Name") ) \
.show(truncate=False)
def upperCase(str):
return str.upper()
columns = ["Seqno","Name"]
data = [("1", "john jones"),
("2", "tracey smith"),
("3", "amy sanders"),
('4',None)]
df2 = spark.createDataFrame(data=data,schema=columns)
df2.show(truncate=False)
df2.createOrReplaceTempView("NAME_TABLE2")
spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "" , StringType())
+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |john jones |
|2 |tracey smith|
|3 |amy sanders |
+-----+------------+
+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+
+-----+------------+-------------+
|Seqno|Name |Cureated Name|
+-----+------------+-------------+
|1 |john jones |JOHN JONES |
|2 |tracey smith|TRACEY SMITH |
|3 |amy sanders |AMY SANDERS |
+-----+------------+-------------+
+-----+-------------+
|Seqno|Name |
+-----+-------------+
|1 |John Jones |
|2 |Tracey Smith |
|3 |Amy Sanders |
+-----+-------------+
+-----+-----------+
|Seqno|Name |
+-----+-----------+
|1 |John Jones |
+-----+-----------+
+-----+------------+
|Seqno|Name |
+-----+------------+
|1 |john jones |
|2 |tracey smith|
|3 |amy sanders |
|4 |NULL |
+-----+------------+
+------------------+
|_nullsafeUDF(Name)|
+------------------+
|John Jones |
|Tracey Smith |
|Amy Sanders |
| |
+------------------+
+-----+-----------+
|Seqno|Name |
+-----+-----------+
|1 |John Jones |
+-----+-----------+
Loading [MathJax]/jax/output/CommonHTML/fonts/TeX/fontdata.js