Convert Python Functions into PySpark UDF
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to convert Python functions into Pyspark UDFs
We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python functions within the Spark ecosystem. By converting Python functions into UDFs, we can leverage the distributed processing capabilities of Spark to perform complex data transformations and operations on large datasets.
PySpark
PySpark is the Python library for Spark programming. It provides a Python API for interacting with the Spark ecosystem, including support for data frames, SQL operations, and machine learning.
User Defined Function (UDF)
A User Defined Function (UDF) is a function that is defined and written by the user, rather than being provided by the system. UDFs in PySpark are used to perform specific operations or calculations on data within the Spark ecosystem.
Distributed Processing
Spark is a distributed computing framework, which means that it can process data in parallel across a cluster of machines. This enables Spark to handle very large datasets, and perform operations quickly by breaking them down into smaller chunks.
Example 1:
In this example, we define a Python function square() that takes a single argument and returns its square. We then create a UDF from this function using the udf() function and specify that the return type of the UDF is IntegerType(). Finally, we use the UDF in a data frame operation to square the values in the "id" column of the data frame. The output of the code will be a data frame with one column with square values from 1 to 10.
Python
# Import required modules
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("gfg").getOrCreate()
# Define the Python function
def square(x):
return x * x
# Create the UDF
square_udf = udf(square,
IntegerType())
# Use the UDF in a DataFrame operation
df = spark.range(1, 10)
df.select(square_udf("id").alias(
"squared")).show()
Output: The below output is the result of running the UDF square_udf on the 'id' column of the data frame df. The UDF is applying the Python function square on each value of the 'id' column, which is squaring the values. The result of this operation is a new column 'squared' in the data frame that contains the square values of the 'id' column.
+-------+
|squared|
+-------+
| 1|
| 4|
| 9|
| 16|
| 25|
| 36|
| 49|
| 64|
| 81|
+-------+
Example 2:
In this example, we define a Python function concat_strings(x,y) that takes two arguments and concatenates them, and returns a single concatenated string. We then create a UDF from this function using the udf() function and specify that the return type of the UDF is StringType(). We then create an example data frame "df" for demonstration purposes and run the UDF on the columns of this data frame as shown in the example.
Python
# Importing required modules
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("gfg").getOrCreate()
# Define the Python function
def concat_strings(x, y):
return x + ' ' + y
# Create the UDF
concat_strings_udf = udf(concat_strings,
StringType())
# Create data frame
df = spark.createDataFrame([('John', 'Doe'),
('Adam', 'Smith'),
('Jane','Doe')],
['first_name',
'last_name'])
df.show()
print("after applying udf function")
df.select(concat_strings_udf(
'first_name','last_name').alias(
'full_name')).show()
Output: As we can see, the UDF concatenates the values of the 'first_name' and 'last_name' columns and returns the concatenated string in a new column 'full_name'.
+----------+---------+
|first_name|last_name|
+----------+---------+
| John| Doe|
| Adam| Smith|
| Jane| Doe|
+----------+---------+
after applying udf function
+----------+
| full_name|
+----------+
| John Doe|
|Adam Smith|
| Jane Doe|
+----------+
Similar Reads
Python PySpark sum() Function PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Python PySpark pivot() Function The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separa
4 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
Convert Python Dictionary List to PySpark DataFrame In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
Calling another custom Python function from Pyspark UDF PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and app
7 min read
Convert pair to value using map() in Pyspark In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python. PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simp
3 min read