Convert Python Functions into PySpark UDF
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to convert Python functions into Pyspark UDFs
We will discuss the process of converting Python functions into PySpark User-Defined Functions (UDFs). PySpark UDFs are a powerful tool for data processing and analysis, as they allow for the use of Python functions within the Spark ecosystem. By converting Python functions into UDFs, we can leverage the distributed processing capabilities of Spark to perform complex data transformations and operations on large datasets.
PySpark
PySpark is the Python library for Spark programming. It provides a Python API for interacting with the Spark ecosystem, including support for data frames, SQL operations, and machine learning.
User Defined Function (UDF)
A User Defined Function (UDF) is a function that is defined and written by the user, rather than being provided by the system. UDFs in PySpark are used to perform specific operations or calculations on data within the Spark ecosystem.
Distributed Processing
Spark is a distributed computing framework, which means that it can process data in parallel across a cluster of machines. This enables Spark to handle very large datasets, and perform operations quickly by breaking them down into smaller chunks.
Example 1:
In this example, we define a Python function square() that takes a single argument and returns its square. We then create a UDF from this function using the udf() function and specify that the return type of the UDF is IntegerType(). Finally, we use the UDF in a data frame operation to square the values in the "id" column of the data frame. The output of the code will be a data frame with one column with square values from 1 to 10.
Python
# Import required modules
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("gfg").getOrCreate()
# Define the Python function
def square(x):
return x * x
# Create the UDF
square_udf = udf(square,
IntegerType())
# Use the UDF in a DataFrame operation
df = spark.range(1, 10)
df.select(square_udf("id").alias(
"squared")).show()
Output: The below output is the result of running the UDF square_udf on the 'id' column of the data frame df. The UDF is applying the Python function square on each value of the 'id' column, which is squaring the values. The result of this operation is a new column 'squared' in the data frame that contains the square values of the 'id' column.
+-------+
|squared|
+-------+
| 1|
| 4|
| 9|
| 16|
| 25|
| 36|
| 49|
| 64|
| 81|
+-------+
Example 2:
In this example, we define a Python function concat_strings(x,y) that takes two arguments and concatenates them, and returns a single concatenated string. We then create a UDF from this function using the udf() function and specify that the return type of the UDF is StringType(). We then create an example data frame "df" for demonstration purposes and run the UDF on the columns of this data frame as shown in the example.
Python
# Importing required modules
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("gfg").getOrCreate()
# Define the Python function
def concat_strings(x, y):
return x + ' ' + y
# Create the UDF
concat_strings_udf = udf(concat_strings,
StringType())
# Create data frame
df = spark.createDataFrame([('John', 'Doe'),
('Adam', 'Smith'),
('Jane','Doe')],
['first_name',
'last_name'])
df.show()
print("after applying udf function")
df.select(concat_strings_udf(
'first_name','last_name').alias(
'full_name')).show()
Output: As we can see, the UDF concatenates the values of the 'first_name' and 'last_name' columns and returns the concatenated string in a new column 'full_name'.
+----------+---------+
|first_name|last_name|
+----------+---------+
| John| Doe|
| Adam| Smith|
| Jane| Doe|
+----------+---------+
after applying udf function
+----------+
| full_name|
+----------+
| John Doe|
|Adam Smith|
| Jane Doe|
+----------+
Similar Reads
Python PySpark sum() Function PySpark, the Python API for Apache Spark, is a powerful tool for big data processing and analytics. One of its essential functions is sum(), which is part of the pyspark.sql.functions module. This function allows us to compute the sum of a column's values in a DataFrame, enabling efficient data anal
3 min read
Python PySpark pivot() Function The pivot() function in PySpark is a powerful method used to reshape a DataFrame by transforming unique values from one column into multiple columns in a new DataFrame, while aggregating data in the process. The function takes a set of unique values from a specified column and turns them into separa
4 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
Convert Python Dictionary List to PySpark DataFrame In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
Calling another custom Python function from Pyspark UDF PySpark, often known as Python API for Apache Spark, was created for distributed data processing. It gives users the ability to efficiently and scalable do complex computations and transformations on large datasets. User-Defined Functions (UDFs), which let users create their unique functions and app
7 min read
Convert pair to value using map() in Pyspark In this article, we are going to learn how to use map() to convert (key, value) pair to value and keys only using Pyspark in Python. PySpark is the Python library for Spark programming. It is an API for interacting with the Spark cluster using the Python programming language. PySpark provides a simp
3 min read
How to convert list of dictionaries into Pyspark DataFrame ? In this article, we are going to discuss the creation of the Pyspark dataframe from the list of dictionaries. We are going to create a dataframe in PySpark using a list of dictionaries with the help createDataFrame() method. The data attribute takes the list of dictionaries and columns attribute tak
2 min read
How to Write Spark UDF (User Defined Functions) in Python ? In this article, we will talk about UDF(User Defined Functions) and how to write these in Python Spark. UDF, basically stands for User Defined Functions. The UDF will allow us to apply the functions directly in the dataframes and SQL databases in python, without making them registering individually.
4 min read
PySpark Window Functions PySpark Window function performs statistical operations such as rank, row number, etc. on a group, frame, or collection of rows and returns results for each row individually. It is also popularly growing to perform data transformations. We will understand the concept of window functions, syntax, and
8 min read
Convert PySpark dataframe to list of tuples In this article, we are going to convert the Pyspark dataframe into a list of tuples. The rows in the dataframe are stored in the list separated by a comma operator. So we are going to create a dataframe by using a nested list Creating Dataframe for demonstration: Python3 # importing module import p
2 min read