Create new column with function in Spark Dataframe
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to create a new column with a function in the PySpark data frame in Python.
PySpark is a popular Python library for distributed data processing that provides high-level APIs for working with big data. The data frame class is a key component of PySpark, as it allows you to manipulate tabular data with distributed computing. In this article, we will need to have PySpark installed and be familiar with basic data frame operations.
Define a function:
In this, we have defined a function that we can use to create a new column. This function takes an age as input and returns a string indicating whether the person is a " child " or an " adult ".
Python3
# Defining a function
def age_group(age):
if age < 18:
return "child"
else:
return "adult"
# Storing return output of age_group function
# in age_group_udf
age_group_udf = udf(age_group)
Create a new column with a function using the withColumn() method in PySpark
In this column, we are going to add a new column to a data frame by defining a custom function and applying it to the data frame using a UDF. The UDF takes a column of the data frame as input, applies the custom function to it, and returns the result as a new column.
Here are the steps for using the withColumn() method to create a new column called "age_group" in our data frame:
Python3
from pyspark.sql import SparkSession
from pyspark.sql.functions import udf
# create a SparkSession
spark = SparkSession.builder.getOrCreate()
# create a list of tuples
data = [("Alice", 1), ("Bob", 2), ("Charlie", 3)]
# create a DataFrame from the list
df = spark.createDataFrame(data, ["name", "age"])
print("Dataframe Before adding new col : ")
# view the DataFrame
df.show()
def age_group(age):
if age < 18:
return "child"
else:
return "adult"
age_group_udf = udf(age_group)
# create a new column called "age_group"
df = df.withColumn("age_group", age_group_udf(df.age))
print("Dataframe After adding new col : ")
# view the DataFrame with the new column
df.show()
Output :
Create a new column with a function using the PySpark UDFs method
In this approach, we are going to add a new column to a data frame by defining a custom function and registering it as a UDF using the spark.udf.register() method. Then using selectExpr() method of the data frame to select the columns of the data frame and the new column which is created by applying the UDF on a column of DataFrame. Here is a step-by-step explanation of the code:
Python3
# Import required modules
from pyspark.sql import SparkSession
# Creating the SparkSession
spark = SparkSession.builder.appName(
"Creating new column using UDF").getOrCreate()
# Creating the DataFrame
data = [("John", 25), ("Mike", 30),
("Emily", 35)]
columns = ["name", "age"]
df = spark.createDataFrame(data, columns)
print("Dataframe Before adding new col : ")
df.show()
# Define the UDF
def my_udf(age):
if age < 30:
return "Young"
else:
return "Old"
# Register the UDF as a function
udf_age_group = spark.udf.register(
"age_group", my_udf)
# Use the UDF in a select statement
df = df.selectExpr("name","age",
"age_group(age) as age_group")
print("Dataframe After adding new col : ")
# Show the DataFrame
df.show()
Output :
Similar Reads
Add new column with default value in PySpark dataframe In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
PySpark - Create dictionary from data in two columns In this article, we are going to see how to create a dictionary from data in two columns in PySpark using Python. Method 1: Using Dictionary comprehension Here we will create dataframe with two columns and then convert it into a dictionary using Dictionary comprehension. Python # importing pyspark #
3 min read
Applying function to PySpark Dataframe Column In this article, we're going to learn 'How we can apply a function to a PySpark DataFrame Column'. Apache Spark can be used in Python using PySpark Library. PySpark is an open-source Python library usually used for data analytics and data science. Pandas is powerful for data analysis but what makes
4 min read
How to get name of dataframe column in PySpark ? In this article, we will discuss how to get the name of the Dataframe column in PySpark. To get the name of the columns present in the Dataframe we are using the columns function through this function we will get the list of all the column names present in the Dataframe. Syntax: df.columns We can a
3 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
PySpark Dataframe distinguish columns with duplicated name In this article, we are going to learn how to distinguish columns with duplicated names in the Pyspark data frame in Python. A dispersed collection of data grouped into named columns is known as the Pyspark data frame. While working in Pyspark, there occurs various situations in which we get the dat
5 min read