How to add a column to a nested struct in a pyspark
Last Updated :
26 Apr, 2025
In this article, we are going to learn how to add a column to a nested struct using Pyspark in Python.
Have you ever worked in a Pyspark data frame? If yes, then might surely know how to add a column in Pyspark, but do you know that you can also create a struct in Pyspark? The struct is used to programmatically specify the schema to the DataFrame and create complex columns. Apart from creating a nested struct, you can also add a column to a nested struct in the Pyspark data frame later. In this article, we will discuss the same, i.e., how to add a column to a nested struct in a Pyspark.
Stepwise Implementation to add a column to a nested struct.
Step 1: First of all, we need to import the required libraries, i.e., libraries SparkSession, StructType, StructField, StringType, IntegerType, col, lit, and when. The SparkSession library is used to create the session while StructType defines the structure of the data frame and StructField defines the columns of the data frame. The StringType and IntegerType are used to represent String and Integer values for the data frame respectively. The col is used to return a column based on the given column name while lit is used to add a new column to the data frame.
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate() function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, define the data set in the list.
data_set = [((nested_values_1), column_value_1),
((nested_values_2), column_value_2),
((nested_values_3), column_value_3)]
Step 4: Later on, define the structure using StructType and StructField functions respectively.
schema = StructType([StructField('column_1',
StructType([StructField('nested_column_1', column_type(), True),
StructField('nested_column_2', column_type(), True),
StructField('nested_column_3', column_type(), True) ])),
StructField('column_2', column_type(), True)])
Step 4: Further, create a Pyspark data frame using the specified structure and data set.
df = spark_session.createDataFrame(data = data_set, schema = schema)
Step 5: Moreover, we add a new column to the nested struct using the withField function with nested_column_name and replace_value with lit function as arguments.
updated_df = df.withColumn("column_name",
col("column_name").withField("nested_column_name",
lit("replace_value"))))
Step 6: Finally, we display the updated data frame.
updated_df.show()
Example 1:
In this example, we have defined the data structure and data set and created the Pyspark data frame according to the data structure with two columns ‘Date_Of_Birth’ and ‘Age’. The ‘Date_Of_Birth’ column is nested as given below. Further, we have added a column ‘Year’ to a nested struct, i.e., ‘Date_Of_Birth’ by checking the condition if ‘Age’ is equal to the value ’18’ and putting the value ‘2004’ if the condition meets else by putting the value ‘2002’.

Python3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_set = [(( 21 , 2 ), 18 ), (( 16 , 4 ), 20 ),
(( 11 , 1 ), 18 ), (( 6 , 3 ), 20 )]
schema = StructType([
StructField( 'Date_Of_Birth' , StructType([
StructField( 'Date' , IntegerType(), True ),
StructField( 'Month' , IntegerType(), True )])),
StructField( 'Age' , IntegerType(), True )])
df = spark_session.createDataFrame(data = data_set,
schema = schema)
updated_df = df.withColumn( "Date_Of_Birth" ,
col( "Date_Of_Birth" ).withField(
"Year" ,when (col( "Age" ) = = 18 ,
lit( 2004 )).otherwise(lit( 2002 ))))
updated_df.show()
|
Output:
Example 2:
In this example, we have defined the data structure and data set and created the Pyspark data frame according to the data structure with four columns ‘Full_Name’, ‘Date_Of_Birth’, ‘Gender’, and ‘Fees’. The ‘Full_Name’ column is further nested as follows:
Further, we have added the nested column ‘Middle_Name’ to a nested struct ‘Full_Name’ by checking the condition if ‘Gender’ is equal to the value ‘Male’ and adding the value ‘Singh’ if the condition meets else by putting the value ‘Kaur’.
Python3
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql.functions import col, lit, when
from pyspark.sql import SparkSession
spark_session = SparkSession.builder.getOrCreate()
data_set = [(( 'Vansh' , 'Rai' ), '2000-21-02' , 'Male' , 13000 ),
(( 'Ria' , 'Kapoor' ), '2004-01-06' , 'Female' , 10000 )]
schema = StructType([
StructField( 'Full_Name' , StructType([
StructField( 'First_Name' , StringType(), True ),
StructField( 'Last_Name' , StringType(), True )])),
StructField( 'Date_Of_Birth' , StringType(), True ),
StructField( 'Gender' , StringType(), True ),
StructField( 'Fees' , IntegerType(), True )])
df = spark_session.createDataFrame(data = data_set,
schema = schema)
updated_df = df.withColumn( "Full_Name" ,
col( "Full_Name" ).withField( "Middle_Name" ,
when (col( "Gender" ) = = "Male" ,
lit( "Singh" )).otherwise(lit( "Kaur" ))))
updated_df.show()
|
Output:

Similar Reads
PySpark - How to Update Nested Columns?
In this article, we are going to learn how to update nested columns using Pyspark in Python. An interface for Apache Spark in Python is known as Pyspark. Do you know that you can create the nested column in the Pyspark data frame too? Not only you can create the nested column, but also you can updat
5 min read
Adding StructType columns to PySpark DataFrames
In this article, we are going to learn about adding StructType columns to Pyspark data frames in Python. The interface which allows you to write Spark applications using Python APIs is known as Pyspark. While creating the data frame in Pyspark, the user can not only create simple data frames but can
4 min read
Add Suffix and Prefix to all Columns in PySpark
In this article, we are going to add suffixes and prefixes to all columns using Pyspark in Python. An open-source, distributed computing framework and set of libraries for real-time, large-scale data processing API primarily developed for Apache Spark, is known as Pyspark. While working in Pyspark,
11 min read
Adding a Column in Dataframe from a list of values using a UDF Pyspark
In this article, we are going to learn how to add a column from a list of values using a UDF using Pyspark in Python. A data frame that is similar to a relational table in Spark SQL, and can be created using various functions in SparkSession is known as a Pyspark data frame. There occur various circ
5 min read
Add Multiple Columns Using UDF in PySpark
In this article, we are going to learn how to add multiple columns using UDF in Pyspark in Python. Have you ever worked on a Pyspark data frame? If yes, then you might surely know how to add a column and you might have also done it. But have you ever thought about how you can add multiple columns us
5 min read
Drop a column with same name using column index in PySpark
In this article, we are going to learn how to drop a column with the same name using column index using Pyspark in Python. Pyspark offers you the essential function 'drop' through which you can easily delete one or more columns. But have you ever got the requirement in which you have various columns
3 min read
Applying a custom function on PySpark Columns with UDF
In this article, we are going to learn how to apply a custom function on Pyspark columns with UDF in Python. The most useful feature of Spark SQL & DataFrame that is used to extend the PySpark build-in capabilities is known as UDF in Pyspark. There occurs various circumstances in which we need t
4 min read
PySpark RDD - Sort by Multiple Columns
In this article, we are going to learn sorting Pyspark RDD by multiple columns in Python. There occurs various situations in being a data scientist when you get unsorted data and there is not only one column unsorted but multiple columns are unsorted. This situation can be overcome by sorting the da
7 min read
Apply a transformation to multiple columns PySpark dataframe
In this article, we are going to learn how to apply a transformation to multiple columns in a data frame using Pyspark in Python. The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. While using Pyspark
7 min read
PySpark - Apply custom schema to a DataFrame
In this article, we are going to apply custom schema to a data frame using Pyspark in Python. A distributed collection of rows under named columns is known as a Pyspark data frame. Usually, the schema of the Pyspark data frame is inferred from the data frame itself, but Pyspark also gives the featur
6 min read