Custom row (List of CustomTypes) to PySpark dataframe
Last Updated :
23 Jul, 2025
In this article, we are going to learn about the custom row (List of Custom Types) to PySpark data frame in Python.
We will explore how to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. PySpark data frames are a powerful and efficient tool for working with large datasets in a distributed computing environment. They are similar to a table in a relational database or a data frame in R or Python. By creating a data frame from a list of custom objects, we can easily convert structured data into a format that can be analyzed and processed using PySpark's built-in functions and libraries.
Syntax of CustomType class to create PySpark data frame :
class CustomType:
def __init__(self, name, age, salary):
self.name = name
self.age = age
self.salary = salary
Explaniation:
- The keyword class is used to define a new class.
- CustomType is the name of the class.
- Inside the class block, we have a special method called __init__, which is used to initialize the object when it is created. The __init__ method takes three arguments: name, age, and salary, and assigns them to the object's properties with the same name.
- self is a reference to the object itself, which is passed to the method automatically when the object is created.
- The property's name, age, and salary are defined by using self.property_name = value notation.
Approach 1:
Now in the below example, we are going to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. The custom objects contain information about a person, such as their name, age, and salary. In this example, we convert the list of custom objects to a list of Row objects using list comprehension. Then it creates a data frame from the list of Row objects using the createDataFrame method.
Steps 1: The first line imports the Row class from the pyspark.sql module, which is used to create a row object for a data frame.
Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.
Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.
Step 4: A list comprehension is used to convert the list of CustomType objects into a list of Row objects, where each CustomType object is mapped to a Row object with the same name, age, and salary.
Step 5: The createDataFrame() method is called on the SparkSession object (spark) with the list of Row objects as input, creating a DataFrame.
Step 6: The data frame is displayed using the show() method.
Python3
# Importing required modules
from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
# Define a custom class to represent a row in the dataframe
class CustomType:
def __init__(self, name,
age, salary):
self.name = name
self.age = age
self.salary = salary
# Create a list of CustomType objects
data = [CustomType("John", 30, 5000),
CustomType("Mary", 25, 6000),
CustomType("Mike", 35, 7000)]
# Convert the list of CustomType
# objects to a list of Row objects
rows = [Row(name=d.name,
age=d.age,
salary=d.salary) for d in data]
# Create a dataframe from the list of Row objects
df = spark.createDataFrame(rows)
# Show the dataframe
df.show()
Output :
Approach 2:
In this example, we convert the list of custom objects directly to RDD and then convert it to Dataframe using the createDataFrame() method.
Step 1: The first line imports the Row class from the pyspark.sql module, which is not actually used in this code.
Step 2: A custom class called CustomType is defined with a constructor that takes in three parameters: name, age, and salary. These will represent the columns of the data frame.
Step 3: A list of CustomType objects is created with three instances, each with a different name, age, and salary.
Step 4: The parallelize method of the SparkContext is called with the list of CustomType objects as input, creating an RDD (Resilient Distributed Dataset)
Step 5: The createDataFrame method is called on the SparkSession object (spark) with the RDD as input, creating a DataFrame.
Step 6: The data frame is displayed using the show method.
Python3
from pyspark.sql import Row
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Metadata").getOrCreate()
# Define a custom class to represent a row in the dataframe
class CustomType:
def __init__(self, name, age, salary):
self.name = name
self.age = age
self.salary = salary
# Create a list of CustomType objects
data = [CustomType("John", 30, 5000),
CustomType("Mary", 25, 6000),
CustomType("Mike", 35, 7000)]
rdd = spark.sparkContext.parallelize(data)
# Create a dataframe from the rdd
df = spark.createDataFrame(rdd)
# Show the dataframe
df.show()
Output:
Approach 3:
In this approach, we first defined the schema for the data frame using the StructType class. We created three fields name, age, and salary with the type of StringType, IntegerType, and IntegerType respectively. Then, we created a list of custom objects, where each object is a Python dictionary with keys corresponding to the field names in our schema. Finally, we used the createDataFrame() method with the list of custom objects and the schema to create the data frame and display it using the show() method.
Step 1: Define the schema for the data frame using the StructType class: This class allows you to define the structure and types of the columns in the data frame. You can define the name and type of each column using the StructField class.
Step 2: Create a list of custom objects: The custom objects can be in the form of Python dictionaries, where each dictionary represents a row in the data frame and the keys of the dictionary correspond to the column names defined in the schema.
Step 3: Create the data frame: Use the createDataFrame method and pass in the list of custom objects and the schema to create the data frame.
Step 4: Show the data frame: To display the data frame, use the show() method on the data frame object.
Python3
# Importing required modules
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
from pyspark.sql import Row
# Create a SparkSession
spark = SparkSession.builder.appName("Myapp").getOrCreate()
# step 1: Define the schema for the dataframe
schema = StructType([
StructField("name",
StringType(), True),
StructField("age",
IntegerType(), True),
StructField("salary",
IntegerType(), True)
])
# step 2: Create a list of custom objects
data = [{"name": "John",
"age": 30, "salary": 5000},
{"name": "Mary",
"age": 25, "salary": 6000},
{"name": "Mike",
"age": 35, "salary": 7000}]
# step 3: Create the dataframe
df = spark.createDataFrame(data, schema)
# step 4: Show the dataframe
df.show()
Output :
Both three approaches achieve the same result which is a data frame with three rows and three columns, named "name", "age", and "salary". The data in the data frame will be the same as the data in the list of custom objects.
Similar Reads
Convert PySpark Row List to Pandas DataFrame In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax
4 min read
Convert PySpark dataframe to list of tuples In this article, we are going to convert the Pyspark dataframe into a list of tuples. The rows in the dataframe are stored in the list separated by a comma operator. So we are going to create a dataframe by using a nested list Creating Dataframe for demonstration: Python3 # importing module import p
2 min read
How to show full column content in a PySpark Dataframe ? Sometimes in Dataframe, when column data containing the long content or large sentence, then PySpark SQL shows the dataframe in compressed form means the first few words of the sentence are shown and others are followed by dots that refers that some more data is available. From the above sample Data
5 min read
Converting a PySpark DataFrame Column to a Python List In this article, we will discuss how to convert Pyspark dataframe column to a Python list. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
5 min read
Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list
2 min read
How to Iterate over rows and columns in PySpark dataframe In this article, we will discuss how to iterate rows and columns in PySpark dataframe. Create the dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app nam
6 min read