Convert PySpark Row List to Pandas DataFrame
Last Updated :
25 Mar, 2022
In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects.
Method 1 : Use createDataFrame() method and use toPandas() method
Here is the syntax of the createDataFrame() method :
Syntax : current_session.createDataFrame(data, schema=None, samplingRatio=None, verifySchema=True)
Parameters :
- data : a resilient distributed dataset or data in form of MySQL/SQL datatypes
- schema : string or list of columns names for the DataFrame.
- samplingRatio -> float: a sample ratio of the rows
- verifySchema -> bool: check if the datatypes of the rows is as specified in the schema
Returns : PySpark DataFrame object.
Example:
In this example, we will pass the Row list as data and create a PySpark DataFrame. We will then use the toPandas() method to get a Pandas DataFrame.
Python
# Importing PySpark and importantly
# Row from pyspark.sql
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# PySpark Session
row_pandas_session = SparkSession.builder.appName(
'row_pandas_session'
).getOrCreate()
# List of Sample Row objects
row_object_list = [Row(Topic='Dynamic Programming', Difficulty=10),
Row(Topic='Arrays', Difficulty=5),
Row(Topic='Sorting', Difficulty=6),
Row(Topic='Binary Search', Difficulty=7)]
# creating PySpark DataFrame using createDataFrame()
df = row_pandas_session.createDataFrame(row_object_list)
# Printing the Spark DataFrame
df.show()
# Conversion to Pandas DataFrame
pandas_df = df.toPandas()
# Final Result
print(pandas_df)
Output :

Method 2 : Using parallelize()
We are going to use parallelize() to create an RDD. Parallelize means to copy the elements present in a pre-defined collection to a distributed dataset on which we can operate in parallel. Here is the syntax of parallelize() :
Syntax : sc.parallelize(data,numSlices)
sc : Spark Context Object
Parameters :
- data : data for which RDD is to be made.
- numSlices : number of partitions that need to be made. This is an optional parameter.
Example:
In this example, we will then use createDataFrame() to create a PySpark DataFrame and then use toPandas() to get a Pandas DataFrame.
Python
# Importing PySpark and importantly
# Row from pyspark.sql
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import Row
# PySpark Session
row_pandas_session = SparkSession.builder.appName(
'row_pandas_session'
).getOrCreate()
# List of Sample Row objects
row_object_list = [Row(Topic='Dynamic Programming', Difficulty=10),
Row(Topic='Arrays', Difficulty=5),
Row(Topic='Sorting', Difficulty=6),
Row(Topic='Binary Search', Difficulty=7)]
# Creating an RDD
rdd = row_pandas_session.sparkContext.parallelize(row_object_list)
# DataFrame created using RDD
df = row_pandas_session.createDataFrame(rdd)
# Checking the DataFrame
df.show()
# Conversion of DataFrame
df2 = df.toPandas()
# Final DataFrame needed
print(df2)
Output :

Method 3: Iteration through Row list
In this method, we will traverse through the Row list, and convert each row object to a DataFrame using createDataFrame(). We will then append() this DataFrame to an accumulative final DataFrame which will be our final answer. The details of append() are given below :
Syntax: df.append(other, ignore_index=False, verify_integrity=False, sort=None)
df : Pandas DataFrame
Parameters :
- other : Pandas DataFrame, Numpy Array, Numpy Series etc.
- ignore_index : Checks if index labels are to be used or not.
- verify_integrity : If True, raise ValueError on creating index with duplicates.
- sort : Sort columns if the columns of df and other are unaligned.
Returns: A new appended DataFrame
Example:
In this example, we will then use createDataFrame() to create a PySpark DataFrame and then use append() to get a Pandas DataFrame.
Python
# Importing PySpark
# Importing Pandas for append()
import pyspark
import pandas
from pyspark.sql import SparkSession
from pyspark.sql import Row
# PySpark Session
row_pandas_session = SparkSession.builder.appName(
'row_pandas_session'
).getOrCreate()
# List of Sample Row objects
row_object_list = [Row(Topic='Dynamic Programming', Difficulty=10),
Row(Topic='Arrays', Difficulty=5),
Row(Topic='Sorting', Difficulty=6),
Row(Topic='Binary Search', Difficulty=7)]
# Our final DataFrame initialized
mega_df = pandas.DataFrame()
# Traversing through the list
for i in range(len(row_object_list)):
# Creating a Spark DataFrame of a single row
small_df = row_pandas_session.createDataFrame([row_object_list[i]])
# appending the Pandas version of small_df
# to mega_df
mega_df = mega_df.append(small_df.toPandas(),
ignore_index=True)
# Printing our desired DataFrame
print(mega_df)
Output :

Similar Reads
How to Convert Pandas to PySpark DataFrame ? In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
Convert PySpark dataframe to list of tuples In this article, we are going to convert the Pyspark dataframe into a list of tuples. The rows in the dataframe are stored in the list separated by a comma operator. So we are going to create a dataframe by using a nested list Creating Dataframe for demonstration: Python3 # importing module import p
2 min read
Convert PySpark RDD to DataFrame In this article, we will discuss how to convert the RDD to dataframe in PySpark. There are two approaches to convert RDD to dataframe. Using createDataframe(rdd, schema)Using toDF(schema) But before moving forward for converting RDD to Dataframe first let's create an RDD Example: Python # importing
3 min read
Pyspark - Converting JSON to DataFrame In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
1 min read
Convert Python Dictionary List to PySpark DataFrame In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
Convert PySpark DataFrame to Dictionary in Python In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
How to Convert Pandas DataFrame into a List? In this article, we will explore the process of converting a Pandas DataFrame into a List, We'll delve into the methods and techniques involved in this conversion, shedding light on the versatility and capabilities of Pandas for handling data structures in Python.Ways to convert Pandas DataFrame Int
7 min read
How to Convert RDD to Dataframe in Spark Scala? This article focuses on discussing ways to convert rdd to dataframe in Spark Scala. Table of Content RDD and DataFrame in SparkConvert Using createDataFrame MethodConversion Using toDF() Implicit MethodConclusionFAQsRDD and DataFrame in SparkRDD and DataFrame are Spark's two primary methods for hand
6 min read
Custom row (List of CustomTypes) to PySpark dataframe In this article, we are going to learn about the custom row (List of Custom Types) to PySpark data frame in Python. We will explore how to create a PySpark data frame from a list of custom objects, where each object represents a row in the data frame. PySpark data frames are a powerful and efficient
6 min read
Converting a PySpark DataFrame Column to a Python List In this article, we will discuss how to convert Pyspark dataframe column to a Python list. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
5 min read