Optimize Conversion between PySpark and Pandas DataFrames
Last Updated :
28 Apr, 2025
PySpark and Pandas are two open-source libraries that are used for doing data analysis and handling data in Python. Given below is a short description of both of them.
Conversion between PySpark and Pandas DataFrames
In this article, we are going to talk about how we can convert a PySpark DataFrame into a Pandas DataFrame and vice versa. Their conversion can be easily done in PySpark.
Converting Pandas DataFrame into a PySpark DataFrame
Here in, we'll be converting a Pandas DataFrame into a PySpark DataFrame. First of all, we'll import PySpark and Pandas libraries. Then we'll start a session. later, we will create a Pandas DataFrame and convert it to PySpark DataFrame. To do that, we'll make a PySpark DataFrame via the createDataFrame() method and store it in the same variable in which we stored the Pandas DataFrame. Inside the createDataFrame() method, as a parameter, we'll pass the pandas DataFrame name. These steps will convert the Pandas DataFrame into a PySpark DataFrame.
Example:
Python3
# importing pandas and PySpark libraries
import pandas as pd
import pyspark
# initializing the PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# creating a pandas DataFrame
df = pd.DataFrame({
'Cardinal':[1, 2, 3],
'Ordinal':['First','Second','Third']
})
# converting the pandas DataFrame into a PySpark DataFrame
df = spark.createDataFrame(df)
# printing the first two rows
df.show(2)
Output:
In case, if you would like to use the pandas DataFrame later, you can store the PySpark DataFrame in another variable.
Converting PySpark DataFrame into a Pandas DataFrame
Now, we will be converting a PySpark DataFrame into a Pandas DataFrame. All the steps are the same but this time, we'll be making use of the toPandas() method. We'll use toPandas() method and convert our PySpark DataFrame to a Pandas DataFrame.
Syntax to use toPandas() method:
spark_DataFrame.toPandas()
Example:
Python3
# importing PySpark Library
import pyspark
# from PySpark importing Row for creating DataFrame
from pyspark import Row
# initializing PySpark session
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# creating a PySpark DataFrame
spark_df = spark.createDataFrame([
Row(Cardinal=1, Ordinal='First'),
Row(Cardinal=2, Ordinal='Second'),
Row(Cardinal=3, Ordinal='Third')
])
# converting spark_dataframe into a pandas DataFrame
pandas_df = spark_df.toPandas()
pandas_df.head()
Output:
Now we will check the time required to do the above conversion.
Python3
%%time
import numpy as np
import pandas as pd
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
(np.random.randint(1, 101, size=100), newshape=(10, 10))))
spark_df.toPandas()
Output
3.17 s
Now let's enable the PyArrow and see the time taken by the process.
Python3
%%time
import numpy as np
import pandas as pd
# creating session in PySpark
spark = pyspark.sql.SparkSession.builder.getOrCreate()
# creating a PySpark DataFrame
spark_df = spark.createDataFrame(pd.DataFrame(np.reshape\
(np.random.randint(1, 101, size=100), newshape=(10, 10))))
# enabling PyArrow
spark.conf.set('spark.sql.execution.arrow.enabled', 'true')
spark_df.toPandas()
Output
460 ms
Here we can see that the time required to convert PySpark and Pandas dataframe has been reduced drastically by using the optimized version.
Similar Reads
Difference Between Spark DataFrame and Pandas DataFrame Dataframe represents a table of data with rows and columns, Dataframe concepts never change in any Programming language, however, Spark Dataframe and Pandas Dataframe are quite different. In this article, we are going to see the difference between Spark dataframe and Pandas Dataframe. Pandas DataFra
3 min read
How to Convert Pandas to PySpark DataFrame ? In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
Convert PySpark Row List to Pandas DataFrame In this article, we will convert a PySpark Row List to Pandas Data Frame. A Row object is defined as a single Row in a PySpark DataFrame. Thus, a Data Frame can be easily represented as a Python List of Row objects. Method 1 : Use createDataFrame() method and use toPandas() method Here is the syntax
4 min read
How to change dataframe column names in PySpark ? In this article, we are going to see how to change the column names in the pyspark data frame. Let's create a Dataframe for demonstration: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark session spark = SparkSession.builder.appName('pyspark - example jo
3 min read
Pyspark - Converting JSON to DataFrame In this article, we are going to convert JSON String to DataFrame in Pyspark. Method 1: Using read_json() We can read JSON files using pandas.read_json. This method is basically used to read JSON files through pandas. Syntax: pandas.read_json("file_name.json") Here we are going to use this JSON file
1 min read
PySpark dataframe add column based on other columns In this article, we are going to see how to add columns based on another column to the Pyspark Dataframe. Creating Dataframe for demonstration: Here we are going to create a dataframe from a list of the given dataset. Python3 # Create a spark session from pyspark.sql import SparkSession spark = Spar
2 min read