How to re-partition pyspark dataframe in Python
Last Updated :
26 Apr, 2025
Are you a data science or machine learning enthusiast who likes to play with data? Have you ever got the need to repartition the Pyspark dataset you got? Got confused, about how to fulfill the demand? Don't worry! In this article, we will discuss the re-partitioning of the Pyspark data frame in Python.
Modules Required:
- Pyspark: spark library which has the ability to run Python applications using Apache Spark is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation:
Step 1: First of all, import the required libraries, i.e. SparkSession, and spark_partition_id. The SparkSession library is used to create the session.
from pyspark.sql import SparkSession
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, read the CSV file and display it to see if it is correctly uploaded.
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
data_frame.show()
Step 4: Next, obtain the number of RDD partitions in the data frame before the repartition of data using the getNumPartitions function.
print(data_frame.rdd.getNumPartitions())
Step 5: Finally, repartition the data using the select and repartition function where the select function will contain the column names that need to be partitioned while the repartition function will contain the number of partitions to be done.
df_partition=data_frame.select(#Column names which need to be partitioned).repartition(#Number of partitions)
Step 6: Finally, obtain the number of RDD partitions in the data frame after the repartition of data using the getNumPartitions function. It is basically done in order to see if the repartition has been done successfully.
print(data_frame_partition.rdd.getNumPartitions())
We have read the CSV file (link) in this example and obtained the current number of partitions. Further, we have repartitioned that data into 2 partitions, i.e., longitude, and latitude, and again get the current number of partitions of the new partitioned data to check if it is correctly partitioned.
Python
# Python program to repartition
# Pyspark dataframe
# Import the SparkSession library
from pyspark.sql import SparkSession
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
data_frame = csv_file = spark_session.read.csv('california_housing_train.csv',
sep=',', inferSchema=True,
header=True)
# Display the csv file read
print(data_frame.head()
)
# Get number of partitions in data frame using getNumPartitions function
print(" Before repartition", data_frame.rdd.getNumPartitions())
# Repartition the CSV file by longitude, latitude columns
data_frame_partition = data_frame.select(data_frame.longitude,
data_frame.latitude).repartition(4)
# Get number of partitions in data frame using getNumPartitions function
print(" After repartition", data_frame_partition.rdd.getNumPartitions())
Output:
Row(longitude=-114.31, latitude=34.19, housing_median_age=15.0,
total_rooms=5612.0, total_bedrooms=1283.0, population=1015.0, households=472.0,
median_income=1.4936, median_house_value=66900.0)
Before repartition 1
After repartition 4
Similar Reads
How to See Record Count Per Partition in a pySpark DataFrame
The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. Whenever we upload any file in the Pyspark, it creates a partition of that data equal to the number of cores. The user can repartition that data and
4 min read
Convert PySpark DataFrame to Dictionary in Python
In this article, we are going to see how to convert the PySpark data frame to the dictionary, where keys are column names and values are column values. Before starting, we will create a sample Dataframe: Python3 # Importing necessary libraries from pyspark.sql import SparkSession # Create a spark se
3 min read
How to Convert Pandas to PySpark DataFrame ?
In this article, we will learn How to Convert Pandas to PySpark DataFrame. Sometimes we will get csv, xlsx, etc. format data, and we have to store it in PySpark DataFrame and that can be done by loading data in Pandas then converted PySpark DataFrame. For conversion, we pass the Pandas dataframe int
3 min read
How to loop through each row of dataFrame in PySpark ?
In this article, we are going to see how to loop through each row of Dataframe in PySpark. Looping through each row helps us to perform complex operations on the RDD or Dataframe. Creating Dataframe for demonstration: Python3 # importing necessary libraries import pyspark from pyspark.sql import Spa
5 min read
Convert Python Dictionary List to PySpark DataFrame
In this article, we will discuss how to convert Python Dictionary List to Pyspark DataFrame. It can be done in these ways: Using Infer schema.Using Explicit schemaUsing SQL Expression Method 1: Infer schema from the dictionary We will pass the dictionary directly to the createDataFrame() method. Syn
3 min read
How to find the sum of Particular Column in PySpark Dataframe
In this article, we are going to find the sum of PySpark dataframe column in Python. We are going to find the sum in a column using agg() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import Spa
2 min read
How to print dataframe in Scala?
Scala stands for scalable language. It was developed in 2003 by Martin Odersky. It is an object-oriented language that provides support for functional programming approach as well. Everything in scala is an object e.g. - values like 1,2 can invoke functions like toString(). Scala is a statically typ
4 min read
How to Join Two DataFrame in Scala?
Scala stands for scalable language. It is a statically typed language although unlike other statically typed languages like C, C++, or Java, it doesn't require type information while writing the code. The type verification is done at the compile time. Static typing allows us to build safe systems by
4 min read
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
PySpark Join Types - Join Two DataFrames
In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe. Syntax: dataframe1.join(dataframe2,dataframe1.column_name == Â dataframe2.column_name,"type")Â where, dataframe1 is the first data
12 min read