Split Dataframe in Row Index in Pyspark
Last Updated :
26 Apr, 2025
In this article, we are going to learn about splitting Pyspark data frame by row index in Python.
In data science. there is a bulk of data and their is need of data processing and lots of modules, functions and methods are available to process data. In this article we are going to process data by splitting dataframe by row indexing using Pyspark in Python.
Modules Required:
Pyspark: The API which was introduced to support Spark and Python language and has features of Scikit-learn and Pandas libraries of Python is known as Pyspark. This module can be installed through the following command in Python:
pip install pyspark
Stepwise Implementation:
Step 1: First of all, import the libraries, SparkSession, WIndow, monotonically_increasing_id, and ntile. The SparkSession library is used to create the session while the Window library operates on a group of rows and returns a single value for every input row. Also, the monotonically_increasing_id library is a column that generates monotonically increasing 64-bit integers and ntile library is used to return the ntile group id (from 1 to n inclusive) in an ordered window partition.
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile
Step 2: Now, create a spark session using the getOrCreate function.
spark_session = SparkSession.builder.getOrCreate()
Step 3: Then, either create the data frame from the list of strings or read the data frame from the CSV file.
values = [#Declare the list of strings]
data_frame = spark_session.createDataFrame(values, ('value',))
orĀ
data_frame=csv_file = spark_session.read.csv('#Path of CSV file', sep = ',', inferSchema = True, header = True)
Step 4: Later on, create a function that when called will split the Pyspark data frame by row index.
def split_by_row_index(df, number_of_partitions=#Number_of_partitions):
Step 4.1: Further, assign a row_id column that has the row order for the data frame using the monotonically_increasing_id function.
updated_df = df.withColumn('_row_id', monotonically_increasing_id())
Step 4.2: Moreover, whenever the partition occurs in a data frame, break the continuous increasing sequence. For breaking the sequence, we have used ntile function that returns the ntile group id for partitioned columns.
updated_df = updated_df.withColumn('_partition', ntile(number_of_partitions).over(Window.orderBy(updated_df._row_id)))
Step 4.3: Next, return each entry of the split data frame according to row index and partitions.
return [updated_df.filter(updated_df._partition == i+1).drop('_row_id', '_partition') for i in range(number_of_partitions)]
Step 5: Finally, call the function for each entry of the data frame to split the Pyspark data frame by row index.
[i.collect() for i in split_by_row_index(data_frame)]
Example 1:
In this example, we have created the data frame from the list of strings, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.
Python3
# Python program to split data frame by row index
# Import the libraries SparkSession, Window,
# monotonically_increasing_id, and ntile libraries
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Define the list of which you want to create data frame
values = [(str(i),) for i in range(20)]
# Create the data frame from the list defined
data_frame = spark_session.createDataFrame(values, ('value',))
# Create a function which when called split
# Pyspark data frame by row index
def split_by_row_index(df, number_of_partitions=4):
# Assign a row_id column that has the row order for the data frame
updated_df = df.withColumn('_row_id',
monotonically_increasing_id())
# Break the continuous increasing sequence and returns the ntile
# group id for partitioned columns using ntile function
updated_df = updated_df.withColumn('_partition', ntile(
number_of_partitions).over(Window.orderBy(updated_df._row_id)))
# Return the split data frame according to row index and partitions
return [updated_df.filter(updated_df._partition == i+1).drop(
'_row_id', '_partition') for i in range(number_of_partitions)]
# Call the function for each entry of the data frame
[i.collect() for i in split_by_row_index(data_frame)]
Output:
[[Row(value='0'),
Row(value='1'),
Row(value='2'),
Row(value='3'),
Row(value='4')],
[Row(value='5'),
Row(value='6'),
Row(value='7'),
Row(value='8'),
Row(value='9')],
[Row(value='10'),
Row(value='11'),
Row(value='12'),
Row(value='13'),
Row(value='14')],
[Row(value='15'),
Row(value='16'),
Row(value='17'),
Row(value='18'),
Row(value='19')]]
Example 2:
In this example, we have read the CSV file (link), i.e., the dataset of 5x5, and then we have split that according to the row index considering the partitions in mind and assigning a group Id to the partitions.
Python3
# Python program to split data frame by row index
# Import the libraries SparkSession, Window,
# monotonically_increasing_id, and ntile libraries
from pyspark.sql import SparkSession
from pyspark.sql.window import Window
from pyspark.sql.functions import monotonically_increasing_id, ntile
# Create a spark session using getOrCreate() function
spark_session = SparkSession.builder.getOrCreate()
# Read the CSV file
data_frame = csv_file = spark_session.read.csv(
'/content/class_data.csv', sep=',', inferSchema=True,
header=True)
# Create a function which when called split
# Pyspark data frame by row index
def split_by_row_index(df, number_of_partitions=4):
# Assign a row_id column that has the row order for the data frame
updated_df = df.withColumn('_row_id', monotonically_increasing_id())
# Break the continuous increasing sequence and returns the ntile
# group id for partitioned columns using ntile function
updated_df = updated_df.withColumn('_partition', ntile(
number_of_partitions).over(Window.orderBy(updated_df._row_id)))
# Return the split data frame according to row index and partitions
return [updated_df.filter(updated_df._partition == i+1).drop(
'_row_id', '_partition') for i in range(number_of_partitions)]
# Call the function for each entry of the data frame
[i.collect() for i in split_by_row_index(data_frame)]
Output:
[[Row(name='Arun', subject='Maths', class=10, fees=12000, discount=400),
Row(name='Aniket', subject='Social Science', class=11, fees=15000, discount=600)],
[Row(name='Ishita', subject='English', class=9, fees=9000, discount=0)],
[Row(name='Pranjal', subject='Science', class=12, fees=18000, discount=1000)],
[Row(name='Vinayak', subject='Computer', class=12, fees=18000, discount=500)]]
Similar Reads
PySpark - Split dataframe into equal number of rows
When there is a huge dataset, it is better to split them into equal chunks and then process each dataframe individually. This is possible if the operation on the dataframe is independent of the rows. Each chunk or equally split dataframe then can be processed parallel making use of the resources mor
3 min read
PySpark Dataframe Split
PySpark is an open-source library used for handling big data. It is an interface of Apache Spark in Python. It is fast and also provides Pandas API to give comfortability to Pandas users while using PySpark. Dataframe is a data structure in which a large amount or even a small amount of data can be
4 min read
Get specific row from PySpark dataframe
In this article, we will discuss how to get the specific row from the PySpark dataframe. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession # from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession # and giving an app
4 min read
PySpark - Random Splitting Dataframe
In this article, we are going to learn how to randomly split data frame using PySpark in Python. A tool created by Apache Spark Community to use Python with Spark altogether is known as Pyspark. Thus, while working for the Pyspark data frame, Sometimes we are required to randomly split that data fra
7 min read
Drop duplicate rows in PySpark DataFrame
In this article, we are going to drop the duplicate rows by using distinct() and dropDuplicates() functions from dataframe using pyspark in Python. Let's create a sample Dataframe Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import Sp
2 min read
Split multiple array columns into rows in Pyspark
Suppose we have a Pyspark DataFrame that contains columns having different types of values like string, integer, etc., and sometimes the column data is in array format also. Working with the array is sometimes difficult and to remove the difficulty we wanted to split those array data into rows. Spl
5 min read
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
PySpark Join Types - Join Two DataFrames
In this article, we are going to see how to join two dataframes in Pyspark using Python. Join is used to combine two or more dataframes based on columns in the dataframe. Syntax: dataframe1.join(dataframe2,dataframe1.column_name == Â dataframe2.column_name,"type")Â where, dataframe1 is the first data
12 min read
DataFrame Row Slice in R
In this article, we are going to see how to Slice row in Dataframe using R Programming Language. Row slicing in R is a way to access the data frame rows and further use them for operations or methods. The rows can be accessed in any possible order and stored in other vectors or matrices as well. Row
4 min read
How to get distinct rows in dataframe using PySpark?
In this article we are going to get the distinct data from pyspark dataframe in Python, So we are going to create the dataframe using a nested list and get the distinct data. We are going to create a dataframe from pyspark list bypassing the list to the createDataFrame() method from pyspark, then by
2 min read