How to slice a PySpark dataframe in two row-wise dataframe?
Last Updated :
26 Jan, 2022
In this article, we are going to learn how to slice a PySpark DataFrame into two row-wise. Slicing a DataFrame is getting a subset containing all rows from one index to another.
Method 1: Using limit() and subtract() functions
In this method, we first make a PySpark DataFrame with precoded data using createDataFrame(). We then use limit() function to get a particular number of rows from the DataFrame and store it in a new variable. The syntax of limit function is :
Syntax : DataFrame.limit(num)
Returns : A DataFrame with num number of rows.
We will then use subtract() function to get the remaining rows from the initial DataFrame. The syntax of subtract function is :
Syntax : DataFrame1.subtract(DataFrame2)
Returns : A new DataFrame containing rows in DataFrame1 but not in DataFrame2.
Python
# Importing PySpark
import pyspark
from pyspark.sql import SparkSession
# Session Creation
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
# DataFrame is created
df = Spark_Session.createDataFrame(rows, columns)
# Getting the slices
# The first slice has 3 rows
df1 = df.limit(3)
# Getting the second slice by removing df1
# from df
df2 = df.subtract(df1)
# Printing the first slice
df1.show()
# Printing the second slice.
df2.show()
Output:

Method 2: Using randomSplit() function
In this method, we are first going to make a PySpark DataFrame using createDataFrame(). We will then use randomSplit() function to get two slices of the DataFrame while specifying the fractions of rows that will be present in both slices. The rows are split up RANDOMLY.
Syntax : DataFrame.randomSplit(weights,seed)
Parameters :
- weights : list of double values according to which the DataFrame is split.
- seed : the seed for sampling. This parameter is optional.
Returns : List of split DataFrames
Python
# Importing PySpark
import pyspark
from pyspark.sql import SparkSession
# Session Creation
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
#DataFrame is created
df = Spark_Session.createDataFrame(rows, columns)
# the first slice has 20% of the rows
# the second slice has 80% of the rows
# the data in both slices is selected randomly
df1, df2 = df.randomSplit([0.20, 0.80])
# Showing the first slice
df1.show()
# Showing the second slice
df2.show()
Output:

Method 3: Using collect() function
In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then get a list of Row objects of the DataFrame using :
DataFrame.collect()
We will then use Python List slicing to get two lists of Rows. Finally, we convert these two lists of rows to PySpark DataFrames using createDataFrame().
Python
# Importing PySpark and Pandas
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
# Session Creation
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
#DataFrame is created
df = Spark_Session.createDataFrame(rows, columns)
# getting the list of Row objects
row_list = df.collect()
# Slicing the Python List
part1 = row_list[:1]
part2 = row_list[1:]
# Converting the slices to PySpark DataFrames
slice1 = Spark_Session.createDataFrame(part1)
slice2 = Spark_Session.createDataFrame(part2)
# Printing the first slice
print('First DataFrame')
slice1.show()
# Printing the second slice
print('Second DataFrame')
slice2.show()
Output:

Method 4: Converting PySpark DataFrame to a Pandas DataFrame and using iloc[] for slicing
In this method, we will first make a PySpark DataFrame using createDataFrame(). We will then convert it into a Pandas DataFrame using toPandas(). We then slice the DataFrame using iloc[] with the Syntax :
DataFrame.iloc[start_index:end_index]
The row at end_index is NOT included. Finally, we will convert our DataFrame slices to a PySpark DataFrame using createDataFrame()
Python
# Importing PySpark and Pandas
import pyspark
from pyspark.sql import SparkSession
import pandas as pd
# Session Creation
Spark_Session = SparkSession.builder.appName(
'Spark Session'
).getOrCreate()
# Data filled in our DataFrame
rows = [['Lee Chong Wei', 69, 'Malaysia'],
['Lin Dan', 66, 'China'],
['Srikanth Kidambi', 9, 'India'],
['Kento Momota', 15, 'Japan']]
# Columns of our DataFrame
columns = ['Player', 'Titles', 'Country']
# DataFrame is created
df = Spark_Session.createDataFrame(rows, columns)
# Converting DataFrame to pandas
pandas_df = df.toPandas()
# First DataFrame formed by slicing
df1 = pandas_df.iloc[:2]
# Second DataFrame formed by slicing
df2 = pandas_df.iloc[2:]
# Converting the slices to PySpark DataFrames
df1 = Spark_Session.createDataFrame(df1)
df2 = Spark_Session.createDataFrame(df2)
# Printing the first slice
print('First DataFrame')
df1.show()
# Printing the second slice
print('Second DataFrame')
df2.show()
Output:
Similar Reads
How to duplicate a row N time in Pyspark dataframe?
In this article, we are going to learn how to duplicate a row N times in a PySpark DataFrame. Method 1: Repeating rows based on column value In this method, we will first make a PySpark DataFrame using createDataFrame(). In our example, the column "Y" has a numerical value that can only be used here
4 min read
How to select a range of rows from a dataframe in PySpark ?
In this article, we are going to select a range of rows from a PySpark dataframe. It can be done in these ways: Using filter().Using where().Using SQL expression. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pys
3 min read
How to get a value from the Row object in PySpark Dataframe?
In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co
5 min read
How to display a PySpark DataFrame in table format ?
In this article, we are going to display the data of the PySpark dataframe in table format. We are going to use show() function and toPandas function to display the dataframe in the required format. show(): Used to display the dataframe. Syntax: dataframe.show( n, vertical = True, truncate = n) wher
3 min read
How to get distinct rows in dataframe using PySpark?
In this article we are going to get the distinct data from pyspark dataframe in Python, So we are going to create the dataframe using a nested list and get the distinct data. We are going to create a dataframe from pyspark list bypassing the list to the createDataFrame() method from pyspark, then by
2 min read
How to create PySpark dataframe with schema ?
In this article, we will discuss how to create the dataframe with schema using PySpark. In simple words, the schema is the structure of a dataset or dataframe. Functions Used:FunctionDescriptionSparkSessionThe entry point to the Spark SQL.SparkSession.builder()It gives access to Builder API that we
2 min read
How to Transform Spark DataFrame to Polars DataFrame?
Apache Spark and Polars are powerful data processing libraries that cater to different needs. Spark excels in distributed computing and is widely used for big data processing, while Polars, a newer library, is designed for fast, single-machine data processing, leveraging Rust for performance. Someti
3 min read
How to select last row and access PySpark dataframe by index ?
In this article, we will discuss how to select the last row and access pyspark dataframe by index. Creating dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving
2 min read
How to Convert Wide Dataframe to Tidy Dataframe with Pandas stack()?
We might sometimes need a tidy/long-form of data for data analysis. So, in python's library Pandas there are a few ways to reshape a dataframe which is in wide form into a dataframe in long/tidy form. Here, we will discuss converting data from a wide form into a long-form using the pandas function s
4 min read
How take a random row from a PySpark DataFrame?
In this article, we are going to learn how to take a random row from a PySpark DataFrame in the Python programming language. Method 1 : PySpark sample() method PySpark provides various methods for Sampling which are used to return a sample from the given PySpark DataFrame. Here are the details of th
4 min read