Split Spark DataFrame based on condition in Python
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to split data frames based on conditions using Pyspark in Python.
Spark data frames are a powerful tool for working with large datasets in Apache Spark. They allow to manipulate and analyze data in a structured way, using SQL-like operations. Sometimes, we may want to split a Spark DataFrame based on a specific condition. For example, we may want to split a DataFrame into two separate DataFrames based on whether a column value is greater than or less than a certain threshold.
Why split a data frame based on a condition?
There are a few common reasons to split a data frame based on a condition:
- Data preparation: We may want to split a data frame into separate data frames based on certain conditions in order to prepare the data for further analysis or modeling. For example, we may want to split a DataFrame into two data frames based on whether a column value is missing or not missing.
- Data exploration: We may want to split a data frame based on a condition in order to explore different subsets of the data and understand their characteristics. For example, we may want to split a data frame into two data frame based on the values in a categorical column in order to compare the distribution of other columns between the two groups.
- Data cleaning: We may want to split a data frame based on a condition in order to identify and fix errors or inconsistencies in the data. For example, we may want to split a data frame into two data frames based on whether a column value is within a certain range in order to identify and fix any values that are outside of the range.
- Data analysis: We may want to split a data frame based on a condition in order to perform different analyses on different subsets of the data. For example, we may want to split a data frame into two data frames based on the values in a categorical column in order to perform separate analyses on each group.
- Data reporting: We may want to split a data frame based on a condition in order to create separate reports or visualizations for different subsets of the data. For example, we may want to split a data frame into two data frames based on the values in a categorical column in order to create separate reports for each group.
Splitting the PySpark data frame using the filter() method
The filter() method is used to return a new data frame that contains only the rows that match a specified condition passed in the filter() function as parameters.
Syntax :
df.filter(condition)
- Where df is the name of the DataFrame and condition is a boolean expression that specifies the condition to be true or false.
Problem statement: Given a CSV file containing information about people, such as their name, age, and gender, the task is to split the data into two data frames based on the gender of the person. The first data frame should contain the rows where the gender is male, and the second data frame should contain rows where the gender is female. Below is the stepwise implementation to perform this task:
Step 1: The first step is to create a SparkSession, which is the entry point to using Spark functionality. We give it the name "Split DataFrame" for reference.
Step 2: Next, we use the spark.read.csv() method to load the data from the "number.csv" file into a data frame. We specify that the file has a header row and that we want Spark to infer the schema of the data.
Step 3: We then use the filter() method on the data frame to split it into two new data frames based on a certain condition. In this case, we use the condition df['gender'] == 'Male' to filter the data frame and create a new data frame called males_df containing only rows with a gender of 'Male'. Similarly, we use the condition df['gender'] == 'Female' to filter the data frame and create a new data frame called females_df containing only rows with a gender of 'Female'.
Step 4: Finally, we use the show() method to print the contents of the males_df and females_df data frames.
Dataset: number.csv
Python3
# Import required modules
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Split DataFrame").getOrCreate()
# Load the data into a DataFrame
df = spark.read.csv("number.csv",
header=True,
inferSchema=True)
df.show()
# Split the DataFrame into two
# DataFrames based on a condition
males_df = df.filter(df['gender'] == 'Male')
females_df = df.filter(df['gender'] == 'Female')
# Print the dataframes
males_df.show()
females_df.show()
Output before split:
Output after split:
Alternatively, we can also use where() method for filter, for example:
males_df = df.where(df['gender'] == 'Male')
females_df = df.where(df['gender'] == 'Female')
Handling the Split DataFrames
Once we have split the data frame, we can perform further operations on the resulting data frames, such as aggregating the data, joining with other tables, or saving the data to a new file. Here is an example of how to use the count() method to get the number of rows in each of the splitted data frames:
Python3
# Import required modules
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Split DataFrame").getOrCreate()
# Load the data into a DataFrame
df = spark.read.csv("number.csv",
header=True,
inferSchema=True)
# Split the DataFrame into two
# data frame based on a condition
males_df = df.filter(df['gender'] == 'Male')
females_df = df.filter(df['gender'] == 'Female')
# Print the data frames
males_df.show()
females_df.show()
# Print the count
print("Males:", males_df.count())
print("Females:", females_df.count())
Output :
Similar Reads
Selecting rows in pandas DataFrame based on conditions Letâs see how to Select rows based on some conditions in Pandas DataFrame. Selecting rows based on particular column value using '>', '=', '=', '<=', '!=' operator. Code #1 : Selecting all the rows from the given dataframe in which 'Percentage' is greater than 80 using basic method. Python# im
6 min read
Delete rows in PySpark dataframe based on multiple conditions In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio
2 min read
Filter Rows Based on Conditions in a DataFrame in R In this article, we will explore various methods to filter rows based on Conditions in a data frame by using the R Programming Language. How to filter rows based on Conditions in a data frame R language offers various methods to filter rows based on Conditions in a data frame. By using these methods
3 min read
Count values by condition in PySpark Dataframe In this article, we are going to count the value of the Pyspark dataframe columns by condition. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from # pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving a
3 min read
How to merge dataframes based on an "OR" condition Merging DataFrames is a fundamental operation in data analysis and data engineering. It allows you to combine data from different sources into a single, cohesive dataset. While most merging operations are straightforward, there are scenarios where you need to merge DataFrames based on more complex c
7 min read
Split dataframe in Pandas based on values in multiple columns In this article, we are going to see how to divide a dataframe by various methods and based on various parameters using Python. To divide a dataframe into two or more separate dataframes based on the values present in the column we first create a data frame. Creating a DataFrame for demonestrationPy
3 min read
Filtering rows based on column values in PySpark dataframe In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
Split a column in Pandas dataframe and get part of it When a part of any column in Dataframe is important and the need is to take it separate, we can split a column on the basis of the requirement. We can use Pandas .str accessor, it does fast vectorized string operations for Series and Dataframes and returns a string object. Pandas str accessor has nu
2 min read
PySpark - Split dataframe by column value A distributed collection of data grouped into named columns is known as a Pyspark data frame in Python. There occurs various circumstances in which you need only particular rows in the data frame. For this, you need to split the data frame according to the column value. This can be achieved either
3 min read
Ways to apply an if condition in Pandas DataFrame Generally on a Pandas DataFrame the if condition can be applied either column-wise, row-wise, or on an individual cell basis. The further document illustrates each of these with examples. First of all we shall create the following DataFrame : python # importing pandas as pd import pandas as pd # cre
3 min read