Split Spark DataFrame based on condition in Python
Last Updated :
28 Apr, 2025
In this article, we are going to learn how to split data frames based on conditions using Pyspark in Python.
Spark data frames are a powerful tool for working with large datasets in Apache Spark. They allow to manipulate and analyze data in a structured way, using SQL-like operations. Sometimes, we may want to split a Spark DataFrame based on a specific condition. For example, we may want to split a DataFrame into two separate DataFrames based on whether a column value is greater than or less than a certain threshold.
Why split a data frame based on a condition?
There are a few common reasons to split a data frame based on a condition:
- Data preparation: We may want to split a data frame into separate data frames based on certain conditions in order to prepare the data for further analysis or modeling. For example, we may want to split a DataFrame into two data frames based on whether a column value is missing or not missing.
- Data exploration: We may want to split a data frame based on a condition in order to explore different subsets of the data and understand their characteristics. For example, we may want to split a data frame into two data frame based on the values in a categorical column in order to compare the distribution of other columns between the two groups.
- Data cleaning: We may want to split a data frame based on a condition in order to identify and fix errors or inconsistencies in the data. For example, we may want to split a data frame into two data frames based on whether a column value is within a certain range in order to identify and fix any values that are outside of the range.
- Data analysis: We may want to split a data frame based on a condition in order to perform different analyses on different subsets of the data. For example, we may want to split a data frame into two data frames based on the values in a categorical column in order to perform separate analyses on each group.
- Data reporting: We may want to split a data frame based on a condition in order to create separate reports or visualizations for different subsets of the data. For example, we may want to split a data frame into two data frames based on the values in a categorical column in order to create separate reports for each group.
Splitting the PySpark data frame using the filter() method
The filter() method is used to return a new data frame that contains only the rows that match a specified condition passed in the filter() function as parameters.
Syntax :
df.filter(condition)
- Where df is the name of the DataFrame and condition is a boolean expression that specifies the condition to be true or false.
Problem statement: Given a CSV file containing information about people, such as their name, age, and gender, the task is to split the data into two data frames based on the gender of the person. The first data frame should contain the rows where the gender is male, and the second data frame should contain rows where the gender is female. Below is the stepwise implementation to perform this task:
Step 1: The first step is to create a SparkSession, which is the entry point to using Spark functionality. We give it the name "Split DataFrame" for reference.
Step 2: Next, we use the spark.read.csv() method to load the data from the "number.csv" file into a data frame. We specify that the file has a header row and that we want Spark to infer the schema of the data.
Step 3: We then use the filter() method on the data frame to split it into two new data frames based on a certain condition. In this case, we use the condition df['gender'] == 'Male' to filter the data frame and create a new data frame called males_df containing only rows with a gender of 'Male'. Similarly, we use the condition df['gender'] == 'Female' to filter the data frame and create a new data frame called females_df containing only rows with a gender of 'Female'.
Step 4: Finally, we use the show() method to print the contents of the males_df and females_df data frames.
Dataset: number.csv
Python3
# Import required modules
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Split DataFrame").getOrCreate()
# Load the data into a DataFrame
df = spark.read.csv("number.csv",
header=True,
inferSchema=True)
df.show()
# Split the DataFrame into two
# DataFrames based on a condition
males_df = df.filter(df['gender'] == 'Male')
females_df = df.filter(df['gender'] == 'Female')
# Print the dataframes
males_df.show()
females_df.show()
Output before split:
Output after split:
Alternatively, we can also use where() method for filter, for example:
males_df = df.where(df['gender'] == 'Male')
females_df = df.where(df['gender'] == 'Female')
Handling the Split DataFrames
Once we have split the data frame, we can perform further operations on the resulting data frames, such as aggregating the data, joining with other tables, or saving the data to a new file. Here is an example of how to use the count() method to get the number of rows in each of the splitted data frames:
Python3
# Import required modules
from pyspark.sql import SparkSession
# Create a SparkSession
spark = SparkSession.builder.appName("Split DataFrame").getOrCreate()
# Load the data into a DataFrame
df = spark.read.csv("number.csv",
header=True,
inferSchema=True)
# Split the DataFrame into two
# data frame based on a condition
males_df = df.filter(df['gender'] == 'Male')
females_df = df.filter(df['gender'] == 'Female')
# Print the data frames
males_df.show()
females_df.show()
# Print the count
print("Males:", males_df.count())
print("Females:", females_df.count())
Output :
Similar Reads
Python Tutorial | Learn Python Programming Language Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Linear Regression in Machine learning Linear regression is a type of supervised machine-learning algorithm that learns from the labelled datasets and maps the data points with most optimized linear functions which can be used for prediction on new datasets. It assumes that there is a linear relationship between the input and output, mea
15+ min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read