PySpark DataFrame - Where Filter

Last Updated : 28 Mar, 2022

In this article, we are going to see where filter in PySpark Dataframe. Where() is a method used to filter the rows from DataFrame based on the given condition. The where() method is an alias for the filter() method. Both these methods operate exactly the same. We can also apply single and multiple conditions on DataFrame columns using the where() method.

Syntax: DataFrame.where(condition)

Example 1:

The following example is to see how to apply a single condition on Dataframe using the where() method.

Python3

# importing required module
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list of Employees data
data = [
    (121, ("Mukul", "Kumar"), 25000, 25),
    (122, ("Arjun", "Singh"), 28000, 23),
    (123, ("Rohan", "Verma"), 30000, 27),
    (124, ("Manoj", "Singh"), 30000, 22),
    (125, ("Robin", "Kumar"), 28000, 23)
]

# specify column names
columns = ['Employee ID', 'Name', 'Salary', 'Age']

# creating a dataframe from the lists of data
df = spark.createDataFrame(data, columns)
print(" Original data ")
df.show()

# filter dataframe based on single condition
df2 = df.where(df.Salary == 28000)
print(" After filter dataframe based on single condition  ")
df2.show()

Output:

Example 2:

The following example is to understand how to apply multiple conditions on Dataframe using the where() method.

Python3

# importing required module
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list of Employees data
data = [
    (121, ("Mukul", "Kumar"), 22000, 23),
    (122, ("Arjun", "Singh"), 23000, 22),
    (123, ("Rohan", "Verma"), 24000, 23),
    (124, ("Manoj", "Singh"), 25000, 22),
    (125, ("Robin", "Kumar"), 26000, 23)
]

# specify column names
columns = ['Employee ID', 'Name', 'Salary', 'Age']

# creating a dataframe from the lists of data
df = spark.createDataFrame(data, columns)
print(" Original data ")
df.show()

# filter dataframe based on multiple conditions
df2 = df.where((df.Salary > 22000) & (df.Age == 22))
print(" After filter dataframe based on multiple conditions  ")
df2.show()

Output:

Example 3:

The following example is to know how to filter Dataframe using the where() method with Column condition. We will use where() methods with specific conditions.

Python3

# importing required module
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list of Employees data
data = [
    (121, "Mukul", 22000, 23),
    (122, "Arjun", 23000, 22),
    (123, "Rohan", 24000, 23),
    (124, "Manoj", 25000, 22),
    (125, "Robin", 26000, 23)
]

# specify column names
columns = ['Employee ID', 'Name', 'Salary', 'Age']

# creating a dataframe from the lists of data
df = spark.createDataFrame(data, columns)
print("Original Dataframe")
df.show()

# where() method with SQL Expression
df2 = df.where(df["Age"] == 23)
print(" After filter dataframe")
df2.show()

Output:

Example 4:

The following example is to know how to use where() method with SQL Expression.

Python3

# importing required module
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list of Employees data
data = [
    (121, "Mukul", 22000, 23),
    (122, "Arjun", 23000, 22),
    (123, "Rohan", 24000, 23),
    (124, "Manoj", 25000, 22),
    (125, "Robin", 26000, 23)
]

# specify column names
columns = ['Employee ID', 'Name', 'Salary', 'Age']

# creating a dataframe from the lists of data
df = spark.createDataFrame(data, columns)
print("Original Dataframe")
df.show()

# where() method with SQL Expression
df2 = df.where("Age == 22")
print(" After filter dataframe")
df2.show()