Delete rows in PySpark dataframe based on multiple conditions

Last Updated : 29 Jun, 2021

In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions.

Method 1: Using Logical expression

Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression.

Syntax: filter( condition)

Parameters:

Condition: Logical condition or SQL expression

Example 1:

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession

# spark library import
import pyspark.sql.functions

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]

# specify column names
columns = ['student_ID', 'student_NAME', 'college']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

dataframe = dataframe.filter(dataframe.college != "IIT")

dataframe.show()

Output:

Example 2:

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql
# module
from pyspark.sql import SparkSession

# spark library import
import pyspark.sql.functions

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]

# specify column names
columns = ['student_ID', 'student_NAME', 'college']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

dataframe = dataframe.filter(
    ((dataframe.college != "DU")
     & (dataframe.student_ID != "3"))
)

dataframe.show()

Output:

Method 2: Using when() method

It evaluates a list of conditions and returns a single value. Thus passing the condition and its required values will get the job done.

Syntax: When( Condition, Value)

Parameters:

Condition: Boolean or columns expression.
Value: Literal Value

Example:

Python3

# importing module
import pyspark

# importing sparksession from pyspark.sql 
# module
from pyspark.sql import SparkSession

# spark library import
import pyspark.sql.functions

# spark library import
from pyspark.sql.functions import when

# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()

# list  of students  data
data = [["1", "Amit", " DU"],
        ["2", "Mohit", "DU"],
        ["3", "rohith", "BHU"],
        ["4", "sridevi", "LPU"],
        ["1", "sravan", "KLMP"],
        ["5", "gnanesh", "IIT"]]

# specify column names
columns = ['student_ID', 'student_NAME', 'college']

# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)

dataframe.withColumn('New_col',
                     when(dataframe.student_ID != '5', "True")
                     .when(dataframe.student_NAME != 'gnanesh', "True")
                     ).filter("New_col == True").drop("New_col").show()