Delete rows in PySpark dataframe based on multiple conditions Last Updated : 29 Jun, 2021 Comments Improve Suggest changes Like Article Like Report In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given condition or SQL expression. Syntax: filter( condition) Parameters: Condition: Logical condition or SQL expression Example 1: Python3 # importing module import pyspark # importing sparksession from pyspark.sql # module from pyspark.sql import SparkSession # spark library import import pyspark.sql.functions # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data data = [["1", "Amit", " DU"], ["2", "Mohit", "DU"], ["3", "rohith", "BHU"], ["4", "sridevi", "LPU"], ["1", "sravan", "KLMP"], ["5", "gnanesh", "IIT"]] # specify column names columns = ['student_ID', 'student_NAME', 'college'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe = dataframe.filter(dataframe.college != "IIT") dataframe.show() Output: Example 2: Python3 # importing module import pyspark # importing sparksession from pyspark.sql # module from pyspark.sql import SparkSession # spark library import import pyspark.sql.functions # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data data = [["1", "Amit", " DU"], ["2", "Mohit", "DU"], ["3", "rohith", "BHU"], ["4", "sridevi", "LPU"], ["1", "sravan", "KLMP"], ["5", "gnanesh", "IIT"]] # specify column names columns = ['student_ID', 'student_NAME', 'college'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe = dataframe.filter( ((dataframe.college != "DU") & (dataframe.student_ID != "3")) ) dataframe.show() Output: Method 2: Using when() method It evaluates a list of conditions and returns a single value. Thus passing the condition and its required values will get the job done. Syntax: When( Condition, Value) Parameters: Condition: Boolean or columns expression.Value: Literal Value Example: Python3 # importing module import pyspark # importing sparksession from pyspark.sql # module from pyspark.sql import SparkSession # spark library import import pyspark.sql.functions # spark library import from pyspark.sql.functions import when # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of students data data = [["1", "Amit", " DU"], ["2", "Mohit", "DU"], ["3", "rohith", "BHU"], ["4", "sridevi", "LPU"], ["1", "sravan", "KLMP"], ["5", "gnanesh", "IIT"]] # specify column names columns = ['student_ID', 'student_NAME', 'college'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) dataframe.withColumn('New_col', when(dataframe.student_ID != '5', "True") .when(dataframe.student_NAME != 'gnanesh', "True") ).filter("New_col == True").drop("New_col").show() Output: Comment More infoAdvertise with us Next Article Delete rows in PySpark dataframe based on multiple conditions kumar_satyam Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Pyspark - Filter dataframe based on multiple conditions In this article, we are going to see how to Filter dataframe based on multiple conditions. Let's Create a Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an 3 min read Count rows based on condition in Pyspark Dataframe In this article, we will discuss how to count rows based on conditions in Pyspark dataframe. For this, we are going to use these methods: Using where() function.Using filter() function. Creating Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pysp 4 min read Split Spark DataFrame based on condition in Python In this article, we are going to learn how to split data frames based on conditions using Pyspark in Python. Spark data frames are a powerful tool for working with large datasets in Apache Spark. They allow to manipulate and analyze data in a structured way, using SQL-like operations. Sometimes, we 5 min read Filtering rows based on column values in PySpark dataframe In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n 2 min read Python PySpark - DataFrame filter on multiple columns In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSessio 2 min read Drop Rows in PySpark DataFrame with Condition In this article, we are going to drop the rows in PySpark dataframe. We will be considering most common conditions like dropping rows with Null values, dropping duplicate rows, etc. All these conditions use different functions and we will discuss them in detail.We will cover the following topics:Dro 4 min read How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy 2 min read How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing 2 min read Drop One or Multiple Columns From PySpark DataFrame In this article, we will discuss how to drop columns in the Pyspark dataframe. In pyspark the drop() function can be used to remove values/columns from the dataframe. Syntax: dataframe_name.na.drop(how=âany/allâ,thresh=threshold_value,subset=[âcolumn_name_1â³,âcolumn_name_2â]) how â This takes either 3 min read Removing duplicate rows based on specific column in PySpark DataFrame In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column 1 min read Like