Python PySpark - DataFrame filter on multiple columns Last Updated : 14 Sep, 2021 Summarize Comments Improve Suggest changes Share Like Article Like Report In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app name spark = SparkSession.builder.appName('sparkdf').getOrCreate() # list of employee data data = [[1, "sravan", "company 1"], [2, "ojaswi", "company 1"], [3, "rohith", "company 2"], [4, "sridevi", "company 1"], [1, "sravan", "company 1"], [4, "sridevi", "company 1"]] # specify column names columns = ['ID', 'NAME', 'Company'] # creating a dataframe from the lists of data dataframe = spark.createDataFrame(data, columns) # display dataframe dataframe.show() Output: Method 1: Using filter() Method filter() is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. We are going to filter the dataframe on multiple columns. It can take a condition and returns the dataframe. Syntax: filter(dataframe.column condition) Example 1: Conditional operator includes boolean or logical or relational operators. Python3 # select dataframe where ID less than 3 dataframe.filter(dataframe.ID < 3).show() Output: Example 2: Python program to filter data based on two columns. In this example, we created a pyspark dataframe and select dataframe where ID less than 3 or name is Sridevi Python3 # select dataframe where ID less than # 3 or name is sridevi dataframe.filter((dataframe.ID < 3) | (dataframe.NAME == 'sridevi')).show() Output: Example 3: Multiple columns filtering Python3 # select dataframe where ID less than # 3 or name is sridevi and comapny 1 dataframe.filter((dataframe.ID < 3) | ( (dataframe.NAME == 'sridevi') & (dataframe.Company == 'company 1'))).show() Output: Method 2: where() method Where: where is similar to filter() function that is used to return the dataframe based on the given condition by removing the rows in the dataframe or by extracting the particular rows or columns from the dataframe. It can take a condition and returns the dataframe. where(dataframe.column condition) Example 1: Python program to filter on multiple columns Python3 # select dataframe where ID less than # 3 or name is sridevi and comapny 1 dataframe.where((dataframe.ID < 3) | ( (dataframe.NAME == 'sridevi') & (dataframe.Company == 'company 1'))).show() Output: Comment More infoAdvertise with us Next Article Pyspark - Filter dataframe based on multiple conditions S sravankumar_171fa07058 Follow Improve Article Tags : Python Python-Pyspark Practice Tags : python Similar Reads Pyspark - Filter dataframe based on multiple conditions In this article, we are going to see how to Filter dataframe based on multiple conditions. Let's Create a Dataframe for demonstration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an 3 min read Drop One or Multiple Columns From PySpark DataFrame In this article, we will discuss how to drop columns in the Pyspark dataframe. In pyspark the drop() function can be used to remove values/columns from the dataframe. Syntax: dataframe_name.na.drop(how=âany/allâ,thresh=threshold_value,subset=[âcolumn_name_1â³,âcolumn_name_2â]) how â This takes either 3 min read How to Add Multiple Columns in PySpark Dataframes ? In this article, we will see different ways of adding Multiple Columns in PySpark Dataframes. Let's create a sample dataframe for demonstration: Dataset Used: Cricket_data_set_odi Python3 # import pandas to read json file import pandas as pd # importing module import pyspark # importing sparksessio 2 min read How to delete columns in PySpark dataframe ? In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing 2 min read Filter PySpark DataFrame Columns with None or Null Values Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th 4 min read How to select and order multiple columns in Pyspark DataFrame ? In this article, we will discuss how to select and order multiple columns from a dataframe using pyspark in Python. For this, we are using sort() and orderBy() functions along with select() function. Methods UsedSelect(): This method is used to select the part of dataframe columns and return a copy 2 min read Like