Filtering a row in PySpark DataFrame based on matching values from a list
Last Updated :
28 Jul, 2021
In this article, we are going to filter the rows in the dataframe based on matching values in the list by using isin in Pyspark dataframe
isin(): This is used to find the elements contains in a given dataframe, it will take the elements and get the elements to match to the data
Syntax: isin([element1,element2,.,element n])
Create Dataframe for demonstration:
Python3
# importing module
import pyspark
# importing sparksession
from pyspark.sql import SparkSession
# creating sparksession
# and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of students data with null values
# we can define null values with none
data = [[1, "sravan", "vignan"],
[2, "ramya", "vvit"],
[3, "rohith", "klu"],
[4, "sridevi", "vignan"],
[5, "gnanesh", "iit"]]
# specify column names
columns = ['ID', 'NAME', 'college']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Output:

Method 1: Using filter() method
It is used to check the condition and give the results, Both are similar
Syntax: dataframe.filter(condition)
Where, condition is the dataframe condition.
Here we will use all the discussed methods.
Syntax: dataframe.filter((dataframe.column_name).isin([list_of_elements])).show()
where,
- column_name is the column
- elements are the values that are present in the column
- show() is used to show the resultant dataframe
Example 1: Get the particular ID's with filter() clause.
Python3
# get the ID : 1,2,3 from dataframe
dataframe.filter((dataframe.ID).isin([1,2,3])).show()
Output:
Example 2: Get ID's not present in 1 and 3
Python3
# get the ID : not in 1 and 3 from dataframe
dataframe.filter(~(dataframe.ID).isin([1, 3])).show()
Output:
Example 3: Get names from dataframe.
Python3
# get name as sravan
dataframe.filter((
dataframe.NAME).isin(['sravan'])).show()
Output:

Method 2: Using where() method
where() is used to check the condition and give the results
Syntax: dataframe.where(condition)
where, condition is the dataframe condition
Overall Syntax with where clause:
dataframe.where((dataframe.column_name).isin([elements])).show()
where,
- column_name is the column
- elements are the values that are present in the column
- show() is used to show the resultant dataframe
Example: Get the particular colleges with where() clause
Python3
# get college as vignan
dataframe.where((
dataframe.college).isin(['vignan'])).show()
Output:
Similar Reads
Filtering rows based on column values in PySpark dataframe In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
Extract First and last N rows from PySpark DataFrame In data analysis, extracting the start and end of a dataset helps understand its structure and content. PySpark, widely used for big data processing, allows us to extract the first and last N rows from a DataFrame. In this article, we'll demonstrate simple methods to do this using built-in functions
2 min read
How to get a value from the Row object in PySpark Dataframe? In this article, we are going to learn how to get a value from the Row object in PySpark DataFrame. Method 1 : Using __getitem()__ magic method We will create a Spark DataFrame with at least one row using createDataFrame(). We then get a Row object from a list of row objects returned by DataFrame.co
5 min read
Delete rows in PySpark dataframe based on multiple conditions In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio
2 min read
Filtering a PySpark DataFrame using isin by exclusion In this article, we will discuss how to filter the pyspark dataframe using isin by exclusion. isin(): This is used to find the elements contains in a given dataframe, it takes the elements and gets the elements to match the data. Syntax: isin([element1,element2,.,element n) Creating Dataframe for de
2 min read
Create PySpark DataFrame from list of tuples In this article, we are going to discuss the creation of a Pyspark dataframe from a list of tuples. To do this, we will use the createDataFrame() method from pyspark. This method creates a dataframe from RDD, list or Pandas Dataframe. Here data will be the list of tuples and columns will be a list
2 min read