Filtering rows based on column values in PySpark dataframe
Last Updated :
29 Jun, 2021
In this article, we are going to filter the rows based on column values in PySpark dataframe.
Creating Dataframe for demonstration:
Python3
# importing module
import spark
# importing sparksession from pyspark.sql module
from pyspark.sql import SparkSession
# creating sparksession and giving an app name
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
# list of employee data
data = [["1", "sravan", "company 1"],
["2", "ojaswi", "company 1"],
["3", "rohith", "company 2"],
["4", "sridevi", "company 1"],
["1", "sravan", "company 1"],
["4", "sridevi", "company 1"]]
# specify column names
columns = ['ID', 'NAME', 'Company']
# creating a dataframe from the lists of data
dataframe = spark.createDataFrame(data, columns)
dataframe.show()
Output:
Method 1: Using where() function
This function is used to check the condition and give the results
Syntax: dataframe.where(condition)
We are going to filter the rows by using column values through the condition, where the condition is the dataframe condition
Example 1: filter rows in dataframe where ID =1
Python3
# get the data where ID=1
dataframe.where(dataframe.ID=='1').show()
Output:
Example 2:
Python3
# get the data where name not 'sravan'
dataframe.where(dataframe.NAME != 'sravan').show()
Output:
Example 3: Where clause multiple column values filtering.
Python program to filter rows where ID greater than 2 and college is vvit
Python3
# filter rows where ID greater than 2
# and college is vvit
dataframe.where((dataframe.ID>'2') & (dataframe.college=='vvit')).show()
Output:
Method 2: Using filter() function
This function is used to check the condition and give the results.
Syntax: dataframe.filter(condition)
Example 1: Python code to get column value = vvit college
Python3
# get the data where college is 'vvit'
dataframe.filter(dataframe.college=='vvit').show()
Output:
Example 2: filter the data where id > 3.
Python3
# get the data where id > 3
dataframe.filter(dataframe.ID>'3').show()
Output:
Example 3: Multiple column value filtering.
Python program to filter rows where ID greater than 2 and college is vignan
Python3
# filter rows where ID greater
# than 2 and college is vignan
dataframe.filter((dataframe.ID>'2') &
(dataframe.college=='vignan')).show()
Output:
Similar Reads
Show distinct column values in PySpark dataframe In this article, we are going to display the distinct column values from dataframe using pyspark in Python. For this, we are using distinct() and dropDuplicates() functions along with select() function. Let's create a sample dataframe. Python3 # importing module import pyspark # importing sparksessi
2 min read
Filtering a row in PySpark DataFrame based on matching values from a list In this article, we are going to filter the rows in the dataframe based on matching values in the list by using isin in Pyspark dataframe isin(): This is used to find the elements contains in a given dataframe, it will take the elements and get the elements to match to the data Syntax: isin([element
2 min read
Filter PySpark DataFrame Columns with None or Null Values Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter th
4 min read
Add new column with default value in PySpark dataframe In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
Removing duplicate rows based on specific column in PySpark DataFrame In this article, we are going to drop the duplicate rows based on a specific column from dataframe using pyspark in Python. Duplicate data means the same data based on some condition (column values). For this, we are using dropDuplicates() method: Syntax: dataframe.dropDuplicates(['column 1','column
1 min read
Delete rows in PySpark dataframe based on multiple conditions In this article, we are going to see how to delete rows in PySpark dataframe based on multiple conditions. Method 1: Using Logical expression Here we are going to use the logical expression to filter the row. Filter() function is used to filter the rows from RDD/DataFrame based on the given conditio
2 min read