Filter PySpark DataFrame Columns with None or Null Values
Last Updated :
25 Jan, 2023
Many times while working on PySpark SQL dataframe, the dataframes contains many NULL/None values in columns, in many of the cases before performing any of the operations of the dataframe firstly we have to handle the NULL/None values in order to get the desired result or output, we have to filter those NULL values from the dataframe.
In this article are going to learn how to filter the PySpark dataframe column with NULL/None values.
For filtering the NULL/None values we have the function in PySpark API know as a filter() and with this function, we are using isNotNull() function.
Syntax:
- df.filter(condition) : This function returns the new dataframe with the values which satisfies the given condition.
- df.column_name.isNotNull() : This function is used to filter the rows that are not NULL/None in the dataframe column.
Example 1: Filtering PySpark dataframe column with None value
In the below code we have created the Spark Session, and then we have created the Dataframe which contains some None values in every column. Now, we have filtered the None values present in the Name column using filter() in which we have passed the condition df.Name.isNotNull() to filter the None values of Name column.
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Filter_values.com") \
.getOrCreate()
return spk
# function to create dataframe
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [("Shivansh", "Data Scientist", "Noida"),
(None, "Software Developer", None),
("Swati", "Data Analyst", "Hyderabad"),
(None, None, "Noida"),
("Arpit", "Android Developer", "Banglore"),
(None, None, None)]
schema = ["Name", "Job Profile", "City"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# filtering the columns with None values
df = df.filter(df.Name.isNotNull())
# visualizing the dataframe
df.show()
Output:
Original Dataframe
Dataframe after filtering NULL/None values
Example 2: Filtering PySpark dataframe column with NULL/None values using filter() function
In the below code we have created the Spark Session, and then we have created the Dataframe which contains some None values in every column. Now, we have filtered the None values present in the City column using filter() in which we have passed the condition in English language form i.e, "City is Not Null" This is the condition to filter the None values of the City column.
Note: The condition must be in double-quotes.
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create new SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Filter_values.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [("Shivansh", "Data Scientist", "Noida"),
(None, "Software Developer", None),
("Swati", "Data Analyst", "Hyderabad"),
(None, None, "Noida"),
("Arpit", "Android Developer", "Banglore"),
(None, None, None)]
schema = ["Name", "Job Profile", "City"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# filtering the columns with None values
df = df.filter("City is Not NULL")
# visualizing the dataframe
df.show()
Output:
Original Dataframe
After filtering NULL/None values from the city column
Example 3: Filter columns with None values using filter() when column name has space
In the below code, we have created the Spark Session, and then we have created the Dataframe which contains some None values in every column. We have filtered the None values present in the 'Job Profile' column using filter() function in which we have passed the condition df["Job Profile"].isNotNull() to filter the None values of the Job Profile column.
Note: For accessing the column name which has space between the words, is accessed by using square brackets [] means with reference to the dataframe we have to give the name using square brackets.
Python
# importing necessary libraries
from pyspark.sql import SparkSession
# function to create SparkSession
def create_session():
spk = SparkSession.builder \
.master("local") \
.appName("Filter_values.com") \
.getOrCreate()
return spk
def create_df(spark, data, schema):
df1 = spark.createDataFrame(data, schema)
return df1
if __name__ == "__main__":
# calling function to create SparkSession
spark = create_session()
input_data = [("Shivansh", "Data Scientist", "Noida"),
(None, "Software Developer", None),
("Swati", "Data Analyst", "Hyderabad"),
(None, None, "Noida"),
("Arpit", "Android Developer", "Banglore"),
(None, None, None)]
schema = ["Name", "Job Profile", "City"]
# calling function to create dataframe
df = create_df(spark, input_data, schema)
# filtering the Job Profile with None values
df = df.filter(df["Job Profile"].isNotNull())
# visualizing the dataframe
df.show()
Output:
Original Dataframe
After filtering NULL/None values from the Job Profile column
Similar Reads
PySpark DataFrame - Drop Rows with NULL or None Values
Sometimes while handling data inside a dataframe we may get null values. In order to clean the dataset we have to remove all the null values in the dataframe. So in this article, we will learn how to drop rows with NULL or None Values in PySpark DataFrame. Function Used In pyspark the drop() funct
5 min read
How to drop all columns with null values in a PySpark DataFrame ?
The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in
3 min read
Add new column with default value in PySpark dataframe
In this article, we are going to see how to add a new column with a default value in PySpark Dataframe. The three ways to add a column to PandPySpark as DataFrame with Default Value. Using pyspark.sql.DataFrame.withColumn(colName, col)Using pyspark.sql.DataFrame.select(*cols)Using pyspark.sql.SparkS
3 min read
Python PySpark - DataFrame filter on multiple columns
In this article, we are going to filter the dataframe on multiple columns by using filter() and where() function in Pyspark in Python. Creating Dataframe for demonestration: Python3 # importing module import pyspark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSessio
2 min read
Filtering rows based on column values in PySpark dataframe
In this article, we are going to filter the rows based on column values in PySpark dataframe. Creating Dataframe for demonstration:Python3 # importing module import spark # importing sparksession from pyspark.sql module from pyspark.sql import SparkSession # creating sparksession and giving an app n
2 min read
Add a column with the literal value in PySpark DataFrame
In this article, we are going to see how to add a column with the literal value in PySpark Dataframe. Creating dataframe for demonstration: Python3 # import SparkSession from the pyspark from pyspark.sql import SparkSession # build and create the # SparkSession with name "lit_value" spark = SparkSes
3 min read
How to delete columns in PySpark dataframe ?
In this article, we are going to delete columns in Pyspark dataframe. To do this we will be using the drop() function. This function can be used to remove values from the dataframe. Syntax: dataframe.drop('column name') Python code to create student dataframe with three columns: Python3 # importing
2 min read
Select columns in PySpark dataframe
In this article, we will learn how to select columns in PySpark dataframe. Function used: In PySpark we can select columns using the select() function. The select() function allows us to select single or multiple columns in different formats. Syntax: dataframe_name.select( columns_names ) Note: We
4 min read
PySpark - Select Columns From DataFrame
In this article, we will discuss how to select columns from the pyspark dataframe. To do this we will use the select() function. Syntax: dataframe.select(parameter).show() where, dataframe is the dataframe nameparameter is the column(s) to be selectedshow() function is used to display the selected
2 min read
Drop rows from Pandas dataframe with missing values or NaN in columns
We are given a Pandas DataFrame that may contain missing values, also known as NaN (Not a Number), in one or more columns. Our task is to remove the rows that have these missing values to ensure cleaner and more accurate data for analysis. For example, if a row contains NaN in any specified column,
4 min read