How to drop all columns with null values in a PySpark DataFrame ?
Last Updated :
01 May, 2022
The pyspark.sql.DataFrameNaFunctions class in PySpark has many methods to deal with NULL/None values, one of which is the drop() function, which is used to remove/delete rows containing NULL values in DataFrame columns. You can also use df.dropna(), as shown in this article. You may drop all rows in any, all, single, multiple, and chosen columns using the drop() method. When you need to sanitize data before processing it, this function is quite useful. Any column with an empty value when reading a file into the PySpark DataFrame API returns NULL on the DataFrame. To drop rows in RDBMS SQL, you must check each column for null values, but the PySpark drop() method is more powerful since it examines all columns for null values and drops the rows.
PySpark drop() Syntax
The drop() method in PySpark has three optional arguments that may be used to eliminate NULL values from single, any, all, or numerous DataFrame columns. Because drop() is a transformation method, it produces a new DataFrame after removing rows/records from the current Dataframe.
drop(how='any', thresh=None, subset=None)
All of these settings are optional.
- how – This accepts any or all values. Drop a row if it includes NULLs in any column by using the 'any' operator. Drop a row only if all columns contain NULL values if you use the 'all' option. The default value is 'any'.
- thresh – This is an int quantity; rows with less than thresh hold non-null values are dropped. 'None' is the default.
- subset – This is used to select the columns that contain NULL values. 'None' is the default.
Implementation
Before we begin, let's read a CSV file into a DataFrame. PySpark assigns null values to empty String and Integer columns when there are no values on those rows.
CSV Used:
Python3
import pyspark.sql.functions as sqlf
from pyspark.sql import SparkSession
import findspark
findspark.init()
spark: SparkSession = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
filePath = "example1.csv"
df = spark.read.options(header='true', inferSchema='true') \
.csv(filePath)
df.printSchema()
df.show(truncate=False)
This results in the output shown below, name and city have null values, as you can see.
Drop Columns with NULL Values
Python3
def dropNullColumns(df):
"""
This function drops columns containing all null values.
:param df: A PySpark DataFrame
"""
null_counts = df.select([sqlf.count(sqlf.when(sqlf.col(c).isNull(), c)).alias(
c) for c in df.columns]).collect()[0].asDict() # 1
col_to_drop = [k for k, v in null_counts.items() if v > 0] # 2
df = df.drop(*col_to_drop) # 3
return df
We're using the pyspark's select method in the first line, which projects a group of expressions and returns a new dataframe. The collection of expressions included in brackets will be evaluated and a new dataframe will be created as a result. The expression counts the number of null values in each column and then can use the collect method to retrieve the data from the dataframe and create a dict with the column names and the number of nulls in each.
We're only filtering out columns with null values greater than 0 in the second line, which basically means any column with null values.
After figuring out the columns containing null values, we used the drop function in the third line and finally returned the dataframe.
Example:
CSV Used:
Python3
import pyspark.sql.functions as sqlf
from pyspark.sql import SparkSession
import findspark
findspark.init()
spark: SparkSession = SparkSession.builder \
.master("local[1]") \
.appName("SparkByExamples.com") \
.getOrCreate()
filePath = "/content/swimming_pool.csv"
df = spark.read.options(header='true', inferSchema='true') \
.csv(filePath)
df.printSchema()
df.show(truncate=False)
After using dropNullColumns function -
Similar Reads
Python Tutorial | Learn Python Programming Language Python Tutorial â Python is one of the most popular programming languages. Itâs simple to use, packed with features and supported by a wide range of libraries and frameworks. Its clean syntax makes it beginner-friendly.Python is:A high-level language, used in web development, data science, automatio
10 min read
Python Interview Questions and Answers Python is the most used language in top companies such as Intel, IBM, NASA, Pixar, Netflix, Facebook, JP Morgan Chase, Spotify and many more because of its simplicity and powerful libraries. To crack their Online Assessment and Interview Rounds as a Python developer, we need to master important Pyth
15+ min read
Non-linear Components In electrical circuits, Non-linear Components are electronic devices that need an external power source to operate actively. Non-Linear Components are those that are changed with respect to the voltage and current. Elements that do not follow ohm's law are called Non-linear Components. Non-linear Co
11 min read
Python OOPs Concepts Object Oriented Programming is a fundamental concept in Python, empowering developers to build modular, maintainable, and scalable applications. By understanding the core OOP principles (classes, objects, inheritance, encapsulation, polymorphism, and abstraction), programmers can leverage the full p
11 min read
Python Projects - Beginner to Advanced Python is one of the most popular programming languages due to its simplicity, versatility, and supportive community. Whether youâre a beginner eager to learn the basics or an experienced programmer looking to challenge your skills, there are countless Python projects to help you grow.Hereâs a list
10 min read
Python Exercise with Practice Questions and Solutions Python Exercise for Beginner: Practice makes perfect in everything, and this is especially true when learning Python. If you're a beginner, regularly practicing Python exercises will build your confidence and sharpen your skills. To help you improve, try these Python exercises with solutions to test
9 min read
Python Programs Practice with Python program examples is always a good choice to scale up your logical understanding and programming skills and this article will provide you with the best sets of Python code examples.The below Python section contains a wide collection of Python programming examples. These Python co
11 min read
Spring Boot Tutorial Spring Boot is a Java framework that makes it easier to create and run Java applications. It simplifies the configuration and setup process, allowing developers to focus more on writing code for their applications. This Spring Boot Tutorial is a comprehensive guide that covers both basic and advance
10 min read
Class Diagram | Unified Modeling Language (UML) A UML class diagram is a visual tool that represents the structure of a system by showing its classes, attributes, methods, and the relationships between them. It helps everyone involved in a projectâlike developers and designersâunderstand how the system is organized and how its components interact
12 min read
Enumerate() in Python enumerate() function adds a counter to each item in a list or other iterable. It turns the iterable into something we can loop through, where each item comes with its number (starting from 0 by default). We can also turn it into a list of (number, item) pairs using list().Let's look at a simple exam
3 min read