0% found this document useful (0 votes)
10 views15 pages

Handling Nulls in PySpark

The document provides a comprehensive guide on handling null values and missing data in PySpark, covering methods for checking, counting, dropping, and filling null values. It also discusses replacing nulls with statistical measures like mean, median, and mode, as well as techniques for forward and backward filling using window functions. Additionally, it addresses handling nulls in aggregations, joins, and complex data types such as arrays and structs.

Uploaded by

copy00166
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views15 pages

Handling Nulls in PySpark

The document provides a comprehensive guide on handling null values and missing data in PySpark, covering methods for checking, counting, dropping, and filling null values. It also discusses replacing nulls with statistical measures like mean, median, and mode, as well as techniques for forward and backward filling using window functions. Additionally, it addresses handling nulls in aggregations, joins, and complex data types such as arrays and structs.

Uploaded by

copy00166
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Handling Null

Values & Missing


Data in PySpark

Karthik Kondpak Swipe for more


1. Checking for Null Values
Before handling null values, it's good practice to check
for their presence in a DataFrame.

Using isNull() and isNotNull()

from pyspark.sql.functions import col

df.filter(col("column_name").isNull()).show()
df.filter(col("column_name").isNotNull()).show()

Using na Functions

df.select([col(c).isNull().alias(c) for c in
df.columns]).show()

https://www.seekhobigdata.com/ Swipe for more


Counting Null Values

from pyspark.sql.functions import count, when

df.select([count(when(col(c).isNull(), c)).alias(c)
for c in df.columns]).show()

https://www.seekhobigdata.com/ Swipe for more


2. Dropping Null Values
PySpark provides the .dropna() function to remove
rows with null values.

df.na.drop()

Dropping Rows Based on Conditions

Drop rows if any column has null:

df.na.drop("any")

https://www.seekhobigdata.com/ Swipe for more


Drop rows only if all columns are null:

df.na.drop("all")

Drop rows based on a subset of columns:

df.na.drop(subset=["column1", "column2"])

https://www.seekhobigdata.com/ Swipe for more


3. Filling Null Values
Instead of dropping null values, you can replace
them using .fillna().

Filling with a Specific Value

df.na.fill("default_value").show()
df.na.fill(0).show()

Filling with Different Values for Different


Columns

df.na.fill({"column1": "unknown", "column2":


0}).show()

https://www.seekhobigdata.com/ Swipe for more


4. Replacing Null with
Mean/Median/Mode
To replace null values with the column mean,
median, or mode:

Filling with Mean

from pyspark.sql.functions import mean

mean_value =
df.select(mean(col("column_name"))).collect()[0]
[0]
df = df.na.fill(mean_value, subset=
["column_name"])

https://www.seekhobigdata.com/ Swipe for more


Filling with Median

from pyspark.sql.functions import expr

median_value =
df.approxQuantile("column_name", [0.5], 0.01)[0]
df = df.na.fill(median_value, subset=
["column_name"])

Filling with Mode

from pyspark.sql.functions import count, col

mode_value =
df.groupBy("column_name").count().orderBy(col(
"count").desc()).first()[0]
df = df.na.fill(mode_value, subset=
["column_name"])

https://www.seekhobigdata.com/ Swipe for more


5. Replacing Null Values with Forward Fill
& Backward Fill

PySpark doesn’t have built-in functions for forward fill


(ffill) or backward fill (bfill), but you can achieve it using
window functions.

Forward Fill (ffill)

from pyspark.sql.window import Window


from pyspark.sql.functions import last

window_spec =
Window.orderBy("some_column").rowsBetween(-
sys.maxsize, 0)
df = df.withColumn("column_name",
last("column_name", True).over(window_spec))

https://www.seekhobigdata.com/ Swipe for more


Backward Fill (bfill)

window_spec =
Window.orderBy("some_column").rowsBetween(
0, sys.maxsize)
df = df.withColumn("column_name",
last("column_name", False).over(window_spec))

https://www.seekhobigdata.com/ Swipe for more


6. Handling Null Values in Aggregations

By default, PySpark ignores null values in aggregations.

Count Including Nulls

df.agg(count("column_name")).show()

Count Excluding Nulls

df.agg(count(when(col("column_name").isNotNull
(), True))).show()

https://www.seekhobigdata.com/ Swipe for more


7. Handling Null Values in Joins
While performing joins, null values in keys can create
issues.

Using fillna() Before Joins

df1 = df1.na.fill("unknown", subset=


["join_column"])
df2 = df2.na.fill("unknown", subset=
["join_column"])
joined_df = df1.join(df2, on="join_column",
how="inner")

https://www.seekhobigdata.com/ Swipe for more


Using coalesce() to Handle Null Joins

from pyspark.sql.functions import coalesce

df1 = df1.withColumn("join_column",
coalesce(col("join_column"), lit("default_value")))
df2 = df2.withColumn("join_column",
coalesce(col("join_column"), lit("default_value")))
joined_df = df1.join(df2, "join_column", "inner")

https://www.seekhobigdata.com/ Swipe for more


8. Handling Nulls in Complex Data
Types
Handling Null in Arrays
from pyspark.sql.functions import array_remove

df = df.withColumn("array_column",
array_remove("array_column", None))

Handling Null in Structs

df = df.withColumn("struct_column",
col("struct_column").alias("new_struct")).drop("st
ruct_column")

https://www.seekhobigdata.com/ Swipe for more


If you
find this
helpful, like
and share it
with your
friends

https://www.seekhobigdata.com/

You might also like