Data Skew and Remedies in Spark Programming
Data Skew and Remedies in Spark Programming
PROGRAMMING
What is Skew
“Data Skew” is a Condition in which a Table’s data is “Unevenly Distributed” among
the Partitions in the Cluster. Operations, such as “Join” performs very slow on such
Partitions.
Initially, data is typically read in “Partitions” of 128 MB size each, and, evenly
distributed throughout the “Cluster of Computers”.
However, as the data is “Transformed”, for example “Aggregated”, it is possible to
have significantly more records in one “Partition” than another. This is called “Skew”.
To some degrees, a small amount of “Skew” is ignorable, may be in 10% range. But
large “Skews” can result in “Spill”, or even worse, really hard to diagnose “Out of
Memory Errors”.
As seen from the above image, after “Aggregated” operation, if the data in City D is 2
times larger than the data in City A, City B, or City C -
➢ The data in City D is going to take 2 times as long to process than the data in
City A, City B, or City C, assuming it takes “n” seconds per Record.
➢ The data in City D requires twice as much RAM to process than the data in City
A, City B, or City C, assuming it takes “n” MB per Record.
The ramification of “Skew” -
➢ The entire “Stage” is going to take as long as the “Longest-Running Task”.
• Purely from an Execution Standpoint in Time, if all the four Partitions are
processed simultaneously, the entire Job will take as long as the
processing of the slowest Stage, i.e., the Stage with the Task of
processing of City D.
➢ The more dire problem would be to not have enough RAM for the “Skewed
Partitions”.
dfVisualize = spark.read.\
format("delta").\
load(transactionPath).\
groupBy("city_id").\
count().\
orderBy("count")
display(dfVisualize)
The resultant data, in the created DataFrame, can be visually displayed in the following
way -
From the above image, it could be seen that it took almost “2 minutes”
to process the Join operation between the two Tables/Views without
“Skew Hint”.
Spark SQL Query of Join between two Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A') */ A.city_id, B.city,
B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id
From the above image, it could be seen that it took “8.13 minutes” to
process the Join operation between a Table/View and a Subquery
without “Skew Hint”.
From the above image, it could be seen that it took “8.68 minutes” to
process the Join operation between a Table/View and a Subquery with
“Skew Hint”.
In this case, using “Skew Hint” on a Relation (Table/View/Subquery)
improves Data Skew but introduces Performance Hit.
➢ Configure Skew Hint with Relation Name and Column Names - There might be
multiple Joins on a “Relation” and only some of the Joins will suffer from Skew.
“Skew Join Optimization” has some overhead. So, it is better to use it only when
needed. For this purpose, the “Skew Hint” accepts Column Names. Only the
Joins with the provided Columns use “Skew Join Optimization”.
✓ Skew on Single Column - Spark SQL Query of Join on a Single Column
between two Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', 'city_id') */ A.city_id, B.city,
B.country, A.transacted_at, A.trx_id, A.retailer_id, A.description,
A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id
➢ Configure Skew Hint with Relation Name, Column Names and Skew Values - It
is possible to specify Skew Values in the “Skew Hint”. Depending on the Query
and the Data, the Skew Values might be known or might be easy to find out, as
the Skew Values never change. Doing this reduces the overhead of “Skew Join
Optimization”. Otherwise, “Delta Lake” detects the Skew Values automatically.
✓ Skew on Single Column and Single Skew Value - Spark SQL Query of Join
on a Single Column and a Single Skew Value between two Tables/Views
with Skew Hint” –
%sql
SELECT /*+ SKEW('A', 'city_id', 28424447) */
A.city_id, B.city, B.country, A.transacted_at, A.trx_id, A.retailer_id,
A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id
✓ Skew on Multiple Columns and Multiple Skew Values - Spark SQL Query
of Join on Multiple Columns and Multiple Skew Values between two
Tables/Views with Skew Hint” -
%sql
SELECT /*+ SKEW('A', ('city_id', 'city'), ((28424447,
'Albany'), (559832710, 'Jackson'))) */ A.city_id, B.city, B.country,
A.transacted_at, A.trx_id, A.retailer_id, A.description, A.amount
FROM global_temp.tblTransaction A
INNER JOIN global_temp.tblCity B
ON A.city_id = B.city_id
spark.conf.set("spark.sql.adaptive.enabled", False)
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", False)
Step 2 - Cross Join the Salt DataFrame “saltDf” with the Dimension DataFrame to
create the “Salted Dimension DataFrame”. Re-Partitioning can help Mitigate the Spills
and Evenly Re-Distributes the new Dimension Table across all Partitions.
In this case “Re-Partition” is used to reduce the Impact of “Cross Join”, because the
“Cross Join” would be an “Expensive Operation” if the “Skew Factor” is high.
Step 3 - For the Fact DataFrame, randomly assign a “Salt” to each record.
sc.setJobDescription("Create Salted Fact DataFrame")
#Join the Two Salted DataFrames by "salted_city_id" and Execute a "noop" Write
dfSaltedTransaction.\
join(dfSaltedCity, dfSaltedTransaction["salted_city_id"] ==
dfSaltedCity["salted_city_id"]).\
write.\
format("noop").\
mode("overwrite").\
save()