From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
Performance:
Supercharging your Spark Queries using the Spark UI
SQL Tab
Catalyst Optimizer
Stage
Spark Job
Tasks
Query Stage
(=physical
plan) Stage
Spark Job
Stage
Dataframe
“action” Stage
Query Stage
(=physical Spark Job
plan) Stage
Tasks
Stage
(3) Job
dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))
dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))
== Physical Plan ==
OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@dc93aa9, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
+- *(2) HashAggregate(keys=[item_id#232L], functions=[finalmerge_sum(merge sum#1247L) AS sum(cast(sales#233 as bigint))#1210L], output=[item_id#232L, itemSales#1211L])
+- Exchange hashpartitioning(item_id#232L, 8), true, [id=#1268]
+- *(1) HashAggregate(keys=[item_id#232L], functions=[partial_sum(cast(sales#233 as bigint)) AS sum#1247L], output=[item_id#232L, sum#1247L])
+- *(1) Filter (isnotnull(item_id#232L) AND (item_id#232L >= 600000))
+- InMemoryTableScan [item_id#232L, sales#233], [isnotnull(item_id#232L), (item_id#232L >= 600000)]
A simple example (3)
== Physical Plan ==
OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@dc93aa9, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
+- *(2) HashAggregate(keys=[item_id#232L], functions=[finalmerge_sum(merge sum#1247L) AS sum(cast(sales#233 as bigint))#1210L], output=[item_id#232L, itemSales#1211L])
+- Exchange hashpartitioning(item_id#232L, 8), true, [id=#1268]
+- *(1) HashAggregate(keys=[item_id#232L], functions=[partial_sum(cast(sales#233 as bigint)) AS sum#1247L], output=[item_id#232L, sum#1247L])
+- *(1) Filter (isnotnull(item_id#232L) AND (item_id#232L >= 600000))
+- InMemoryTableScan [item_id#232L, sales#233], [isnotnull(item_id#232L), (item_id#232L >= 600000)]
An Overview of Common Components of the
Physical Plan
The physical plan under the hood
What is the physical plan represented by in the Spark Code?
Binary
Transformation
(BinaryExecNode)
Unary
Transformation
(UnaryExecNode)
Output
(UnaryExecNode)
Physical operators of SparkPlan
Extending SparkPlan (152 subclasses)
Query Input Query Input
(LeafExecNode) (LeafExecNode)
spark Q1
.read
.format("csv")
.option("header", True) Q2
.load("/databricks-datasets/airlines") 1
.write 2
.format("delta")
.save("/tmp/airlines_delta")
3
4
Columnar Scan Delta Lake and Write to Delta Lake
High level
spark Q1
.read
.format("delta")
Q2
.load("...path...")
.write
.format("delta")
Anything in this box
.save("/tmp/..._delta") supports codegen
spark
.read
.format("delta")
.load("...path...") 1
.write
.format("delta")
.save("/tmp/..._delta")
Columnar Scan Delta Lake and Write to Delta Lake
Statistics on WSCG + ColumnarToRow Q2
spark
.read
.format("delta")
1
.load("...path...")
.write
.format("delta")
.save("/tmp/..._delta") 2
3
Common Narrow Transformations
Common Narrow Transformations
Filter / Project
spark
.read
.format("delta")
Filter
.load("...path...") r →
Filte
.filter(col("item_id") < 1000)
.withColumn("doubled_item_id", col("item_id")*2)
.write withCo
lumn/
.format("delta") select
→ Pro
ject
.save("/tmp/..._delta")
Common Narrow Transformations
Range / Sample / Union / Coalesce
df2 = spark.range(1000000)
df1
.sample(0.1) sample → Sample
.union(df2)
union → U
nion
.coalesce(1)
.write coale
sce →
Coale
.format("delta") sce
.save("/tmp/..._delta")
Special Case! Local Sorting
sortWithinPartitions
Partition X
33 33 33
Partition Y
1
34 4 4
66 8 8
P artitions / partitionBy → Sort
sortWithin
(global=False) 4 34 34
8 66 66
Special Case! Global Sorting
orderBy
8 4 4
33 4 8 8
Partition Y New
Partition Y
34
4 33 34 34
8 34 66 66
Wide Transformations
What are wide transformations?
requiredChildDistribution =
UnspecifiedDistribution
requiredChildDistribution =
OrderedDistribution (ASC/DESC)
Ensure the requirements
Sort
(global=True) Sort
(global=True)
outputPartitioning = retain
child’s outputPartitioning =
RangePartitioning
Shuffle Exchange
What are the metrics in the Shuffle exchange?
When is it used? Before any operation that requires the same keys on same partitions (e.g. groupBy +
aggregation, and for joins (sortMergeJoin)
Broadcast Exchange
Only output rows are a metric with
broadcasts
When is it used? Before any operation in which copying the same data to all nodes is required. Usually:
BroadcastHashJoin, BroadcastNestedLoopJoin
Zooming in on Aggregates
Aggregates
Distr
ibuti
on re
quire
men
df t Input (item_id, Result of Result of
sales) Exchange HashAggregate 2
.groupBy("item_id") Partition X New Partition X
.agg(F.sum("sales"))
(A, 10) (A,10) (A, 13)
(B, 5) (A,3)
groupBy/agg → HashAggregate
(B, 1) (B, 1)
(B, 1) (B, 1)
(B, 2) (B, 2)
Aggregate implementations
HashAggregateExec (Dataframe API)
- Based on HashTable structure.
df - Supports codegen
.groupBy("item_id") - When hitting memory limits, spill to disk and start new
HashTable
.agg(F.sum("sales")) - Merge all HashTables using sort based aggregation
method.
ObjectHashAggregateExec (Dataset API)
- Same as HashAggregateExec, but for JVM objects
- Does not support codegen
- Immediately falls back to sort based aggregation
method when hitting memory limits
SortAggregateExec
- sort based aggregation
Aggregates Metrics
(B, 1)
(B, 2)
Zooming in on Joins
Joins
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")
requiredChildDistribution)
Join Implementations & Requirements
Different joins have different complexities
BroadcastHashJoinExec One Side: None Performs local hash join between O(n)
BroadcastDistribution broadcast side and other side.
Other: UnspecifiedDistribution
SortMergeJoinExec Both Sides: Both Sides: Compare keys of sorted data O(nlogn)
HashClusteredDistribution Ordered (asc) sets and merges if match.
by join keys
BroadcastNestedLoopJoinExec One Side: None For each row of [Left/Right] O(n * m), small
BroadcastDistribution dataset, compare all rows of m
Other: UnspecifiedDistribution [Left/Right] data set.
Join Strategy
How does Catalyst choose what Yes
join?
No
One side small No No BroadcastNested
inner join?
enough? LoopJoinExec
Yes Yes
BroadcastNestedLoopJoinExec CartesianProductExec
Ordering requirements
Example for SortMergeJoinExec
requiredChildOrdering =
[left.id, right.id] (ASC)
Ensure the requirements
SortMergeJoin
(left.id=right.id SortMergeJoin
, Inner) (left.id=right.id
, Inner)
outputOrdering = depends on
join type outputOrdering =
[left.id, right.id] ASC
Revisiting our join
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")
RequiredChildOrdering-> Sort
equi-join? Yes
Broadcastable? No } sortMergeJoin
Supercharge your Spark Queries
Scenario 1: Filter + Union anti-pattern
E.g. apply different logic based on a category the data belongs to.
final_df = functools.reduce(DataFrame.union,
[
logic_cat_0(df.filter(F.col("category") == 0)),
logic_cat_1(df.filter(F.col("category") == 1)),
logic_cat_2(df.filter(F.col("category") == 2)),
logic_cat_3(df.filter(F.col("category") == 3))
]
)
…
a!
at
fD
return df.withColumn("output", F.col("sales") * 2)
so
ad
…
Re
ed
at
pe
Re
Scenario 1: Filter + Union anti-pattern FIXED
Rewrite code with CASE WHEN :)
!
ead
final_df = (
e r
df On
.filter((F.col("category") >= 0) & (F.col("category") <= 3))
.withColumn("output",
F.when(F.col("category") == 0, logic_cat_0())
.when(F.col("category") == 1, logic_cat_1())
.when(F.col("category") == 2, logic_cat_2())
.otherwise(logic_cat_3())
)
)
def logic_cat_0() -> Column:
return F.col("sales") * 2
Scenario 2: Partial Aggregations
Partial aggregations do not help with high-cardinality grouping keys
!
help
esn’t
s do
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales")) Thi
spark.conf.set("spark.sql.aggregate.partialaggregate.skip.enabled", True)
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales"))
ship_ports = dfPorts.alias("p").join(
dfShips.alias("s"),
(col("s.lat") >= col("p.min_lat")) &
(col("s.lat") <= col("p.max_lat")) &
(col("s.lon") >= col("p.min_lon")) &
(col("s.lon") <= col("p.max_lon")))
w!
slo
Scenario 3: Join Strategy FIXED
Use a geohash to convert to equi-join
ship_ports = dfPorts.alias("p").join(
dfShips.alias("s"),
(col("s.lat") >= col("p.min_lat")) &
(col("s.lat") <= col("p.max_lat")) &
(col("s.lon") >= col("p.min_lon")) &
(col("s.lon") <= col("p.max_lon")) &
(substring(col("s.geohash"),1,2) == substring(col("p.geohash"),1,2)))
Fast!
In Summary
What we covered
The SQL Tab provides insights into how the Spark query is executed
We can use the SQL Tab to reason about query execution time.
We can answer important questions:
What part of my Spark query takes the most time?
Is my Spark query choosing the most efficient Spark operators for the task?