0% found this document useful (0 votes)
111 views

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Filter and Project are examples of narrow transformations. A narrow transformation operates on each row independently and does not require a shuffle. Filter selects or filters rows based on a predicate. Project selects a subset of columns. These are very common and help optimize queries by reducing data before wider transformations like joins that require data movement between executors.

Uploaded by

WF1 Hackathon1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Filter and Project are examples of narrow transformations. A narrow transformation operates on each row independently and does not require a shuffle. Filter selects or filters rows based on a predicate. Project selects a subset of columns. These are very common and help optimize queries by reducing data before wider transformations like joins that require data movement between executors.

Uploaded by

WF1 Hackathon1
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 52

From Query Plan to Query

Performance:
Supercharging your Spark Queries using the Spark UI
SQL Tab

Max Thone - Resident Solutions Architect


Stefan van Wouw - Sr. Resident Solutions Architect
Agenda

Introduction to Spark SQL Tab

The Most Common Components


of the Query Plan

Supercharge your spark queries


Introduction to Spark SQL Tab
Why should you know about the SQL Tab?

▪ Shows how the Spark query is executed


▪ Can be used to reason about query execution time.
What is a Query Plan?
▪ A Spark SQL/Dataframe/Dataset query goes through Spark Catalyst Optimizer before
being executed by the JVM
▪ With “Query plan” we mean the “Selected Physical Plan”, it is the output of Catalyst

Catalyst Optimizer

From the Databricks glossary (https://databricks.com/glossary/catalyst-optimizer)


Hierarchy: From Spark Dataframe to Spark task
One “dataframe action” can spawn multiple queries, which can spawn multiple jobs

Stage
Spark Job
Tasks
Query Stage
(=physical
plan) Stage
Spark Job
Stage
Dataframe
“action” Stage

Query Stage
(=physical Spark Job
plan) Stage
Tasks
Stage
(3) Job

A simple example (1) (2) Query (physical plan)

(1) dataframe “action”


# dfSalesSample is some cached dataframe

dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))

# Trigger the query


dfItemSales.write.format("noop").mode("overwrite").save()

(4) Two Stages

(5) Nine tasks


A simple example (2)
# dfSalesSample is some cached dataframe

dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))

# Trigger the query


dfItemSales.write.format("noop").mode("overwrite").save()

== Physical Plan ==
OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@dc93aa9, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
+- *(2) HashAggregate(keys=[item_id#232L], functions=[finalmerge_sum(merge sum#1247L) AS sum(cast(sales#233 as bigint))#1210L], output=[item_id#232L, itemSales#1211L])
+- Exchange hashpartitioning(item_id#232L, 8), true, [id=#1268]
+- *(1) HashAggregate(keys=[item_id#232L], functions=[partial_sum(cast(sales#233 as bigint)) AS sum#1247L], output=[item_id#232L, sum#1247L])
+- *(1) Filter (isnotnull(item_id#232L) AND (item_id#232L >= 600000))
+- InMemoryTableScan [item_id#232L, sales#233], [isnotnull(item_id#232L), (item_id#232L >= 600000)]
A simple example (3)

▪ What more possible operators exist in Physical plan?


▪ How should we interpret the “details” in the SQL plan?
▪ How can we use above knowledge to optimise our Query?

== Physical Plan ==
OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@dc93aa9, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
+- *(2) HashAggregate(keys=[item_id#232L], functions=[finalmerge_sum(merge sum#1247L) AS sum(cast(sales#233 as bigint))#1210L], output=[item_id#232L, itemSales#1211L])
+- Exchange hashpartitioning(item_id#232L, 8), true, [id=#1268]
+- *(1) HashAggregate(keys=[item_id#232L], functions=[partial_sum(cast(sales#233 as bigint)) AS sum#1247L], output=[item_id#232L, sum#1247L])
+- *(1) Filter (isnotnull(item_id#232L) AND (item_id#232L >= 600000))
+- InMemoryTableScan [item_id#232L, sales#233], [isnotnull(item_id#232L), (item_id#232L >= 600000)]
An Overview of Common Components of the
Physical Plan
The physical plan under the hood
What is the physical plan represented by in the Spark Code?

▪ The physical plan is represented by SparkPlan class


▪ SparkPlan is a recursive data structure:
▪ It represents a physical operator in the physical plan, AND the whole plan itself (1)

▪ SparkPlan is the base class, or “blueprint” for these physical operators


▪ These physical operators are “chained” together

(1) From Jacek Laskowski’s Mastering Spark SQL (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkPlan.html#contract


Physical operators of SparkPlan
Extending SparkPlan

Query Input Query Input


(LeafExecNode) (LeafExecNode)

Binary
Transformation
(BinaryExecNode)

Unary
Transformation
(UnaryExecNode)

Output
(UnaryExecNode)
Physical operators of SparkPlan
Extending SparkPlan (152 subclasses)
Query Input Query Input
(LeafExecNode) (LeafExecNode)

▪ LeafExecNode (27 subclasses)


▪ All file sources, cache read, construction of dataframes from RDDs, range
generator, and reused exchanges & subqueries. Binary

▪ BinaryExecNode (8 subclasses) Transformation


(BinaryExecNode)
▪ Operations with 2 dataframes as input (joins, unions, etc.)

▪ UnaryExecNode (82 subclasses) Unary


▪ Operations with one dataframe as input. E.g. sort, aggregates, exchanges, Transformation
filters, projects, limits (UnaryExecNode)

▪ Other (32 traits/abstract/misc classes)


Output
(UnaryExecNode)
The Most Common Components of the Physical
Plan
Parts we will cover. Parts we will NOT cover.

▪ Common Narrow Transformations ▪ Adaptive Query Execution


▪ Distribution Requirements ▪ Streaming
(Exchange) ▪ Datasource V2 specifics
▪ Common Wide Transformations ▪ Command specifics (Hive metastore
▪ Aggregates related)
▪ Joins ▪ Dataset API specifics
▪ Ordering Requirements (Sort) ▪ Caching / Reuse
▪ UDFs
Let’s start with the basics: Read/Write
Row-based Scan CSV and Write to Delta Lake
No dataframe transformations apart from read/write

spark Q1
.read
.format("csv")
.option("header", True) Q2
.load("/databricks-datasets/airlines") 1
.write 2
.format("delta")
.save("/tmp/airlines_delta")

3
4
Columnar Scan Delta Lake and Write to Delta Lake
High level

spark Q1
.read
.format("delta")
Q2
.load("...path...")
.write
.format("delta")
Anything in this box
.save("/tmp/..._delta") supports codegen

Parquet is Columnar, while Spark is


row-based
Columnar Scan Delta Lake and Write to Delta Lake
Statistics on Columnar Parquet Scan Q2

spark
.read
.format("delta")
.load("...path...") 1
.write
.format("delta")
.save("/tmp/..._delta")
Columnar Scan Delta Lake and Write to Delta Lake
Statistics on WSCG + ColumnarToRow Q2

spark
.read
.format("delta")
1
.load("...path...")
.write
.format("delta")
.save("/tmp/..._delta") 2
3
Common Narrow Transformations
Common Narrow Transformations
Filter / Project

spark
.read
.format("delta")
Filter
.load("...path...") r →
Filte
.filter(col("item_id") < 1000)
.withColumn("doubled_item_id", col("item_id")*2)
.write withCo
lumn/
.format("delta") select
→ Pro
ject
.save("/tmp/..._delta")
Common Narrow Transformations
Range / Sample / Union / Coalesce

df1 = spark.range(1000000) spark.range → Range

df2 = spark.range(1000000)

df1
.sample(0.1) sample → Sample

.union(df2)
union → U
nion
.coalesce(1)
.write coale
sce →
Coale
.format("delta") sce

.save("/tmp/..._delta")
Special Case! Local Sorting
sortWithinPartitions

Input Result of Global


(item_id) Sort result
(unsorted!
df.sortWithinPartitions("item_id") )

Partition X

33 33 33

Partition Y
1
34 4 4

66 8 8
P artitions / partitionBy → Sort
sortWithin
(global=False) 4 34 34

8 66 66
Special Case! Global Sorting
orderBy

Input Result of Result of Global


(item_id) Exchange Sort result
(example) (sorted!)
df.orderBy("item_id")
Partition X New
Partition X

8 4 4

33 4 8 8

Partition Y New
Partition Y

34

orderBy → Sort (global=True)


66 66 33 33

4 33 34 34

8 34 66 66
Wide Transformations
What are wide transformations?

▪ Transformations for which re-distribution of data is required


▪ e.g: joins, global sorting, and aggregations
▪ These above requirements are captured through “distribution”
requirements
Distribution requirements
Each node in the physical plan can specify how it expects data to be distributed over the Spark cluster

requiredChildDistribution (Default: UnspecifiedDistribution)

SparkPlan Required Distribution Satisfied by (roughly) Example operator


this Partitioning of
Operator (e.g.
child
Filter)
UnspecifiedDistributio All Scan
outputPartitioning (Default: UnknownPartitioning) n

AllTuples All with 1 partition only Flatmap in Pandas

OrderedDistribution RangePartitioning Sort (global)

(Hash)ClusteredDistrib HashPartitioning HashAggregate /


ution SortMergeJoin

BroadcastDistribution BroadcastPartitioning BroadcastHashJoin


Distribution requirements
Example for Local Sort (global=False)

requiredChildDistribution =
UnspecifiedDistribution

Sort Ensure the requirements Sort


(global=False) (global=False)

outputPartitioning = retain outputPartitioning = retain


child’s
child’s
Distribution requirements
Example for Global Sort (global=True)
Exchange
(rangepartition
ing)

requiredChildDistribution =
OrderedDistribution (ASC/DESC)
Ensure the requirements
Sort
(global=True) Sort
(global=True)

outputPartitioning = retain
child’s outputPartitioning =
RangePartitioning
Shuffle Exchange
What are the metrics in the Shuffle exchange?

Size of serialised data read from


“local” executor

Serialised size of data read from


“remote” executors

Size of shuffle bytes written

When is it used? Before any operation that requires the same keys on same partitions (e.g. groupBy +
aggregation, and for joins (sortMergeJoin)
Broadcast Exchange
Only output rows are a metric with
broadcasts

time to build the broadcast table

time to build the broadcast table


time to collect all the data

Size of broadcasted data (in memory)

# of rows in broadcasted data

When is it used? Before any operation in which copying the same data to all nodes is required. Usually:
BroadcastHashJoin, BroadcastNestedLoopJoin
Zooming in on Aggregates
Aggregates
Distr
ibuti
on re
quire
men
df t Input (item_id, Result of Result of
sales) Exchange HashAggregate 2
.groupBy("item_id") Partition X New Partition X
.agg(F.sum("sales"))
(A, 10) (A,10) (A, 13)

(B, 5) (A,3)

Partition Y New Partition Y

(A, 3) (B,1) (B, 9)

groupBy/agg → HashAggregate
(B, 1) (B, 1)

(B, 1) (B, 1)

(B, 2) (B, 2)
Aggregate implementations
HashAggregateExec (Dataframe API)
- Based on HashTable structure.
df - Supports codegen
.groupBy("item_id") - When hitting memory limits, spill to disk and start new
HashTable
.agg(F.sum("sales")) - Merge all HashTables using sort based aggregation
method.
ObjectHashAggregateExec (Dataset API)
- Same as HashAggregateExec, but for JVM objects
- Does not support codegen
- Immediately falls back to sort based aggregation
method when hitting memory limits
SortAggregateExec
- sort based aggregation
Aggregates Metrics

Only in case of fallback to sorting (too many distinct


keys to keep in memory)
Partial Aggregation
Extra HashAggregate

Input (item_id, Result of Result of Result of


sales) HashAggregate 1 Exchange HashAggregate 2

Partition X New Partition X

(A, 10) (A, 10) (A,10) (A, 13)

(B, 5) (B, 5) (A,3)

Partition Y New Partition Y

(A, 3) (A, 3) (B,5) (B, 9)

(B, 1) (B, 4) (B, 4)

(B, 1)

(B, 2)
Zooming in on Joins
Joins
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")

▪ What kind of join algorithms exist?


▪ How does Spark choose what join algorithm to use?
▪ Where are the sorts and filters coming from?
▪ (We already know Exchanges come from

requiredChildDistribution)
Join Implementations & Requirements
Different joins have different complexities

Join Type Required Child Distribution Required Description Complexity


Child (ballpark)
Ordering

BroadcastHashJoinExec One Side: None Performs local hash join between O(n)
BroadcastDistribution broadcast side and other side.
Other: UnspecifiedDistribution

SortMergeJoinExec Both Sides: Both Sides: Compare keys of sorted data O(nlogn)
HashClusteredDistribution Ordered (asc) sets and merges if match.
by join keys

BroadcastNestedLoopJoinExec One Side: None For each row of [Left/Right] O(n * m), small
BroadcastDistribution dataset, compare all rows of m
Other: UnspecifiedDistribution [Left/Right] data set.

CartesianProductExec None None Cartesian product/”cross join” + O(n* m), bigger


filter m
BroadcastHashJoinExec

Join Strategy
How does Catalyst choose what Yes
join?

One side small No


SortMergeJoinExec
enough?
Yes

equiJoin? Danger Zone (OOM)

No
One side small No No BroadcastNested
inner join?
enough? LoopJoinExec

Yes Yes

BroadcastNestedLoopJoinExec CartesianProductExec
Ordering requirements
Example for SortMergeJoinExec

Sort ([left.id], Sort ([right.id],


ASC) ASC)

requiredChildOrdering =
[left.id, right.id] (ASC)
Ensure the requirements
SortMergeJoin
(left.id=right.id SortMergeJoin
, Inner) (left.id=right.id
, Inner)
outputOrdering = depends on
join type outputOrdering =
[left.id, right.id] ASC
Revisiting our join
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")

Inner join -> Add isNotNull filter to join keys


(Logical plan, not physical plan step)

RequiredChildDistribution -> Shuffle Exchange

RequiredChildOrdering-> Sort

equi-join? Yes
Broadcastable? No } sortMergeJoin
Supercharge your Spark Queries
Scenario 1: Filter + Union anti-pattern
E.g. apply different logic based on a category the data belongs to.

final_df = functools.reduce(DataFrame.union,
[
logic_cat_0(df.filter(F.col("category") == 0)),
logic_cat_1(df.filter(F.col("category") == 1)),
logic_cat_2(df.filter(F.col("category") == 2)),
logic_cat_3(df.filter(F.col("category") == 3))
]
)

def logic_cat_0(df: DataFrame) -> DataFrame:

a!
at
fD
return df.withColumn("output", F.col("sales") * 2)

so
ad

Re
ed
at
pe
Re
Scenario 1: Filter + Union anti-pattern FIXED
Rewrite code with CASE WHEN :)

!
ead
final_df = (
e r
df On
.filter((F.col("category") >= 0) & (F.col("category") <= 3))
.withColumn("output",
F.when(F.col("category") == 0, logic_cat_0())
.when(F.col("category") == 1, logic_cat_1())
.when(F.col("category") == 2, logic_cat_2())
.otherwise(logic_cat_3())

)
)
def logic_cat_0() -> Column:
return F.col("sales") * 2
Scenario 2: Partial Aggregations
Partial aggregations do not help with high-cardinality grouping keys

transaction_dim = 100000000 # 100 million transactions


item_dim = 90000000 # 90 million itemIDs

!
help
esn’t
s do
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales")) Thi

Query duration: 23 seconds


Scenario 2: Partial Aggregations FIXED
Partial aggregations do not help with high-cardinality grouping keys

transaction_dim = 100000000 # 100 million transactions


item_dim = 90000000 # 90 million itemIDs

spark.conf.set("spark.sql.aggregate.partialaggregate.skip.enabled", True)
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales"))

Query duration: 18 seconds (22% reduction)

PR for enabling partial aggregation skipping


Scenario 3: Join Strategy
Compare coordinates to check if a ship is in a port

ship_ports = dfPorts.alias("p").join(
dfShips.alias("s"),
(col("s.lat") >= col("p.min_lat")) &
(col("s.lat") <= col("p.max_lat")) &
(col("s.lon") >= col("p.min_lon")) &
(col("s.lon") <= col("p.max_lon")))

Query duration: 3.5 minutes

w!
slo
Scenario 3: Join Strategy FIXED
Use a geohash to convert to equi-join

ship_ports = dfPorts.alias("p").join(
dfShips.alias("s"),
(col("s.lat") >= col("p.min_lat")) &
(col("s.lat") <= col("p.max_lat")) &
(col("s.lon") >= col("p.min_lon")) &
(col("s.lon") <= col("p.max_lon")) &
(substring(col("s.geohash"),1,2) == substring(col("p.geohash"),1,2)))

Query duration: 6 seconds

Fast!
In Summary
What we covered

The SQL Tab provides insights into how the Spark query is executed
We can use the SQL Tab to reason about query execution time.
We can answer important questions:
What part of my Spark query takes the most time?
Is my Spark query choosing the most efficient Spark operators for the task?

Want to practice / know more?


Mentally visualize what a physical plan might look like for a spark query, and then check the SQL tab if you are correct.
Check out the source code of SparkPlan
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

You might also like