0% found this document useful (0 votes)

125 views

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Filter and Project are examples of narrow transformations. A narrow transformation operates on each row independently and does not require a shuffle. Filter selects or filters rows based on a predicate. Project selects a subset of columns. These are very common and help optimize queries by reducing data before wider transformations like joins that require data movement between executors.

Uploaded by

WF1 Hackathon1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

125 views

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Uploaded by

WF1 Hackathon1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 52

From Query Plan to Query

Performance:
Supercharging your Spark Queries using the Spark UI
SQL Tab

Max Thone - Resident Solutions Architect

Stefan van Wouw - Sr. Resident Solutions Architect
Agenda

Introduction to Spark SQL Tab

The Most Common Components

of the Query Plan

Supercharge your spark queries

Introduction to Spark SQL Tab
Why should you know about the SQL Tab?

▪ Shows how the Spark query is executed

▪ Can be used to reason about query execution time.
What is a Query Plan?
▪ A Spark SQL/Dataframe/Dataset query goes through Spark Catalyst Optimizer before
being executed by the JVM
▪ With “Query plan” we mean the “Selected Physical Plan”, it is the output of Catalyst

Catalyst Optimizer

From the Databricks glossary (https://databricks.com/glossary/catalyst-optimizer)

Hierarchy: From Spark Dataframe to Spark task
One “dataframe action” can spawn multiple queries, which can spawn multiple jobs

Stage
Spark Job
Tasks
Query Stage
(=physical
plan) Stage
Spark Job
Stage
Dataframe
“action” Stage

Query Stage
(=physical Spark Job
plan) Stage
Tasks
Stage
(3) Job

A simple example (1) (2) Query (physical plan)

(1) dataframe “action”

# dfSalesSample is some cached dataframe

dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))

# Trigger the query

dfItemSales.write.format("noop").mode("overwrite").save()

(4) Two Stages

(5) Nine tasks

A simple example (2)
# dfSalesSample is some cached dataframe

dfItemSales = (dfSalesSample
.filter(f.col("item_id") >= 600000)
.groupBy("item_id")
.agg(f.sum(f.col("sales")).alias("itemSales")))

# Trigger the query

dfItemSales.write.format("noop").mode("overwrite").save()

== Physical Plan ==
OverwriteByExpression org.apache.spark.sql.execution.datasources.noop.NoopTable$@dc93aa9, [AlwaysTrue()], org.apache.spark.sql.util.CaseInsensitiveStringMap@1f
+- *(2) HashAggregate(keys=[item_id#232L], functions=[finalmerge_sum(merge sum#1247L) AS sum(cast(sales#233 as bigint))#1210L], output=[item_id#232L, itemSales#1211L])
+- Exchange hashpartitioning(item_id#232L, 8), true, [id=#1268]
+- *(1) HashAggregate(keys=[item_id#232L], functions=[partial_sum(cast(sales#233 as bigint)) AS sum#1247L], output=[item_id#232L, sum#1247L])
+- *(1) Filter (isnotnull(item_id#232L) AND (item_id#232L >= 600000))
+- InMemoryTableScan [item_id#232L, sales#233], [isnotnull(item_id#232L), (item_id#232L >= 600000)]
A simple example (3)

▪ What more possible operators exist in Physical plan?

▪ How should we interpret the “details” in the SQL plan?
▪ How can we use above knowledge to optimise our Query?

▪ The physical plan is represented by SparkPlan class

▪ SparkPlan is a recursive data structure:
▪ It represents a physical operator in the physical plan, AND the whole plan itself (1)

▪ SparkPlan is the base class, or “blueprint” for these physical operators

▪ These physical operators are “chained” together

(1) From Jacek Laskowski’s Mastering Spark SQL (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkPlan.html#contract

Physical operators of SparkPlan
Extending SparkPlan

Query Input Query Input

(LeafExecNode) (LeafExecNode)

Binary
Transformation
(BinaryExecNode)

Unary
Transformation
(UnaryExecNode)

Output
(UnaryExecNode)
Physical operators of SparkPlan
Extending SparkPlan (152 subclasses)
Query Input Query Input
(LeafExecNode) (LeafExecNode)

▪ LeafExecNode (27 subclasses)

▪ All file sources, cache read, construction of dataframes from RDDs, range
generator, and reused exchanges & subqueries. Binary

▪ BinaryExecNode (8 subclasses) Transformation

(BinaryExecNode)
▪ Operations with 2 dataframes as input (joins, unions, etc.)

▪ UnaryExecNode (82 subclasses) Unary

▪ Operations with one dataframe as input. E.g. sort, aggregates, exchanges, Transformation
filters, projects, limits (UnaryExecNode)

▪ Other (32 traits/abstract/misc classes)

Output
(UnaryExecNode)
The Most Common Components of the Physical
Plan
Parts we will cover. Parts we will NOT cover.

▪ Common Narrow Transformations ▪ Adaptive Query Execution

▪ Distribution Requirements ▪ Streaming
(Exchange) ▪ Datasource V2 specifics
▪ Common Wide Transformations ▪ Command specifics (Hive metastore
▪ Aggregates related)
▪ Joins ▪ Dataset API specifics
▪ Ordering Requirements (Sort) ▪ Caching / Reuse
▪ UDFs
Let’s start with the basics: Read/Write
Row-based Scan CSV and Write to Delta Lake
No dataframe transformations apart from read/write

spark Q1
.read
.format("csv")
.option("header", True) Q2
.load("/databricks-datasets/airlines") 1
.write 2
.format("delta")
.save("/tmp/airlines_delta")

3
4
Columnar Scan Delta Lake and Write to Delta Lake
High level

spark Q1
.read
.format("delta")
Q2
.load("...path...")
.write
.format("delta")
Anything in this box
.save("/tmp/..._delta") supports codegen

Parquet is Columnar, while Spark is

row-based
Columnar Scan Delta Lake and Write to Delta Lake
Statistics on Columnar Parquet Scan Q2

spark
.read
.format("delta")
.load("...path...") 1
.write
.format("delta")
.save("/tmp/..._delta")
Columnar Scan Delta Lake and Write to Delta Lake
Statistics on WSCG + ColumnarToRow Q2

spark
.read
.format("delta")
1
.load("...path...")
.write
.format("delta")
.save("/tmp/..._delta") 2
3
Common Narrow Transformations
Common Narrow Transformations
Filter / Project

spark
.read
.format("delta")
Filter
.load("...path...") r →
Filte
.filter(col("item_id") < 1000)
.withColumn("doubled_item_id", col("item_id")*2)
.write withCo
lumn/
.format("delta") select
→ Pro
ject
.save("/tmp/..._delta")
Common Narrow Transformations
Range / Sample / Union / Coalesce

df1 = spark.range(1000000) spark.range → Range

df2 = spark.range(1000000)

df1
.sample(0.1) sample → Sample

.union(df2)
union → U
nion
.coalesce(1)
.write coale
sce →
Coale
.format("delta") sce

.save("/tmp/..._delta")
Special Case! Local Sorting
sortWithinPartitions

Input Result of Global

(item_id) Sort result
(unsorted!
df.sortWithinPartitions("item_id") )

Partition X

33 33 33

Partition Y
1
34 4 4

66 8 8
P artitions / partitionBy → Sort
sortWithin
(global=False) 4 34 34

8 66 66
Special Case! Global Sorting
orderBy

Input Result of Result of Global

(item_id) Exchange Sort result
(example) (sorted!)
df.orderBy("item_id")
Partition X New
Partition X

8 4 4

33 4 8 8

Partition Y New
Partition Y

orderBy → Sort (global=True)

66 66 33 33

4 33 34 34

8 34 66 66
Wide Transformations
What are wide transformations?

▪ Transformations for which re-distribution of data is required

▪ e.g: joins, global sorting, and aggregations
▪ These above requirements are captured through “distribution”
requirements
Distribution requirements
Each node in the physical plan can specify how it expects data to be distributed over the Spark cluster

requiredChildDistribution (Default: UnspecifiedDistribution)

SparkPlan Required Distribution Satisfied by (roughly) Example operator

this Partitioning of
Operator (e.g.
child
Filter)
UnspecifiedDistributio All Scan
outputPartitioning (Default: UnknownPartitioning) n

AllTuples All with 1 partition only Flatmap in Pandas

OrderedDistribution RangePartitioning Sort (global)

(Hash)ClusteredDistrib HashPartitioning HashAggregate /

ution SortMergeJoin

BroadcastDistribution BroadcastPartitioning BroadcastHashJoin

Distribution requirements
Example for Local Sort (global=False)

requiredChildDistribution =
UnspecifiedDistribution

Sort Ensure the requirements Sort

(global=False) (global=False)

outputPartitioning = retain outputPartitioning = retain

child’s
child’s
Distribution requirements
Example for Global Sort (global=True)
Exchange
(rangepartition
ing)

requiredChildDistribution =
OrderedDistribution (ASC/DESC)
Ensure the requirements
Sort
(global=True) Sort
(global=True)

outputPartitioning = retain
child’s outputPartitioning =
RangePartitioning
Shuffle Exchange
What are the metrics in the Shuffle exchange?

Size of serialised data read from

“local” executor

Serialised size of data read from

“remote” executors

Size of shuffle bytes written

When is it used? Before any operation that requires the same keys on same partitions (e.g. groupBy +
aggregation, and for joins (sortMergeJoin)
Broadcast Exchange
Only output rows are a metric with
broadcasts

time to build the broadcast table

time to collect all the data

Size of broadcasted data (in memory)

# of rows in broadcasted data

When is it used? Before any operation in which copying the same data to all nodes is required. Usually:
BroadcastHashJoin, BroadcastNestedLoopJoin
Zooming in on Aggregates
Aggregates
Distr
ibuti
on re
quire
men
df t Input (item_id, Result of Result of
sales) Exchange HashAggregate 2
.groupBy("item_id") Partition X New Partition X
.agg(F.sum("sales"))
(A, 10) (A,10) (A, 13)

(B, 5) (A,3)

Partition Y New Partition Y

(A, 3) (B,1) (B, 9)

groupBy/agg → HashAggregate
(B, 1) (B, 1)

(B, 1) (B, 1)

(B, 2) (B, 2)
Aggregate implementations
HashAggregateExec (Dataframe API)
- Based on HashTable structure.
df - Supports codegen
.groupBy("item_id") - When hitting memory limits, spill to disk and start new
HashTable
.agg(F.sum("sales")) - Merge all HashTables using sort based aggregation
method.
ObjectHashAggregateExec (Dataset API)
- Same as HashAggregateExec, but for JVM objects
- Does not support codegen
- Immediately falls back to sort based aggregation
method when hitting memory limits
SortAggregateExec
- sort based aggregation
Aggregates Metrics

Only in case of fallback to sorting (too many distinct

keys to keep in memory)
Partial Aggregation
Extra HashAggregate

Input (item_id, Result of Result of Result of

sales) HashAggregate 1 Exchange HashAggregate 2

Partition X New Partition X

(A, 10) (A, 10) (A,10) (A, 13)

(B, 5) (B, 5) (A,3)

Partition Y New Partition Y

(A, 3) (A, 3) (B,5) (B, 9)

(B, 1) (B, 4) (B, 4)

(B, 1)

(B, 2)
Zooming in on Joins
Joins
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")

▪ What kind of join algorithms exist?

▪ How does Spark choose what join algorithm to use?
▪ Where are the sorts and filters coming from?
▪ (We already know Exchanges come from

requiredChildDistribution)
Join Implementations & Requirements
Different joins have different complexities

Join Type Required Child Distribution Required Description Complexity

Child (ballpark)
Ordering

BroadcastHashJoinExec One Side: None Performs local hash join between O(n)
BroadcastDistribution broadcast side and other side.
Other: UnspecifiedDistribution

SortMergeJoinExec Both Sides: Both Sides: Compare keys of sorted data O(nlogn)
HashClusteredDistribution Ordered (asc) sets and merges if match.
by join keys

BroadcastNestedLoopJoinExec One Side: None For each row of [Left/Right] O(n * m), small
BroadcastDistribution dataset, compare all rows of m
Other: UnspecifiedDistribution [Left/Right] data set.

CartesianProductExec None None Cartesian product/”cross join” + O(n* m), bigger

filter m
BroadcastHashJoinExec

Join Strategy
How does Catalyst choose what Yes
join?

One side small No

SortMergeJoinExec
enough?
Yes

equiJoin? Danger Zone (OOM)

No
One side small No No BroadcastNested
inner join?
enough? LoopJoinExec

Yes Yes

BroadcastNestedLoopJoinExec CartesianProductExec
Ordering requirements
Example for SortMergeJoinExec

Sort ([left.id], Sort ([right.id],

ASC) ASC)

requiredChildOrdering =
[left.id, right.id] (ASC)
Ensure the requirements
SortMergeJoin
(left.id=right.id SortMergeJoin
, Inner) (left.id=right.id
, Inner)
outputOrdering = depends on
join type outputOrdering =
[left.id, right.id] ASC
Revisiting our join
Example “standard join” example (sort merge join)
# Basic aggregation + join
dfJoin = dfSalesSample.join(dfItemDim, "item_id")

Inner join -> Add isNotNull filter to join keys

(Logical plan, not physical plan step)

RequiredChildDistribution -> Shuffle Exchange

RequiredChildOrdering-> Sort

equi-join? Yes
Broadcastable? No } sortMergeJoin
Supercharge your Spark Queries
Scenario 1: Filter + Union anti-pattern
E.g. apply different logic based on a category the data belongs to.

final_df = functools.reduce(DataFrame.union,
[
logic_cat_0(df.filter(F.col("category") == 0)),
logic_cat_1(df.filter(F.col("category") == 1)),
logic_cat_2(df.filter(F.col("category") == 2)),
logic_cat_3(df.filter(F.col("category") == 3))
]
)
…

def logic_cat_0(df: DataFrame) -> DataFrame:

a!
at
fD
return df.withColumn("output", F.col("sales") * 2)

so
ad
…

Re
ed
at
pe
Re
Scenario 1: Filter + Union anti-pattern FIXED
Rewrite code with CASE WHEN :)

!
ead
final_df = (
e r
df On
.filter((F.col("category") >= 0) & (F.col("category") <= 3))
.withColumn("output",
F.when(F.col("category") == 0, logic_cat_0())
.when(F.col("category") == 1, logic_cat_1())
.when(F.col("category") == 2, logic_cat_2())
.otherwise(logic_cat_3())

)
)
def logic_cat_0() -> Column:
return F.col("sales") * 2
Scenario 2: Partial Aggregations
Partial aggregations do not help with high-cardinality grouping keys

transaction_dim = 100000000 # 100 million transactions

item_dim = 90000000 # 90 million itemIDs

!
help
esn’t
s do
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales")) Thi

Query duration: 23 seconds

Scenario 2: Partial Aggregations FIXED
Partial aggregations do not help with high-cardinality grouping keys

transaction_dim = 100000000 # 100 million transactions

item_dim = 90000000 # 90 million itemIDs

spark.conf.set("spark.sql.aggregate.partialaggregate.skip.enabled", True)
itemDF.groupBy("itemID").agg(sum(col("sales")).alias("sales"))

Query duration: 18 seconds (22% reduction)

PR for enabling partial aggregation skipping

Scenario 3: Join Strategy
Compare coordinates to check if a ship is in a port

ship_ports = dfPorts.alias("p").join(
dfShips.alias("s"),
(col("s.lat") >= col("p.min_lat")) &
(col("s.lat") <= col("p.max_lat")) &
(col("s.lon") >= col("p.min_lon")) &
(col("s.lon") <= col("p.max_lon")))

Query duration: 3.5 minutes

w!
slo
Scenario 3: Join Strategy FIXED
Use a geohash to convert to equi-join

Query duration: 6 seconds

Fast!
In Summary
What we covered

The SQL Tab provides insights into how the Spark query is executed
We can use the SQL Tab to reason about query execution time.
We can answer important questions:
What part of my Spark query takes the most time?
Is my Spark query choosing the most efficient Spark operators for the task?

Want to practice / know more?

Mentally visualize what a physical plan might look like for a spark query, and then check the SQL tab if you are correct.
Check out the source code of SparkPlan
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.

PYSPARK Interview Questions
100% (3)
PYSPARK Interview Questions
126 pages
Catálogo FORCROP Inglés Internacional
No ratings yet
Catálogo FORCROP Inglés Internacional
19 pages
Spark SQL-A Compiler From Queries To RDDs
No ratings yet
Spark SQL-A Compiler From Queries To RDDs
44 pages
Fall209 Spark SQL MC
No ratings yet
Fall209 Spark SQL MC
96 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
IBM_PySpark_CheatSheet
No ratings yet
IBM_PySpark_CheatSheet
2 pages
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers instant download
No ratings yet
Spark The Definitive Guide Big Data Processing Made Simple Bill Chambers instant download
79 pages
MyinterviewQs (1)
No ratings yet
MyinterviewQs (1)
9 pages
Page 01
No ratings yet
Page 01
2 pages
Bda Unit-4 PDF
No ratings yet
Bda Unit-4 PDF
63 pages
Intro To Spark Development
No ratings yet
Intro To Spark Development
172 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
15 Asked Questions in KPMG
No ratings yet
15 Asked Questions in KPMG
22 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
4- Spark SQL
No ratings yet
4- Spark SQL
58 pages
Spark SQL Optimization
No ratings yet
Spark SQL Optimization
29 pages
master_pyspark_zero_to_hero_1738689679
No ratings yet
master_pyspark_zero_to_hero_1738689679
102 pages
Pyspark Basics
No ratings yet
Pyspark Basics
16 pages
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
No ratings yet
Top 10 Production-Grade Reusable PySpark Scripts for Data Engineers _ by Mayurkumar Surani _ May, 2025 _ Medium
14 pages
Pyspark Funcamentals
No ratings yet
Pyspark Funcamentals
10 pages
Fundamental Pyspark Operations 1708364268
No ratings yet
Fundamental Pyspark Operations 1708364268
10 pages
Q1. Difference between cache and pe
No ratings yet
Q1. Difference between cache and pe
13 pages
PySpark Optimization Scenarios - Wipro
No ratings yet
PySpark Optimization Scenarios - Wipro
8 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
1714069759520
No ratings yet
1714069759520
17 pages
Introduction To Apache Spark (Spark) : - by Praveen
No ratings yet
Introduction To Apache Spark (Spark) : - by Praveen
19 pages
Spark Devops
0% (1)
Spark Devops
301 pages
DE Bootcamp _ Week 3 Day 2
No ratings yet
DE Bootcamp _ Week 3 Day 2
4 pages
spark_optimization_1741826797
No ratings yet
spark_optimization_1741826797
7 pages
Spark Interview Questions
No ratings yet
Spark Interview Questions
61 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
pyspark interview questions
No ratings yet
pyspark interview questions
9 pages
Databricks Optimization Technique
No ratings yet
Databricks Optimization Technique
18 pages
Apache Spark
No ratings yet
Apache Spark
5 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
No ratings yet
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
39 pages
Mastering Apache Spark
100% (2)
Mastering Apache Spark
1,831 pages
BDA U5 copy
No ratings yet
BDA U5 copy
42 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
PySpark Q&A
No ratings yet
PySpark Q&A
56 pages
Skyess Spark Syllabus
No ratings yet
Skyess Spark Syllabus
12 pages
PySpark Questions
No ratings yet
PySpark Questions
5 pages
DP 203t00a Enu Powerpoint 03
No ratings yet
DP 203t00a Enu Powerpoint 03
25 pages
Data Engineer Question
No ratings yet
Data Engineer Question
33 pages
Pyspark
100% (1)
Pyspark
48 pages
PySpark Cheatsheet
No ratings yet
PySpark Cheatsheet
12 pages
Apache Spark - Optimization Techniques
No ratings yet
Apache Spark - Optimization Techniques
7 pages
Cleaning Data With PySpark Chapter3
No ratings yet
Cleaning Data With PySpark Chapter3
25 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
Shopify's Big Data Platform
No ratings yet
Shopify's Big Data Platform
28 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
SparkSql_AND_DF
No ratings yet
SparkSql_AND_DF
89 pages
APJ Lakehouse Optimisation Webinar
No ratings yet
APJ Lakehouse Optimisation Webinar
53 pages
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
JDK Tutorials - Herong's Tutorial Examples
From Everand
JDK Tutorials - Herong's Tutorial Examples
Herong Yang
No ratings yet
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
Math 5: Q3-Module No. 1: Week 1
100% (6)
Math 5: Q3-Module No. 1: Week 1
4 pages
CPC 100 User Manual
No ratings yet
CPC 100 User Manual
77 pages
Esp Group 2's Lesson Plan
No ratings yet
Esp Group 2's Lesson Plan
9 pages
Etf Eto
No ratings yet
Etf Eto
4 pages
"Old" Style "New" Paradigm: Lone Ranger, Hard Bargainer Get A Good Deal
No ratings yet
"Old" Style "New" Paradigm: Lone Ranger, Hard Bargainer Get A Good Deal
3 pages
ETAP typical report
No ratings yet
ETAP typical report
13 pages
Catalogo Sensitiv TA
No ratings yet
Catalogo Sensitiv TA
12 pages
TA10-Unit 1
No ratings yet
TA10-Unit 1
59 pages
Wa0010.
No ratings yet
Wa0010.
4 pages
Wain229 I
No ratings yet
Wain229 I
112 pages
Learning Output Module No.4
No ratings yet
Learning Output Module No.4
2 pages
Rate Limiter Ppt Updated
No ratings yet
Rate Limiter Ppt Updated
13 pages
Practice Teaching Portfolio Entries2
86% (7)
Practice Teaching Portfolio Entries2
10 pages
Practical 1&2
No ratings yet
Practical 1&2
32 pages
Authorship 2024 Form
No ratings yet
Authorship 2024 Form
5 pages
Nitrogen As A Probable Problematic Factor of Computational Chemistry - A Benchmarking Study
No ratings yet
Nitrogen As A Probable Problematic Factor of Computational Chemistry - A Benchmarking Study
19 pages
Wenger - Acoustic Problems and Solutions - PG - LT0153
No ratings yet
Wenger - Acoustic Problems and Solutions - PG - LT0153
28 pages
Topic Test For Science Year 8.pdf (1)
No ratings yet
Topic Test For Science Year 8.pdf (1)
2 pages
How To Memorize
No ratings yet
How To Memorize
4 pages
6.+821+-+Vol+23+N+2+-+pp.227-265+[PUBLISHED+30-01-2025]
No ratings yet
6.+821+-+Vol+23+N+2+-+pp.227-265+[PUBLISHED+30-01-2025]
39 pages
Corbel PDF
No ratings yet
Corbel PDF
25 pages
Riftvally University Colledge Hawassa: Factor Affecting The Performace Micro and Small Enterprise
100% (1)
Riftvally University Colledge Hawassa: Factor Affecting The Performace Micro and Small Enterprise
50 pages
SOFT255SL Software Engineering For The Internet Using Java Acadamic Year 2020
No ratings yet
SOFT255SL Software Engineering For The Internet Using Java Acadamic Year 2020
8 pages
Division by Zero
No ratings yet
Division by Zero
9 pages
Seminars and Fieldtrip
No ratings yet
Seminars and Fieldtrip
3 pages
Guia Reparacion Inyectores Denso
100% (4)
Guia Reparacion Inyectores Denso
22 pages
Review: I Agree Completely I Agree To Some Extent I Disagree Completely
No ratings yet
Review: I Agree Completely I Agree To Some Extent I Disagree Completely
1 page
Dr. Deva Brinda CV
No ratings yet
Dr. Deva Brinda CV
12 pages
Indicators of Science and Mathematics Education
No ratings yet
Indicators of Science and Mathematics Education
222 pages

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Uploaded by

From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab

Uploaded by

From Query Plan to Query

Max Thone - Resident Solutions Architect

Introduction to Spark SQL Tab

The Most Common Components

Supercharge your spark queries

▪ Shows how the Spark query is executed

From the Databricks glossary (https://databricks.com/glossary/catalyst-optimizer)

A simple example (1) (2) Query (physical plan)

(1) dataframe “action”

# Trigger the query

(4) Two Stages

(5) Nine tasks

# Trigger the query

▪ What more possible operators exist in Physical plan?

▪ The physical plan is represented by SparkPlan class

▪ SparkPlan is the base class, or “blueprint” for these physical operators

(1) From Jacek Laskowski’s Mastering Spark SQL (https://jaceklaskowski.gitbooks.io/mastering-spark-sql/content/spark-sql-SparkPlan.html#contract

Query Input Query Input

▪ LeafExecNode (27 subclasses)

▪ BinaryExecNode (8 subclasses) Transformation

▪ UnaryExecNode (82 subclasses) Unary

▪ Other (32 traits/abstract/misc classes)

▪ Common Narrow Transformations ▪ Adaptive Query Execution

Parquet is Columnar, while Spark is

df1 = spark.range(1000000) spark.range → Range

Input Result of Global

Input Result of Result of Global

orderBy → Sort (global=True)

▪ Transformations for which re-distribution of data is required

requiredChildDistribution (Default: UnspecifiedDistribution)

SparkPlan Required Distribution Satisfied by (roughly) Example operator

AllTuples All with 1 partition only Flatmap in Pandas

OrderedDistribution RangePartitioning Sort (global)

(Hash)ClusteredDistrib HashPartitioning HashAggregate /

BroadcastDistribution BroadcastPartitioning BroadcastHashJoin

Sort Ensure the requirements Sort

outputPartitioning = retain outputPartitioning = retain

Size of serialised data read from

Serialised size of data read from

Size of shuffle bytes written

time to build the broadcast table

time to build the broadcast table

Size of broadcasted data (in memory)

# of rows in broadcasted data

Partition Y New Partition Y

(A, 3) (B,1) (B, 9)

Only in case of fallback to sorting (too many distinct

Input (item_id, Result of Result of Result of

Partition X New Partition X

(A, 10) (A, 10) (A,10) (A, 13)

(B, 5) (B, 5) (A,3)

Partition Y New Partition Y

(A, 3) (A, 3) (B,5) (B, 9)

(B, 1) (B, 4) (B, 4)

▪ What kind of join algorithms exist?

Join Type Required Child Distribution Required Description Complexity

CartesianProductExec None None Cartesian product/”cross join” + O(n* m), bigger

One side small No

equiJoin? Danger Zone (OOM)

Sort ([left.id], Sort ([right.id],

Inner join -> Add isNotNull filter to join keys

RequiredChildDistribution -> Shuffle Exchange

def logic_cat_0(df: DataFrame) -> DataFrame:

transaction_dim = 100000000 # 100 million transactions

Query duration: 23 seconds

transaction_dim = 100000000 # 100 million transactions

Query duration: 18 seconds (22% reduction)

PR for enabling partial aggregation skipping

Query duration: 3.5 minutes

Query duration: 6 seconds

Want to practice / know more?

You might also like