Spark SQL Optimization
Spark SQL Optimization
Spark SQL Optimization
pm jat @ daiict
Query Optimization in Spark-SQL?
Do you remember: 𝑅 ⋈𝑐 𝑆 ≡ 𝜎𝑐 (𝑅 × 𝑆) ?
DAG of RDDs
• Rule Says
Smaller table of
R and L is to be
Hashed!
• If Only Rule is used
(without considering
intermediate results)
• Estimating size of
Intermediate Result requires some
more information, called statistical information
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 21
Statistical Information in CBO [4]
• Uses notion of
“Filter Selectivity”,
and
“Join Selectivity”
• Selectivity's are
often estimated
based on histograms
of distinct values
and cardinalities of operand relations, etc
https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 22
Cost-Based Optimization
(Example #1)
Here we see how the change of Join-order by looking at Join Selectivity
• df.explain
• In Scala you can also call df.queryExecution.logical or
df.queryExecution.optimizedPlan