Spark SQL Optimization

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

Spark SQL – Optimization

pm jat @ daiict
Query Optimization in Spark-SQL?

• What do you understand by “Query Optimization”?

12-Sep-23 Spark SQL – Optimization 2


Query Execution and Optimization
• A query can be expressed in different way
(1) πfname, dname, salary(
σsalary >= 30000 AND employee.dno = department.dno
(employee x department))
(2) πfname, dname, salary(σsalary >= 30000 (employee)
* department )
(3) σsalary >= 30000(πfname, dname, salary
(employee * department))
• Which one is better query and why?
• Should a programmer take care for this?

12-Sep-23 Spark SQL – Optimization 3


Concept of Query Optimization?
• Let user write a query that is mathematically correct (not necessarily most efficient
in terms of execution)
• DBMS systems provide a run-time module called “query optimizer”
• Query Optimizer should read the query and generate most “efficient plan” for
executing the query.
• What do mean by plan?

12-Sep-23 Spark SQL – Optimization 4


Query Execution Steps
• Query Parsing
• Query Optimization
– Logical: Rewritten - Reordering of
Relational operations
– Create Physical Plan: Identification of
Actual “Physical operations” – in
algorithmic form
• Query Execution
• Generated Physical plan is actually
executed!

12-Sep-23 Spark SQL – Optimization 5


12-Sep-23 Spark SQL – Optimization 6
Logical Plan and Physical Plan in
“Relational World”
• Logical Plan:
– A Relation Algebra Tree representing the user query
• Physical Plan?
– A sequence of lower-level (physical) operations on data file in order to execute
the user query
– Examples of lower-level operations are - sequential scan of file, index traversal,
sequential scan of leaf nodes of “b+-tree index file”, sort-merge, hash-join, etc
– Our SQL expressions are required to be executed in terms of these operations

12-Sep-23 Spark SQL – Optimization 7


How Logical Optimization works?
• It should be easy to explain following
– If possible, selection should be executed early in the order
--> reduces the size of intermediate “operand relations” in following operations
--> reduced overall cost of execution.
– With the same logic, early projection also leads to “faster execution” of the
query
– If a user has submitted a query in which JOIN is expressed in terms of
CROSS PRODUCT, it should be rewritten using JOIN

Do you remember: 𝑅 ⋈𝑐 𝑆 ≡ 𝜎𝑐 (𝑅 × 𝑆) ?

12-Sep-23 Spark SQL – Optimization 8


How Logical Optimization works?
• Said “preferred approach” can be defined in terms of rules.
• The query optimizer finds if the “inputted query” is not in compliance with these
rules, it is re-expressed on these lines.
• That means Query is “Rewritten” (called as “Query Rewriting”)
• In other words, the Parsed Evaluation Tree from the inputted query is transformed
into “better ones” considering “said Rules”
• So, how it is done
– “Operation Reordering”
– Pushing “predicate” down, “projection” down
– Combining or splitting operations

12-Sep-23 Spark SQL – Optimization 9


Physical Optimization
• A physical plan is basically a sequence of physical operations that are actually
performed to execute the query
• Typical set of physical operations are: table scan, index scan, hashing, sort-merge,
hash-join, so forth!
• For producing a physical plan for a given logical plan, often we have multiple options
and depend on “data file organization“ and data statistics
• Query optimizer uses the concept of “COST” for choosing the optimal query plan.
• A cost function typically gives some estimation of time taken by the query. A plan
with minimum cost is chosen.

12-Sep-23 Spark SQL – Optimization 10


Cost based Optimization
• For example, the following is a simple estimation of the cost for “single loop join”;
the same is used in our broadcast join.
𝐵𝑅 + ( 𝑅 × (𝐻𝐵𝑆 + 1)) + ( 𝑗𝑠 × |𝑅| × |𝑆 |Τ𝐵𝐹𝑅𝑆 )
• Similar would be the cost for another join approach; the optimizer chooses a plan
that has minimum cost.
• For details, you can refer to any database text book. Here formula comes from
Elmasri/Navathe.

12-Sep-23 Spark SQL – Optimization 11


Cost based Optimization (factors)
• File Organization includes
– If records are sorted, sorted on what attribute
– If indexes are available, if yes, on what attribute, what is a method of index “B+-
tree” based or “hash based”
• Metadata
– Record Size, block size, cardinality, selectivity (ratio of distinct values for
attributes), join selectivity

12-Sep-23 Spark SQL – Optimization 12


“explain” of SQL (RDBMS)
• Explain of SQL is used for showing you finally “Optimized Physical” plan of query
execution! Snap shot here is from “PostgreSQL”

12-Sep-23 Spark SQL – Optimization 13


Spark SQL Optimizer Catalyst [1]
• All the statements are cached as Abstract Syntax Tree (AST)
• Lazy evaluation of AST enables optimization of expressed operations.
• Diagram here depicts optimization pipeline

DAG of RDDs

12-Sep-23 Spark SQL – Optimization 14


Catalyst - Analysis Phase
• SQL/Dataframe  AST
• Resolves attribute: if valid, able to resolve name ambiguity, etc
• “Type Coercion”
• Takes meta-information from “Catalog”

12-Sep-23 Spark SQL – Optimization 15


An example of a “query optimization”

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 16
Example

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 17
Logical Optimization in SparkSQL-Catalyst
• Logical operations: SparkSQL operation in SQL/API
• The logical optimization phase applies standard rule-based optimizations to the
logical plan. The article[1] reports the following rule-based techniques that can help
in producing a better query plan.
– Constant folding : constant propagation
– Predicate pushdown: move the predicate as early as possible
– Projection pruning: drop unnecessary columns in query execution
– null propagation,
– Boolean expression simplification, and other rules.

12-Sep-23 Spark SQL – Optimization 18


Cost-Based Optimization (Motivating Example #1)

• Here is an example: a simplified version of


Q11 TPC-DS benchmark.
• Join Order makes a difference, and the
most optimal can not be determined
unless we have some estimation of
intermediate results.
• Require some additional information
Cost-Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal
Spark 12-Sep-23
Summit’2017: https://www.youtube.com/watch?v=qS_aS99TjCM Spark SQL – Optimization 19
Physical Optimization in SparkSQL-Catalyst
• The physical plan is basically RDD DAG only.
• In the physical planning phase, Spark SQL takes a logical plan and generates one or more
physical plans.
• It then selects a plan using a Cost Model. (Cost Based Optimization, CBO)
• An example of a cost comparison might be choosing how to perform a given join by looking
at the physical attributes of a given table (how big the table is or how big its partitions are).
– Say which join approach to use: “--Broadcast join” or “Shuffle Join” (Sort Merge Join)
• The physical planner also performs rule-based physical optimizations, such as “pipelining
projections or filters” into one Spark map operation.
• In addition, it can push operations from the logical plan into data sources that support
predicate or projection pushdown.
• The final phase of query optimization involves generating Java bytecode to run on each
machine.
12-Sep-23 Spark SQL – Optimization 20
Cost-Based Optimization (Motivating Example #2)

• Rule Says
Smaller table of
R and L is to be
Hashed!
• If Only Rule is used
(without considering
intermediate results)
• Estimating size of
Intermediate Result requires some
more information, called statistical information

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 21
Statistical Information in CBO [4]

• Uses notion of
“Filter Selectivity”,
and
“Join Selectivity”
• Selectivity's are
often estimated
based on histograms
of distinct values
and cardinalities of operand relations, etc

https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
12-Sep-23 Spark SQL – Optimization 22
Cost-Based Optimization
(Example #1)
Here we see how the change of Join-order by looking at Join Selectivity

estimated size intermediate results turns out to be Filter Selectivity


faster!

12-Sep-23 Spark SQL – Optimization 23


https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-2.html
.explain(example)

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 24
.explain(example)

Book: Learning Spark [4]


12-Sep-23 Spark SQL – Optimization 25
SparkSQL-Catalyst Features [1]
• Supports both: “rule-based” and “cost-based” optimization.
• Catalyst is extensible.
• Its extensibility is said to have the following two purposes
– Different type of optimization rules for different problems of associated with
“big data” (e.g., semistructured data and advanced analytics).
– Ability to add data source-specific rules that can push filtering or aggregation
into external storage systems.
• More features
– Schema inference,
– Query federation to external databases

12-Sep-23 Spark SQL – Optimization 26


SparkSQL-Catalyst “Extensibility”
• The article says “In general, we have found it extremely simple to add rules for a
wide variety of situations”
– For example: aggregate operations on fixed precision decimals are done by
converting them 64 bit integers and finally converting them back to decimals.
– SQL LIKE operation is executed through “String.startsWith” or
“String.contains” calls makes a difference

12-Sep-23 Spark SQL – Optimization 27


SparkSQL Optimization - API Notes

• df.explain
• In Scala you can also call df.queryExecution.logical or
df.queryExecution.optimizedPlan

12-Sep-23 Spark SQL – Optimization 28


References/Further Reading
[1] Armbrust, Michael, et al. "Spark SQL: Relational data processing in spark." Proceedings of the 2015
ACM SIGMOD international conference on management of data. ACM, 2015.
[2] Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal, Cost Based Optimizer in Apache Spark
2.2. Databricks. https://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2-
2.html. Published August, 2017 and Video Talk: https://www.youtube.com/watch?v=qS_aS99TjCM
[3] Baldacci L, Golfarelli M. A Cost Model for SPARK SQL. IEEE Trans Knowl Data Eng. 2019;31(5):819-832.
doi:10.1109/TKDE.2018.2850339
[4] (Book) Learning Spark: lightning-fast big data analytics by Damji, Jules S., et al. O'Reilly Media, 2020.
[5] Armbrust, M., et al. "Deep dive into spark sql’s catalyst optimizer." Diambil kembali dari
https://databricks. com: https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-
optimizer.html (2015).

12-Sep-23 Spark SQL – Optimization 29

You might also like