0% found this document useful (0 votes)

562 views29 pages

Spark SQL Optimization

1) Query optimization involves generating the most efficient execution plan for a given query. The query optimizer will analyze the query and apply logical and physical optimizations to improve performance. 2) Logical optimizations rewrite or reorder the logical operations in the query in a way that reduces the size of intermediate results. Physical optimizations select the specific low-level physical operations used to execute the query based on analyzing statistics and estimating costs. 3) Spark SQL's Catalyst query optimizer applies rule-based logical optimizations like predicate pushdown and projection pruning. It then uses a cost-based approach and statistics like selectivity to select the most efficient physical execution plan, such as choosing a broadcast join over a shuffle join.

Uploaded by

Parv Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

562 views29 pages

Spark SQL Optimization

Uploaded by

Parv Agarwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Spark SQL – Optimization

pm jat @ daiict
Query Optimization in Spark-SQL?

• What do you understand by “Query Optimization”?

12-Sep-23 Spark SQL – Optimization 2

Query Execution and Optimization
• A query can be expressed in different way
(1) πfname, dname, salary(
σsalary >= 30000 AND [Link] = [Link]
(employee x department))
(2) πfname, dname, salary(σsalary >= 30000 (employee)
* department )
(3) σsalary >= 30000(πfname, dname, salary
(employee * department))
• Which one is better query and why?
• Should a programmer take care for this?

12-Sep-23 Spark SQL – Optimization 3

Concept of Query Optimization?
• Let user write a query that is mathematically correct (not necessarily most efficient
in terms of execution)
• DBMS systems provide a run-time module called “query optimizer”
• Query Optimizer should read the query and generate most “efficient plan” for
executing the query.
• What do mean by plan?

12-Sep-23 Spark SQL – Optimization 4

Query Execution Steps
• Query Parsing
• Query Optimization
– Logical: Rewritten - Reordering of
Relational operations
– Create Physical Plan: Identification of
Actual “Physical operations” – in
algorithmic form
• Query Execution
• Generated Physical plan is actually
executed!

12-Sep-23 Spark SQL – Optimization 5

12-Sep-23 Spark SQL – Optimization 6
Logical Plan and Physical Plan in
“Relational World”
• Logical Plan:
– A Relation Algebra Tree representing the user query
• Physical Plan?
– A sequence of lower-level (physical) operations on data file in order to execute
the user query
– Examples of lower-level operations are - sequential scan of file, index traversal,
sequential scan of leaf nodes of “b+-tree index file”, sort-merge, hash-join, etc
– Our SQL expressions are required to be executed in terms of these operations

12-Sep-23 Spark SQL – Optimization 7

How Logical Optimization works?
• It should be easy to explain following
– If possible, selection should be executed early in the order
--> reduces the size of intermediate “operand relations” in following operations
--> reduced overall cost of execution.
– With the same logic, early projection also leads to “faster execution” of the
query
– If a user has submitted a query in which JOIN is expressed in terms of
CROSS PRODUCT, it should be rewritten using JOIN

Do you remember: 𝑅 ⋈𝑐 𝑆 ≡ 𝜎𝑐 (𝑅 × 𝑆) ?

12-Sep-23 Spark SQL – Optimization 8

How Logical Optimization works?
• Said “preferred approach” can be defined in terms of rules.
• The query optimizer finds if the “inputted query” is not in compliance with these
rules, it is re-expressed on these lines.
• That means Query is “Rewritten” (called as “Query Rewriting”)
• In other words, the Parsed Evaluation Tree from the inputted query is transformed
into “better ones” considering “said Rules”
• So, how it is done
– “Operation Reordering”
– Pushing “predicate” down, “projection” down
– Combining or splitting operations

12-Sep-23 Spark SQL – Optimization 9

Physical Optimization
• A physical plan is basically a sequence of physical operations that are actually
performed to execute the query
• Typical set of physical operations are: table scan, index scan, hashing, sort-merge,
hash-join, so forth!
• For producing a physical plan for a given logical plan, often we have multiple options
and depend on “data file organization“ and data statistics
• Query optimizer uses the concept of “COST” for choosing the optimal query plan.
• A cost function typically gives some estimation of time taken by the query. A plan
with minimum cost is chosen.

12-Sep-23 Spark SQL – Optimization 10

Cost based Optimization
• For example, the following is a simple estimation of the cost for “single loop join”;
the same is used in our broadcast join.
𝐵𝑅 + ( 𝑅 × (𝐻𝐵𝑆 + 1)) + ( 𝑗𝑠 × |𝑅| × |𝑆 |Τ𝐵𝐹𝑅𝑆 )
• Similar would be the cost for another join approach; the optimizer chooses a plan
that has minimum cost.
• For details, you can refer to any database text book. Here formula comes from
Elmasri/Navathe.

12-Sep-23 Spark SQL – Optimization 11

Cost based Optimization (factors)
• File Organization includes
– If records are sorted, sorted on what attribute
– If indexes are available, if yes, on what attribute, what is a method of index “B+-
tree” based or “hash based”
• Metadata
– Record Size, block size, cardinality, selectivity (ratio of distinct values for
attributes), join selectivity

12-Sep-23 Spark SQL – Optimization 12

“explain” of SQL (RDBMS)
• Explain of SQL is used for showing you finally “Optimized Physical” plan of query
execution! Snap shot here is from “PostgreSQL”

12-Sep-23 Spark SQL – Optimization 13

Spark SQL Optimizer Catalyst [1]
• All the statements are cached as Abstract Syntax Tree (AST)
• Lazy evaluation of AST enables optimization of expressed operations.
• Diagram here depicts optimization pipeline

DAG of RDDs

12-Sep-23 Spark SQL – Optimization 14

Catalyst - Analysis Phase
• SQL/Dataframe  AST
• Resolves attribute: if valid, able to resolve name ambiguity, etc
• “Type Coercion”
• Takes meta-information from “Catalog”

12-Sep-23 Spark SQL – Optimization 15

An example of a “query optimization”

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 16
Example

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 17
Logical Optimization in SparkSQL-Catalyst
• Logical operations: SparkSQL operation in SQL/API
• The logical optimization phase applies standard rule-based optimizations to the
logical plan. The article[1] reports the following rule-based techniques that can help
in producing a better query plan.
– Constant folding : constant propagation
– Predicate pushdown: move the predicate as early as possible
– Projection pruning: drop unnecessary columns in query execution
– null propagation,
– Boolean expression simplification, and other rules.

12-Sep-23 Spark SQL – Optimization 18

Cost-Based Optimization (Motivating Example #1)

• Here is an example: a simplified version of

Q11 TPC-DS benchmark.
• Join Order makes a difference, and the
most optimal can not be determined
unless we have some estimation of
intermediate results.
• Require some additional information
Cost-Based Optimizer in Apache Spark 2 2 - Ron Hu & Sameer Agarwal
Spark 12-Sep-23
Summit’2017: [Link] Spark SQL – Optimization 19
Physical Optimization in SparkSQL-Catalyst
• The physical plan is basically RDD DAG only.
• In the physical planning phase, Spark SQL takes a logical plan and generates one or more
physical plans.
• It then selects a plan using a Cost Model. (Cost Based Optimization, CBO)
• An example of a cost comparison might be choosing how to perform a given join by looking
at the physical attributes of a given table (how big the table is or how big its partitions are).
– Say which join approach to use: “--Broadcast join” or “Shuffle Join” (Sort Merge Join)
• The physical planner also performs rule-based physical optimizations, such as “pipelining
projections or filters” into one Spark map operation.
• In addition, it can push operations from the logical plan into data sources that support
predicate or projection pushdown.
• The final phase of query optimization involves generating Java bytecode to run on each
machine.
12-Sep-23 Spark SQL – Optimization 20
Cost-Based Optimization (Motivating Example #2)

• Rule Says
Smaller table of
R and L is to be
Hashed!
• If Only Rule is used
(without considering
intermediate results)
• Estimating size of
Intermediate Result requires some
more information, called statistical information

[Link]
12-Sep-23 Spark SQL – Optimization 21
Statistical Information in CBO [4]

• Uses notion of
“Filter Selectivity”,
and
“Join Selectivity”
• Selectivity's are
often estimated
based on histograms
of distinct values
and cardinalities of operand relations, etc

[Link]
12-Sep-23 Spark SQL – Optimization 22
Cost-Based Optimization
(Example #1)
Here we see how the change of Join-order by looking at Join Selectivity

estimated size intermediate results turns out to be Filter Selectivity

faster!

12-Sep-23 Spark SQL – Optimization 23

[Link]
.explain(example)

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 24
.explain(example)

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 25
SparkSQL-Catalyst Features [1]
• Supports both: “rule-based” and “cost-based” optimization.
• Catalyst is extensible.
• Its extensibility is said to have the following two purposes
– Different type of optimization rules for different problems of associated with
“big data” (e.g., semistructured data and advanced analytics).
– Ability to add data source-specific rules that can push filtering or aggregation
into external storage systems.
• More features
– Schema inference,
– Query federation to external databases

12-Sep-23 Spark SQL – Optimization 26

SparkSQL-Catalyst “Extensibility”
• The article says “In general, we have found it extremely simple to add rules for a
wide variety of situations”
– For example: aggregate operations on fixed precision decimals are done by
converting them 64 bit integers and finally converting them back to decimals.
– SQL LIKE operation is executed through “[Link]” or
“[Link]” calls makes a difference

12-Sep-23 Spark SQL – Optimization 27

SparkSQL Optimization - API Notes

• [Link]
• In Scala you can also call [Link] or
[Link]

12-Sep-23 Spark SQL – Optimization 28

References/Further Reading
[1] Armbrust, Michael, et al. "Spark SQL: Relational data processing in spark." Proceedings of the 2015
ACM SIGMOD international conference on management of data. ACM, 2015.
[2] Ron Hu, Zhenhua Wang, Wenchen Fan and Sameer Agarwal, Cost Based Optimizer in Apache Spark
2.2. Databricks. [Link]
[Link]. Published August, 2017 and Video Talk: [Link]
[3] Baldacci L, Golfarelli M. A Cost Model for SPARK SQL. IEEE Trans Knowl Data Eng. 2019;31(5):819-832.
doi:10.1109/TKDE.2018.2850339
[4] (Book) Learning Spark: lightning-fast big data analytics by Damji, Jules S., et al. O'Reilly Media, 2020.
[5] Armbrust, M., et al. "Deep dive into spark sql’s catalyst optimizer." Diambil kembali dari
[Link] com: [Link]
[Link] (2015).

12-Sep-23 Spark SQL – Optimization 29

Spark Optimization PDF
50% (2)
Spark Optimization PDF
14 pages
Pyspark Practice - Databricks
No ratings yet
Pyspark Practice - Databricks
66 pages
Spark Optimizations & Deployment
No ratings yet
Spark Optimizations & Deployment
39 pages
Azure Comapny Wise Question
No ratings yet
Azure Comapny Wise Question
68 pages
Azure Databricks Mastery
No ratings yet
Azure Databricks Mastery
95 pages
Dp203 Notes
No ratings yet
Dp203 Notes
87 pages
Pyspark Basics and Word Count Examples
0% (1)
Pyspark Basics and Word Count Examples
133 pages
Spark Application Deployment Guide
No ratings yet
Spark Application Deployment Guide
18 pages
Advanced Project For Data Engineering in Azure
100% (1)
Advanced Project For Data Engineering in Azure
5 pages
Naresh DE
No ratings yet
Naresh DE
5 pages
Databricks Course Curriculum
No ratings yet
Databricks Course Curriculum
2 pages
25 Pyspark Transformation
No ratings yet
25 Pyspark Transformation
10 pages
Pyspark Interview 1738079940
No ratings yet
Pyspark Interview 1738079940
6 pages
PySpark Tutorial for Beginners
No ratings yet
PySpark Tutorial for Beginners
206 pages
PySpark Tutorial: From Basics to Advanced
No ratings yet
PySpark Tutorial: From Basics to Advanced
102 pages
Spark Streaming App Development Guide
No ratings yet
Spark Streaming App Development Guide
8 pages
Deloitee Data Engineer Interview Questions
100% (1)
Deloitee Data Engineer Interview Questions
24 pages
PySpark Cheat 23
No ratings yet
PySpark Cheat 23
9 pages
Pyspark 4
No ratings yet
Pyspark 4
5 pages
Azure Data Engineer Mock Interview - Project Special
No ratings yet
Azure Data Engineer Mock Interview - Project Special
11 pages
Apache Spark Interview Questions Guide
0% (1)
Apache Spark Interview Questions Guide
40 pages
Databricks Python & Linux Commands Guide
No ratings yet
Databricks Python & Linux Commands Guide
109 pages
Top Pyspark InterviewQuestions
No ratings yet
Top Pyspark InterviewQuestions
21 pages
Databricks Optimization Techniques Guide
No ratings yet
Databricks Optimization Techniques Guide
4 pages
Spark QA
No ratings yet
Spark QA
34 pages
Using Apache Spark in Local Mode
No ratings yet
Using Apache Spark in Local Mode
56 pages
Spark Production Insights and Lessons
No ratings yet
Spark Production Insights and Lessons
34 pages
PySpark Optimization Interview Scenarios
No ratings yet
PySpark Optimization Interview Scenarios
8 pages
Azure Databricks Team Data Science Lab
No ratings yet
Azure Databricks Team Data Science Lab
18 pages
ADE Azure Data Engineer Interview
No ratings yet
ADE Azure Data Engineer Interview
12 pages
PySpark Cheatsheet
100% (1)
PySpark Cheatsheet
12 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
Data Engineering Interview Prep
No ratings yet
Data Engineering Interview Prep
8 pages
Data Engineering Pipelines in Python
No ratings yet
Data Engineering Pipelines in Python
25 pages
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
No ratings yet
Deloitte Pyspark Interview Questions For Data Engineer 2024 - by Ronit Malhotra - Jun, 2024 - Medium
9 pages
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
No ratings yet
PySpark Tutorial For Beginners - Python Examples - Spark by (Examples)
19 pages
Azure DE Interview Que
100% (2)
Azure DE Interview Que
25 pages
Spark RDD Actions & Transformations
No ratings yet
Spark RDD Actions & Transformations
25 pages
Caching DataFrames in PySpark
No ratings yet
Caching DataFrames in PySpark
51 pages
Spark SQL and Hadoop Integration Guide
100% (1)
Spark SQL and Hadoop Integration Guide
25 pages
Creating Secrets in Databricks
No ratings yet
Creating Secrets in Databricks
13 pages
Data Engineering 101 - Spark Concepts
No ratings yet
Data Engineering 101 - Spark Concepts
100 pages
Delta Table and Pyspark Interview Questions
100% (1)
Delta Table and Pyspark Interview Questions
14 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Azure DataBricks
No ratings yet
Azure DataBricks
37 pages
Pyspark
No ratings yet
Pyspark
31 pages
Pyspark Material
No ratings yet
Pyspark Material
16 pages
Databricks Sparkconfig 1669383836
No ratings yet
Databricks Sparkconfig 1669383836
1 page
Interview Series ADF Part-1
No ratings yet
Interview Series ADF Part-1
17 pages
Azure Data Engineer Skills Overview
No ratings yet
Azure Data Engineer Skills Overview
4 pages
Databricks Best Practices
No ratings yet
Databricks Best Practices
25 pages
Big Data With Apache Spark 3 and Python From Zero To Expert
No ratings yet
Big Data With Apache Spark 3 and Python From Zero To Expert
28 pages
SCD Type-1,2 Implementation in Pyspark
No ratings yet
SCD Type-1,2 Implementation in Pyspark
6 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
54 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Data Engineering 101 - Databricks Optimization
No ratings yet
Data Engineering 101 - Databricks Optimization
16 pages
Py Spark
No ratings yet
Py Spark
10 pages
Unity Catalog
No ratings yet
Unity Catalog
16 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
CH 19 Sum
No ratings yet
CH 19 Sum
8 pages
Data Cube Implementation Overview
No ratings yet
Data Cube Implementation Overview
29 pages
Overview of NoSQL Database Types
No ratings yet
Overview of NoSQL Database Types
29 pages
DynamoDB Data Model Guide
No ratings yet
DynamoDB Data Model Guide
75 pages
Lab-7 202103010 202103041
No ratings yet
Lab-7 202103010 202103041
13 pages
Logic Pro X Exam Prep Guide v10 2 FINAL
100% (1)
Logic Pro X Exam Prep Guide v10 2 FINAL
21 pages
Consumer Electronics Market Hungary
No ratings yet
Consumer Electronics Market Hungary
40 pages
IT Infrastructure Lecture - Data Center Networking - Networking - 9
No ratings yet
IT Infrastructure Lecture - Data Center Networking - Networking - 9
12 pages
5G Millimeter-Wave Rectenna Systems For Radio-Frequency Energy Harvesting Wireless Power Transmission Applications An Overview
No ratings yet
5G Millimeter-Wave Rectenna Systems For Radio-Frequency Energy Harvesting Wireless Power Transmission Applications An Overview
20 pages
Glossika How To Read These 60 Programming Terms in English
No ratings yet
Glossika How To Read These 60 Programming Terms in English
15 pages
Objectives:: Camosun College Electronics & Computer Eng. Dept. Elex 143 - Lab 1 Introduction To The Oscilloscope
No ratings yet
Objectives:: Camosun College Electronics & Computer Eng. Dept. Elex 143 - Lab 1 Introduction To The Oscilloscope
7 pages
SAP Transaction Codes SM19 and SM20
No ratings yet
SAP Transaction Codes SM19 and SM20
8 pages
Computer Applications and Cleaning Teas Solubility
100% (1)
Computer Applications and Cleaning Teas Solubility
4 pages
TBS Encoder SRT Configuration Guide
No ratings yet
TBS Encoder SRT Configuration Guide
6 pages
SSUET Project Rubric for Computer Architecture
No ratings yet
SSUET Project Rubric for Computer Architecture
3 pages
Digital Data Calculation & Storage
No ratings yet
Digital Data Calculation & Storage
16 pages
9 Stewart Cambio de Variablel (9022) PDF
No ratings yet
9 Stewart Cambio de Variablel (9022) PDF
10 pages
Normal Distribution Guide for Engineers
No ratings yet
Normal Distribution Guide for Engineers
39 pages
@@@@6th Sem Minor Project Title and Guide List
No ratings yet
@@@@6th Sem Minor Project Title and Guide List
9 pages
Appliance Datasheet
No ratings yet
Appliance Datasheet
5 pages
1.3.3 Constraining A Sketch
No ratings yet
1.3.3 Constraining A Sketch
4 pages
History and Evolution of The Web
No ratings yet
History and Evolution of The Web
6 pages
Keyblanks Guide
No ratings yet
Keyblanks Guide
36 pages
Job Application Form
No ratings yet
Job Application Form
2 pages
BS Electrical Engineering Plan
No ratings yet
BS Electrical Engineering Plan
1 page
Certified Procurement Professional Certification CPP BOK
No ratings yet
Certified Procurement Professional Certification CPP BOK
67 pages
Testbank For Voyages in World History 4th Edition Hansen Instant Download
0% (1)
Testbank For Voyages in World History 4th Edition Hansen Instant Download
18 pages
Compiler Phases Explained
No ratings yet
Compiler Phases Explained
24 pages
Ta 080
No ratings yet
Ta 080
108 pages
6 TH
No ratings yet
6 TH
16 pages
Accessing WITS Resources and Lessons
No ratings yet
Accessing WITS Resources and Lessons
4 pages
PM Topic Explainer Notes
No ratings yet
PM Topic Explainer Notes
2 pages
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
No ratings yet
Case Studies and Best Practices From Leading Companies For Monitoring API Endpoints
7 pages
2021 Regulation Lab With Exp New
No ratings yet
2021 Regulation Lab With Exp New
14 pages
03 Feb 20236040
No ratings yet
03 Feb 20236040
22 pages

Spark SQL Optimization

Uploaded by

Spark SQL Optimization

Uploaded by

Spark SQL – Optimization

• What do you understand by “Query Optimization”?

12-Sep-23 Spark SQL – Optimization 2

12-Sep-23 Spark SQL – Optimization 3

12-Sep-23 Spark SQL – Optimization 4

12-Sep-23 Spark SQL – Optimization 5

12-Sep-23 Spark SQL – Optimization 7

12-Sep-23 Spark SQL – Optimization 8

12-Sep-23 Spark SQL – Optimization 9

12-Sep-23 Spark SQL – Optimization 10

12-Sep-23 Spark SQL – Optimization 11

12-Sep-23 Spark SQL – Optimization 12

12-Sep-23 Spark SQL – Optimization 13

12-Sep-23 Spark SQL – Optimization 14

12-Sep-23 Spark SQL – Optimization 15

Book: Learning Spark [4]

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 18

• Here is an example: a simplified version of

estimated size intermediate results turns out to be Filter Selectivity

12-Sep-23 Spark SQL – Optimization 23

Book: Learning Spark [4]

Book: Learning Spark [4]

12-Sep-23 Spark SQL – Optimization 26

12-Sep-23 Spark SQL – Optimization 27

12-Sep-23 Spark SQL – Optimization 28

12-Sep-23 Spark SQL – Optimization 29

You might also like