0% found this document useful (0 votes)

101 views44 pages

Spark SQL-A Compiler From Queries To RDDs

Spark SQL uses a technique called Catalyst to compile SQL queries into efficient execution plans using RDDs. Catalyst represents the query as a logical query plan tree, then applies transformations to optimize the plan before generating a physical query plan using RDDs. The transformations operate on the tree by matching patterns and replacing subtrees to implement optimizations like constant folding and predicate pushdown.

Uploaded by

franc.peralta6200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

101 views44 pages

Spark SQL-A Compiler From Queries To RDDs

Uploaded by

franc.peralta6200

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

SparkSQL:

A Compiler from Queries to RDDs

Sameer Agarwal
Spark Summit | Boston | February 9th 2017
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

3
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

Opaque Computation

4
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

Opaque Data

5
RDD Programming Model

Construct execution DAG using low level RDD operators.

6
RDD Programming Model

Construct execution DAG using low level RDD operators.

7
RDD Programming Model

Construct execution DAG using low level RDD operators.

8
SQL/Structured Programming Model
• High-level APIs (SQL, DataFrame/Dataset): Programs
describe what data operations are needed without
specifying how to execute these operations
• More efficient: An optimizer can automatically find out
the most efficient plan to execute a query

9
Spark SQL Overview

Catalyst Tungsten
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs

(Trees)
10
How Catalyst Works: An Overview

Catalyst
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs

(Trees)
11
Trees: Abstractions of Users’ Programs

SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

12
Trees: Abstractions of Users’ Programs
Expression
• An expression represents a
SELECT sum(v) new value, computed based
FROM ( on input values
SELECT
• e.g. 1 + 2 + t1.value
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

13
Trees: Abstractions of Users’ Programs
Query Plan Aggregate sum(v)

t1.id,
SELECT sum(v)
Project 1+2+t1.value
FROM ( as v
SELECT
t1.id, t1.id=t2.id
Filter t2.id>50*1000
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE Join
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan Scan
(t1) (t2)
14
Logical Plan
Aggregate sum(v)
• A Logical Plan describes computation
on datasets without defining how to Project
t1.id,
1+2+t1.value
conduct the computation as v

t1.id=t2.id
Filter t2.id>50*1000

Join

Scan Scan
(t1) (t2)
15
Physical Plan
Hash-
sum(v)
Aggregate
• A Physical Plan describes computation
on datasets with specific definitions on Project
t1.id,
1+2+t1.value
how to conduct the computation as v

t1.id=t2.id
Filter t2.id>50*1000

Sort-Merge
Join

Parquet Scan JSON Scan

(t1) (t2)
16
How Catalyst Works: An Overview

Catalyst
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset
(Java/Scala)
Abstractions of users’ programs
(Trees)
17
Transform
• A function associated with every tree used to
implement a single rule
1 + 2 + t1.value
Evaluate 1 + 2 Add Evaluate 1 + 2 once 3+ t1.value
for every row
Add
Attribute
Add
(t1.value)

Attribute
Literal(3)
(t1.value)
Literal(1) Literal(2)

18
Transform
• A transform is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

Case statement determine if the partial function is defined for a given input

19
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
20
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
21
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
22
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value
3+ t1.value
Add

Add

Attribute
Add
(t1.value)
Attribute
Literal(3)
(t1.value)

Literal(1) Literal(2)
23
Combining Multiple Rules
Predicate Pushdown
Aggregate sum(v) Aggregate sum(v)

t1.id, t1.id,
Project 1+2+t1.value Project 1+2+t1.value
as v as v

t1.id=t2.id
Filter t2.id>50*1000 Join t1.id=t2.id
t2.id>50*1000

Join Filter

Scan Scan Scan Scan

(t1) (t2) (t1) (t2) 24
Combining Multiple Rules
Constant Folding
Aggregate sum(v) Aggregate sum(v)

t1.id, t1.id,
Project 1+2+t1.value Project 3+t1.value as
as v v

Join t1.id=t2.id Join t1.id=t2.id

t2.id>50*1000 t2.id>50000

Filter Filter

Scan Scan Scan Scan

(t1) (t2) (t1) (t2) 25
Combining Multiple Rules Aggregate sum(v)
Column Pruning
t1.id,
Aggregate sum(v) Project 3+t1.value as
v
t1.id,
Project 3+t1.value as Join t1.id=t2.id
v
t2.id>50000
Join t1.id=t2.id
Filter
t2.id>50000
t1.id
Filter Project Project t2.id
t1.value

Scan Scan Scan Scan

(t1) (t2) (t1) (t2) 26
After transformations
Combining Multiple Rules Aggregate sum(v)
Before transformations
t1.id,
Aggregate sum(v) Project 3+t1.value as
v
t1.id,
Project 1+2+t1.value Join t1.id=t2.id
as v
t2.id>50000
t1.id=t2.id
Filter t2.id>50*1000 Filter

Join t1.id Project Project t2.id

t1.value

Scan Scan Scan Scan

(t1) (t2) (t1) (t2) 27
Spark SQL Overview

Catalyst Tungsten
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs

(Trees)
28
Aggregate

select count(*) from store_sales Project

where ss_item_sk = 1000

Filter

Scan
G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994
Volcano Iterator Model
• Standard for 30 years: class Filter(
child: Operator,
almost all databases do it predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
• Each operator is an current = child.next()
}
“iterator” that consumes return current
}
records from its input }
operator
Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 calls for each row in Aggregate

2. Extensive memory access

o “row” is a small segment in memory (or in L1/L2/L3 cache)

3. Can’t take advantage of modern CPU features

o SIMD, pipelining, prefetching, branch prediction, ILP, instruction
cache, …
Whole-stage Codegen: Spark as a “Compiler”

Aggregate

long count = 0;
for (ss_item_sk in store_sales) {
Project
if (ss_item_sk == 1000) {
count += 1;
}
Filter
}

Scan
Whole-stage Codegen
• Fusing operators together so the generated code looks like
hand optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query
T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011
Putting it All Together
Operator Benchmarks: Cost/Row (ns)

5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)

Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)

Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)

10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)

Spark 2.0 Spark 1.6

Query Time

Lower is Better

Query #
What’s Next?
Spark 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and better integration
with the in-memory column format in Spark
Thank you.

Spark SQL
No ratings yet
Spark SQL
41 pages
Got My Mojo Working-Muddy Waters
100% (1)
Got My Mojo Working-Muddy Waters
1 page
Light Propagation Modelling Using Comsol Multiphysics 4.4
No ratings yet
Light Propagation Modelling Using Comsol Multiphysics 4.4
22 pages
EMMANUEL ODWIRA - Quantitaty Methods
No ratings yet
EMMANUEL ODWIRA - Quantitaty Methods
203 pages
Pyspark DataEngineering Power Guide
No ratings yet
Pyspark DataEngineering Power Guide
73 pages
LSMV Fact Sheet
No ratings yet
LSMV Fact Sheet
2 pages
w12 - Runningnotes 201026 001818
No ratings yet
w12 - Runningnotes 201026 001818
25 pages
PySpark Notes
No ratings yet
PySpark Notes
31 pages
CLILpaper Tommy TCope
No ratings yet
CLILpaper Tommy TCope
74 pages
BDA Lect5 Apache Spark 2023
No ratings yet
BDA Lect5 Apache Spark 2023
115 pages
Pyspark
No ratings yet
Pyspark
44 pages
BDT MSE2Scheme 23-24
No ratings yet
BDT MSE2Scheme 23-24
4 pages
Spark
No ratings yet
Spark
51 pages
Lab 4 - Apache Spark SQL
No ratings yet
Lab 4 - Apache Spark SQL
46 pages
2017 TheNextProductionRevolution PDF
No ratings yet
2017 TheNextProductionRevolution PDF
442 pages
SBT V0-13-Reference
No ratings yet
SBT V0-13-Reference
321 pages
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
No ratings yet
Apache Spark Tutorial, With Deep-Dives On SparkR and Data Sources API
39 pages
SPARK
No ratings yet
SPARK
125 pages
Nisbau Brochure
No ratings yet
Nisbau Brochure
19 pages
Spark SQL - Updated
No ratings yet
Spark SQL - Updated
19 pages
Bda U5
No ratings yet
Bda U5
42 pages
4 - Spark SQL
No ratings yet
4 - Spark SQL
58 pages
Sample C Memorandum and Articles of Asso
No ratings yet
Sample C Memorandum and Articles of Asso
19 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages
Pyspark
No ratings yet
Pyspark
31 pages
Bda Unit 5 - Mam
No ratings yet
Bda Unit 5 - Mam
44 pages
4.3. Spark SQL
No ratings yet
4.3. Spark SQL
25 pages
BDA Lec9
No ratings yet
BDA Lec9
25 pages
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
No ratings yet
Datasets and Dataframes: Org - Apache.Spark - Sql.Sparksession
17 pages
Py Spark
No ratings yet
Py Spark
7 pages
Devops Slides
No ratings yet
Devops Slides
223 pages
Whiter Shade of Pale Arr by Daryl Shawn 942
No ratings yet
Whiter Shade of Pale Arr by Daryl Shawn 942
6 pages
Unit-5 Spark SQL and Spark Streaming
No ratings yet
Unit-5 Spark SQL and Spark Streaming
24 pages
Py Spark 3 Quick Reference Guide
No ratings yet
Py Spark 3 Quick Reference Guide
2 pages
Apache Spark
No ratings yet
Apache Spark
8 pages
Lecture 25
No ratings yet
Lecture 25
59 pages
Q1. Difference Between Cache and Pe
No ratings yet
Q1. Difference Between Cache and Pe
13 pages
10 Spark1
No ratings yet
10 Spark1
31 pages
Smart Materials Final
No ratings yet
Smart Materials Final
4 pages
Questions
No ratings yet
Questions
4 pages
Spark Programming Basics
No ratings yet
Spark Programming Basics
54 pages
Spark Essentials
No ratings yet
Spark Essentials
15 pages
Unit-5 Spark
No ratings yet
Unit-5 Spark
24 pages
Myinterview Qs
No ratings yet
Myinterview Qs
9 pages
Chapter 3 Spark
No ratings yet
Chapter 3 Spark
6 pages
Roman Granite Catalogue Complete
No ratings yet
Roman Granite Catalogue Complete
260 pages
Notes
No ratings yet
Notes
5 pages
Dissolution Problems
No ratings yet
Dissolution Problems
12 pages
3.hydraulic and Electrical System
100% (1)
3.hydraulic and Electrical System
35 pages
PT1.3 Capacitor and Dielectrics
No ratings yet
PT1.3 Capacitor and Dielectrics
5 pages
ScalaJVMBigData SparkLessons PDF
100% (1)
ScalaJVMBigData SparkLessons PDF
100 pages
Comparison of SQL
No ratings yet
Comparison of SQL
11 pages
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
No ratings yet
From Query Plan To Query Performance:: Supercharging Your Spark Queries Using The Spark UI SQL Tab
52 pages
Intac Reviewer 2
No ratings yet
Intac Reviewer 2
10 pages
Mastering Apache Spark
67% (3)
Mastering Apache Spark
1,831 pages
Report Socio of Family
No ratings yet
Report Socio of Family
2 pages
Ifrs 7 Financial Instruments Disclosures
No ratings yet
Ifrs 7 Financial Instruments Disclosures
70 pages
Spark A To Z
No ratings yet
Spark A To Z
63 pages
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
No ratings yet
Cs 744: Spark SQL: Shivaram Venkataraman Fall 2019
24 pages
Mod5 Bda
No ratings yet
Mod5 Bda
9 pages
Spark 3.0 New Features: Spark With GPU Support
No ratings yet
Spark 3.0 New Features: Spark With GPU Support
8 pages
Phrack Magazine Issue 1
100% (1)
Phrack Magazine Issue 1
13 pages
DBR 7.x - Spark 3.x Features Migration
No ratings yet
DBR 7.x - Spark 3.x Features Migration
86 pages
You and I (Will Travel Far Together)
100% (2)
You and I (Will Travel Far Together)
3 pages
GC1 PPT Questions
No ratings yet
GC1 PPT Questions
4 pages
Spark Summit East 2015 - Adv Dev Ops - Student Slides
No ratings yet
Spark Summit East 2015 - Adv Dev Ops - Student Slides
219 pages
What Is Spark?: History of Apache Spark
No ratings yet
What Is Spark?: History of Apache Spark
65 pages
Apache Spark With Java
No ratings yet
Apache Spark With Java
209 pages
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
No ratings yet
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
4 pages
Apache Spark - DataFrames and Spark SQL
100% (2)
Apache Spark - DataFrames and Spark SQL
146 pages
Page 01
No ratings yet
Page 01
2 pages
Spark SQL Meetup - 4-8-2012
No ratings yet
Spark SQL Meetup - 4-8-2012
27 pages
RDDs Vs DataFrames and Datasets
No ratings yet
RDDs Vs DataFrames and Datasets
7 pages
LearningSpark EXCERPT
50% (2)
LearningSpark EXCERPT
47 pages
DWG-04.11-ME-021 Pri - Sludge - Pump - Station1 (r.6) PDF
No ratings yet
DWG-04.11-ME-021 Pri - Sludge - Pump - Station1 (r.6) PDF
5 pages
Industrial Employement (Standing Orders) Act 1946
100% (1)
Industrial Employement (Standing Orders) Act 1946
9 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Question Bank-BDA (Module 1&2) 2
No ratings yet
Question Bank-BDA (Module 1&2) 2
5 pages
Hardness Test
100% (1)
Hardness Test
8 pages
Business Environment Notes
No ratings yet
Business Environment Notes
4 pages
Questionnaire For Financial Audit
No ratings yet
Questionnaire For Financial Audit
2 pages
ATA 24 Electrical Power L1
100% (1)
ATA 24 Electrical Power L1
40 pages
DuPont Analysis
No ratings yet
DuPont Analysis
2 pages
One (U2)
No ratings yet
One (U2)
1 page
Apach Spark With Scala Slides
No ratings yet
Apach Spark With Scala Slides
187 pages
Architecture and Components of Spark
No ratings yet
Architecture and Components of Spark
6 pages
Criminal Procedure - Reviewer
No ratings yet
Criminal Procedure - Reviewer
16 pages
Intercropping Cereals and Grain Legumes - A Farmers Perspective
No ratings yet
Intercropping Cereals and Grain Legumes - A Farmers Perspective
2 pages
Best of Pop & Rock For Classical Guitar (Vol 6)
100% (7)
Best of Pop & Rock For Classical Guitar (Vol 6)
48 pages
Bolero Arcas
No ratings yet
Bolero Arcas
3 pages
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
No ratings yet
You've Got To Hide Your Love Away: The Beatles Arranged by Daryl Shawn
4 pages
Spark SQL
No ratings yet
Spark SQL
24 pages
Jsa For Unloading - Shifting of Heavy Materials (Panels, Transformers, Cable Drums Etc.)
No ratings yet
Jsa For Unloading - Shifting of Heavy Materials (Panels, Transformers, Cable Drums Etc.)
1 page
Article 1156: An Obligation Is A Juridical Necessity To Give, To Do, or Not To Do
No ratings yet
Article 1156: An Obligation Is A Juridical Necessity To Give, To Do, or Not To Do
4 pages
A005 Atos
No ratings yet
A005 Atos
4 pages

Spark SQL-A Compiler From Queries To RDDs

Uploaded by

Spark SQL-A Compiler From Queries To RDDs

Uploaded by

SparkSQL:

A Compiler from Queries to RDDs

Construct execution DAG using low level RDD operators.

Construct execution DAG using low level RDD operators.

Construct execution DAG using low level RDD operators.

Abstractions of users’ programs

Abstractions of users’ programs

Parquet Scan JSON Scan

Scan Scan Scan Scan

Join t1.id=t2.id Join t1.id=t2.id

Scan Scan Scan Scan

Scan Scan Scan Scan

Join t1.id Project Project t2.id

Scan Scan Scan Scan

Abstractions of users’ programs

select count(*) from store_sales Project

2. Extensive memory access

3. Can’t take advantage of modern CPU features

Spark 2.0 Spark 1.6

You might also like