0% found this document useful (0 votes)
101 views44 pages

Spark SQL-A Compiler From Queries To RDDs

Spark SQL uses a technique called Catalyst to compile SQL queries into efficient execution plans using RDDs. Catalyst represents the query as a logical query plan tree, then applies transformations to optimize the plan before generating a physical query plan using RDDs. The transformations operate on the tree by matching patterns and replacing subtrees to implement optimizations like constant folding and predicate pushdown.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
101 views44 pages

Spark SQL-A Compiler From Queries To RDDs

Spark SQL uses a technique called Catalyst to compile SQL queries into efficient execution plans using RDDs. Catalyst represents the query as a logical query plan tree, then applies transformations to optimize the plan before generating a physical query plan using RDDs. The transformations operate on the tree by matching patterns and replacing subtrees to implement optimizations like constant folding and predicate pushdown.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

SparkSQL:

A Compiler from Queries to RDDs

Sameer Agarwal
Spark Summit | Boston | February 9th 2017
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

3
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

Opaque Computation

4
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]

Opaque Data

5
RDD Programming Model

Construct execution DAG using low level RDD operators.

6
RDD Programming Model

Construct execution DAG using low level RDD operators.

7
RDD Programming Model

Construct execution DAG using low level RDD operators.

8
SQL/Structured Programming Model
• High-level APIs (SQL, DataFrame/Dataset): Programs
describe what data operations are needed without
specifying how to execute these operations
• More efficient: An optimizer can automatically find out
the most efficient plan to execute a query

9
Spark SQL Overview

Catalyst Tungsten
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs


(Trees)
10
How Catalyst Works: An Overview

Catalyst
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs


(Trees)
11
Trees: Abstractions of Users’ Programs

SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

12
Trees: Abstractions of Users’ Programs
Expression
• An expression represents a
SELECT sum(v) new value, computed based
FROM ( on input values
SELECT
• e.g. 1 + 2 + t1.value
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp

13
Trees: Abstractions of Users’ Programs
Query Plan Aggregate sum(v)

t1.id,
SELECT sum(v)
Project 1+2+t1.value
FROM ( as v
SELECT
t1.id, t1.id=t2.id
Filter t2.id>50*1000
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE Join
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan Scan
(t1) (t2)
14
Logical Plan
Aggregate sum(v)
• A Logical Plan describes computation
on datasets without defining how to Project
t1.id,
1+2+t1.value
conduct the computation as v

t1.id=t2.id
Filter t2.id>50*1000

Join

Scan Scan
(t1) (t2)
15
Physical Plan
Hash-
sum(v)
Aggregate
• A Physical Plan describes computation
on datasets with specific definitions on Project
t1.id,
1+2+t1.value
how to conduct the computation as v

t1.id=t2.id
Filter t2.id>50*1000

Sort-Merge
Join

Parquet Scan JSON Scan


(t1) (t2)
16
How Catalyst Works: An Overview

Catalyst
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset
(Java/Scala)
Abstractions of users’ programs
(Trees)
17
Transform
• A function associated with every tree used to
implement a single rule
1 + 2 + t1.value
Evaluate 1 + 2 Add Evaluate 1 + 2 once 3+ t1.value
for every row
Add
Attribute
Add
(t1.value)

Attribute
Literal(3)
(t1.value)
Literal(1) Literal(2)

18
Transform
• A transform is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

Case statement determine if the partial function is defined for a given input

19
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
20
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
21
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value

Add

Attribute
Add
(t1.value)

Literal(1) Literal(2)
22
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}

1 + 2 + t1.value
3+ t1.value
Add

Add

Attribute
Add
(t1.value)
Attribute
Literal(3)
(t1.value)

Literal(1) Literal(2)
23
Combining Multiple Rules
Predicate Pushdown
Aggregate sum(v) Aggregate sum(v)

t1.id, t1.id,
Project 1+2+t1.value Project 1+2+t1.value
as v as v

t1.id=t2.id
Filter t2.id>50*1000 Join t1.id=t2.id
t2.id>50*1000

Join Filter

Scan Scan Scan Scan


(t1) (t2) (t1) (t2) 24
Combining Multiple Rules
Constant Folding
Aggregate sum(v) Aggregate sum(v)

t1.id, t1.id,
Project 1+2+t1.value Project 3+t1.value as
as v v

Join t1.id=t2.id Join t1.id=t2.id


t2.id>50*1000 t2.id>50000

Filter Filter

Scan Scan Scan Scan


(t1) (t2) (t1) (t2) 25
Combining Multiple Rules Aggregate sum(v)
Column Pruning
t1.id,
Aggregate sum(v) Project 3+t1.value as
v
t1.id,
Project 3+t1.value as Join t1.id=t2.id
v
t2.id>50000
Join t1.id=t2.id
Filter
t2.id>50000
t1.id
Filter Project Project t2.id
t1.value

Scan Scan Scan Scan


(t1) (t2) (t1) (t2) 26
After transformations
Combining Multiple Rules Aggregate sum(v)
Before transformations
t1.id,
Aggregate sum(v) Project 3+t1.value as
v
t1.id,
Project 1+2+t1.value Join t1.id=t2.id
as v
t2.id>50000
t1.id=t2.id
Filter t2.id>50*1000 Filter

Join t1.id Project Project t2.id


t1.value

Scan Scan Scan Scan


(t1) (t2) (t1) (t2) 27
Spark SQL Overview

Catalyst Tungsten
SQL AST Transformations

Optimized
DataFrame Query Plan RDDs
Query Plan

Dataset

Abstractions of users’ programs


(Trees)
28
Aggregate

select count(*) from store_sales Project


where ss_item_sk = 1000

Filter

Scan
G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994
Volcano Iterator Model
• Standard for 30 years: class Filter(
child: Operator,
almost all databases do it predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
• Each operator is an current = child.next()
}
“iterator” that consumes return current
}
records from its input }
operator
Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 calls for each row in Aggregate

2. Extensive memory access


o “row” is a small segment in memory (or in L1/L2/L3 cache)

3. Can’t take advantage of modern CPU features


o SIMD, pipelining, prefetching, branch prediction, ILP, instruction
cache, …
Whole-stage Codegen: Spark as a “Compiler”

Aggregate

long count = 0;
for (ss_item_sk in store_sales) {
Project
if (ss_item_sk == 1000) {
count += 1;
}
Filter
}

Scan
Whole-stage Codegen
• Fusing operators together so the generated code looks like
hand optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query
T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011
Putting it All Together
Operator Benchmarks: Cost/Row (ns)

5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)

Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)

Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)

10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)

Spark 2.0 Spark 1.6


Query Time

Lower is Better

Query #
What’s Next?
Spark 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and better integration
with the in-memory column format in Spark
Thank you.

You might also like