Spark SQL-A Compiler From Queries To RDDs
Spark SQL-A Compiler From Queries To RDDs
Sameer Agarwal
Spark Summit | Boston | February 9th 2017
About Me
• Software Engineer at Databricks (Spark Core/SQL)
• PhD in Databases (AMPLab, UC Berkeley)
• Research on BlinkDB (Approximate Queries in Spark)
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
3
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
Opaque Computation
4
Background: What is an RDD?
• Dependencies
• Partitions
• Compute function: Partition => Iterator[T]
Opaque Data
5
RDD Programming Model
6
RDD Programming Model
7
RDD Programming Model
8
SQL/Structured Programming Model
• High-level APIs (SQL, DataFrame/Dataset): Programs
describe what data operations are needed without
specifying how to execute these operations
• More efficient: An optimizer can automatically find out
the most efficient plan to execute a query
9
Spark SQL Overview
Catalyst Tungsten
SQL AST Transformations
Optimized
DataFrame Query Plan RDDs
Query Plan
Dataset
Catalyst
SQL AST Transformations
Optimized
DataFrame Query Plan RDDs
Query Plan
Dataset
SELECT sum(v)
FROM (
SELECT
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
12
Trees: Abstractions of Users’ Programs
Expression
• An expression represents a
SELECT sum(v) new value, computed based
FROM ( on input values
SELECT
• e.g. 1 + 2 + t1.value
t1.id,
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE
t1.id = t2.id AND
t2.id > 50 * 1000) tmp
13
Trees: Abstractions of Users’ Programs
Query Plan Aggregate sum(v)
t1.id,
SELECT sum(v)
Project 1+2+t1.value
FROM ( as v
SELECT
t1.id, t1.id=t2.id
Filter t2.id>50*1000
1 + 2 + t1.value AS v
FROM t1 JOIN t2
WHERE Join
t1.id = t2.id AND
t2.id > 50 * 1000) tmp Scan Scan
(t1) (t2)
14
Logical Plan
Aggregate sum(v)
• A Logical Plan describes computation
on datasets without defining how to Project
t1.id,
1+2+t1.value
conduct the computation as v
t1.id=t2.id
Filter t2.id>50*1000
Join
Scan Scan
(t1) (t2)
15
Physical Plan
Hash-
sum(v)
Aggregate
• A Physical Plan describes computation
on datasets with specific definitions on Project
t1.id,
1+2+t1.value
how to conduct the computation as v
t1.id=t2.id
Filter t2.id>50*1000
Sort-Merge
Join
Catalyst
SQL AST Transformations
Optimized
DataFrame Query Plan RDDs
Query Plan
Dataset
(Java/Scala)
Abstractions of users’ programs
(Trees)
17
Transform
• A function associated with every tree used to
implement a single rule
1 + 2 + t1.value
Evaluate 1 + 2 Add Evaluate 1 + 2 once 3+ t1.value
for every row
Add
Attribute
Add
(t1.value)
Attribute
Literal(3)
(t1.value)
Literal(1) Literal(2)
18
Transform
• A transform is defined as a Partial Function
• Partial Function: A function that is defined for a subset
of its possible arguments
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
Case statement determine if the partial function is defined for a given input
19
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
1 + 2 + t1.value
Add
Attribute
Add
(t1.value)
Literal(1) Literal(2)
20
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
1 + 2 + t1.value
Add
Attribute
Add
(t1.value)
Literal(1) Literal(2)
21
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
1 + 2 + t1.value
Add
Attribute
Add
(t1.value)
Literal(1) Literal(2)
22
Transform
val expression: Expression = ...
expression.transform {
case Add(Literal(x, IntegerType), Literal(y, IntegerType)) =>
Literal(x + y)
}
1 + 2 + t1.value
3+ t1.value
Add
Add
Attribute
Add
(t1.value)
Attribute
Literal(3)
(t1.value)
Literal(1) Literal(2)
23
Combining Multiple Rules
Predicate Pushdown
Aggregate sum(v) Aggregate sum(v)
t1.id, t1.id,
Project 1+2+t1.value Project 1+2+t1.value
as v as v
t1.id=t2.id
Filter t2.id>50*1000 Join t1.id=t2.id
t2.id>50*1000
Join Filter
t1.id, t1.id,
Project 1+2+t1.value Project 3+t1.value as
as v v
Filter Filter
Catalyst Tungsten
SQL AST Transformations
Optimized
DataFrame Query Plan RDDs
Query Plan
Dataset
Filter
Scan
G. Graefe, Volcano— An Extensible and Parallel Query Evaluation System,
In IEEE Transactions on Knowledge and Data Engineering 1994
Volcano Iterator Model
• Standard for 30 years: class Filter(
child: Operator,
almost all databases do it predicate: (Row => Boolean))
extends Operator {
def next(): Row = {
var current = child.next()
while (current == null ||predicate(current)) {
• Each operator is an current = child.next()
}
“iterator” that consumes return current
}
records from its input }
operator
Downside of the Volcano Model
1. Too many virtual function calls
o at least 3 calls for each row in Aggregate
Aggregate
long count = 0;
for (ss_item_sk in store_sales) {
Project
if (ss_item_sk == 1000) {
count += 1;
}
Filter
}
Scan
Whole-stage Codegen
• Fusing operators together so the generated code looks like
hand optimized code:
- Identify chains of operators (“stages”)
- Compile each stage into a single function
- Functionality of a general purpose execution engine;
performance as if hand built system just to run your query
T Neumann, Efficiently compiling efficient query plans for modern hardware. In VLDB 2011
Putting it All Together
Operator Benchmarks: Cost/Row (ns)
5-30x
Speedups
Operator Benchmarks: Cost/Row (ns)
Radix Sort
10-100x
Speedups
Operator Benchmarks: Cost/Row (ns)
Shuffling
still the
bottleneck
Operator Benchmarks: Cost/Row (ns)
10x
Speedup
TPC-DS (Scale Factor 1500, 100 cores)
Lower is Better
Query #
What’s Next?
Spark 2.2 and beyond
1. SPARK-16026: Cost Based Optimizer
- Leverage table/column level statistics to optimize joins and aggregates
- Statistics Collection Framework (Spark 2.1)
- Cost Based Optimizer (Spark 2.2)
2. Boosting Spark’s Performance on Many-Core Machines
- In-memory/ single node shuffle
3. Improving quality of generated code and better integration
with the in-memory column format in Spark
Thank you.