0% found this document useful (0 votes)
43 views

Duckdb Parallelism

Uploaded by

yoonghm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views

Duckdb Parallelism

Uploaded by

yoonghm
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Mark Raasveldt

Parallel Quacking
Parallel Quacking

▸ When building DuckDB we have mostly


focused on building a functional system
▸ Avoid premature optimization
▸ Avoid adding optimizations that prevent
adding features
Parallel Quacking

▸ Suddenly people are benchmarking our system


▸ Including benchmarks in research papers
▸ Yikes!

▸ We haven’t exactly spend a lot of time


optimizing…
Parallel Quacking

▸ We are now pretty happy with functionality


▸ Window functions, subqueries, collations,
(recursive) CTEs, Parquet/Pandas/CSV
readers, …
▸ Maybe we should start optimizing!
Parallel Quacking

▸ DuckDB is currently single-threaded


▸ Parallelism is an obvious performance boost

▸ More importantly: parallelism requires a


structural change to the code
▸ Optimizations need to account for parallelism
▸ Optimizing a single-threaded HT is pointless if
we have to throw it away once we add
parallelism!
Parallel Quacking

▸ Parallelism is actually our oldest open issue!

▸ Created one month after the initial commit

▸ So it’s about time :)


DBMS Parallelism

▸ Short intro to DBMS parallelism


▸ DBMS have two types of parallelism
▸ Inter-query and intra-query parallelism

▸ Inter-query: multiple different queries


can be executed in parallel
▸ Intra-query: a single query can be
parallelized
DBMS Parallelism

▸ Most systems have inter-query


▸ We already had this

▸ Most useful for OLTP systems


▸ Many concurrent clients requests, etc
DBMS Parallelism

▸ Intra-query is not part of most OLTP


systems
▸ e.g. MySQL/PostgreSQL/SQLite
▸ Not useful for small queries

▸ Only useful for complex queries


▸ Aka OLAP systems
DBMS Parallelism

▸ Exchange operator: original way of


doing parallelism
▸ Parallelism is encapsulated in the
exchange operator
▸ All other ops are unaware of parallelism
▸ Easy to bolt onto existing systems

[1993] Encapsulation of Parallelism and


Architecture-Independence in Extensible
Database Query Execution

Goetz Graefe et al.


DBMS Parallelism
DBMS Parallelism

▸ MonetDB uses system similar to exchange


operator
▸ Individual ops are parallelism-unaware

▸ Data is partitioned by mitosis (mergetable?)


▸ Ops execute sequentially on partitions
▸ Result is combined by mat.pack
DBMS Parallelism

▸ Exchange operator works to parallelize queries


▸ It is nice to bolt on to an existing system
▸ Don’t need to change any operators!

▸ But has partitioning/merging overhead…


▸ Works well for certain queries1, not for many
others
▸ 1ungrouped aggregates or aggregates with low
amount of groups
Morsel-Driven Parallelism

▸ Alternative: Morsel-driven parallelism


▸ Parallelism-aware operators
▸ Query is divided into pipelines
▸ Those pipelines are executed in parallel

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

SELECT …
FROM S
JOIN R USING (A) 3: Probe HTs and output result
JOIN T USING (B); (depends on 1 and 2)

1: HT Build “T”

2: HT Build “S”
Morsel-Driven Parallelism

SELECT …
FROM S
JOIN R USING (A)
JOIN T USING (B);

HT Build “T”

HT Build “S”

▸ HT builds of S and T can be trivially parallelized


▸ No shared data
▸ Limited parallelizability: depends on Q complexity…
Morsel-Driven Parallelism

▸ Need to parallelize inside a pipeline


▸ How to do that?
▸ Contention happens at endpoints
▸ Scan of T
▸ HT build at join HT Build “T”

▸ Use parallelism-aware operators at endpoints


▸ The rest of the operators (HT probe, projection,
filter, etc…) don't need to be aware
Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Vegetable Soup

▸ Morsel-driven parallelism seems like the way to go

▸ How can we add it to our vegetable soup?


Parallelism in DuckDB

▸ DuckDB uses a pull-based volcano execution model


▸ "Vector Volcano”

▸ Every operator implements a GetChunk operator


▸ Recursively calls GetChunk on children
▸ Until we reach a data source (e.g. table scan)
Parallelism in DuckDB

▸ BuildHashTable: pull everything from RHS (build-side)


▸ ProbeHashTable: pull single chunk from LHS (probe
side)
Parallelism in DuckDB

▸ Have to split up building from probing


▸ Create individual pipelines
▸ Design interface that allows for parallel-aware execution
Parallelism in DuckDB

▸ Contention is in the source and sink of a pipeline


▸ Most difficult contention is in the sink
▸ Splitting up a scan is relatively simple
Parallelism in DuckDB

▸ Sink Interface
▸ Sink has two states
▸ Global state: single state per sink
▸ Local state: single state per thread
▸ Actual content depends on the operator
Parallelism in DuckDB

▸ Sink Interface
▸ Sink takes as input the two states + a DataChunk
▸ Called repeatedly until the source data is exhausted
Parallelism in DuckDB

▸ Sink Interface
▸ Combine is called after source of a single thread is
exhausted

▸ Combine is the final chance to merge any changes


in the local sink state to the global state
Parallelism in DuckDB

▸ Sink Interface
▸ Finalize is called after all tasks related to the sink
are completed
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Global state holds the aggregate result, and a lock
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Local state holds a thread-local aggregate, and
some intermediates
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Sink: Aggregate into thread-local aggregation
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Combine: Merge local state into global state
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Finalize: Nothing, we are done
▸ (both Combine and Finalize are optional)
Parallelism in DuckDB

▸ Splitting up scans
▸ Splitting up scans is generally not very difficult
▸ But we have multiple types of scans
▸ Base table, parquet, CSV, aggregate HT, etc…
▸ How to split up depends on scan type
Parallelism in DuckDB

▸ Interface for parallel scans:

▸ One task is created for every invoked callback

▸ Implementation is optional
▸ No implementation -> scan will not be parallelized
Parallelism in DuckDB

▸ Currently only implemented for base table


▸ One task for every 100 vectors (102,400 tuples)

▸ Parquet/Pandas is not very complicated


▸ CSV can also benefit…
▸ Future work!
Parallelism in DuckDB

▸ Creating the pipelines


▸ Created by a single traversal of the query tree
▸ Encounter a pipeline breaker: create a new
pipeline
Parallelism in DuckDB

Encounter hash join: create build pipeline in RHS*


SELECT … and create a dependency in main pipeline
FROM S
JOIN R USING (A)
JOIN T USING (B);

* This image is taken from HyPer which builds on the LHS - we build on the RHS.
Is there a standard? Should we switch this? Is it even important?
Parallelism in DuckDB

SELECT …
FROM S
JOIN R USING (A)
Another hash join: create another
JOIN T USING (B); build pipeline and dependency
Parallelism in DuckDB

TPC-H Q1

P1 (depends on P2)
Scans the aggregate HT!

P2

This 0 is a bug in our profiler with parallel execution atm, TODO


Parallelism in DuckDB

▸ Notes on parallelism
▸ The final pipeline (i.e. the one that outputs
results) is not parallelized

▸ Doesn’t matter for TPC-H (there is always a Top-N


or ORDER BY…)

▸ But can definitely matter for other queries!

▸ We can push a “materialize” operator that


materializes in parallel

▸ Future work!
Parallelism in DuckDB

▸ Notes on load balancing


▸ Pipelines are split into tasks
▸ Tasks are scheduled in a concurrent queue
▸ Worker threads work on these tasks in scheduled
order

▸ Except the calling thread: this thread works on its


own query

▸ Short queries will not have to wait for long queries


▸ Every query has at least one thread working on it
Parallelism in DuckDB

▸ NUMA Awareness
▸ TODO :)
Preliminary Results

▸ Results
▸ Before we implemented splitting of scans we were
curious

▸ How much does TPC-H benefit from inter-pipeline


parallelization?
Preliminary Results

▸ Small speedup in some queries


▸ Most queries are dominated by a single pipeline!

* Actually 3 threads, due to an off-by-one :)


Preliminary Results

▸ Preliminary results (including splitting of pipelines)


▸ Notes
▸ We did not implement a good aggregate HT yet!
▸ Currently global HT that is locked on every sink

▸ Join HT/scan also have a (low) amount of contention


▸ Did not have much time to look at it yet
▸ This was all finished last Thursday :)
Preliminary Results

▸ Preliminary results
Preliminary Results

▸ Preliminary results
Preliminary Results
▸ Q1
Parallel Sequential
Preliminary Results
▸ Q18 Sequential
Preliminary Results
▸ Q18 Parallel
Future Work

▸ Future Work
▸ Rework aggregate hash table
▸ More profiling of contention (specifically in scans)
▸ Parallel window functions, ORDER BY, Top N…
▸ Parallel Parquet/CSV/Pandas scans
▸ Expand profiler to better display parallelism/
pipelines

You might also like