0% found this document useful (0 votes)
130 views54 pages

Duckdb Parallelism

Uploaded by

yoonghm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
130 views54 pages

Duckdb Parallelism

Uploaded by

yoonghm
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Mark Raasveldt

Parallel Quacking
Parallel Quacking

▸ When building DuckDB we have mostly


focused on building a functional system
▸ Avoid premature optimization
▸ Avoid adding optimizations that prevent
adding features
Parallel Quacking

▸ Suddenly people are benchmarking our system


▸ Including benchmarks in research papers
▸ Yikes!

▸ We haven’t exactly spend a lot of time


optimizing…
Parallel Quacking

▸ We are now pretty happy with functionality


▸ Window functions, subqueries, collations,
(recursive) CTEs, Parquet/Pandas/CSV
readers, …
▸ Maybe we should start optimizing!
Parallel Quacking

▸ DuckDB is currently single-threaded


▸ Parallelism is an obvious performance boost

▸ More importantly: parallelism requires a


structural change to the code
▸ Optimizations need to account for parallelism
▸ Optimizing a single-threaded HT is pointless if
we have to throw it away once we add
parallelism!
Parallel Quacking

▸ Parallelism is actually our oldest open issue!

▸ Created one month after the initial commit

▸ So it’s about time :)


DBMS Parallelism

▸ Short intro to DBMS parallelism


▸ DBMS have two types of parallelism
▸ Inter-query and intra-query parallelism

▸ Inter-query: multiple different queries


can be executed in parallel
▸ Intra-query: a single query can be
parallelized
DBMS Parallelism

▸ Most systems have inter-query


▸ We already had this

▸ Most useful for OLTP systems


▸ Many concurrent clients requests, etc
DBMS Parallelism

▸ Intra-query is not part of most OLTP


systems
▸ e.g. MySQL/PostgreSQL/SQLite
▸ Not useful for small queries

▸ Only useful for complex queries


▸ Aka OLAP systems
DBMS Parallelism

▸ Exchange operator: original way of


doing parallelism
▸ Parallelism is encapsulated in the
exchange operator
▸ All other ops are unaware of parallelism
▸ Easy to bolt onto existing systems

[1993] Encapsulation of Parallelism and


Architecture-Independence in Extensible
Database Query Execution

Goetz Graefe et al.


DBMS Parallelism
DBMS Parallelism

▸ MonetDB uses system similar to exchange


operator
▸ Individual ops are parallelism-unaware

▸ Data is partitioned by mitosis (mergetable?)


▸ Ops execute sequentially on partitions
▸ Result is combined by mat.pack
DBMS Parallelism

▸ Exchange operator works to parallelize queries


▸ It is nice to bolt on to an existing system
▸ Don’t need to change any operators!

▸ But has partitioning/merging overhead…


▸ Works well for certain queries1, not for many
others
▸ 1ungrouped aggregates or aggregates with low
amount of groups
Morsel-Driven Parallelism

▸ Alternative: Morsel-driven parallelism


▸ Parallelism-aware operators
▸ Query is divided into pipelines
▸ Those pipelines are executed in parallel

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

SELECT …
FROM S
JOIN R USING (A) 3: Probe HTs and output result
JOIN T USING (B); (depends on 1 and 2)

1: HT Build “T”

2: HT Build “S”
Morsel-Driven Parallelism

SELECT …
FROM S
JOIN R USING (A)
JOIN T USING (B);

HT Build “T”

HT Build “S”

▸ HT builds of S and T can be trivially parallelized


▸ No shared data
▸ Limited parallelizability: depends on Q complexity…
Morsel-Driven Parallelism

▸ Need to parallelize inside a pipeline


▸ How to do that?
▸ Contention happens at endpoints
▸ Scan of T
▸ HT build at join HT Build “T”

▸ Use parallelism-aware operators at endpoints


▸ The rest of the operators (HT probe, projection,
filter, etc…) don't need to be aware
Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Parallelism

TPC-H SF100, 32 cores

[2014] Morsel-Driven Parallelism: A


NUMA-Aware Query Evaluation
32 cores, 64 hardware threads Framework for the Many-Core Age

Viktor Leis et al.


Morsel-Driven Vegetable Soup

▸ Morsel-driven parallelism seems like the way to go

▸ How can we add it to our vegetable soup?


Parallelism in DuckDB

▸ DuckDB uses a pull-based volcano execution model


▸ "Vector Volcano”

▸ Every operator implements a GetChunk operator


▸ Recursively calls GetChunk on children
▸ Until we reach a data source (e.g. table scan)
Parallelism in DuckDB

▸ BuildHashTable: pull everything from RHS (build-side)


▸ ProbeHashTable: pull single chunk from LHS (probe
side)
Parallelism in DuckDB

▸ Have to split up building from probing


▸ Create individual pipelines
▸ Design interface that allows for parallel-aware execution
Parallelism in DuckDB

▸ Contention is in the source and sink of a pipeline


▸ Most difficult contention is in the sink
▸ Splitting up a scan is relatively simple
Parallelism in DuckDB

▸ Sink Interface
▸ Sink has two states
▸ Global state: single state per sink
▸ Local state: single state per thread
▸ Actual content depends on the operator
Parallelism in DuckDB

▸ Sink Interface
▸ Sink takes as input the two states + a DataChunk
▸ Called repeatedly until the source data is exhausted
Parallelism in DuckDB

▸ Sink Interface
▸ Combine is called after source of a single thread is
exhausted

▸ Combine is the final chance to merge any changes


in the local sink state to the global state
Parallelism in DuckDB

▸ Sink Interface
▸ Finalize is called after all tasks related to the sink
are completed
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Global state holds the aggregate result, and a lock
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Local state holds a thread-local aggregate, and
some intermediates
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Sink: Aggregate into thread-local aggregation
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Combine: Merge local state into global state
Parallelism in DuckDB

▸ Example: Ungrouped Aggregate


▸ Finalize: Nothing, we are done
▸ (both Combine and Finalize are optional)
Parallelism in DuckDB

▸ Splitting up scans
▸ Splitting up scans is generally not very difficult
▸ But we have multiple types of scans
▸ Base table, parquet, CSV, aggregate HT, etc…
▸ How to split up depends on scan type
Parallelism in DuckDB

▸ Interface for parallel scans:

▸ One task is created for every invoked callback

▸ Implementation is optional
▸ No implementation -> scan will not be parallelized
Parallelism in DuckDB

▸ Currently only implemented for base table


▸ One task for every 100 vectors (102,400 tuples)

▸ Parquet/Pandas is not very complicated


▸ CSV can also benefit…
▸ Future work!
Parallelism in DuckDB

▸ Creating the pipelines


▸ Created by a single traversal of the query tree
▸ Encounter a pipeline breaker: create a new
pipeline
Parallelism in DuckDB

Encounter hash join: create build pipeline in RHS*


SELECT … and create a dependency in main pipeline
FROM S
JOIN R USING (A)
JOIN T USING (B);

* This image is taken from HyPer which builds on the LHS - we build on the RHS.
Is there a standard? Should we switch this? Is it even important?
Parallelism in DuckDB

SELECT …
FROM S
JOIN R USING (A)
Another hash join: create another
JOIN T USING (B); build pipeline and dependency
Parallelism in DuckDB

TPC-H Q1

P1 (depends on P2)
Scans the aggregate HT!

P2

This 0 is a bug in our profiler with parallel execution atm, TODO


Parallelism in DuckDB

▸ Notes on parallelism
▸ The final pipeline (i.e. the one that outputs
results) is not parallelized

▸ Doesn’t matter for TPC-H (there is always a Top-N


or ORDER BY…)

▸ But can definitely matter for other queries!

▸ We can push a “materialize” operator that


materializes in parallel

▸ Future work!
Parallelism in DuckDB

▸ Notes on load balancing


▸ Pipelines are split into tasks
▸ Tasks are scheduled in a concurrent queue
▸ Worker threads work on these tasks in scheduled
order

▸ Except the calling thread: this thread works on its


own query

▸ Short queries will not have to wait for long queries


▸ Every query has at least one thread working on it
Parallelism in DuckDB

▸ NUMA Awareness
▸ TODO :)
Preliminary Results

▸ Results
▸ Before we implemented splitting of scans we were
curious

▸ How much does TPC-H benefit from inter-pipeline


parallelization?
Preliminary Results

▸ Small speedup in some queries


▸ Most queries are dominated by a single pipeline!

* Actually 3 threads, due to an off-by-one :)


Preliminary Results

▸ Preliminary results (including splitting of pipelines)


▸ Notes
▸ We did not implement a good aggregate HT yet!
▸ Currently global HT that is locked on every sink

▸ Join HT/scan also have a (low) amount of contention


▸ Did not have much time to look at it yet
▸ This was all finished last Thursday :)
Preliminary Results

▸ Preliminary results
Preliminary Results

▸ Preliminary results
Preliminary Results
▸ Q1
Parallel Sequential
Preliminary Results
▸ Q18 Sequential
Preliminary Results
▸ Q18 Parallel
Future Work

▸ Future Work
▸ Rework aggregate hash table
▸ More profiling of contention (specifically in scans)
▸ Parallel window functions, ORDER BY, Top N…
▸ Parallel Parquet/CSV/Pandas scans
▸ Expand profiler to better display parallelism/
pipelines

You might also like