0% found this document useful (0 votes)

62 views

Query Execution: Intro To Database Systems Andy Pavlo

The document discusses parallel query execution in databases. It describes three process models for parallel execution: process per worker, process pool, and thread per worker. It also discusses inter-query and intra-query parallelism. Intra-query parallelism can be achieved through intra-operator parallelism, where operators are decomposed into fragments that operate on different data subsets, or through inter-operator parallelism by pipelining operators.

Uploaded by

akshay

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

62 views

Query Execution: Intro To Database Systems Andy Pavlo

Uploaded by

akshay

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Query Execution

13 Part II

Intro to Database Systems Andy Pavlo

15-445/15-645
Fall 2019 AP Computer Science
Carnegie Mellon University
2

ADMINISTRIVIA

Homework #3 is due Today @ 11:59pm

Mid-Term Exam is Wed Oct 16th @ 12:00pm

Project #2 is due Sun Oct 20th @ 11:59pm

CMU 15-445/645 (Fall 2019)

QUERY EXECUTION
SELECT R.id, S.cdate
We discussed last class how to FROM R JOIN S
compose operators together to ON R.id = S.id
execute a query plan. WHERE S.value > 100

We assumed that the queries execute

p R.id, S.value

with a single worker (e.g., thread).

⨝ R.id=S.id

We now need to talk about how to

execute with multiple workers…
s value>100

R S
CMU 15-445/645 (Fall 2019)
4

W H Y C A R E A B O U T PA R A L L E L E X E C U T I O N ?

Increased performance.
→ Throughput
→ Latency

Increased responsiveness and availability.

Potentially lower total cost of ownership (TCO).

CMU 15-445/645 (Fall 2019)

PA R A L L E L V S . D I S T R I B U T E D

Database is spread out across multiple resources

to improve different aspects of the DBMS.

Appears as a single database instance to the

application.
→ SQL query for a single-resource DBMS should generate
same result on a parallel or distributed DBMS.

CMU 15-445/645 (Fall 2019)

PA R A L L E L V S . D I S T R I B U T E D

Parallel DBMSs:
→ Resources are physically close to each other.
→ Resources communicate with high-speed interconnect.
→ Communication is assumed to cheap and reliable.

Distributed DBMSs:
→ Resources can be far from each other.
→ Resources communicate using slow(er) interconnect.
→ Communication cost and problems cannot be ignored.

CMU 15-445/645 (Fall 2019)

T O D AY ' S A G E N D A

Process Models
Execution Parallelism
I/O Parallelism

CMU 15-445/645 (Fall 2019)

PROCESS MODEL

A DBMS’s process model defines how the system

is architected to support concurrent requests from
a multi-user application.

A worker is the DBMS component that is

responsible for executing tasks on behalf of the
client and returning the results.

CMU 15-445/645 (Fall 2019)

PROCESS MODELS

Approach #1: Process per DBMS Worker

Approach #2: Process Pool

Approach #3: Thread per DBMS Worker

CMU 15-445/645 (Fall 2019)

PROCESS PER WORKER

Each worker is a separate OS process.

→ Relies on OS scheduler.
→ Use shared-memory for global data structures.
→ A process crash doesn’t take down entire system.
→ Examples: IBM DB2, Postgres, Oracle

Dispatcher Worker
CMU 15-445/645 (Fall 2019)
11

PROCESS POOL

A worker uses any process that is free in a pool

→ Still relies on OS scheduler and shared memory.
→ Bad for CPU cache locality.
→ Examples: IBM DB2, Postgres (2015)

Dispatcher Worker Pool

CMU 15-445/645 (Fall 2019)
12

THREAD PER WORKER

Single process with multiple worker threads.

→ DBMS manages its own scheduling.
→ May or may not use a dispatcher thread.
→ Thread crash (may) kill the entire system.
→ Examples: IBM DB2, MSSQL, MySQL, Oracle (2014)

Worker Threads
CMU 15-445/645 (Fall 2019)
13

PROCESS MODELS

Using a multi-threaded architecture has several

advantages:
→ Less overhead per context switch.
→ Do not have to manage shared memory.

The thread per worker model does not mean that

the DBMS supports intra-query parallelism.

Andy is not aware of any new DBMS from last 10

years that doesn’t use threads unless they are
Postgres forks.
CMU 15-445/645 (Fall 2019)
14

SCHEDULING

For each query plan, the DBMS decides where,

when, and how to execute it.
→ How many tasks should it use?
→ How many CPU cores should it use?
→ What CPU core should the tasks execute on?
→ Where should a task store its output?

The DBMS always knows more than the OS.

CMU 15-445/645 (Fall 2019)

I N T E R- V S . I N T R A - Q U E R Y PA R A L L E L I S M

Inter-Query: Different queries are executed

concurrently.
→ Increases throughput & reduces latency.

Intra-Query: Execute the operations of a single

query in parallel.
→ Decreases latency for long-running queries.

CMU 15-445/645 (Fall 2019)

I N T E R- Q U E R Y PA R A L L E L I S M

Improve overall performance by allowing multiple

queries to execute simultaneously.

If queries are read-only, then this requires little

coordination between queries.

If multiple queries are updating the database at the

same time, then this is hard to do correctly…

CMU 15-445/645 (Fall 2019)

I N T R A - Q U E R Y PA R A L L E L I S M

Improve the performance of a single query by

executing its operators in parallel.

Think of organization of operators in terms of a

producer/consumer paradigm.

There are parallel algorithms for every relational

operator.
→ Can either have multiple threads access centralized data
structures or use partitioning to divide work up.

CMU 15-445/645 (Fall 2019)

PA R A L L E L G R A C E H A S H J O I N

Use a separate worker to perform the join for each

level of buckets for R and S after partitioning.

R(id,name) HTR HTS

0 S(id,value,cdate)
1
h1 2 h1
⋮ ⋮
max

CMU 15-445/645 (Fall 2019)

PA R A L L E L G R A C E H A S H J O I N

Use a separate worker to perform the join for each

level of buckets for R and S after partitioning.

R(id,name) HTR HTS

1 0 S(id,value,cdate)
2 1
h1 3 2 h1
⋮ ⋮
n max

CMU 15-445/645 (Fall 2019)

I N T R A - Q U E R Y PA R A L L E L I S M