Parallel Databases
Introduction
Parallel machines are becoming quite common and affordable
Prices of microprocessors, memory and disks have dropped
sharply
Recent desktop computers feature multiple processors and this
trend is projected to accelerate
Databases are growing increasingly large
large volumes of transaction data are collected and stored for later
analysis.
multimedia objects like images are increasingly stored in
databases
Large-scale parallel database systems increasingly used for:
storing large volumes of data
processing time-consuming decision-support queries
providing high throughput for transaction processing
Parallelism in Databases
Data can be partitioned across multiple disks for parallel I/O.
Individual relational operations (e.g., sort, join, aggregation) can be
executed in parallel
Queries are expressed in high level language (SQL, translated to
relational algebra)
makes parallelization easier.
Different queries can be run in parallel with each other.
Concurrency control takes care of conflicts.
Partitioning
Types of partitioning
Horizontal partitioning – tuples of a relation are divided among many
disks such that each tuple resides on one disk.
Vertical partitioning-Schema of relation is divided among many disks
such that data fields of each tuple are split and stored on various
multiple disks.
Partitioning
Partitioning techniques (number of disks = n):
Round-robin:
Send the I th tuple inserted in the relation to disk i mod n.
Hash partitioning:
Choose one or more attributes as the partitioning attributes.
Choose hash function h with range 0…n - 1
Let i denote result of hash function h applied to the partitioning
attribute value of a tuple. Send tuple to disk i.
Range partitioning:
Choose an attribute as the partitioning attribute.
A partitioning vector [vo, v1, ..., vn-2] is chosen.
Let v be the partitioning attribute value of a tuple. Tuples such that vi vi+1 go to
disk I + 1. Tuples with v < v0 go to disk 0 and tuples with v vn-2 go to disk n-1.
Interquery Parallelism
Queries/transactions execute in parallel with one another.
Increases transaction throughput; used primarily to scale up a transaction
processing system to support a larger number of transactions per second.
Easiest form of parallelism to support, particularly in a shared-memory
parallel database, because even sequential database systems support
concurrent processing.
Intraquery Parallelism
Execution of a single query in parallel on multiple processors/disks;
important for speeding up long-running queries.
Two complementary forms of intraquery parallelism:
Intraoperation Parallelism – parallelize the execution of each individual
operation in the query.
Interoperation Parallelism – execute the different operations in a query
expression in parallel.
the first form scales better with increasing parallelism because
the number of tuples processed by each operation is typically more than the
number of operations in a query.
Interoperator Parallelism
Pipelined parallelism
Consider a join of four relations
r1 r2 r3 r4
Set up a pipeline that computes the three joins in parallel
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = temp1
r3
And P3 be assigned the computation of temp2 r4
Each of these operations can execute in parallel, sending result
tuples it computes to the next operation even as it is computing
further results
Independent Parallelism
Independent parallelism
Consider a join of four relations
r1 r2 r3 r4
Let P1 be assigned the computation of
temp1 = r1 r2
And P2 be assigned the computation of temp2 = r 3 r4
And P3 be assigned the computation of temp1 temp 2
P1 and P2 can work independently in parallel
P3 has to wait for input from P1 and P2
Can pipeline output of P1 and P2 to P3, combining
independent parallelism and pipelined parallelism
Does not provide a high degree of parallelism
useful with a lower degree of parallelism.
less useful in a highly parallel system.
Design of Parallel Systems
Some issues in the design of parallel systems:
Parallel loading of data from external sources is needed in order
to handle large volumes of incoming data.
Resilience to failure of some processors or disks.
Probability of some disk or processor failing is higher in a parallel
system.
Operation (perhaps with degraded performance) should be possible
in spite of failure.
Redundancy achieved by storing extra copy of every data item at
another processor.
End of Chapter