0% found this document useful (0 votes)
389 views

Distributed Databases

The document discusses key issues in distributed database design such as fragmentation, allocation, and replication of data across multiple sites. It also describes different techniques for data distribution including synchronous and asynchronous replication as well as primary site and peer-to-peer replication. Query processing in distributed databases requires considering communication costs and data placement during query optimization. Semi-joins can help reduce data transfer sizes between sites during query execution.

Uploaded by

api-26355935
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
389 views

Distributed Databases

The document discusses key issues in distributed database design such as fragmentation, allocation, and replication of data across multiple sites. It also describes different techniques for data distribution including synchronous and asynchronous replication as well as primary site and peer-to-peer replication. Query processing in distributed databases requires considering communication costs and data placement during query optimization. Semi-joins can help reduce data transfer sizes between sites during query execution.

Uploaded by

api-26355935
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 58

Distributed Databases

By
Sudarshan

MCA Sem V 12/10/2007


Distributed Database Design

• Three key issues:


– Fragmentation
• Relation may be divided into a number of sub-
relations, which are then distributed.

– Allocation
• Each fragment is stored at site with “optimal”
distribution.

– Replication.
• Copy of fragment may be maintained at several
sites.
Data Allocation

• An allocation schema describes the


allocation of fragments to sites of the
DDBs.
• It is mapping that specifies for each
fragment the site at which it is stored.
• If a fragment is stored at more than one
site it is said to be replicated.
Distributed Catalog Management

• Must keep track of how data is


distributed across sites.
• Must be able to name each replica of
each n fragment. To preserve local
autonomy:
– <local-name, birth-site>
• Site Catalog: Describes all objects
(fragments, replicas) at a site + Keeps
track of replicas of relations created at
this site.
– To find a relation, look up its birth-site
catalog.
Data Replication

• Fully replicated : each fragment at


each site
• Partially replicated : each fragment
at some of the sites
• Types of replication
– Synchronous Replication
– Asynchronous Replication

• Rule of thumb:

– If replication is
advantageous,

– otherwise replication may cause


Synchronous Replication

• All copies of a modified relation


(fragment) must be updated before the
modifying transaction commits.
– Data distribution is made transparent to
users.
• 2 techniques for synchronous replication
– Voting
– Read-any Write-all
Asynchronous Replication

• Allows modifying transaction to commit


before all copies have been changed (and
readers nonetheless look at just one copy).
• Copies of a modified relation are only
periodically updated; different copies may get
out of synch in the meantime.
– Users must be aware of data distribution.
– Current products follow this approach.
• 2 techniques for asynchronous replication
– Primary Site Replication
– Peer-to-Peer Replication
• Difference lies in how many copies are
“updatable’’ or “master copies’’.
Techniques for Synchronous Replication

• Voting: Transactions must write a majority of


copies to modify an object; must read enough
copies to be sure of seeing at least one most
recent copy.
– E.g., 10 copies; 7 written for update; 4 copies read.
– Each copy has version number.
– Not attractive usually because reads are common.
• Read-any Write-all: Writes are slower and
reads are faster, relative to Voting.
– Most common approach to synchronous
replication.
• Choice of technique determines which locks to
set.
Primary Site Replication

• Exactly one copy of a relation is designated


the primary or master copy. Replicas at other
sites cannot be directly updated.
– The primary copy is published.
– Other sites subscribe to (fragments of) this
relation; these are secondary copies.

• Main issue: How are changes to the primary


copy propagated to the secondary copies?
– Done in two steps.
• First, capture changes made by committed transactions;
• Then apply these changes.
Implementing the Capture Step

• Log-Based Capture: The log (kept for


recovery) is used to generate a Change
Data Table (CDT).
– If this is done when the log tail is written to
disk, it must somehow remove changes due
to subsequently aborted transactions.
• Procedural Capture: A procedure that is
automatically invoked (trigger) does the
capture; typically, just takes a snapshot.
• Log-Based Capture is better (cheaper,
faster) but relies on proprietary log
details.
Implementing the Apply Step

• The Apply process at the secondary site


periodically obtains (a snapshot or) changes
to the CDT table from the primary site, and
updates the copy.
– Period can be timer-based or user/application
defined.
• Replica can be a view over the modified
relation!
– If so, the replication consists of incrementally
updating the materialized view as the relation
changes.

• Log-Based Capture plus continuous Apply


minimizes delay in propagating changes.
• Procedural Capture plus application-driven
Apply is the most flexible way to process
Peer-to-Peer Replication

• More than one of the copies of an object


can be a master in this approach.
• Changes to a master copy must be
propagated to other copies somehow.
• If two master copies are changed in a
conflicting manner, this must be
resolved. (e.g., Site 1: Joe’s age
changed to 35; Site 2: to 36)
• Best used when conflicts do not arise:
– E.g., Each master site owns a disjoint
fragment.
– E.g., Updating rights owned by one master
Distributed Query Processing
Sailors (sid : int, sname : str, rating : int, age : int)
Reserves (sid : int, bid : int, day : date, rname : string)

SELECT AVG(S.age)
FROM Sailors S
WHERE S.rating > 3 AND S.rating < 7
• Horizontally Fragmented: Tuples with rating < 5 at
Mumbai, >= 5 at Delhi.
– Must compute SUM(age), COUNT(age) at both sites.
– If WHERE contained just S.rating>6, just one site.
• Vertically Fragmented: sid and rating at Mumbai,
sname and age at Delhi, tid at both.
– Must reconstruct relation by join on tid, then evaluate the
query.
• Replicated: Sailors copies at both sites.
– Choice of site based on local costs, shipping costs.
Distributed Query Optimization

• Cost-based approach; consider all plans,


pick cheapest; similar to centralized
optimization.
– Difference 1: Communication costs must be
considered.
– Difference 2: Local site autonomy must be
respected.
– Difference 3: New distributed join methods.
• Query site constructs global plan, with
suggested local plans describing
processing at each site.
Distributed Query Processing – Example
Distributed Query Processing – Example

• This query will return 10,000 records


(every employee belongs to one
department).
• Each record will be 40 bytes long
(FNAME + LNAME + DNAME = 15 + 15
+ 10 = 40).
• Thus the result set will be 400,000
bytes.
• Assume cost to transfer query text
between nodes can be safely ignored.
Distributed Query Processing – Example

• Three alternatives:
– Copy all EMPLOYEE and DEPARTMENT records to
node 3. Perform the join and display the results.
Total Cost = 1,000,000 + 3,500 = 1,003,500 bytes
– Copy all EMPLOYEE records (1,000,000 bytes) from
node 1 to node 2. Perform the join, then ship the
results (400,000 bytes) to node 3.
Total cost = 1,000,000 + 400,000 = 1,400,000
bytes
– Copy all DEPARTMENT records (3,500) from node 2
to node 1. Perform the join. Ship the results from
node 1 to node 3 (400,000).
Total cost = 3,500 + 400,000 = 403,500 bytes
Distributed Query Processing – Example

• For each department, retrieve the department


name and the name of the department
manager. ∏FNAME,LNAME,DNAME (DEPARTMENT XMGRSSN=SSN
EMPLOYEE)
• Total no of tuples in relation = 100. each of 40 bytes.

• Transfer both the relations to site 3 and perform join


Total size = 1,000,000 + 3500 = 1,003,500 bytes
• Transfer EMPLOYEE relation to site 2 and perform join. Send the
results to site 3.
Total size = 1,000,000 + 4000 = 1,004,000 bytes
• Transfer DEPARTMENT relation to site 1 and perform join. Send
the results to site 3.
Total size = 3500 + 4000 = 7500 bytes
Distributed Query Processing – Example

• Taking the same example:


– Copy just the FNAME, LNAME and DNO
columns from Site 1 to Site 3 (cost = 34
bytes times 10,000 records = 340,000
bytes)
– Copy just the DNUMBER and DNAME
columns from site 2 to site 3 (cost = 14
bytes times 100 records = 1,400 bytes)
– Perform the join at site 3 and display the
results.
Total cost = 341,400
Semi-Join

• The semijoin of r1 with r2, is denoted by:


r1 r2
• The idea is to reduce the number of tuples in
a relation before transferring it to another
site.
• Send the joining column of a relation R (say
r1) to a site where the other relation S (say r2)
is located, and perform a join with r2.
• Then, the join attribute along with the other
required attributes are projected and send to
the original site.
• A join operation is performed at this site.
Semi-Join Example

London Site: Sailors (sid: int, sname: str, rating:


int, age: int)
Paris Site: Reserves (sid: int, bid: int, day: date,
rname: string)

• At London, project Sailors onto join columns


and ship this to Paris.
• At Paris, join Sailors projection with Reserves.
– Result is called reduction of Reserves wrt Sailors.
• Ship reduction of Reserves to London.
• At London, join Sailors with reduction of
Reserves.
• Especially useful if there is a selection on
Semi-Join Example
• Project the join attribute of DEPARTMENT at site 2,
and transfer them at site 1.
F = ∏DNUMBER (DEPARTMENT)
Size = 4*100 = 400 bytes
• Join the transferred file with EMPLOYEE relation at site
1, and transfer the required attributes from the
resulting file to site 2.
R= ∏DNO,FNAME.LNAME (F X DNUMBER=DNO EMPLOYEE)
Size = 34*10,000= 340,000 bytes.
• Execute a join of transferred file R with DEPARTMENT,
and present the result to site 3.
Size = 400,000 bytes

Total Size = 400+340,000+400,000 = 740,400 bytes


Semi-Join Example

• Project the join attribute of DEPARTMENT at site 2,


and transfer them at site 1.
F = ∏MGRSSN (DEPARTMENT)
Size = 9*100 = 900 bytes
• Join the transferred file with EMPLOYEE relation at site
1, and transfer the required attributes from the
resulting file to site 2.
R= ∏MGRSSN,FNAME.LNAME (F X MGRSSN = SSN EMPLOYEE)
Size = 39*100= 3,900 bytes.
• Execute a join of transferred file R with DEPARTMENT,
and present the result to site 3.
Size = 4,000 bytes

Total Size = 900+3,900+4,000 = 8,800 bytes


Joins - Fetch as Needed

• Perform a page oriented


Nested Loop Join in London
with Sailors as the outer and
for each Sailors page, fetch all Reserves
pages from Paris.
• Cache all the pages.
• Fetch as Needed,
– Cost: 500 D + 500 * 1000 (D+S)
– D is cost to read/write page; S is cost to ship page.
– If query was not submitted at London, must add
cost of shipping result to query site.
• Can also do Indexed Nested Loop Join at
London, fetching matching Reserves tuples to
London as needed.
Joins - Ship to One Site

• Transfer Reserves to London.


– Cost: 1000 S + 4500 D

• Transfer Sailors to Paris.


– Cost: 500 S + 4500 D

• If result size is very large, may be better


to ship both relations to result site and
then join them!
Bloomjoins

• At London, compute a bit-vector of some size


k:
– Hash join column values into range 0 to k-1.
– If some tuple hashes to i, set bit i to 1 (i from 0 to
k-1).
– Ship bit-vector to Paris.
• At Paris, hash each tuple of Reserves
similarly, and discard tuples that hash to 0 in
Sailors bit-vector.
– Result is called reduction of Reserves wrt Sailors.
• Ship bit-vector reduced Reserves to London.
• At London, join Sailors with reduced Reserves.
Distributed Transactions

• Distributed Concurrency Control:


– How can locks for objects stored across
several sites be managed.
– How can deadlocks be detected in a
distributed database
• Distributed Recovery:
– Transaction atomicity must be ensured.
Distributed Transactions
Concurrency Control and Recovery

• Dealing with multiple copies of the data


items
• Failure of individual sites
• Failure of communication links
• Distributed commit
• Distributed deadlock
Distributed Locking

• How do we manage locks for objects across


many sites?
– Centralized: One site does all locking.
• Vulnerable to single site failure.
– Primary Copy: All locking for an object done at
the primary copy site for this object.
• Reading requires access to locking site as well as site
where the object is stored.
– Fully Distributed: Locking for a copy done at site
where the copy is stored.
• Locks at all sites while writing an object.
• Obtaining and releasing of locks is determined
by the concurrency control protocol.
Deadlock Handling

Consider the following two transactions and history, with


item X and transaction T1 at site 1, and item Y and
transaction
T: T2 (X)
write at site 2: T: write (Y)
1 2
write (Y) write (X)
X-lock on X
write (X) X-lock on Y
write (Y)
wait for X-lock on X
Wait for X-lock on Y

Result: deadlock which cannot be detected locally at either site


Local and Global Wait-For Graphs

Local

Global
Distributed Deadlock – Solution

• Three solutions:
– Centralized (send all local graphs to one
site);
– Hierarchical (organize sites into a hierarchy
and send local graphs to parent in the
hierarchy);
– Timeout (abort transaction if it waits too
long).
Centralized Approach
• A global wait-for graph is constructed and
maintained in a single site; the deadlock-
detection coordinator
– Real graph: Real, but unknown, state of the
system.
– Constructed graph:Approximation generated by
the controller during the execution of its algorithm
.
• The global wait-for graph can be constructed
when:
– a new edge is inserted in or removed from one of
the local wait-for graphs.
– a number of changes have occurred in a local
wait-for graph.
Example Wait-For Graph for False Cycles
Initial state:
False Cycles (Cont.)
• Suppose that starting from the state shown in
figure,
1. T2 releases resources at S1
• resulting in a message remove T1 → T2 message from
the Transaction Manager at site S1 to the coordinator)

2. And then T2 requests a resource held by T3 at


site S2
• resulting in a message insert T2 → T3 from S2 to the
coordinator
• Suppose further that the insert message reaches
before the delete message (this can happen due to
network delays)
• The coordinator would then find a false cycle
T1 → T2 → T3 → T1
Unnecessary Rollbacks

• Unnecessary rollbacks may result when


deadlock has indeed occurred and a
victim has been picked, and meanwhile
one of the transactions was aborted for
reasons unrelated to the deadlock.
• Unnecessary rollbacks can result from
false cycles in the global wait-for graph;
however, likelihood of false cycles is
low.
Distributed Recovery

• Two new issues:


– New kinds of failure, e.g., links and remote
sites.
– If “sub-transactions” of an transaction+
execute at different sites, all or none must
commit. Need a commit protocol to achieve
this.
• A log is maintained at each site, as in a
centralized DBMS, and commit protocol
actions are additionally logged.
Coordinator Selection

• Backup coordinators
– site which maintains enough information locally to
assume the role of coordinator if the actual
coordinator fails
– executes the same algorithms and maintains the
same internal state information as the actual
coordinator fails executes state information as the
actual coordinator
– allows fast recovery from coordinator failure but
involves overhead during normal processing.
• Election algorithms
– used to elect a new coordinator in case of failures
– Example: Bully Algorithm - applicable to systems
where every site can send a message to every
Bully Algorithm

• If site Si sends a request that is not answered by the


coordinator within a time interval T, assume that the
coordinator has failed Si tries to elect itself as the new
coordinator.
• Si sends an election message to every site with a
higher identification number, Si then waits for any of
these processes to answer within T.
• If no response within T, assume that all sites with
number greater than i have failed, Si elects itself the
new coordinator.
• If answer is received Si begins time interval T’,
waiting to receive a message that a site with a higher
identification number has been elected.
Bully Algorithm

• If no message is sent within T’, assume the


site with a higher number has failed; Si
restarts the algorithm.
• After a failed site recovers, it immediately
begins execution of the same algorithm.
• If there are no active sites with higher
numbers, the recovered site forces all
processes with lower numbers to let it
become the coordinator site, even if there is a
currently active coordinator with a lower
number.
Distributed Concurrency Control

• Idea is to designate a particular copy of each


data item as a distinguished copy.
• The locks for this data item are associated
with the distinguished copy and all the locking
and unlocking requests are sent to the site
that contains the copy.
• Methods for concurrency control
– Primary Site Technique
– Primary Site with Backup Site
– Primary Copy Technique
– Voting
Distributed Concurrency Control

• Primary site technique


– A single site is designated to be coordinator site for
all database items
– All locks are kept at this site.
– All requests are sent at this site.
• Advantages
– Simple extension of centralized approach
• Disadvantages
– Performance bottleneck
– Failure of primary site
Distributed Concurrency Control

• Primary site with backup site


– Overcomes the second disadvantage of
primary site technique
– All locking information maintained at the
primary as well as backup site.
– In case of failure of primary site, backup
site takes the control and becomes a
primary site.
– It also chooses a site as a backup site and
copies the lock information.
Distributed Concurrency Control

• Primary copy technique


– Attempts to distribute load of lock
coordination by having distinguished copies
of different data items stored at different
sites.
– Failure of a site affects transactions that
access locks on that particular site.
– Other transactions can continue to run.
– Can use the method of backup to increase
availability and reliability.
Distributed Concurrency Control

• Based on voting
– To lock a data item:
• Send a message to all nodes that maintain a
replica of this item.
• If a node can safely lock the item, then vote
"Yes", otherwise, vote "No".
• If a majority of participating nodes vote "Yes"
then the lock is granted.
• Send the results of the vote back out to all
participating sites.
Normal Execution and Commit Protocols

• Commit protocols are used to ensure


atomicity across sites
– a transaction which executes at multiple sites must
either be committed at all the sites, or aborted at
all the sites.
– not acceptable to have a transaction committed at
one site and aborted at another
• The two-phase commit (2PC) protocol is
widely used
• The three-phase commit (3PC) protocol is
more complicated and more expensive, but
avoids some drawbacks of two-phase commit
protocol. This protocol is not used in practice.
Two-Phase Commit (2PC)

• Site at which transaction originates is


coordinator; other sites at which it executes
are subordinates.
• When an transaction wants to commit:
– Coordinator sends prepare msg to each
subordinate.
– Subordinate force-writes an abort or prepare log
record and then sends a no or yes msg to
coordinator.
– If coordinator gets unanimous yes votes, force-
writes a commit log record and sends commit
msg to all subs. Else, force-writes abort log rec,
and sends abort msg.
– Subordinates force-write abort/commit log rec
based on msg they get, then send ack msg to
Two-Phase Commit (2PC)

• Two rounds of communication: first,


voting; then, termination. Both initiated by
coordinator.
• Any site can decide to abort an
transaction.
• Every message reflects a decision by the
sender; to ensure that this decision
survives failures, it is first recorded in the
local log.
• All commit protocol log records for an
transactions contain Transaction_id and
Coordinator_id. The coordinator’s
Handling of Failures - Site Failure

When site Si recovers, it examines its log to determine


the fate of
transactions active at the time of the failure.
• Log contain <commit T> record: site executes redo
(T)
• Log contains <abort T> record: site executes undo
(T)
• Log contains <ready T> record: site must consult Ci
to determine the fate of T.
– If T committed, redo (T)
– If T aborted, undo (T)
• The log contains no control records concerning T
replies that Sk failed before responding to the
prepare T message from Ci
– since the failure of Sk precludes the sending of such a
Handling of Failures- Coordinator Failure

• If coordinator fails while the commit protocol for T is


executing then participating sites must decide on T’s
fate:
★ If an active site contains a <commit T> record in its log,
then T must be committed.
★ If an active site contains an <abort T> record in its log, then
T must be aborted.
★ If some active participating site does not contain a <ready
T> record in its log, then the failed coordinator Ci cannot
have decided to commit T. Can therefore abort T.
★ If none of the above cases holds, then all active sites must
have a <ready T> record in their logs, but no additional
control records (such as <abort T> of <commit T>). In this
case active sites must wait for Ci to recover, to find decision.
• Blocking problem : active sites may have to wait for
failed coordinator to recover.
Handling of Failures - Network Partition

• If the coordinator and all its participants


remain in one partition, the failure has no
effect on the commit protocol.
• If the coordinator and its participants belong
to several partitions:
– Sites that are not in the partition containing the
coordinator think the coordinator has failed, and
execute the protocol to deal with failure of the
coordinator.
• No harm results, but sites may still have to wait for
decision from coordinator.
• The coordinator and the sites are in the same
partition as the coordinator think that the
sites in the other partition have failed, and
follow the usual commit protocol.
• Again, no harm results
Recovery and Concurrency Control

• In-doubt transactions have a <ready T>, but


neither a
<commit T>, nor an <abort T> log record.
• The recovering site must determine the commit-abort
status of such transactions by contacting other sites;
this can slow and potentially block recovery.
• Recovery algorithms can note lock information in the
log.
– Instead of <ready T>, write out <ready T, L> L = list of
locks held by T when the log is written (read locks can be
omitted).
– For every in-doubt transaction T, all the locks noted in the
<ready T, L> log record are reacquired.
• After lock reacquisition, transaction processing can
resume; the commit or rollback of in-doubt
transactions is performed concurrently with the
execution of new transactions.
Restart after a Failure

• If we have a commit or abort log record for


transaction T, but not an end record, must
redo/undo T.
– If this site is the coordinator for T, keep sending
commit/abort msgs to subs until acks received.
• If we have a prepare log record for transaction
T, but not commit/abort, this site is a
subordinate for T.
– Repeatedly contact the coordinator to find status
of T, then write commit/abort log record;
redo/undo T; and write end log record.
• If we don’t have even a prepare log record for
T, unilaterally abort and undo T.
– This site may be coordinator! If so, subs may send
msgs.
Observations on 2PC

• Ack msgs used to let coordinator know


when it can “forget” an transaction;
until it receives all acks, it must keep T
in the transaction Table.
• If coordinator fails after sending prepare
msgs but before writing commit/abort
log records, when it comes back up it
aborts the transaction .
• If a sub-transaction does no updates, its
commit or abort status is irrelevant.
2PC with Presumed Abort

• When coordinator aborts T, it undoes T and


removes it from the transaction Table
immediately.
– Doesn’t wait for acks; “presumes abort” if
transaction not in transaction Table. Names of subs
not recorded in abort log rec.
• Subordinates do not send acks on abort.
• If sub- transaction does not do updates, it
responds to prepare msg with reader instead
of yes/no.
• Coordinator subsequently ignores readers.
• If all sub- transaction are readers, 2nd phase
not needed.
Three Phase Commit (3PC)
• Assumptions:
– No network partitioning
– At any point, at least one site must be up.
– At most K sites (participants as well as
coordinator) can fail
• Phase 1: Obtaining Preliminary
Decision: Identical to 2PC Phase 1.
– Every site is ready to commit if instructed
to do so
Three-Phase Commit (3PC)

• Phase 2 of 2PC is split into 2 phases, Phase 2 and


Phase 3 of 3PC
– In phase 2 coordinator makes a decision as in 2PC (called
the pre-commit decision) and records it in multiple (at
least K) sites
– In phase 3, coordinator sends commit/abort message to all
participating sites,
• Under 3PC, knowledge of pre-commit decision
can be used to commit despite coordinator failure
– Avoids blocking problem as long as < K sites fail
• Drawbacks:
– higher overheads
– assumptions may not be satisfied in practice

You might also like