0% found this document useful (0 votes)
17 views79 pages

5-Transaction Processing

The document outlines the principles of distributed database systems, focusing on distributed transaction processing, concurrency control, and reliability. It discusses transaction characteristics, including atomicity, consistency, isolation, and durability, as well as various concurrency control algorithms such as locking-based and timestamp-based methods. Additionally, it covers deadlock detection techniques and the importance of ensuring serializability in distributed environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views79 pages

5-Transaction Processing

The document outlines the principles of distributed database systems, focusing on distributed transaction processing, concurrency control, and reliability. It discusses transaction characteristics, including atomicity, consistency, isolation, and durability, as well as various concurrency control algorithms such as locking-based and timestamp-based methods. Additionally, it covers deadlock detection techniques and the importance of ensuring serializability in distributed environments.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

Principles of Distributed Database

Systems
M. Tamer Özsu
Patrick Valduriez

© 2020, M.T. Özsu & P. Valduriez 1


Outline
◼ Distributed Transaction Processing
❑ Distributed Concurrency Control
❑ Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 2


Transaction

A transaction is a collection of actions that make consistent


transformations of system states while preserving system
consistency.
❑ concurrency transparency
❑ failure transparency

© 2020, M.T. Özsu & P. Valduriez 3


Transaction Characterization

Begin_transaction

Read
Read

Write
Read

Commit
◼ Read set (RS)
❑ The set of data items that are read by a transaction
◼ Write set (WS)
❑ The set of data items whose values are changed by this transaction
◼ Base set (BS)
❑ RS and WS

© 2020, M.T. Özsu & P. Valduriez 4


Principles of Transactions

ATOMICITY
❑ all or nothing

CONSISTENCY
❑ no violation of integrity constraints

ISOLATION
❑ concurrent changes invisible  serializable

DURABILITY
❑ committed updates persist

© 2020, M.T. Özsu & P. Valduriez 5


Transactions Provide…

◼ Atomic and reliable execution in the presence of


failures

◼ Correct and fast execution in the presence of


multiple user accesses

◼ Correct management of replicas (if they support


it)

© 2020, M.T. Özsu & P. Valduriez 6


Distributed TM Architecture

© 2020, M.T. Özsu & P. Valduriez 7


Outline
◼ Distributed Transaction Processing
❑ Distributed Concurrency Control

© 2020, M.T. Özsu & P. Valduriez 8


Concurrency Control

◼ The problem of synchronizing concurrent transactions


such that the consistency of the database is maintained
while, at the same time, maximum degree of
concurrency is achieved.
◼ Enforce isolation property
◼ Anomalies:
❑ Lost updates
◼ The effects of some transactions are not reflected on the database.
❑ Inconsistent retrievals
◼ A transaction, if it reads the same data item more than once, should
always read the same value.

© 2020, M.T. Özsu & P. Valduriez 9


Conflict Operations

◼ Two actions are said to be in conflict (conflicting pair) if:


1. The actions belong to different transactions.

2. At least one of the actions is a write operation.

3. The actions access the same object (read or write).

◼ The following set of actions is conflicting:


❑ R1(X), W2(X), W3(X) (3 conflicting pairs)

◼ While the following sets of actions are not:


❑ R1(X), R2(X), R3(X)

❑ R1(X), W2(Y), R3(X)

© 2020, M.T. Özsu & P. Valduriez 10


Example of Conflict Equivalence
conflict
S1: r1(x) r2(x) w2(x) r1(y) w1(y)

S2: r1(x) r2(x) r1(y) w2(x) w1(y) schedules have


the same set of
S3: r1(x) r1(y) r2(x) w2(x) w1(y) conflicting
operations.
S4: r1(x) r1(y) r2(x) w1(y) w2(x)

S5: r1(x) r1(y) w1(y) r2(x) w2(x)


conflicting operations
S1 is equivalent to S5 ordered in same way
S5 is the serial schedule T1, T2
S1 is serializable
S1 is not equivalent to the serial schedule T2, T1
11
Serializable Schedules
initial state final state A schedule is
x=1, y=3 r2(x) w2(y) r1(x) w1(x) x=5, y=1 said to be
T2 T1 serializable
x=1, y=3 r1(x) r2(x) w2(y) w1(x) x=5, y=1 when the
schedule is
x=1, y=3 r2(x) r1(x) w2(y) w1(x) x=5, y=1 conflict-
equivalent to
x=1, y=3 r2(x) r1(x) w1(x) w2(y) x=5, y=1
one or more
x=1, y=3 r1(x) r2(x) w1(x) w2(y) x=5, y=1 serial schedules.
T1: begin transaction T2: begin transaction
read (x, X); read (x,Y);
X = X+4; write (y,Y);
write (x, X); commit;
commit;
12
Serialization Graph of a Schedule, S

◼ Nodes represent transactions


◼ There is a directed edge from node Ti to node Tj if Ti
has an operation pi,k that conflicts with an operation pj,r
of Tj and pi,k precedes pj,r in S
◼ Theorem - A schedule is serializable if and only if its
serialization graph has no cycles

13
Example of Serialization Conflict (*)
Graph S: … p1,i, …, p2,j, ...

T2 T4 S is serializable in order
* T1 T2 T3 T4 T5 T6 T7
T1 T5 T6 T7

T3
S is not serializable due
T2 T4
to cycle T2 T6 T7 T2

T1 T5 T6 T7

T3
14
Serializability in Distributed DBMS

◼ Two histories have to be considered:


❑ local histories
❑ global history

◼ For global transactions (i.e., global history) to be


serializable, two conditions are necessary:
❑ Each local history should be serializable → local serializability
❑ Two conflicting operations should be in the same relative order
in all of the local histories where they appear together →
global serializability

© 2020, M.T. Özsu & P. Valduriez 15


Global Non-serializability

T1: Read(x) T2: Read(x)


x ←x-100 Read(y)
Write(x) Commit
Read(y)
y ←y+100
Write(y)
Commit

◼ x stored at Site 1, y stored at Site 2


◼ LH1, LH2 are individually serializable (in fact serial), and the
two transactions are globally serializable.
LH1={R1(x), W1(x), R2(x)}
LH2={R1(y), W1(y), R2(y)}
© 2020, M.T. Özsu & P. Valduriez 16
Global Non-serializability

T1: Read(x) T2: Read(x)


x ←x-100 Read(y)
Write(x) Commit
Read(y)
y ←y+100
Write(y)
Commit

◼ x stored at Site 1, y stored at Site 2


◼ LH1, LH2 are individually serializable (in fact serial), but the
two transactions are not globally serializable.
LH1={R1(x),W1(x), R2(x)}
LH2={R2(y), R1(y),W1(y)}
© 2020, M.T. Özsu & P. Valduriez 17
Concurrency Control Algorithms

◼ Pessimistic Algorithms
❑ Locking-based Algorithms
◼ Centralized (primary site) 2PL (Two-Phase
Locking)
◼ Distributed 2PL
❑ Timestamp-based Algorithms
◼ Basic TO (Timestamp Ordering)
◼ Conservative TO
❑ Multiversion Concurrency Control

◼ Optimistic Algorithms

© 2020, M.T. Özsu & P. Valduriez 18


Locking-Based Algorithms

◼ Transactions indicate their intentions by requesting locks


from the scheduler (called lock manager).
◼ Locks are either read lock (rl) [also called shared lock] or
write lock (wl) [also called exclusive lock]
◼ Read locks and write locks conflict (because Read and
Write operations are incompatible
rl wl
rl yes no
wl no no
◼ Locking works nicely to allow concurrent processing of
transactions.
© 2020, M.T. Özsu & P. Valduriez 19
Two-Phase Locking (2PL)
◼ Transaction does not release a lock until it
has all the locks it will ever require.
◼ Transaction has a locking phase followed by
an unlocking phase

Ts first unlock


Number
of locks T commits
held by T

time

◼ Guarantees serializability when locking is


done in this way
20
Centralized 2PL (C2PL)
◼ There is only one Coordinating TM, the lock manager at the central site,
and the data processors (DP) at the other participating sites.
◼ The participating sites are those that store the data items on which the
operation is to be carried out.
◼ Lock requests
are issued to
Coordinating
TM.

© 2020, M.T. Özsu & P. Valduriez 21


Distributed 2PL (D2PL)
◼ Lock managers are placed at each site. Each scheduler
handles lock requests for data at that site. The
distributed 2PL is similar to the C2PL, with two major
modifications.
◼ The messages that are sent to the central site lock
manager in C2PL are sent to the lock managers at all
participating sites in D2PL.
◼ The second difference is that the operations are not
passed to the data processors by the coordinating
transaction manager, but by the participating lock
managers.
❑ This means that the coordinating transaction manager does not
wait for a “lock request granted” message.
© 2020, M.T. Özsu & P. Valduriez 22
Distributed 2PL Execution

© 2020, M.T. Özsu & P. Valduriez 23


Deadlock
◼ A transaction is deadlocked if it is blocked and will
remain blocked until there is intervention.
◼ Locking-based CC algorithms may cause deadlocks.
◼ TO-based algorithms that involve waiting may cause
deadlocks.
◼ Wait-for graph
❑ If transaction Ti waits for another transaction Tj to release a lock
on an entity, then Ti → Tj in WFG.

Ti Tj

© 2020, M.T. Özsu & P. Valduriez 24


Local versus Global WFG
◼ T1 and T2 run at site 1, T3 and T4 run at site 2.
◼ T3 waits for a lock held by T4 which waits for a lock held by T1 which
waits for a lock held by T2 which, in turn, waits for a lock held by T3.

Local WFG

Global WFG

© 2020, M.T. Özsu & P. Valduriez 25


Deadlock Detection

◼ Transactions are allowed to wait freely.


◼ Wait-for graphs and cycles.
◼ Topologies for deadlock detection algorithms
❑ Centralized
❑ Hierarchical
❑ Distributed

© 2020, M.T. Özsu & P. Valduriez 26


Centralized Deadlock Detection
◼ One site is designated as the deadlock detector for the
system.
◼ Each scheduler periodically sends its local WFG to the
central site which merges them to a global WFG to
determine cycles.
◼ How often to transmit?
❑ Too often ⇒ higher communication cost but lower
delays due to undetected deadlocks
❑ Too late ⇒ higher delays due to deadlocks, but lower
communication cost
◼ Would be a reasonable choice if the concurrency control
algorithm is also centralized.
◼ Proposed for Distributed INGRES
© 2020, M.T. Özsu & P. Valduriez 27
Hierarchical Deadlock Detection
• An alternative to centralized deadlock detection is the building of a
hierarchy of deadlock detectors (see Fig. below).
• Deadlocks that are local to a single site would be detected at that
site using the LWFG.
• Each site also sends its LWFG to the deadlock detector at the next
level.
• For example, a deadlock at site 1
would be detected by the local
deadlock detector (DD) at site 1
(denoted DD21, 2 for level 2, 1 for
site 1).
• If, however, the deadlock
involves sites 1 and 2, then DD11
detects it.
• Finally, if the deadlock involves
sites 1 and 4, DD0x detects it,
where x is one of 1, 2, 3, or 4.
© 2020, M.T. Özsu & P. Valduriez 28
Distributed Deadlock Detection
◼ There are local deadlock detectors at each site that communicate
their LWFGs with one another. The LWFG at each site is formed and
is modified as follows:
1. Since each site receives the potential deadlock cycles from other
sites, these edges are added to the LWFGs.
2. The edges in the LWFG show that local transactions are waiting
for transactions at other sites.

© 2020, M.T. Özsu & P. Valduriez 29


Distributed Deadlock Detection

◼ If there is a cycle that does not include the external edges,


there is a local deadlock that can be handled locally.
◼ If, on the other hand, there is a cycle involving these external
edges, there is a potential distributed deadlock and this cycle
information has to be communicated to other deadlock
detectors.
◼ In the case of Example, the possibility of such a distributed
deadlock is detected by both sites.

© 2020, M.T. Özsu & P. Valduriez 30


Concurrency Control Algorithms

◼ Pessimistic Algorithms
❑ Locking-based Algorithms
◼ Centralized (primary site) 2PL (Two-Phase
Locking)
◼ Distributed 2PL
❑ Timestamp-based Algorithms
◼ Basic TO (Timestamp Ordering)
◼ Conservative TO
❑ Multiversion Concurrency Control

◼ Optimistic Algorithms

© 2020, M.T. Özsu & P. Valduriez 31


Timestamp Ordering

 Transaction (Ti) is assigned a globally unique timestamp


(using system clock) ts(Ti).
 Transaction manager attaches the timestamp to all
operations issued by the transaction.
 Each data item is assigned a write timestamp (wts) and
a read timestamp (rts):
❑ rts(x) = largest timestamp of any read on x
❑ wts(x) = largest timestamp of any write on x
 Conflicting operations are resolved by timestamp order.

© 2020, M.T. Özsu & P. Valduriez 32


Basic Timestamp Ordering

◼ Two conflicting operations Oij of Ti and Okl of Tk →


❑ Oij executed before Okl iff ts(Ti) < ts(Tk).
❑ Ti is called older transaction
❑ Tk is called younger transaction

for Ri(x) for Wi(x)

if ts(Ti) < wts(x) if ts(Ti) < rts(x) or ts(Ti) < wts(x)


then reject Ri(x) then reject Wi(x)
else accept Ri(x) else accept Wi(x)
rts(x)  ts(Ti) wts(x)  ts(Ti)

© 2020, M.T. Özsu & P. Valduriez 33


Conservative Timestamp Ordering

◼ Basic timestamp ordering tries to execute an operation


as soon as it is accepted
❑ progressive
❑ too many restarts since there is no delaying
◼ Conservative timestamping delays each operation until
no operation with a smaller timestamp can arrive at that
scheduler.
◼ If this condition can be guaranteed, the scheduler will
never reject an operation.
◼ However, this delay introduces the possibility of
deadlocks.

© 2020, M.T. Özsu & P. Valduriez 34


Concurrency Control Algorithms

◼ Pessimistic Algorithms
❑ Locking-based Algorithms
◼ Centralized (primary site) 2PL (Two-Phase
Locking)
◼ Distributed 2PL
❑ Timestamp-based Algorithms
◼ Basic TO (Timestamp Ordering)
◼ Conservative TO
❑ Multiversion Concurrency Control

◼ Optimistic Algorithms

© 2020, M.T. Özsu & P. Valduriez 35


Multiversion Concurrency Control
(MVCC)
◼ Do not modify the values in the database, create new
values.
◼ Implemented in a number of systems: IBM DB2, Oracle,
SQL Server, SAP HANA, BerkeleyDB, PostgreSQL
◼ MVCC techniques typically use timestamps to maintain
transaction isolation
◼ Each version of a data item that is created is labeled with
the timestamp of the transaction that creates it.
◼ The idea is that each read operation accesses the
version of the data item that is appropriate for its
timestamp, thus reducing transaction aborts and restarts.
© 2020, M.T. Özsu & P. Valduriez 36
MVCC Reads

◼ A Ri(x) is translated into a read on one version of x.


❑ Find a version of x (say xv) such that ts(xv) is the largest
timestamp less than ts(Ti).

© 2020, M.T. Özsu & P. Valduriez 37


MVCC Writes
◼ A Wi(x) is translated into Wi(xw) so that ts(xw) = ts(Ti)
❑ accepted if and only if no other transaction with a
timestamp greater than ts(Ti) has read the value of a
version of x (say, xr), in other words, accepted if ts(xr) <
ts(xw)
❑ Rejected If the scheduler has already processed any Rj(xr)
such that ts(xw) < ts(xr)
◼ If Wi(x) is accepted, it would create a version (xc) that Rj
should have read, but did not since the version was not
available when Rj was executed

xr
© 2020, M.T. Özsu & P. Valduriez 38
Concurrency Control Algorithms

◼ Pessimistic Algorithms
❑ Locking-based Algorithms
◼ Centralized (primary site) 2PL (Two-Phase
Locking)
◼ Distributed 2PL
❑ Timestamp-based Algorithms
◼ Basic TO (Timestamp Ordering)
◼ Conservative TO
❑ Multiversion Concurrency Control

◼ Optimistic Algorithms

© 2020, M.T. Özsu & P. Valduriez 39


Optimistic Concurrency Control
Algorithms

Pessimistic execution

Validate Read Compute Write

Optimistic execution

Read Compute Validate Write

© 2020, M.T. Özsu & P. Valduriez 40


Optimistic Concurrency Control
Algorithms
◼ Transaction execution model: divide into subtransactions
each of which execute at a site
❑ Tks: transaction Tk that executes at site s
◼ Transactions run independently at each site until they
reach the end of their read phases
◼ All subtransactions are assigned a timestamp at the end
of their read phase
◼ Validation test is performed during validation phase. If
one fails, all rejected.

© 2020, M.T. Özsu & P. Valduriez 41


Optimistic CC Validation Test

 If all transactions Tks where ts(Tks) < ts(Tis) have


completed their write phase before Tis has started its
read phase, then validation succeeds
❑ Transaction executions in serial order

© 2020, M.T. Özsu & P. Valduriez 42


Optimistic CC Validation Test

 If there is any transaction Tks such that ts(Tks)<ts(Tis) and


which completes its write phase while Tis is in its read
phase, then validation succeeds if WS(Tks)  RS(Tis) = Ø
❑ Read and write phases overlap, but Tis does not read
data items written by Tks

© 2020, M.T. Özsu & P. Valduriez 43


Optimistic CC Validation Test

 If there is any transaction Tks such that ts(Tks)< ts(Tis)


and which completes its read phase before Tis completes
its read phase, then validation succeeds if
WS(Tks)  RS(Tis) = Ø and WS(Tks)  WS(Tis) = Ø
❑ They overlap, but don't access any common data
items.

© 2020, M.T. Özsu & P. Valduriez 44


Assignment #3

© 2020, M.T. Özsu & P. Valduriez 45


Outline
◼ Distributed Transaction Processing

❑ Distributed Reliability

© 2020, M.T. Özsu & P. Valduriez 46


Reliability

Problem:
How to maintain

atomicity

durability

properties of transactions

© 2020, M.T. Özsu & P. Valduriez 47


Ch.10/47
Types of Failures
◼ Transaction failures
❑ Transaction aborts (unilaterally or due to deadlock)
◼ System (site) failures
❑ Failure of processor, main memory, power supply, …
❑ Main memory contents are lost, but secondary storage
contents are safe
❑ Partial vs. total failure
◼ Media failures
❑ Failure of secondary storage devices → stored data is lost
❑ Head crash/controller failure
◼ Communication failures
❑ Lost/undeliverable messages
❑ Network partitioning

© 2020, M.T. Özsu & P. Valduriez 48


Distributed Reliability Protocols
◼ Distributed reliability protocols aim to maintain the atomicity and
durability of distributed transactions.
◼ Commit protocols
❑ How to execute commit command for distributed transactions.
❑ Issue: how to ensure atomicity and durability?
◼ Termination protocols
❑ If a failure occurs, how can the remaining operational sites deal with it?
❑ Non-blocking: the occurrence of failures should not force the sites to
wait until the failure is repaired to terminate the transaction.
◼ Recovery protocols
❑ When a failure occurs, how do the sites where the failure occurred deal
with it?
❑ Independent: a failed site can determine the outcome of a transaction
without having to obtain remote information.
◼ Independent recovery  non-blocking termination

© 2020, M.T. Özsu & P. Valduriez 49


Two-Phase Commit (2PC)
◼ It is a very simple and elegant protocol that ensures the
atomic commitment of distributed transactions.
◼ Coordinator :The process at the site where the transaction
originates and which controls the execution
◼ Participant :The process at the other sites that participate in
executing the transaction
Phase 1 : The coordinator gets the participants ready to write the
results into the database
Phase 2 : Everybody writes the results into the database
Global Commit Rule:
 The coordinator aborts a transaction if and only if at least
one participant votes to abort it.
 The coordinator commits a transaction if and only if all of
the participants vote to commit it.
© 2020, M.T. Özsu & P. Valduriez 50
State Transitions in 2PC

Coordinator Participant

COMMIT

© 2020, M.T. Özsu & P. Valduriez 51


2PC Protocol Actions

© 2020, M.T. Özsu & P. Valduriez 52


Centralized 2PC

© 2020, M.T. Özsu & P. Valduriez 53


Linear 2PC

V-C: Vote-Commit, V-A: Vote-Abort, G-C: Global-commit, G-A: Global-abort

© 2020, M.T. Özsu & P. Valduriez 54


Distributed 2PC
◼ The coordinator sends the
prepare message to all
participants.
◼ Each participant then
sends its decision to all
the other participants (and
to the coordinator) by
means of either a “vote-
commit” or a “vote-abort”
message.

© 2020, M.T. Özsu & P. Valduriez 55


Distributed 2PC
◼ Each participant waits for
messages from all the
other participants and
makes its termination
decision according to the
global-commit rule.
◼ Obviously, there is no
need for the second
phase of the protocol,
since each participant has
independently reached
that decision at the end of
the first phase.

© 2020, M.T. Özsu & P. Valduriez 56


Dealing with Site Failures
Our aim is to develop nonblocking termination and
independent recovery protocols.
◼ Termination Protocol for 2PC
❑ It serves the timeouts for both the coordinator and the participant
processes.
❑ A timeout occurs at a destination site when it cannot get an
expected message from a source site within the expected time
period.
❑ In this section, we consider that this is due to the failure of the
source site.
◼ Recovery Protocol for 2PC
❑ It serves the failures for both the coordinator and the participant
processes.
◼ 3PC Protocol
© 2020, M.T. Özsu & P. Valduriez 57
Site Failures - 2PC Termination
◼ Timeout in WAIT
❑ Coordinator is waiting for the local
decisions of the participants.
❑ Cannot unilaterally commit since the
global-commit rule has not been
satisfied.
❑ Can unilaterally abort

◼ Timeout in ABORT or COMMIT


❑ Not certain that the commit or abort
procedures have been completed by
the participant sites.
❑ Thus the coordinator repeatedly
sends the “global-commit” or “global-
abort” commands to the sites that
have not yet responded, and waits
for their acknowledgement.

© 2020, M.T. Özsu & P. Valduriez 58


Site Failures - 2PC Termination
◼ Timeout in INITIAL
❑ Participant is waiting for a “prepare”
message.
❑ Coordinator must have failed in
INITIAL state
❑ Unilaterally abort

◼ Timeout in READY
❑ Participant has voted to commit the
transaction but does not know the
global decision of the coordinator.
❑ The participant cannot unilaterally
reach a decision.
❑ Stay blocked until it can learn from
someone (either the coordinator or
some other participant) the ultimate
fate of the transaction
© 2020, M.T. Özsu & P. Valduriez 59
Site Failures - 2PC Recovery

◼ Failure in INITIAL
❑ Start the commit process upon
recovery
◼ Failure in WAIT
❑ Restart the commit process
upon recovery
◼ Failure in ABORT or COMMIT
❑ Nothing special if all the acks
have been received
❑ Otherwise the termination
protocol is involved

© 2020, M.T. Özsu & P. Valduriez 60


Site Failures - 2PC Recovery

◼ Failure in INITIAL
❑ Unilaterally abort upon recovery
◼ Failure in READY
❑ The coordinator has been
informed about the local decision
❑ Treat as timeout in READY state
and invoke the termination
protocol
◼ Failure in ABORT or COMMIT
❑ These states represent the
termination conditions, so, upon
recovery, the participant does not
need to take any special action.

© 2020, M.T. Özsu & P. Valduriez 61


2PC Recovery Protocols –
Additional Cases
A site failure may occur after the
coordinator or a participant has
written a log record but before it
can send a message
◼ Coordinator site fails after
writing “begin_commit” log and
before sending “prepare”
command
❑ treat it as a failure in WAIT
state; send “prepare”
command upon recovery

© 2020, M.T. Özsu & P. Valduriez 62


2PC Recovery Protocols –
Additional Cases
A site failure may occur after the
coordinator or a participant has
written a log record but before it can
send a message
◼ Participant site fails after writing
“ready” record in log but before
“vote-commit” is sent
❑ treat it as failure in READY
state
❑ alternatively, can send “vote-
commit” upon recovery

© 2020, M.T. Özsu & P. Valduriez 63


2PC Recovery Protocols –
Additional Cases
A site failure may occur after the
coordinator or a participant has
written a log record but before it can
send a message
◼ Participant site fails after writing
“abort” record in log but before
“vote-abort” is sent
❑ no need to do anything upon
recovery

© 2020, M.T. Özsu & P. Valduriez 64


2PC Recovery Protocols –
Additional Case
◼ Coordinator site fails
after logging its final
decision record but
before sending its
decision to the
participants
❑ coordinator treats it as
a failure in COMMIT
or ABORT state
❑ participants treat it as
timeout in the READY
state
© 2020, M.T. Özsu & P. Valduriez 65
2PC Recovery Protocols –
Additional Case
◼ Participant site fails after
writing “abort” or
“commit” record in log
but before
acknowledgement is sent
❑ participant treats it as
failure in COMMIT or
ABORT state
❑ coordinator will handle
it by timeout in
COMMIT or ABORT
state
© 2020, M.T. Özsu & P. Valduriez 66
Problem With 2PC
◼ Blocking
❑ Ready implies that the
participant waits for the
coordinator
❑ If coordinator fails, site is
blocked until recovery
❑ Blocking reduces availability

◼ Independent recovery is not possible


◼ However, it is known that:
❑ Independent recovery protocols
exist only for single site failures;
❑ no independent recovery protocol
exists which is resilient to
multiple-site failures.
◼ So we search for these protocols –
3PC
© 2020, M.T. Özsu & P. Valduriez 67
Three-Phase Commit

◼ 3PC is non-blocking.
◼ A commit protocols is non-blocking iff
❑ it is synchronous (occurring) within one state transition, and

❑ its state transition diagram contains

◼ no state which is “adjacent” to both a commit and an


abort state, and
◼ no non-committable state which is “adjacent” to a
commit state
◼ Adjacent: possible to go from one state to another with a single
state transition
◼ Committable: all sites have voted to commit a transaction
❑ e.g.: COMMIT state

© 2020, M.T. Özsu & P. Valduriez 68


State Transitions in 3PC
◼ add another
Coordinator Participant
state between
the WAIT (and
READY) and
COMMIT states
which serves as
a buffer state
where the
process is ready
to commit (if that
is the final
decision) but has
not yet
committed.
2PC Protocol Actions

© 2020, M.T. Özsu & P. Valduriez 70


3PC Protocol Actions

© 2020, M.T. Özsu & P. Valduriez 71


Network Partitioning

◼ Simple partitioning
❑ Only two partitions
◼ Multiple partitioning
❑ More than two partitions
◼ Formal bounds:
❑ There exists no non-blocking protocol that is resilient to a
network partition if messages are lost when partition occurs.
❑ There exist non-blocking protocols which are resilient to a single
network partition if all undeliverable messages are returned to
sender.
❑ There exists no non-blocking protocol which is resilient to a
multiple partition.

© 2020, M.T. Özsu & P. Valduriez 72


Independent Recovery Protocols for
Network Partitioning

◼ No general solution possible


❑ allow one group to terminate while the other is blocked
❑ improve availability
◼ How to determine which group to proceed?
❑ The group with a majority
◼ How does a group know if it has majority?
❑ Centralized
◼ Whichever partitions contains the central site should terminate the
transaction
❑ Voting-based (quorum)

© 2020, M.T. Özsu & P. Valduriez 73


Quorum Protocols
◼ The network partitioning problem is handled by the commit
protocol.
◼ Every site is assigned a vote Vi.
◼ Total number of votes in the system V
◼ Abort quorum Va, commit quorum Vc
1. Va + Vc > V where 0 ≤ Va , Vc ≤ V
2. Before a transaction commits, it must obtain a commit quorum
Vc
3. Before a transaction aborts, it must obtain an abort quorum Va

◼ The first rule ensures that a transaction cannot be committed and


aborted at the same time.
◼ The next two rules indicate the votes that a transaction has to
obtain before it can terminate one way or the other.
© 2020, M.T. Özsu & P. Valduriez 74
Paxos Consensus Protocol

◼ 2PC has blocking, and to overcome it, we have 3PC


which is expensive and not resilient to network
partitioning
◼ General problem: how to reach an agreement
(consensus) among TMs about the fate of a transaction
◼ General idea: If a majority reaches a decision, the global
decision is reached (like voting)
◼ Paxos is a family of protocols for solving consensus in a
network of unreliable or fallible processors.
◼ Consensus is the process of agreeing on one result
among a group of participants.

© 2020, M.T. Özsu & P. Valduriez 75


Paxos
◼ Roles:
❑ Proposer: recommends a decision

❑ Acceptor: decides whether to accept the proposed decision

❑ Learner: discovers the agreed-upon decision by asking or it is


pushed
◼ Naïve Paxos: one proposer
❑ Operates like a 2PC

◼ In the first round, the proposer suggests a value for the variable and
acceptors send their responses (accept/not accept).
◼ If the proposer gets accepts from a majority of the acceptors, then it
determines that particular value to be the value of the variable and
notifies the acceptors who now record that value the final one.
◼ A learner can, at any point, ask an acceptor what the value of the
variable is and learn the latest value.

© 2020, M.T. Özsu & P. Valduriez 76


Paxos & Complications
◼ Multiple proposers can put forward a value for the same
variable. Therefore, an acceptor needs to pick one of the
proposed values.
❑ using a ballot number so that acceptors can
differentiate different proposals
◼ Given multiple proposals, it is possible to get split votes
on multiple proposals with no proposed value receiving a
majority.
❑ running multiple consensus rounds—if no proposal
achieves a majority, then another round is run and
this is repeated until one value achieves majority

© 2020, M.T. Özsu & P. Valduriez 77


Paxos & Complications
◼ It is possible that some of the acceptors fail after they
accept a value. If the remaining acceptors who accepted
that value do not constitute a majority, this causes a
problem.
❑ this could be treated as the second issue and a new
round can be started.
❑ However, the complication is that some learners may
have learned the accepted value from acceptors in the
previous round, and if a different value is chosen in the
new round, we have inconsistency.
❑ Paxos deals with this again by using ballot numbers.

© 2020, M.T. Özsu & P. Valduriez 78


Basic Paxos with Failures

◼ Some acceptors fail but there is quorum


❑ Not a problem

◼ Enough acceptors fail to eliminate quorum


❑ Run a new ballot

◼ Proposer/leader fails
❑ Choose a new leader and start a new ballot

© 2020, M.T. Özsu & P. Valduriez 79

You might also like