0% found this document useful (0 votes)
24 views5 pages

Colyer. Aurora II 2019

Uploaded by

scribd2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

Colyer. Aurora II 2019

Uploaded by

scribd2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

the morning paper

an interesting/influential/important paper from the world of


CS every weekday morning, as selected by Adrian Colyer

Amazon Aurora: on avoiding distributed


consensus for I/Os, commits, and
membership changes
MARCH 27, 2019
tags: Datastores, Distributed Systems
Amazon Aurora: on avoiding distributed consensus for I/Os, commits, and membership changes
(https://dl.acm.org/citation.cfm?id=3183713.3196937), Verbitski et al., SIGMOD’18

This is a follow-up to the paper we looked at earlier this week on the design of Amazon Aurora
(https://blog.acolyer.org/2019/03/25/amazon-aurora-design-considerations-for-high-throughput-cloud-
native-relational-databases/). I’m going to assume a level of background knowledge from that work and skip
over the parts of this paper that recap those key points. What is new and interesting here are the details of how
quorum membership changes are dealt with, the notion of heterogeneous quorum set members, and more detail
on the use of consistency points and the redo log.

Changing quorum membership

Managing quorum failures is complex. Traditional mechanisms cause I/O stalls while membership is being

“ changed.

As you may recall though, Aurora is designed for a world with a constant background level of failure. So once a
quorum member is suspected faulty we don’t want to have to wait to see if it comes back, but nor do we want
throw away the benefits of all the state already present on a node that might in fact come back quite quickly.
Aurora’s membership change protocol is designed to support continued processing during the change, to
tolerate additional failures while changing membership, and to allow member re-introduction if a suspected
faulty member recovers.

Each membership change is made via at least two transitions. Say we start out with a protection group with six
segment members, A-F. A write quorum is any 4 of 6, and a read quorum is any 3 of 6. We miss some
heartbeats for F and suspect it of being faulty.

Move one is to increment the membership epoch and introduce a new node G. All read and write requests, and
any gossip messages, all carry the epoch number. Any request with a stale epoch number will be rejected.
Making the epoch change requires a write quorum, just like any other write. The new membership epoch
established through this process now requires a write set to be any four of ABCDEF and any four of ABCDEG.
Notice that whether we ultimately choose to reinstate F, or we stick with G, we have valid quorums at all points
under both paths. For a read set we need any 3 of ABCDEF or any 3 of ABCDEG.

You can probably see where this is going. If F comes back before G has finished completing hydrating from its
peers, then we make a second membership transition back to the ABCDEF formation. If it doesn’t, we can make
a transition to the ABCDEG formation.

Additional failures are handled in a similar manner. Suppose we’re in the transition state with a write quorum of
(4/6 of ABCDEF) AND (4/6 of ABCDEG) and wouldn’t you know it, now there’s a problem with E! Meet H.
We can transition to a write quorum that is (4/6 of ABCDEF and 4/6 ABCDEG) AND (4/6 of ABCDHF and
4/6 of ABCDHG). Note that even here, simply writing to ABCD fulfils all four conditions.

Heterogeneous quorums

Quorums are generally thought of as a collection of like members, grouped together to transparently handle

“ failures. However, there is nothing in the quorum model to prevent unlike members with differing latency,
cost, or durability characteristics.

Aurora exploits this to set up protection groups with three full segments, which store both redo log records and
materialized data blocks, and three tail segments which store just the redo logs. Since data blocks typically take
more space than redo logs the costs stay closer to traditional 3x replication than 6x.

With the split into full and tail segments a write quorum becomes any 4/6 segments, or all 3 full segments. (“In
practice, we write log records to the same 4/6 quorum as we did previously. At least one of these log records
arrives at a full segment and generates a data block“). A read quorum becomes 3/6 segments, to include at least
one full segment. In practice though, data is read directly from a full segment avoiding the need for a quorum
read – an optimisation we’ll look at next.
There are many options available once one moves to quorum sets of unlike members. One can combine local

“ disks to reduce latency and remote disks for durability and availability. One can combine SSDs for
performance and HDDs for cost. One can span quorums across regions to improve disaster recovery. There
are numerous moving parts that one needs to get right, but the payoffs can be significant. For Aurora, the
quorum set model described earlier lets us achieve storage prices comparable to low-cost alternatives,
while providing high durability, availability, and performance.

Efficient reads

So we spent the previous paper and much of this one nodding along with read quorums that must overlap with
write quorums, understanding the 4/6 and 3/6 requirements and so on, only to uncover the bombshell that
Aurora doesn’t actually use read quorums in practice at all! What!? (Ok, it does use read quorums, but only
during recovery).

The thing is there can be a lot of reads, and there’s an I/O amplification effect as a function of the size of the
read quorum. Whereas with write amplification we’re sending compact redo log records, with reading we’re
looking at full data blocks too. So Aurora avoids quorum reads.

Aurora does not do quorum reads. Through its bookkeeping of writes and consistency points, the database

“ instance knows which segments have the last durable version of a data block and can request it directly from
any of those segments… The database will usually issue a request to the segment with the lowest measured
latency, but occasionally also query one of the others in parallel to ensure up to data read latency response
times.

If a read is taking a long time, Aurora will issue a read to another storage node and go with whichever node
returns first.

The bookkeeping that supports this is based on read views that maintain snapshot isolation using Multi-Version
Concurrency Control (MVCC). When a transaction commits, the log sequence number (LSN) of its commit
redo record is called the System Commit Number or SCN. When a read view is established we remember the
SCN of the most recent commit, and the list of transactions active as of that LSN.

Data blocks seen by a read request must be at or after the read view LSN and back out any transactions

“ either active as of that LSN or started after that LSN… Snapshot isolation is straightforward in a single-
node database instance by having a transaction read the last durable version of a database block and apply
undo to rollback any changes.

Storage consistency points

Aurora is able to avoid much of the work of consensus by recognizing that, during normal forward

“ processing of a system, there are local oases of consistency. Using backward chaining of redo records, a
storage node can tell if it is missing data and gossip with its peers to fill in gaps. Using the advancement of
segment chains, a databased instance can determine whether it can advance durable points and reply to
clients requesting commits. Coordination and consensus is rarely required….

Recall that the only writes which cross the network from the database instance to the storage node are log redo
records. Redo log application code is run within the storage nodes to materialize blocks in the background or
on-demand to satisfy read requests.
Log records form logical chains. Each log record stores the LSN of the previous log record in the volume, the
previous LSN for the segment, and the previous LSN for the block being modified.

The block chain is used to materialise individual blocks on demand


The segment chain is used by each storage node to identify records it has not received and fill those holes
via gossip
The full log chain provides a fallback path to regenerate storage volume metadata in case of a “disastrous
loss of metadata state.”

…all log writes, including those for commit redo log records, are sent asynchronously to storage nodes,

“ processed asynchronously at the storage node, and asynchronously acknowledged back to the database
instance.

Individual nodes advance a Segment Complete LSN (SCL), representing the latest point in time for which it has
received all log records. The SCL is sent as part of a write acknowledgement. Once the database instance has
seen the SCL advance at 4/6 members of a protection group, it advances the Protection Group Complete LSN
(PGCL) – the point at which the protection group has made all writes durable. In the following figure for
example, the PGCL for PG1 is 103 (because 105 has not met quorum), and the PGCL for PG2 is 104.

The databases advances a Volume Complete LSN (VCL) once there are no pending writes preventing PGCL
advancing for one of its protection groups. No consensus is required to advance SCL, PGCL, or VCL, this can
all be done via local bookkeeping.

This is possible because storage nodes do not have a vote in determining whether to accept a write, they

“ must do so. Locking, transaction management, deadlocks, constraints, and other conditions that influence
whether an operation may proceed are all resolved at the database tier.

Commits are acknowledged by the database once the VCL has advanced beyond the System Commit Number
of the transaction’s commit redo log record.
from → Uncategorized
2 Comments leave one →
1. mhannum72 PERMALINK
March 29, 2019 9:59 pm
do you only serve pages less than the pgcl?

REPLY
2. mhannum72 PERMALINK
March 29, 2019 10:12 pm
How do you prevent a node from serving a page that isnt durable? Without a read quorum you could only do
that by serving pages less than the PGCL .. the node serving the page has to know what that is.

REPLY

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

You might also like