0% found this document useful (0 votes)

24 views5 pages

Colyer. Aurora II 2019

Uploaded by

scribd2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views5 pages

Colyer. Aurora II 2019

Uploaded by

scribd2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

the morning paper

an interesting/influential/important paper from the world of

CS every weekday morning, as selected by Adrian Colyer

Amazon Aurora: on avoiding distributed

consensus for I/Os, commits, and
membership changes
MARCH 27, 2019
tags: Datastores, Distributed Systems
Amazon Aurora: on avoiding distributed consensus for I/Os, commits, and membership changes
(https://dl.acm.org/citation.cfm?id=3183713.3196937), Verbitski et al., SIGMOD’18

This is a follow-up to the paper we looked at earlier this week on the design of Amazon Aurora
(https://blog.acolyer.org/2019/03/25/amazon-aurora-design-considerations-for-high-throughput-cloud-
native-relational-databases/). I’m going to assume a level of background knowledge from that work and skip
over the parts of this paper that recap those key points. What is new and interesting here are the details of how
quorum membership changes are dealt with, the notion of heterogeneous quorum set members, and more detail
on the use of consistency points and the redo log.

Changing quorum membership

Managing quorum failures is complex. Traditional mechanisms cause I/O stalls while membership is being

“ changed.

As you may recall though, Aurora is designed for a world with a constant background level of failure. So once a
quorum member is suspected faulty we don’t want to have to wait to see if it comes back, but nor do we want
throw away the benefits of all the state already present on a node that might in fact come back quite quickly.
Aurora’s membership change protocol is designed to support continued processing during the change, to
tolerate additional failures while changing membership, and to allow member re-introduction if a suspected
faulty member recovers.

Each membership change is made via at least two transitions. Say we start out with a protection group with six
segment members, A-F. A write quorum is any 4 of 6, and a read quorum is any 3 of 6. We miss some
heartbeats for F and suspect it of being faulty.

Move one is to increment the membership epoch and introduce a new node G. All read and write requests, and
any gossip messages, all carry the epoch number. Any request with a stale epoch number will be rejected.
Making the epoch change requires a write quorum, just like any other write. The new membership epoch
established through this process now requires a write set to be any four of ABCDEF and any four of ABCDEG.
Notice that whether we ultimately choose to reinstate F, or we stick with G, we have valid quorums at all points
under both paths. For a read set we need any 3 of ABCDEF or any 3 of ABCDEG.

You can probably see where this is going. If F comes back before G has finished completing hydrating from its
peers, then we make a second membership transition back to the ABCDEF formation. If it doesn’t, we can make
a transition to the ABCDEG formation.

Additional failures are handled in a similar manner. Suppose we’re in the transition state with a write quorum of
(4/6 of ABCDEF) AND (4/6 of ABCDEG) and wouldn’t you know it, now there’s a problem with E! Meet H.
We can transition to a write quorum that is (4/6 of ABCDEF and 4/6 ABCDEG) AND (4/6 of ABCDHF and
4/6 of ABCDHG). Note that even here, simply writing to ABCD fulfils all four conditions.

Heterogeneous quorums

Quorums are generally thought of as a collection of like members, grouped together to transparently handle

“ failures. However, there is nothing in the quorum model to prevent unlike members with differing latency,
cost, or durability characteristics.

Aurora exploits this to set up protection groups with three full segments, which store both redo log records and
materialized data blocks, and three tail segments which store just the redo logs. Since data blocks typically take
more space than redo logs the costs stay closer to traditional 3x replication than 6x.

With the split into full and tail segments a write quorum becomes any 4/6 segments, or all 3 full segments. (“In
practice, we write log records to the same 4/6 quorum as we did previously. At least one of these log records
arrives at a full segment and generates a data block“). A read quorum becomes 3/6 segments, to include at least
one full segment. In practice though, data is read directly from a full segment avoiding the need for a quorum
read – an optimisation we’ll look at next.
There are many options available once one moves to quorum sets of unlike members. One can combine local

“ disks to reduce latency and remote disks for durability and availability. One can combine SSDs for
performance and HDDs for cost. One can span quorums across regions to improve disaster recovery. There
are numerous moving parts that one needs to get right, but the payoffs can be significant. For Aurora, the
quorum set model described earlier lets us achieve storage prices comparable to low-cost alternatives,
while providing high durability, availability, and performance.

Efficient reads

So we spent the previous paper and much of this one nodding along with read quorums that must overlap with
write quorums, understanding the 4/6 and 3/6 requirements and so on, only to uncover the bombshell that
Aurora doesn’t actually use read quorums in practice at all! What!? (Ok, it does use read quorums, but only
during recovery).

The thing is there can be a lot of reads, and there’s an I/O amplification effect as a function of the size of the
read quorum. Whereas with write amplification we’re sending compact redo log records, with reading we’re
looking at full data blocks too. So Aurora avoids quorum reads.

Aurora does not do quorum reads. Through its bookkeeping of writes and consistency points, the database

“ instance knows which segments have the last durable version of a data block and can request it directly from
any of those segments… The database will usually issue a request to the segment with the lowest measured
latency, but occasionally also query one of the others in parallel to ensure up to data read latency response
times.

If a read is taking a long time, Aurora will issue a read to another storage node and go with whichever node
returns first.

The bookkeeping that supports this is based on read views that maintain snapshot isolation using Multi-Version
Concurrency Control (MVCC). When a transaction commits, the log sequence number (LSN) of its commit
redo record is called the System Commit Number or SCN. When a read view is established we remember the
SCN of the most recent commit, and the list of transactions active as of that LSN.

Data blocks seen by a read request must be at or after the read view LSN and back out any transactions

“ either active as of that LSN or started after that LSN… Snapshot isolation is straightforward in a single-
node database instance by having a transaction read the last durable version of a database block and apply
undo to rollback any changes.

Storage consistency points

Aurora is able to avoid much of the work of consensus by recognizing that, during normal forward

“ processing of a system, there are local oases of consistency. Using backward chaining of redo records, a
storage node can tell if it is missing data and gossip with its peers to fill in gaps. Using the advancement of
segment chains, a databased instance can determine whether it can advance durable points and reply to
clients requesting commits. Coordination and consensus is rarely required….

Recall that the only writes which cross the network from the database instance to the storage node are log redo
records. Redo log application code is run within the storage nodes to materialize blocks in the background or
on-demand to satisfy read requests.
Log records form logical chains. Each log record stores the LSN of the previous log record in the volume, the
previous LSN for the segment, and the previous LSN for the block being modified.

The block chain is used to materialise individual blocks on demand

The segment chain is used by each storage node to identify records it has not received and fill those holes
via gossip
The full log chain provides a fallback path to regenerate storage volume metadata in case of a “disastrous
loss of metadata state.”

…all log writes, including those for commit redo log records, are sent asynchronously to storage nodes,

“ processed asynchronously at the storage node, and asynchronously acknowledged back to the database
instance.

Individual nodes advance a Segment Complete LSN (SCL), representing the latest point in time for which it has
received all log records. The SCL is sent as part of a write acknowledgement. Once the database instance has
seen the SCL advance at 4/6 members of a protection group, it advances the Protection Group Complete LSN
(PGCL) – the point at which the protection group has made all writes durable. In the following figure for
example, the PGCL for PG1 is 103 (because 105 has not met quorum), and the PGCL for PG2 is 104.

The databases advances a Volume Complete LSN (VCL) once there are no pending writes preventing PGCL
advancing for one of its protection groups. No consensus is required to advance SCL, PGCL, or VCL, this can
all be done via local bookkeeping.

This is possible because storage nodes do not have a vote in determining whether to accept a write, they

“ must do so. Locking, transaction management, deadlocks, constraints, and other conditions that influence
whether an operation may proceed are all resolved at the database tier.

Commits are acknowledged by the database once the VCL has advanced beyond the System Commit Number
of the transaction’s commit redo log record.
from → Uncategorized
2 Comments leave one →
1. mhannum72 PERMALINK
March 29, 2019 9:59 pm
do you only serve pages less than the pgcl?

REPLY
2. mhannum72 PERMALINK
March 29, 2019 10:12 pm
How do you prevent a node from serving a page that isnt durable? Without a read quorum you could only do
that by serving pages less than the PGCL .. the node serving the page has to know what that is.

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Blog at WordPress.com.

PL-300 Exam Renew Questions & Answers
No ratings yet
PL-300 Exam Renew Questions & Answers
10 pages
Chapter 10 Database
No ratings yet
Chapter 10 Database
76 pages
Data Engineering With Databricks Da
100% (3)
Data Engineering With Databricks Da
232 pages
Postgresql DBA Architecture
100% (1)
Postgresql DBA Architecture
60 pages
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
No ratings yet
Nosql Systems: Sharding, Replication and Consistency: Riccardo Torlone Università Roma Tre
28 pages
Fault Tolerance in Distributed Systems: A Fault-Tolerant System
No ratings yet
Fault Tolerance in Distributed Systems: A Fault-Tolerant System
15 pages
Murex: A Mutable Replica Control Scheme For Structured Peer-to-Peer Storage Systems
No ratings yet
Murex: A Mutable Replica Control Scheme For Structured Peer-to-Peer Storage Systems
40 pages
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
No ratings yet
The Chubby Locks Service: For Loosely-Coupled Distributed Systems
57 pages
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
No ratings yet
15-440 Distributed Systems: Fault Tolerance, Logging and Recovery Thursday Oct 8, 2015
30 pages
Amazon Aurora: On Avoiding Distributed Consensus For I/Os, Commits, and Membership Changes
No ratings yet
Amazon Aurora: On Avoiding Distributed Consensus For I/Os, Commits, and Membership Changes
8 pages
Using Machine Learning From Your Database: Danilo Poccia, Chief Evangelist (EMEA) @danilop
No ratings yet
Using Machine Learning From Your Database: Danilo Poccia, Chief Evangelist (EMEA) @danilop
24 pages
Consistency in Distributed Systems
No ratings yet
Consistency in Distributed Systems
21 pages
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
No ratings yet
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
20 pages
p1041 Verbitski PDF
No ratings yet
p1041 Verbitski PDF
12 pages
cp4 1
No ratings yet
cp4 1
13 pages
Qjournal Design Archtecture
No ratings yet
Qjournal Design Archtecture
19 pages
Consensus
No ratings yet
Consensus
77 pages
DBMS ER Design Issues - Copy Unit.2
No ratings yet
DBMS ER Design Issues - Copy Unit.2
2 pages
Dynamo
No ratings yet
Dynamo
19 pages
Consistency and Rep Contd
No ratings yet
Consistency and Rep Contd
28 pages
Brewer Conjecture (CAP)
No ratings yet
Brewer Conjecture (CAP)
17 pages
Murex GPC
No ratings yet
Murex GPC
40 pages
Replication and Consistency in Distributed Systems (Cont'd)
No ratings yet
Replication and Consistency in Distributed Systems (Cont'd)
17 pages
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
No ratings yet
Dynamo: Amazon'S Highly Available Key-Value Store: Csci 8101: Advanced Operating Systems Presented By: Chaithra KN
23 pages
The Google File System: Kenneth Chiu
No ratings yet
The Google File System: Kenneth Chiu
40 pages
Capstone Project Report
No ratings yet
Capstone Project Report
38 pages
Migration of Wincc Projects From V4 To V7
No ratings yet
Migration of Wincc Projects From V4 To V7
55 pages
D.S Consistency and Replication
No ratings yet
D.S Consistency and Replication
44 pages
Replication
No ratings yet
Replication
11 pages
4 - Leaderless Replication
No ratings yet
4 - Leaderless Replication
19 pages
6.1 Cassandra
No ratings yet
6.1 Cassandra
21 pages
Jolly
No ratings yet
Jolly
7 pages
Cheat Sheet Dbms
No ratings yet
Cheat Sheet Dbms
1 page
PBFT
No ratings yet
PBFT
26 pages
Weighted Voting For Replicated Data
No ratings yet
Weighted Voting For Replicated Data
13 pages
LogDevice Consensus Deepdive
No ratings yet
LogDevice Consensus Deepdive
56 pages
Keys in Dbms
No ratings yet
Keys in Dbms
19 pages
COMOS Automation Interfaces enUS en-US
No ratings yet
COMOS Automation Interfaces enUS en-US
42 pages
Transaction Processing Systems
100% (7)
Transaction Processing Systems
26 pages
Repli
No ratings yet
Repli
38 pages
Lec7 Logging
No ratings yet
Lec7 Logging
4 pages
IC WBS Tree Diagram Template 8721
No ratings yet
IC WBS Tree Diagram Template 8721
3 pages
Outline: File System Consistency Issues in The Presence of Failures
No ratings yet
Outline: File System Consistency Issues in The Presence of Failures
4 pages
Raft
No ratings yet
Raft
30 pages
Unit 4 - DSRM
No ratings yet
Unit 4 - DSRM
5 pages
Lecture 07
No ratings yet
Lecture 07
58 pages
Distributed Transactions - Database Systems
No ratings yet
Distributed Transactions - Database Systems
10 pages
Intelligent Data and Analytics Fabric
No ratings yet
Intelligent Data and Analytics Fabric
18 pages
Amazon SD PDF
No ratings yet
Amazon SD PDF
61 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
75 pages
Course Content - Dell Boomi
No ratings yet
Course Content - Dell Boomi
7 pages
DS CH6 - Consistency and Replication
No ratings yet
DS CH6 - Consistency and Replication
18 pages
Lec 26
No ratings yet
Lec 26
28 pages
Dos 6
No ratings yet
Dos 6
22 pages
Aurora
No ratings yet
Aurora
8 pages
Replication
No ratings yet
Replication
16 pages
4.3 Amazon Dynamodb
No ratings yet
4.3 Amazon Dynamodb
16 pages
Lecture 11A - Replication Control
No ratings yet
Lecture 11A - Replication Control
15 pages
UNIT - 4B Fault Tolerance
No ratings yet
UNIT - 4B Fault Tolerance
13 pages
Data Science and Its Relationship To Library and Information Science: A Content Analysis
No ratings yet
Data Science and Its Relationship To Library and Information Science: A Content Analysis
21 pages
Lecture 18
No ratings yet
Lecture 18
57 pages
Slides
No ratings yet
Slides
31 pages
Accord
No ratings yet
Accord
25 pages
Managing Replicated Objects: Deterministic Thread Scheduling
No ratings yet
Managing Replicated Objects: Deterministic Thread Scheduling
12 pages
Troubleshoot For Server
No ratings yet
Troubleshoot For Server
40 pages
ER Model Chapter 4
No ratings yet
ER Model Chapter 4
21 pages
Colyer. Aurora I 2019
No ratings yet
Colyer. Aurora I 2019
9 pages
Nosql 1
No ratings yet
Nosql 1
40 pages
Gui Unit 5 Notes
No ratings yet
Gui Unit 5 Notes
16 pages
Amazon Aurora Storage Demystified DAT401
No ratings yet
Amazon Aurora Storage Demystified DAT401
30 pages
A1 Injection
No ratings yet
A1 Injection
2 pages
CS502M Project Spec
No ratings yet
CS502M Project Spec
8 pages
Project IS3940 - PNU
No ratings yet
Project IS3940 - PNU
28 pages
Microsoft Visual FoxPro ® 6.0 Programmer's Guide (PDFDrive)
No ratings yet
Microsoft Visual FoxPro ® 6.0 Programmer's Guide (PDFDrive)
128 pages
Lecture 05
No ratings yet
Lecture 05
29 pages
Midterm Cheatsheet
No ratings yet
Midterm Cheatsheet
2 pages
(123doc) Assignment2 Solution Database
No ratings yet
(123doc) Assignment2 Solution Database
8 pages
Huawei
No ratings yet
Huawei
32 pages
PRG5.Basic Version Routines-R14
No ratings yet
PRG5.Basic Version Routines-R14
52 pages
NFT Project (1) 1
No ratings yet
NFT Project (1) 1
81 pages
Big Data Unit 3 Own
No ratings yet
Big Data Unit 3 Own
20 pages
RAFT
No ratings yet
RAFT
8 pages
A Micro Project On An "Online Feedback System "
No ratings yet
A Micro Project On An "Online Feedback System "
22 pages
Lecture 13
No ratings yet
Lecture 13
37 pages
03 Hands-On Activity 1
No ratings yet
03 Hands-On Activity 1
2 pages
Big Data All Units by MultiAtoms 1
No ratings yet
Big Data All Units by MultiAtoms 1
49 pages
22.quorom Based Protocol
No ratings yet
22.quorom Based Protocol
2 pages
Message
No ratings yet
Message
132 pages
DS Unit5
No ratings yet
DS Unit5
13 pages
PostgreSQL Replication - Second Edition
From Everand
PostgreSQL Replication - Second Edition
Hans-Jurgen Schonig
No ratings yet

Colyer. Aurora II 2019

Uploaded by

Colyer. Aurora II 2019

Uploaded by

the morning paper

an interesting/influential/important paper from the world of

Amazon Aurora: on avoiding distributed

Changing quorum membership

Storage consistency points

The block chain is used to materialise individual blocks on demand

You might also like