Distributed Systems
Distributed Systems
book.mixu.net/distsys/ebook.html
Introduction
I wanted a text that would bring together the ideas behind many of the more recent
distributed systems - systems such as Amazon's Dynamo, Google's BigTable and
MapReduce, Apache's Hadoop and so on.
In this text I've tried to provide a more accessible introduction to distributed systems.
To me, that means two things: introducing the key concepts that you will need in order
to have a good time reading more serious texts, and providing a narrative that covers
things in enough detail that you get a gist of what's going on without getting stuck on
details. It's 2013, you've got the Internet, and you can selectively read more about the
topics you find most interesting.
In other words, that the core of distributed programming is dealing with distance
(duh!) and having more than one thing (duh!). These constraints define a space of
possible system designs, and my hope is that after reading this you'll have a better
sense of how distance, time and consistency models interact.
This text is focused on distributed programming and systems concepts you'll need to
understand commercial systems in the data center. It would be madness to attempt to
cover everything. You'll learn many key protocols and algorithms (covering, for
example, many of the most cited papers in the discipline), including some new exciting
ways to look at eventual consistency that haven't still made it into college textbooks -
such as CRDTs and the CALM theorem.
I hope you like it! If you want to say thanks, follow me on Github (or Twitter). And if you
spot an error, file a pull request on Github.
1. Basics
The first chapter covers distributed systems at a high level by introducing a number of
important terms and concepts. It covers high level goals, such as scalability, availability,
performance, latency and fault tolerance; how those are hard to achieve, and how
abstractions and models as well as partitioning and replication come into play.
1/71
2. Up and down the level of abstraction
The second chapter dives deeper into abstractions and impossibility results. It starts
with a Nietzsche quote, and then introduces system models and the many
assumptions that are made in a typical system model. It then discusses the CAP
theorem and summarizes the FLP impossibility result. It then turns to the implications
of the CAP theorem, one of which is that one ought to explore other consistency
models. A number of consistency models are then discussed.
Appendix
The appendix covers recommendations for further reading.
There are two basic tasks that any computer system needs to accomplish:
2/71
storage and
computation
Distributed programming is the art of solving the same problem that you can solve on
a single computer using multiple computers - usually, because the problem no longer
fits on a single computer.
Nothing really demands that you use distributed systems. Given infinite money and
infinite R&D time, we wouldn't need distributed systems. All computation and storage
could be done on a magic box - a single, incredibly fast and incredibly reliable system
that you pay someone else to design for you.
However, few people have infinite resources. Hence, they have to find the right place
on some real-world cost-benefit curve. At a small scale, upgrading hardware is a viable
strategy. However, as problem sizes increase you will reach a point where either the
hardware upgrade that allows you to solve the problem on a single node does not
exist, or becomes cost-prohibitive. At that point, I welcome you to the world of
distributed systems.
It is a current reality that the best value is in mid-range, commodity hardware - as long
as the maintenance costs can be kept down through fault-tolerant software.
Computations primarily benefit from high-end hardware to the extent to which they
can replace slow network accesses with internal memory accesses. The performance
advantage of high-end hardware is limited in tasks that require large amounts of
communication between nodes.
As the figure above from Barroso, Clidaras & Hölzle shows, the performance gap
between high-end and commodity hardware decreases with cluster size assuming a
uniform memory access pattern across all nodes.
Ideally, adding a new machine would increase the performance and capacity of the
3/71
system linearly. But of course this is not possible, because there is some overhead that
arises due to having separate computers. Data needs to be copied around,
computation tasks have to be coordinated and so on. This is why it's worthwhile to
study distributed algorithms - they provide efficient solutions to specific problems, as
well as guidance about what is possible, what the minimum cost of a correct
implementation is, and what is impossible.
The focus of this text is on distributed programming and systems in a mundane, but
commercially relevant setting: the data center. For example, I will not discuss
specialized problems that arise from having an exotic network configuration, or that
arise in a shared-memory setting. Additionally, the focus is on exploring the system
design space rather than on optimizing any specific design - the latter is a topic for a
much more specialized text.
Most things are trivial at a small scale - and the same problem becomes much harder
once you surpass a certain size, volume or other physically constrained thing. It's easy
to lift a piece of chocolate, it's hard to lift a mountain. It's easy to count how many
people are in a room, and hard to count how many people are in a country.
Scalability
is the ability of a system, network, or process, to handle a growing amount of work in a
capable manner or its ability to be enlarged to accommodate that growth.
What is it that is growing? Well, you can measure growth in almost any terms (number
of people, electricity usage etc.). But there are three particularly interesting things to
look at:
Size scalability: adding more nodes should make the system linearly faster;
growing the dataset should not increase latency
Geographic scalability: it should be possible to use multiple data centers to
reduce the time it takes to respond to user queries, while dealing with cross-data
center latency in some sensible manner.
Administrative scalability: adding more nodes should not increase the
administrative costs of the system (e.g. the administrators-to-machines ratio).
A scalable system is one that continues to meet the needs of its users as scale
increases. There are two particularly relevant aspects - performance and availability -
which can be measured in various ways.
4/71
Performance (and latency)
Performance
is characterized by the amount of useful work accomplished by a computer system
compared to the time and resources used.
Depending on the context, this may involve achieving one or more of the following:
There are tradeoffs involved in optimizing for any of these outcomes. For example, a
system may achieve a higher throughput by processing larger batches of work thereby
reducing operation overhead. The tradeoff would be longer response times for
individual pieces of work due to batching.
I find that low latency - achieving a short response time - is the most interesting aspect
of performance, because it has a strong connection with physical (rather than
financial) limitations. It is harder to address latency using financial resources than the
other aspects of performance.
There are a lot of really specific definitions for latency, but I really like the idea that the
etymology of the word evokes:
Latency
The state of being latent; delay, a period between the initiation of something and the
occurrence.
Latent
From Latin latens, latentis, present participle of lateo ("lie hidden"). Existing or present
but concealed or inactive.
This definition is pretty cool, because it highlights how latency is really the time
between when something happened and the time it has an impact or becomes visible.
For example, imagine that you are infected with an airborne virus that turns people
into zombies. The latent period is the time between when you became infected, and
when you turn into a zombie. That's latency: the time during which something that has
already happened is concealed from view.
Let's assume for a moment that our distributed system does just one high-level task:
given a query, it takes all of the data in the system and calculates a single result. In
other words, think of a distributed system as a data store with the ability to run a single
deterministic computation (function) over its current content:
5/71
Then, what matters for latency is not the amount of old data, but rather the speed at
which new data "takes effect" in the system. For example, latency could be measured
in terms of how long it takes for a write to become visible to readers.
The other key point based on this definition is that if nothing happens, there is no
"latent period". A system in which data doesn't change doesn't (or shouldn't) have a
latency problem.
How much that minimum latency impacts your queries depends on the nature of
those queries and the physical distance the information needs to travel.
Availability
the proportion of time a system is in a functioning condition. If a user cannot access
the system, it is said to be unavailable.
Distributed systems can take a bunch of unreliable components, and build a reliable
system on top of them.
Availability from a technical perspective is mostly about being fault tolerant. Because
the probability of a failure occurring increases with the number of components, the
system should be able to compensate so as to not become less reliable as the number
of components increases.
For example:
6/71
99.9% ("three nines") Less than 9 hours
Availability is in some sense a much wider concept than uptime, since the availability of
a service can also be affected by, say, a network outage or the company owning the
service going out of business (which would be a factor which is not really relevant to
fault tolerance but would still influence the availability of the system). But without
knowing every single specific aspect of the system, the best we can do is design for
fault tolerance.
Fault tolerance
ability of a system to behave in a well-defined manner once faults occur
Fault tolerance boils down to this: define what faults you expect and then design a
system or an algorithm that is tolerant of them. You can't tolerate faults you haven't
considered.
the number of nodes (which increases with the required storage and
computation capacity)
the distance between nodes (information travels, at best, at the speed of light)
Beyond these tendencies - which are a result of the physical constraints - is the world
of system design options.
Both performance and availability are defined by the external guarantees the system
makes. On a high level, you can think of the guarantees as the SLA (service level
agreement) for the system: if I write data, how quickly can I access it elsewhere? After
the data is written, what guarantees do I have of durability? If I ask the system to run a
computation, how quickly will it return results? When components fail, or are taken out
7/71
of operation, what impact will this have on the system?
There is another criterion, which is not explicitly mentioned but implied: intelligibility.
How understandable are the guarantees that are made? Of course, there are no
simple metrics for what is intelligible.
I was kind of tempted to put "intelligibility" under physical limitations. After all, it is a
hardware limitation in people that we have a hard time understanding anything that
involves more moving things than we have fingers. That's the difference between an
error and an anomaly - an error is incorrect behavior, while an anomaly is unexpected
behavior. If you were smarter, you'd expect the anomalies to occur.
A good abstraction makes working with a system easier to understand, while capturing
the factors that are relevant for a particular purpose.
There is a tension between the reality that there are many nodes and with our desire
for systems that "work like a single system". Often, the most familiar model (for
example, implementing a shared memory abstraction on a distributed system) is too
expensive.
A system that makes weaker guarantees has more freedom of action, and hence
potentially greater performance - but it is also potentially hard to reason about.
People are better at reasoning about systems that work like a single system, rather
than a collection of nodes.
One can often gain performance by exposing more details about the internals of the
system. For example, in columnar storage, the user can (to some extent) reason about
the locality of the key-value pairs within the system and hence make decisions that
influence the performance of typical queries. Systems which hide these kinds of details
are easier to understand (since they act more like single unit, with fewer details to
think about), while systems that expose more real-world details may be more
performant (because they correspond more closely to reality).
Several types of failures make writing distributed systems that act like a single system
difficult. Network latency and network partitions (e.g. total network failure between
some nodes) mean that a system needs to sometimes make hard choices about
whether it is better to stay available but lose some crucial guarantees that cannot be
enforced, or to play it safe and refuse clients when these types of failures occur.
8/71
The CAP theorem - which I will discuss in the next chapter - captures some of these
tensions. In the end, the ideal system meets both programmer needs (clean semantics)
and business needs (availability/consistency/latency).
There are two basic techniques that can be applied to a data set. It can be split over
multiple nodes (partitioning) to allow for more parallel processing. It can also be copied
or cached on different nodes to reduce the distance between the client and the server
and for greater fault tolerance (replication).
The picture below illustrates the difference between these two: partitioned data (A and
B below) is divided into independent sets, while replicated data (C below) is copied to
multiple locations.
This is the one-two punch for solving any problem where distributed computing plays
a role. Of course, the trick is in picking the right technique for your concrete
implementation; there are many algorithms that implement replication and
partitioning, each with different limitations and advantages which need to be assessed
against your design objectives.
9/71
Partitioning
Partitioning is dividing the dataset into smaller distinct independent sets; this is used
to reduce the impact of dataset growth since each partition is a subset of the data.
Partitioning is mostly about defining your partitions based on what you think the
primary access pattern will be, and dealing with the limitations that come from having
independent partitions (e.g. inefficient access across partitions, different rate of growth
etc.).
Replication
Replication is making copies of the same data on multiple machines; this allows more
servers to take part in the computation.
Replication is about providing extra bandwidth, and caching where it counts. It is also
about maintaining consistency in some way according to some consistency model.
Replication is also the source of many of the problems, since there are now
independent copies of the data that has to be kept in sync on multiple machines - this
means ensuring that the replication follows a consistency model.
The choice of a consistency model is crucial: a good consistency model provides clean
semantics for programmers (in other words, the properties it guarantees are easy to
10/71
reason about) and meets business/design goals such as high availability or strong
consistency.
Only one consistency model for replication - strong consistency - allows you to
program as-if the underlying data was not replicated. Other consistency models
expose some internals of the replication to the programmer. However, weaker
consistency models can provide lower latency and higher availability - and are not
necessarily harder to understand, just different.
Further reading
If you've done any programming, the idea of levels of abstraction is probably familiar
to you. You'll always work at some level of abstraction, interface with a lower level
layer through some API, and probably provide some higher-level API or user interface
to your users. The seven-layer OSI model of computer networking is a good example of
this.
Distributed programming is, I'd assert, in large part dealing with consequences of
distribution (duh!). That is, there is a tension between the reality that there are many
nodes and with our desire for systems that "work like a single system". That means
finding a good abstraction that balances what is possible with what is understandable
and performant.
What do we mean when say X is more abstract than Y? First, that X does not introduce
anything new or fundamentally different from Y. In fact, X may remove some aspects
of Y or present them in a way that makes them more manageable. Second, that X is in
some sense easier to grasp than Y, assuming that the things that X removed from Y
are not important to the matter at hand.
As Nietzsche wrote:
Every concept originates through our equating what is unequal. No leaf ever wholly equals
another, and the concept "leaf" is formed through an arbitrary abstraction from these
individual differences, through forgetting the distinctions; and now it gives rise to the idea that
in nature there might be something besides the leaves which would be "leaf" - some kind of
original form after which all leaves have been woven, marked, copied, colored, curled, and
painted, but by unskilled hands, so that no copy turned out to be a correct, reliable, and faithful
image of the original form.
Abstractions, fundamentally, are fake. Every situation is unique, as is every node. But
abstractions make the world manageable: simpler problem statements - free of reality
11/71
- are much more analytically tractable and provided that we did not ignore anything
essential, the solutions are widely applicable.
Indeed, if the things that we kept around are essential, then the results we can derive
will be widely applicable. This is why impossibility results are so important: they take
the simplest possible formulation of a problem, and demonstrate that it is impossible
to solve within some set of constraints or assumptions.
All abstractions ignore something in favor of equating things that are in reality unique.
The trick is to get rid of everything that is not essential. How do you know what is
essential? Well, you probably won't know a priori.
Every time we exclude some aspect of a system from our specification of the system,
we risk introducing a source of error and/or a performance issue. That's why
sometimes we need to go in the other direction, and selectively introduce some
aspects of real hardware and the real-world problem back. It may be sufficient to
reintroduce some specific hardware characteristics (e.g. physical sequentiality) or
other physical characteristics to get a system that performs well enough.
With this in mind, what is the least amount of reality we can keep around while still
working with something that is still recognizable as a distributed system? A system
model is a specification of the characteristics we consider important; having specified
one, we can then take a look at some impossibility results and challenges.
A system model
A key property of distributed systems is distribution. More specifically, programs in a
distributed system:
System model
12/71
a set of assumptions about the environment and facilities on which a distributed
system is implemented
System models vary in their assumptions about the environment and facilities. These
assumptions include:
what capabilities the nodes have and how they may fail
how communication links operate and how they may fail and
properties of the overall system, such as assumptions about time and order
A robust system model is one that makes the weakest assumptions: any algorithm
written for such a system is very tolerant of different environments, since it makes very
few and very weak assumptions.
On the other hand, we can create a system model that is easy to reason about by
making strong assumptions. For example, assuming that nodes do not fail means that
our algorithm does not need to handle node failures. However, such a system model is
unrealistic and hence hard to apply into practice.
Let's look at the properties of nodes, links and time and order in more detail.
Nodes execute deterministic algorithms: the local computation, the local state after
the computation, and the messages sent are determined uniquely by the message
received and local state when the message was received.
There are many possible failure models which describe the ways in which nodes can
fail. In practice, most systems assume a crash-recovery failure model: that is, nodes
can only fail by crashing, and can (possibly) recover after crashing at some later point.
Another alternative is to assume that nodes can fail by misbehaving in any arbitrary
way. This is known as Byzantine fault tolerance. Byzantine faults are rarely handled in
real world commercial systems, because algorithms resilient to arbitrary faults are
more expensive to run and more complex to implement. I will not discuss them here.
13/71
Some algorithms assume that the network is reliable: that messages are never lost
and never delayed indefinitely. This may be a reasonable assumption for some real-
world settings, but in general it is preferable to consider the network to be unreliable
and subject to message loss and delays.
A network partition occurs when the network fails while the nodes themselves remain
operational. When this occurs, messages may be lost or delayed until the network
partition is repaired. Partitioned nodes may be accessible by some clients, and so must
be treated differently from crashed nodes. The diagram below illustrates a node
failure vs. a network partition:
Timing assumptions are a convenient shorthand for capturing assumptions about the
extent to which we take this reality into account. The two main alternatives are:
The synchronous system model imposes many constraints on time and order. It
essentially assumes that the nodes have the same experience: that messages that are
sent are always received within a particular maximum transmission delay, and that
14/71
processes execute in lock-step. This is convenient, because it allows you as the system
designer to make assumptions about time and order, while the asynchronous system
model doesn't.
Asynchronicity is a non-assumption: it just assumes that you can't rely on timing (or a
"time sensor").
Of course, assuming the synchronous system model is not particularly realistic. Real-
world networks are subject to failures and there are no hard bounds on message
delay. Real world systems are at best partially synchronous: they may occasionally
work correctly and provide some upper bounds, but there will be times where
messages are delayed indefinitely and clocks are out of sync. I won't really discuss
algorithms for synchronous systems here, but you will probably run into them in many
other introductory books because they are analytically easier (but unrealistic).
whether or not network partitions are included in the failure model, and
synchronous vs. asynchronous timing assumptions
influence the system design choices by discussing two impossibility results (FLP and
CAP).
Several computers (or nodes) achieve consensus if they all agree on some value. More
formally:
The consensus problem is at the core of many commercial distributed systems. After
all, we want the reliability and performance of a distributed system without having to
deal with the consequences of distribution (e.g. disagreements / divergence between
nodes), and solving the consensus problem makes it possible to solve several related,
more advanced problems such as atomic broadcast and atomic commit.
15/71
Two impossibility results
The first impossibility result, known as the FLP impossibility result, is an impossibility
result that is particularly relevant to people who design distributed algorithms. The
second - the CAP theorem - is a related result that is more relevant to practitioners;
people who need to choose between different system designs but who are not directly
concerned with the design of algorithms.
Under these assumptions, the FLP result states that "there does not exist a
(deterministic) algorithm for the consensus problem in an asynchronous system
subject to failures, even if messages can never be lost, at most one process may fail,
and it can only fail by crashing (stopping executing)".
This result means that there is no way to solve the consensus problem under a very
minimal system model in a way that cannot be delayed forever. The argument is that if
such an algorithm existed, then one could devise an execution of that algorithm in
which it would remain undecided ("bivalent") for an arbitrary amount of time by
delaying message delivery - which is allowed in the asynchronous system model. Thus,
such an algorithm cannot exist.
Consistency: all nodes see the same data at the same time.
Availability: node failures do not prevent survivors from continuing to operate.
Partition tolerance: the system continues to operate despite message loss due to
network and/or node failure
only two can be satisfied simultaneously. We can even draw this as a pretty diagram,
picking two properties out of three gives us three types of systems that correspond to
different intersections:
Note that the theorem states that the middle piece (having all three properties) is not
achievable. Then we get three different system types:
The CA and CP system designs both offer the same consistency model: strong
consistency. The only difference is that a CA system cannot tolerate any node failures;
a CP system can tolerate up to f faults given 2f+1 nodes in a non-Byzantine failure
model (in other words, it can tolerate the failure of a minority f of the nodes as long
as majority f+1 stays up). The reason is simple:
A CA system does not distinguish between node failures and network failures,
and hence must stop accepting writes everywhere to avoid introducing
divergence (multiple copies). It cannot tell whether a remote node is down, or
17/71
whether just the network connection is down: so the only safe thing is to stop
accepting writes.
A CP system prevents divergence (e.g. maintains single-copy consistency) by
forcing asymmetric behavior on the two sides of the partition. It only keeps the
majority partition around, and requires the minority partition to become
unavailable (e.g. stop accepting writes), which retains a degree of availability (the
majority partition) and still ensures single-copy consistency.
I'll discuss this in more detail in the chapter on replication when I discuss Paxos. The
important thing is that CP systems incorporate network partitions into their failure
model and distinguish between a majority partition and a minority partition using an
algorithm like Paxos, Raft or viewstamped replication. CA systems are not partition-
aware, and are historically more common: they often use the two-phase commit
algorithm and are common in traditional distributed relational databases.
Assuming that a partition occurs, the theorem reduces to a binary choice between
availability and consistency.
I think there are four conclusions that should be drawn from the CAP theorem:
First, that many system designs used in early distributed relational database systems did
not take into account partition tolerance (e.g. they were CA designs). Partition tolerance
is an important property for modern systems, since network partitions become much
more likely if the system is geographically distributed (as many large systems are).
Second, that there is a tension between strong consistency and high availability during
network partitions. The CAP theorem is an illustration of the tradeoffs that occur
between strong guarantees and distributed computation.
18/71
Strong consistency guarantees require us
to give up availability during a partition.
This is because one cannot prevent
divergence between two replicas that
cannot communicate with each other
while continuing to accept writes on both
sides of the partition.
Third, that there is a tension between strong consistency and performance in normal
operation.
If you can live with a consistency model other than the classic one, a consistency model
that allows replicas to lag or to diverge, then you can reduce latency during normal
operation and maintain availability in the presence of partitions.
When fewer messages and fewer nodes are involved, an operation can complete
faster. But the only way to accomplish that is to relax the guarantees: let some of the
nodes be contacted less frequently, which means that nodes can contain old data.
This also makes it possible for anomalies to occur. You are no longer guaranteed to
get the most recent value. Depending on what kinds of guarantees are made, you
might read a value that is older than expected, or even lose some updates.
Fourth - and somewhat indirectly - that if we do not want to give up availability during a
network partition, then we need to explore whether consistency models other than strong
consistency are workable for our purposes.
For example, even if user data is georeplicated to multiple datacenters, and the link
between those two datacenters is temporarily out of order, in many cases we'll still
want to allow the user to use the website / service. This means reconciling two
divergent sets of data later on, which is both a technical challenge and a business risk.
But often both the technical challenge and the business risk are manageable, and so it
is preferable to provide high availability.
Consistency and availability are not really binary choices, unless you limit yourself to
strong consistency. But strong consistency is just one consistency model: the one
where you, by necessity, need to give up availability in order to prevent more than a
19/71
single copy of the data from being active. As Brewer himself points out, the "2 out of 3"
interpretation is misleading.
If you take away just one idea from this discussion, let it be this: "consistency" is not a
singular, unambiguous property. Remember:
ACID consistency !=
CAP consistency !=
Oatmeal consistency
Instead, a consistency model is a guarantee - any guarantee - that a data store gives to
programs that use it.
Consistency model
a contract between programmer and system, wherein the system guarantees that if
the programmer follows some specific rules, the results of operations on the data
store will be predictable
The "C" in CAP is "strong consistency", but "consistency" is not a synonym for "strong
consistency".
Strong consistency models guarantee that the apparent order and visibility of updates
is equivalent to a non-replicated system. Weak consistency models, on the other hand,
do not make such guarantees.
Note that this is by no means an exhaustive list. Again, consistency models are just
arbitrary contracts between the programmer and system, so they can be almost
anything.
20/71
have executed atomically in an order that is consistent with the global real-time
ordering of operations. (Herlihy & Wing, 1991)
Sequential consistency: Under sequential consistency, all operations appear to
have executed atomically in some order that is consistent with the order seen at
individual nodes and that is equal at all nodes. (Lamport, 1979)
The key difference is that linearizable consistency requires that the order in which
operations take effect is equal to the actual real-time ordering of operations.
Sequential consistency allows for operations to be reordered as long as the order
observed on each node remains consistent. The only way someone can distinguish
between the two is if they can observe all the inputs and timings going into the
system; from the perspective of a client interacting with a node, the two are equivalent.
The difference seems immaterial, but it is worth noting that sequential consistency
does not compose.
Strong consistency models allow you as a programmer to replace a single server with a
cluster of distributed nodes and not run into any problems.
All the other consistency models have anomalies (compared to a system that
guarantees strong consistency), because they behave in a way that is distinguishable
from a non-replicated system. But often these anomalies are acceptable, either
because we don't care about occasional issues or because we've written code that
deals with inconsistencies after they have occurred in some way.
Note that there really aren't any universal typologies for weak consistency models,
because "not a strong consistency model" (e.g. "is distinguishable from a non-
replicated system in some way") can be almost anything.
Clients may still see older versions of the data, if the replica node they are on does not
contain the latest version, but they will never see anomalies where an older version of
a value resurfaces (e.g. because they connected to a different replica). Note that there
are many kinds of consistency models that are client-centric.
Eventual consistency
The eventual consistency model says that if you stop changing values, then after some
undefined amount of time all replicas will agree on the same value. It is implied that
before that time results between replicas are inconsistent in some undefined manner.
21/71
Since it is trivially satisfiable (liveness property only), it is useless without supplemental
information.
Saying something is merely eventually consistent is like saying "people are eventually
dead". It's a very weak constraint, and we'd probably want to have at least some more
specific characterization of two things:
First, how long is "eventually"? It would be useful to have a strict lower bound, or at
least some idea of how long it typically takes for the system to converge to the same
value.
Second, how do the replicas agree on a value? A system that always returns "42" is
eventually consistent: all replicas agree on the same value. It just doesn't converge to a
useful value since it just keeps returning the same fixed value. Instead, we'd like to
have a better idea of the method. For example, one way to decide is to have the value
with the largest timestamp always win.
So when vendors say "eventual consistency", what they mean is some more precise
term, such as "eventually last-writer-wins, and read-the-latest-observed-value in the
meantime" consistency. The "how?" matters, because a bad method can lead to writes
being lost - for example, if the clock on one node is set incorrectly and timestamps are
used.
I will look into these two questions in more detail in the chapter on replication
methods for weak consistency models.
Further reading
Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-
Tolerant Web Services - Gilbert & Lynch, 2002
Impossibility of distributed consensus with one faulty process - Fischer, Lynch
and Patterson, 1985
Perspectives on the CAP Theorem - Gilbert & Lynch, 2012
CAP Twelve Years Later: How the "Rules" Have Changed - Brewer, 2012
Uniform consensus is harder than consensus - Charron-Bost & Schiper, 2000
Replicated Data Consistency Explained Through Baseball - Terry, 2011
Life Beyond Distributed Transactions: an Apostate's Opinion - Helland, 2007
If you have too much data, then 'good enough' is good enough - Helland, 2011
Building on Quicksand - Helland & Campbell, 2009
I mean, why are we so obsessed with order in the first place? Why do we care whether
A happened before B? Why don't we care about some other property, like "color"?
22/71
Well, my crazy friend, let's go back to the definition of distributed systems to answer
that.
As you may remember, I described distributed programming as the art of solving the
same problem that you can solve on a single computer using multiple computers.
This is, in fact, at the core of the obsession with order. Any system that can only do one
thing at a time will create a total order of operations. Like people passing through a
single door, every operation will have a well-defined predecessor and successor. That's
basically the programming model that we've worked very hard to preserve.
The traditional model is: a single program, one process, one memory space running on
one CPU. The operating system abstracts away the fact that there might be multiple
CPUs and multiple programs, and that the memory on the computer is actually shared
among many programs. I'm not saying that threaded programming and event-oriented
programming don't exist; it's just that they are special abstractions on top of the
"one/one/one" model. Programs are written to be executed in an ordered fashion: you
start from the top, and then go down towards the bottom.
Order as a property has received so much attention because the easiest way to define
"correctness" is to say "it works like it would on a single machine". And that usually
means that a) we run the same operations and b) that we run them in the same order
- even if there are multiple machines.
The nice thing about distributed systems that preserve order (as defined for a single
system) is that they are generic. You don't need to care about what the operations are,
because they will be executed exactly like on a single machine. This is great because
you know that you can use the same system no matter what the operations are.
In reality, a distributed program runs on multiple nodes; with multiple CPUs and
multiple streams of operations coming in. You can still assign a total order, but it
requires either accurate clocks or some form of communication. You could timestamp
each operation using a completely accurate clock then use that to figure out the total
order. Or you might have some kind of communication system that makes it possible
to assign sequential numbers as in a total order.
A total order is a binary relation that defines an order for every element in some set.
Two distinct elements are comparable when one of them is greater than the other. In
a partially ordered set, some pairs of elements are not comparable and hence a partial
order doesn't specify the exact order of every item.
23/71
Both total order and partial order are transitive and antisymmetric. The following
statements hold in both a total order and a partial order for all a, b and c in X:
Note that totality implies reflexivity; so a partial order is a weaker variant of total order.
For some elements in a partial order, the totality property does not hold - in other
words, some of the elements are not comparable.
Git branches are an example of a partial order. As you probably know, the git revision
control system allows you to create multiple branches from a single base branch - e.g.
from a master branch. Each branch represents a history of source code changes
derived based on a common ancestor:
The branches A and B were derived from a common ancestor, but there is no definite
order between them: they represent different histories and cannot be reduced to a
single linear history without additional work (merging). You could, of course, put all
the commits in some arbitrary order (say, sorting them first by ancestry and then
breaking ties by sorting A before B or B before A) - but that would lose information by
forcing a total order where none existed.
In a system consisting of one node, a total order emerges by necessity: instructions are
executed and messages are processed in a specific, observable order in a single
program. We've come to rely on this total order - it makes executions of programs
predictable. This order can be maintained on a distributed system, but at a cost:
communication is expensive, and time synchronization is difficult and fragile.
What is time?
Time is a source of order - it allows us to define the order of operations - which
coincidentally also has an interpretation that people can understand (a second, a
minute, a day and so on).
In some sense, time is just like any other integer counter. It just happens to be
important enough that most computers have a dedicated time sensor, also known as a
clock. It's so important that we've figured out how to synthesize an approximation of
the same counter using some imperfect physical system (from wax candles to cesium
24/71
atoms). By "synthesize", I mean that we can approximate the value of the integer
counter in physically distant places via some physical property without communicating
it directly.
Timestamps really are a shorthand value for representing the state of the world from
the start of the universe to the current moment - if something occurred at a particular
timestamp, then it was potentially influenced by everything that happened before it.
This idea can be generalized into a causal clock that explicitly tracks causes
(dependencies) rather than simply assuming that everything that preceded a
timestamp was relevant. Of course, the usual assumption is that we should only worry
about the state of the specific system rather than the whole world.
Assuming that time progresses at the same rate everywhere - and that is a big
assumption which I'll return to in a moment - time and timestamps have several useful
interpretations when used in a program. The three interpretations are:
Order
Duration
Interpretation
Order. When I say that time is a source of order, what I mean is that:
Duration - durations measured in time have some relation to the real world.
Algorithms generally don't care about the absolute value of a clock or its interpretation
as a date, but they might use durations to make some judgment calls. In particular, the
amount of time spent waiting can provide clues about whether a system is partitioned
or merely experiencing high latency.
Imposing (or assuming) order is one way to reduce the space of possible executions
and possible occurrences. Humans have a hard time reasoning about things when
things can happen in any order - there just are too many permutations to consider.
25/71
Does time progress at the same rate everywhere?
We all have an intuitive concept of time based on our own experience as individuals.
Unfortunately, that intuitive notion of time makes it easier to picture total order rather
than partial order. It's easier to picture a sequence in which things happen one after
another, rather than concurrently. It is easier to reason about a single order of
messages than to reason about messages arriving in different orders and with
different delays.
There are three common answers to the question "does time progress at the same
rate everywhere?". These are:
These correspond roughly to the three timing assumptions that I mentioned in the
second chapter: the synchronous system model has a global clock, the partially
synchronous model has a local clock, and in the asynchronous system model one
cannot use clocks at all. Let's look at these in more detail.
26/71
The global clock is basically a
source of total order (exact order
of every operation on all nodes
even if those nodes have never
communicated).
Nevertheless, there are some real-world systems that make this assumption.
Facebook's Cassandra is an example of a system that assumes clocks are
synchronized. It uses timestamps to resolve conflicts between writes - the write with
the newer timestamp wins. This means that if clocks drift, new data may be ignored or
overwritten by old data; again, this is an operational challenge (and from what I've
heard, one that people are acutely aware of). Another interesting example is Google's
Spanner: the paper describes their TrueTime API, which synchronizes time but also
estimates worst-case clock drift.
27/71
The local clock assumption corresponds
more closely to the real world. It assigns a
partial order: events on each system are
ordered but events cannot be ordered
across systems by only using a clock.
This way, we can determine the order of events between different machines, but
cannot say anything about intervals and cannot use timeouts (since we assume that
there is no "time sensor"). This is a partial order: events can be ordered on a single
system using a counter and no communication, but ordering events across systems
requires a message exchange.
One of the most cited papers in distributed systems is Lamport's paper on time, clocks
and the ordering of events. Vector clocks, a generalization of that concept (which I will
cover in more detail), are a way to track causality without using clocks. Cassandra's
cousins Riak (Basho) and Voldemort (Linkedin) use vector clocks rather than assuming
that nodes have access to a global clock of perfect accuracy. This allows those systems
to avoid the clock accuracy issues mentioned earlier.
When clocks are not used, the maximum precision at which events can be ordered
across distant machines is bound by communication latency.
A global clock would allow operations on two different machines to be ordered without
the two machines communicating directly. Without a global clock, we need to
communicate in order to determine order.
Time can also be used to define boundary conditions for algorithms - specifically, to
distinguish between "high latency" and "server or network link is down". This is a very
important use case; in most real-world systems timeouts are used to determine
whether a remote machine has failed, or whether it is simply experiencing high
network latency. Algorithms that make this determination are called failure detectors;
and I will discuss them fairly soon.
Lamport clocks and vector clocks are replacements for physical clocks which rely on
counters and communication to determine the order of events across a distributed
system. These clocks provide a counter that is comparable across different nodes.
A Lamport clock is simple. Each process maintains a counter using the following rules:
Expressed as code:
29/71
function LamportClock() {
this.value = 1;
}
LamportClock.prototype.get = function() {
return this.value;
}
LamportClock.prototype.increment = function() {
this.value++;
}
LamportClock.prototype.merge = function(other) {
this.value = Math.max(this.value, other.value) + 1;
}
This is known as clock consistency condition: if one event comes before another, then
that event's logical clock comes before the others. If a and b are from the same
causal history, e.g. either both timestamp values were produced on the same process;
or b is a response to the message sent in a then we know that a happened before
b .
Intuitively, this is because a Lamport clock can only carry information about one
timeline / history; hence, comparing Lamport timestamps from systems that never
communicate with each other may cause concurrent events to appear to be ordered
when they are not.
Imagine a system that after an initial period divides into two independent subsystems
which never communicate with each other.
For all events in each independent system, if a happened before b, then ts(a) <
ts(b) ; but if you take two events from the different independent systems (e.g. events
that are not causally related) then you cannot say anything meaningful about their
relative order. While each part of the system has assigned timestamps to events, those
timestamps have no relation to each other. Two events may appear to be ordered even
though they are unrelated.
However - and this is still a useful property - from the perspective of a single machine,
any message sent with ts(a) will receive a response with ts(b) which is > ts(a) .
A vector clock is an extension of Lamport clock, which maintains an array [ t1, t2,
... ] of N logical clocks - one per each node. Rather than incrementing a common
counter, each node increments its own logical clock in the vector by one on each
internal event. Hence the update rules are:
Whenever a process does work, increment the logical clock value of the node in
30/71
the vector
Whenever a process sends a message, include the full vector of logical clocks
When a message is received:
update each element in the vector to be max(local, received)
increment the logical clock value representing the current node in the
vector
function VectorClock(value) {
// expressed as a hash keyed by node id: e.g. { node1: 1, node2: 3 }
this.value = value || {};
}
VectorClock.prototype.get = function() {
return this.value;
};
VectorClock.prototype.increment = function(nodeId) {
if(typeof this.value[nodeId] == 'undefined') {
this.value[nodeId] = 1;
} else {
this.value[nodeId]++;
}
};
VectorClock.prototype.merge = function(other) {
var result = {}, last,
a = this.value,
b = other.value;
// This filters out duplicate keys in the hash
(Object.keys(a)
.concat(b))
.sort()
.filter(function(key) {
var isDuplicate = (key == last);
last = key;
return !isDuplicate;
}).forEach(function(key) {
result[key] = Math.max(a[key] || 0, b[key] || 0);
});
this.value = result;
};
31/71
Each of the three nodes (A, B, C) keeps track of the vector clock. As events occur, they
are timestamped with the current value of the vector clock. Examining a vector clock
such as { A: 2, B: 4, C: 1 } lets us accurately identify the messages that
(potentially) influenced that event.
The issue with vector clocks is mainly that they require one entry per node, which
means that they can potentially become very large for large systems. A variety of
techniques have been applied to reduce the size of vector clocks (either by performing
periodic garbage collection, or by reducing accuracy by limiting the size).
We've looked at how order and causality can be tracked without physical clocks. Now,
let's look at how time durations can be used for cutoff.
Given a program running on one node, how can it tell that a remote node has failed?
In the absence of accurate information, we can infer that an unresponsive remote
node has failed after some reasonable amount of time has passed.
But what is a "reasonable amount"? This depends on the latency between the local and
remote nodes. Rather than explicitly specifying algorithms with specific values (which
would inevitably be wrong in some cases), it would be nicer to deal with a suitable
abstraction.
A failure detector is a way to abstract away the exact timing assumptions. Failure
detectors are implemented using heartbeat messages and timers. Processes exchange
heartbeat messages. If a message response is not received before the timeout occurs,
32/71
then the process suspects the other process.
A failure detector based on a timeout will carry the risk of being either overly
aggressive (declaring a node to have failed) or being overly conservative (taking a long
time to detect a crash). How accurate do failure detectors need to be for them to be
usable?
Chandra et al. (1996) discuss failure detectors in the context of solving consensus - a
problem that is particularly relevant since it underlies most replication problems
where the replicas need to agree in environments with latency and network partitions.
They characterize failure detectors using two properties, completeness and accuracy:
Strong completeness.
Every crashed process is eventually suspected by every correct process.
Weak completeness.
Every crashed process is eventually suspected by some correct process.
Strong accuracy.
No correct process is suspected ever.
Weak accuracy.
Some correct process is never suspected.
Avoiding incorrectly suspecting non-faulty processes is hard unless you are able to
assume that there is a hard maximum on the message delay. That assumption can be
made in a synchronous system model - and hence failure detectors can be strongly
accurate in such a system. Under system models that do not impose hard bounds on
message delay, failure detection can at best be eventually accurate.
Chandra et al. show that even a very weak failure detector - the eventually weak failure
detector ⋄W (eventually weak accuracy + weak completeness) - can be used to solve
the consensus problem. The diagram below (from the paper) illustrates the
relationship between system models and problem solvability:
33/71
As you can see above, certain problems are not solvable without a failure detector in
asynchronous systems. This is because without a failure detector (or strong
assumptions about time bounds e.g. the synchronous system model), it is not possible
to tell whether a remote node has crashed, or is simply experiencing high latency. That
distinction is important for any system that aims for single-copy consistency: failed
nodes can be ignored because they cannot cause divergence, but partitioned nodes
cannot be safely ignored.
How can one implement a failure detector? Conceptually, there isn't much to a simple
failure detector, which simply detects failure when a timeout expires. The most
interesting part relates to how the judgments are made about whether a remote node
has failed.
Ideally, we'd prefer the failure detector to be able to adjust to changing network
conditions and to avoid hardcoding timeout values into it. For example, Cassandra
uses an accrual failure detector, which is a failure detector that outputs a suspicion
level (a value between 0 and 1) rather than a binary "up" or "down" judgment. This
allows the application using the failure detector to make its own decisions about the
tradeoff between accurate detection and early detection.
If you're writing a distributed system, you presumably own more than one computer.
The natural (and realistic) view of the world is a partial order, not a total order. You can
transform a partial order into a total order, but this requires communication, waiting
and imposes restrictions that limit how many computers can do work at any particular
point in time.
All clocks are mere approximations bound by either network latency (logical time) or
by physics. Even keeping a simple integer counter in sync across multiple nodes is a
challenge.
34/71
While time and order are often discussed together, time itself is not such a useful
property. Algorithms don't really care about time as much as they care about more
abstract properties:
Imposing a total order is possible, but expensive. It requires you to proceed at the
common (lowest) speed. Often the easiest way to ensure that events are delivered in
some defined order is to nominate a single (bottleneck) node through which all
operations are passed.
Is time / order / synchronicity really necessary? It depends. In some use cases, we want
each intermediate operation to move the system from one consistent state to another.
For example, in many cases we want the responses from a database to represent all of
the available information, and we want to avoid dealing with the issues that might
occur if the system could return an inconsistent result.
But in other cases, we might not need that much time / order / synchronization. For
example, if you are running a long running computation, and don't really care about
what the system does until the very end - then you don't really need much
synchronization as long as you can guarantee that the answer is correct.
Synchronization is often applied as a blunt tool across all operations, when only a
subset of cases actually matter for the final outcome. When is order needed to
guarantee correctness? The CALM theorem - which I will discuss in the last chapter -
provides one answer.
In other cases, it is acceptable to give an answer that only represents the best known
estimate - that is, is based on only a subset of the total information contained in the
system. In particular, during a network partition one may need to answer queries with
only a part of the system being accessible. In other use cases, the end user cannot
really distinguish between a relatively recent answer that can be obtained cheaply and
one that is guaranteed to be correct and is expensive to calculate. For example, is the
Twitter follower count for some user X, or X+1? Or are movies A, B and C the
absolutely best answers for some query? Doing a cheaper, mostly correct "best effort"
can be acceptable.
In the next two chapters we'll examine replication for fault-tolerant strongly consistent
systems - systems which provide strong guarantees while being increasingly resilient
to failures. These systems provide solutions for the first case: when you need to
guarantee correctness and are willing to pay for it. Then, we'll discuss systems with
weak consistency guarantees, which can remain available in the face of partitions, but
that can only give you a "best effort" answer.
35/71
Further reading
Failure detection
Snapshots
Consistent global states of distributed systems: Fundamental concepts and
mechanisms, Ozalp Babaogly and Keith Marzullo, 1993
Distributed snapshots: Determining global states of distributed systems, K. Mani
Chandy and Leslie Lamport, 1985
Causality
Detecting Causal Relationships in Distributed Computations: In Search of the
Holy Grail - Schwarz & Mattern, 1994
Understanding the Limitations of Causally and Totally Ordered Communication -
Cheriton & Skeen, 1993
4. Replication
The replication problem is one of many problems in distributed systems. I've chosen to
focus on it over other problems such as leader election, failure detection, mutual
exclusion, consensus and global snapshots because it is often the part that people are
most interested in. One way in which parallel databases are differentiated is in terms
of their replication features, for example. Furthermore, replication provides a context
for many subproblems, such as leader election, failure detection, consensus and
atomic broadcast.
Again, there are many ways to approach replication. The approach I'll take here just
looks at high level patterns that are possible for a system with replication. Looking at
this visually helps keep the discussion focused on the overall pattern rather than the
specific messaging involved. My goal here is to explore the design space rather than to
explain the specifics of each algorithm.
Let's first define what replication looks like. We assume that we have some initial
database, and that clients make requests which change the state of the database.
36/71
The arrangement and communication pattern can then be divided into several stages:
This model is loosely based on this article. Note that the pattern of messages
exchanged in each portion of the task depends on the specific algorithm: I am
intentionally trying to get by without discussing the specific algorithm.
Given these stages, what kind of communication patterns can we create? And what are
the performance and availability implications of the patterns we choose?
Synchronous replication
The first pattern is synchronous replication (also known as active, or eager, or push, or
pessimistic replication). Let's draw what that looks like:
37/71
Here, we can see three distinct stages: first, the client sends the request. Next, what we
called the synchronous portion of replication takes place. The term refers to the fact
that the client is blocked - waiting for a reply from the system.
During the synchronous phase, the first server contacts the two other servers and
waits until it has received replies from all the other servers. Finally, it sends a response
to the client informing it of the result (e.g. success or failure).
All this seems straightforward. What can we say of this specific arrangement of
communication patterns, without discussing the details of the algorithm during the
synchronous phase? First, observe that this is a write N - of - N approach: before a
response is returned, it has to be seen and acknowledged by every server in the
system.
From a performance perspective, this means that the system will be as fast as the
slowest server in it. The system will also be very sensitive to changes in network
latency, since it requires every server to reply before proceeding.
Given the N-of-N approach, the system cannot tolerate the loss of any servers. When a
server is lost, the system can no longer write to all the nodes, and so it cannot
proceed. It might be able to provide read-only access to the data, but modifications are
not allowed after a node has failed in this design.
38/71
This arrangement can provide very strong durability guarantees: the client can be
certain that all N servers have received, stored and acknowledged the request when
the response is returned. In order to lose an accepted update, all N copies would need
to be lost, which is about as good a guarantee as you can make.
Asynchronous replication
Let's contrast this with the second pattern - asynchronous replication (a.k.a. passive
replication, or pull replication, or lazy replication). As you may have guessed, this is the
opposite of synchronous replication:
Here, the master (/leader / coordinator) immediately sends back a response to the
client. It might at best store the update locally, but it will not do any significant work
synchronously and the client is not forced to wait for more rounds of communication
to occur between the servers.
At some later stage, the asynchronous portion of the replication task takes place.
Here, the master contacts the other servers using some communication pattern, and
the other servers update their copies of the data. The specifics depend on the
algorithm in use.
39/71
What can we say of this specific arrangement without getting into the details of the
algorithm? Well, this is a write 1 - of - N approach: a response is returned immediately
and update propagation occurs sometime later.
From a performance perspective, this means that the system is fast: the client does not
need to spend any additional time waiting for the internals of the system to do their
work. The system is also more tolerant of network latency, since fluctuations in internal
latency do not cause additional waiting on the client side.
Given the 1-of-N approach, the system can remain available as long as at least one
node is up (at least in theory, though in practice the load will probably be too high). A
purely lazy approach like this provides no durability or consistency guarantees; you
may be allowed to write to the system, but there are no guarantees that you can read
back what you wrote if any faults occur.
Finally, it's worth noting that passive replication cannot ensure that all nodes in the
system always contain the same state. If you accept writes at multiple locations and do
not require that those nodes synchronously agree, then you will run the risk of
divergence: reads may return different results from different locations (particularly
after nodes fail and recover), and global constraints (which require communicating
with everyone) cannot be enforced.
I haven't really mentioned the communication patterns during a read (rather than a
write), because the pattern of reads really follows from the pattern of writes: during a
read, you want to contact as few nodes as possible. We'll discuss this a bit more in the
context of quorums.
We've only discussed two basic arrangements and none of the specific algorithms. Yet
we've been able to figure out quite a bit of about the possible communication patterns
as well as their performance, durability guarantees and availability characteristics.
There are many, many different ways to categorize replication techniques. The second
distinction (after sync vs. async) I'd like to introduce is between:
The first group of methods has the property that they "behave like a single system". In
particular, when partial failures occur, the system ensures that only a single copy of
40/71
the system is active. Furthermore, the system ensures that the replicas are always in
agreement. This is known as the consensus problem.
Several processes (or computers) achieve consensus if they all agree on some value.
More formally:
Mutual exclusion, leader election, multicast and atomic broadcast are all instances of
the more general problem of consensus. Replicated systems that maintain single copy
consistency need to solve the consensus problem in some way.
These algorithms vary in their fault tolerance (e.g. the types of faults they can tolerate).
I've classified these simply by the number of messages exchanged during an execution
of the algorithm, because I think it is interesting to try to find an answer to the
question "what are we buying with the added message exchanges?"
The diagram below, adapted from Ryan Barret at Google, describes some of the
aspects of the different options:
The consistency, latency, throughput, data loss and failover characteristics in the
diagram above can really be traced back to the two different replication methods:
41/71
synchronous replication (e.g. waiting before responding) and asynchronous
replication. When you wait, you get worse performance but stronger guarantees. The
throughput difference between 2PC and quorum systems will become apparent when
we discuss partition (and latency) tolerance.
In that diagram, algorithms enforcing weak (/eventual) consistency are lumped up into
one category ("gossip"). However, I will discuss replication methods for weak
consistency - gossip and (partial) quorum systems - in more detail. The "transactions"
row really refers more to global predicate evaluation, which is not supported in
systems with weak consistency (though local predicate evaluation can be supported).
It is worth noting that systems enforcing weak consistency requirements have fewer
generic algorithms, and more techniques that can be selectively applied. Since systems
that do not enforce single-copy consistency are free to act like distributed systems
consisting of multiple nodes, there are fewer obvious objectives to fix and the focus is
more on giving people a way to reason about the characteristics of the system that
they have.
For example:
I'll talk about all of these a bit further on, first; let's look at the replication algorithms
that maintain single-copy consistency.
Primary/backup replication
Primary/backup replication (also known as primary copy replication master-slave
replication or log shipping) is perhaps the most commonly used replication method,
and the most basic algorithm. All updated are performed on the primary, and a log of
operations (or alternatively, changes) is shipped across the network to the backup
replicas. There are two variants:
P/B is very common. For example, by default MySQL replication uses the asynchronous
42/71
variant. MongoDB also uses P/B (with some additional procedures for failover). All
operations are performed on one master server, which serializes them to a local log,
which is then replicated asynchronously to the backup servers.
The synchronous variant of primary/backup replication ensures that writes have been
stored on other nodes before returning back to the client - at the cost of waiting for
responses from other replicas. However, it is worth noting that even this variant can
only offer weak guarantees. Consider the following simple failure scenario:
The client now assumes that the commit failed, but the backup committed it; if the
backup is promoted to primary, it will be incorrect. Manual cleanup may be needed to
reconcile the failed primary or divergent backups.
What is key in the log-shipping / primary/backup based schemes is that they can only
offer a best-effort guarantee (e.g. they are susceptible to lost updates or incorrect
updates if nodes fail at inopportune times). Furthermore, P/B schemes are susceptible
to split-brain, where the failover to a backup kicks in due to a temporary network issue
and causes both the primary and backup to be active at the same time.
43/71
In the first phase (voting), the coordinator sends the update to all the participants.
Each participant processes the update and votes whether to commit or abort. When
voting to commit, the participants store the update onto a temporary area (the write-
ahead log). Until the second phase completes, the update is considered temporary.
In the second phase (decision), the coordinator decides the outcome and informs
every participant about it. If all participants voted to commit, then the update is taken
from the temporary area and made permanent.
Having a second phase in place before the commit is considered permanent is useful,
because it allows the system to roll back an update when a node fails. In contrast, in
primary/backup ("1PC"), there is no step for rolling back an operation that has failed on
some nodes and succeeded on others, and hence the replicas could diverge.
2PC is prone to blocking, since a single node failure (participant or coordinator) blocks
progress until the node has recovered. Recovery is often possible thanks to the second
phase, during which other nodes are informed about the system state. Note that 2PC
assumes that the data in stable storage at each node is never lost and that no node
crashes forever. Data loss is still possible if the data in the stable storage is corrupted
in a crash.
The details of the recovery procedures during node failures are quite complicated so I
won't get into the specifics. The major tasks are ensuring that writes to disk are
durable (e.g. flushed to disk rather than cached) and making sure that the right
recovery decisions are made (e.g. learning the outcome of the round and then redoing
or undoing an update locally).
2PC strikes a decent balance between performance and fault tolerance, which is why it
has been popular in relational databases. However, newer systems often use a
partition tolerant consensus algorithm, since such an algorithm can provide automatic
recovery from temporary network partitions as well as more graceful handling of
increased between-node latency.
44/71
include nodes that fail by acting maliciously. Such algorithms are rarely used in
commercial systems, because they are more expensive to run and more complicated
to implement - and hence I will leave them out.
Network partitions are tricky because during a network partition, it is not possible to
distinguish between a failed remote node and the node being unreachable. If a
network partition occurs but no nodes fail, then the system is divided into two
partitions which are simultaneously active. The two diagrams below illustrate how a
network partition can look similar to a node failure.
45/71
A system that enforces single-copy consistency must have some method to break
symmetry: otherwise, it will split into two separate systems, which can diverge from
each other and can no longer maintain the illusion of a single copy.
Network partition tolerance for systems that enforce single-copy consistency requires
that during a network partition, only one partition of the system remains active since
during a network partition it is not possible to prevent divergence (e.g. CAP theorem).
Majority decisions
This is why partition tolerant consensus algorithms rely on a majority vote. Requiring a
majority of nodes - rather than all of the nodes (as in 2PC) - to agree on updates allows
a minority of the nodes to be down, or slow, or unreachable due to a network
partition. As long as (N/2 + 1)-of-N nodes are up and accessible, the system can
continue to operate.
Partition tolerant consensus algorithms use an odd number of nodes (e.g. 3, 5 or 7).
With just two nodes, it is not possible to have a clear majority after a failure. For
example, if the number of nodes is three, then the system is resilient to one node
failure; with five nodes the system is resilient to two node failures.
When a network partition occurs, the partitions behave asymmetrically. One partition
will contain the majority of the nodes. Minority partitions will stop processing
operations to prevent divergence during a network partition, but the majority partition
can remain active. This ensures that only a single copy of the system state remains
active.
Majorities are also useful because they can tolerate disagreement: if there is a
perturbation or failure, the nodes may vote differently. However, since there can be
only one majority decision, a temporary disagreement can at most block the protocol
from proceeding (giving up liveness) but it cannot violate the single-copy consistency
criterion (safety property).
Roles
46/71
There are two ways one might structure a system: all nodes may have the same
responsibilities, or nodes may have separate, distinct roles.
Consensus algorithms for replication generally opt for having distinct roles for each
node. Having a single fixed leader or master server is an optimization that makes the
system more efficient, since we know that all updates must pass through that server.
Nodes that are not the leader just need to forward their requests to the leader.
Note that having distinct roles does not preclude the system from recovering from the
failure of the leader (or any other role). Just because roles are fixed during normal
operation doesn't mean that one cannot recover from failure by reassigning the roles
after a failure (e.g. via a leader election phase). Nodes can reuse the result of a leader
election until node failures and/or network partitions occur.
Both Paxos and Raft make use of distinct node roles. In particular, they have a leader
node ("proposer" in Paxos) that is responsible for coordination during normal
operation. During normal operation, the rest of the nodes are followers ("acceptors" or
"voters" in Paxos).
Epochs
Each period of normal operation in both Paxos and Raft is called an epoch ("term" in
Raft). During each epoch only one node is the designated leader (a similar system is
used in Japan where era names change upon imperial succession).
After a successful election, the same leader coordinates until the end of the epoch. As
shown in the diagram above (from the Raft paper), some elections may fail, causing
the epoch to end immediately.
Epochs act as a logical clock, allowing other nodes to identify when an outdated node
starts communicating - nodes that were partitioned or out of operation will have a
smaller epoch number than the current one, and their commands are ignored.
47/71
All nodes start as followers; one node is elected to be a leader at the start. During
normal operation, the leader maintains a heartbeat which allows the followers to
detect if the leader fails or becomes partitioned.
When a node detects that a leader has become non-responsive (or, in the initial case,
that no leader exists), it switches to an intermediate state (called "candidate" in Raft)
where it increments the term/epoch value by one, initiates a leader election and
competes to become the new leader.
In order to be elected a leader, a node must receive a majority of the votes. One way to
assign votes is to simply assign them on a first-come-first-served basis; this way, a
leader will eventually be elected. Adding a random amount of waiting time between
attempts at getting elected will reduce the number of nodes that are simultaneously
attempting to get elected.
Normal operation
During normal operation, all proposals go through the leader node. When a client
submits a proposal (e.g. an update operation), the leader contacts all nodes in the
quorum. If no competing proposals exist (based on the responses from the followers),
the leader proposes the value. If a majority of the followers accept the value, then the
value is considered to be accepted.
Since it is possible that another node is also attempting to act as a leader, we need to
ensure that once a single proposal has been accepted, its value can never change.
Otherwise a proposal that has already been accepted might for example be reverted
by a competing leader. Lamport states this as:
P2: If a proposal with value v is chosen, then every higher-numbered proposal that is chosen
has value v .
Ensuring that this property holds requires that both followers and proposers are
constrained by the algorithm from ever changing a value that has been accepted by a
majority. Note that "the value can never change" refers to the value of a single
execution (or run / instance / decision) of the protocol. A typical replication algorithm
will run multiple executions of the algorithm, but most discussions of the algorithm
focus on a single run to keep things simple. We want to prevent the decision history
from being altered or overwritten.
In order to enforce this property, the proposers must first ask the followers for their
(highest numbered) accepted proposal and value. If the proposer finds out that a
proposal already exists, then it must simply complete this execution of the protocol,
48/71
rather than making its own proposal. Lamport states this as:
P2b. If a proposal with value v is chosen, then every higher-numbered proposal issued by
any proposer has value v .
More specifically:
P2c. For any v and n , if a proposal with value v and number n is issued [by a leader],
then there is a set S consisting of a majority of acceptors [followers] such that either (a) no
acceptor in S has accepted any proposal numbered less than n , or (b) v is the value of the
highest-numbered proposal among all proposals numbered less than n accepted by the
followers in S .
This is the core of the Paxos algorithm, as well as algorithms derived from it. The value
to be proposed is not chosen until the second phase of the protocol. Proposers must
sometimes simply retransmit a previously made decision to ensure safety (e.g. clause
b in P2c) until they reach a point where they know that they are free to impose their
own proposal value (e.g. clause a).
To ensure that no competing proposals emerge between the time the proposer asks
each acceptor about its most recent value, the proposer asks the followers not to
accept proposals with lower proposal numbers than the current one.
Putting the pieces together, reaching a decision using Paxos requires two rounds of
communication:
The prepare stage allows the proposer to learn of any competing or previous
proposals. The second phase is where either a new value or a previously accepted
value is proposed. In some cases - such as if two proposers are active at the same time
(dueling); if messages are lost; or if a majority of the nodes have failed - then no
proposal is accepted by a majority. But this is acceptable, since the decision rule for
what value to propose converges towards a single value (the one with the highest
proposal number in the previous attempt).
Indeed, according to the FLP impossibility result, this is the best we can do: algorithms
that solve the consensus problem must either give up safety or liveness when the
guarantees regarding bounds on message delivery do not hold. Paxos gives up
49/71
liveness: it may have to delay decisions indefinitely until a point in time where there
are no competing leaders, and a majority of nodes accept a proposal. This is
preferable to violating the safety guarantees.
Of course, implementing this algorithm is much harder than it sounds. There are many
small concerns which add up to a fairly significant amount of code even in the hands
of experts. These are issues such as:
practical optimizations:
avoiding repeated leader election via leadership leases (rather than
heartbeats)
avoiding repeated propose messages when in a stable state where the
leader identity does not change
ensuring that followers and proposers do not lose items in stable storage and
that results stored in stable storage are not subtly corrupted (e.g. disk
corruption)
enabling cluster membership to change in a safe manner (e.g. base Paxos
depends on the fact that majorities always intersect in one node, which does not
hold if the membership can change arbitrarily)
procedures for bringing a new replica up to date in a safe and efficient manner
after a crash, disk loss or when a new node is provisioned
procedures for snapshotting and garbage collecting the data required to
guarantee safety after some reasonable period (e.g. balancing storage
requirements and fault tolerance requirements)
Paxos. Paxos is one of the most important algorithms when writing strongly consistent
partition tolerant replicated systems. It is used in many of Google's systems, including
the Chubby lock manager used by BigTable/Megastore, the Google File System as well
as Spanner.
Paxos is named after the Greek island of Paxos, and was originally presented by Leslie
Lamport in a paper called "The Part-Time Parliament" in 1998. It is often considered to
be difficult to implement, and there have been a series of papers from companies with
considerable distributed systems expertise explaining further practical details (see the
further reading). You might want to read Lamport's commentary on this issue here
and here.
The issues mostly relate to the fact that Paxos is described in terms of a single round
of consensus decision making, but an actual working implementation usually wants to
run multiple rounds of consensus efficiently. This has led to the development of many
50/71
extensions on the core protocol that anyone interested in building a Paxos-based
system still needs to digest. Furthermore, there are additional practical challenges
such as how to facilitate cluster membership change.
ZAB. ZAB - the Zookeeper Atomic Broadcast protocol is used in Apache Zookeeper.
Zookeeper is a system which provides coordination primitives for distributed systems,
and is used by many Hadoop-centric distributed systems for coordination (e.g. HBase,
Storm, Kafka). Zookeeper is basically the open source community's version of Chubby.
Technically speaking atomic broadcast is a problem different from pure consensus,
but it still falls under the category of partition tolerant algorithms that ensure strong
consistency.
Primary/Backup
Single, static master
Replicated log, slaves are not involved in executing operations
No bounds on replication delay
Not partition tolerant
Manual/ad-hoc failover, not fault tolerant, "hot backup"
2PC
Unanimous vote: commit or abort
Static master
2PC cannot survive simultaneous failure of the coordinator and a node during a
commit
Not partition tolerant, tail latency sensitive
Paxos
Majority vote
Dynamic master
Robust to n/2-1 simultaneous failures as part of protocol
Less sensitive to tail latency
51/71
Further reading
Paxos
The implication that follows from the limitation on the speed at which information
travels is that nodes experience the world in different, unique ways. Computation on a
single node is easy, because everything happens in a predictable global total order.
Computation on a distributed system is difficult, because there is no global total order.
For the longest while (e.g. decades of research), we've solved this problem by
introducing a global total order. I've discussed the many methods for achieving strong
consistency by creating order (in a fault-tolerant manner) where there is no naturally
occurring total order.
Of course, the problem is that enforcing order is expensive. This breaks down in
particular with large scale internet systems, where a system needs to remain available.
A system enforcing strong consistency doesn't behave like a distributed system: it
behaves like a single system, which is bad for availability during a partition.
Furthermore, for each operation, often a majority of the nodes must be contacted -
52/71
and often not just once, but twice (as you saw in the discussion on 2PC). This is
particularly painful in systems that need to be geographically distributed to provide
adequate performance for a global user base.
Perhaps what we want is a system where we can write code that doesn't use expensive
coordination, and yet returns a "usable" value. Instead of having a single truth, we will
allow different replicas to diverge from each other - both to keep things efficient but
also to tolerate partitions - and then try to find a way to deal with the divergence in
some manner.
Eventual consistency expresses this idea: that nodes can for some time diverge from
each other, but that eventually they will agree on the value.
Within the set of systems providing eventual consistency, there are two types of
system designs:
Eventual consistency with probabilistic guarantees. This type of system can detect
conflicting writes at some later point, but does not guarantee that the results are
equivalent to some correct sequential execution. In other words, conflicting updates
will sometimes result in overwriting a newer value with an older one and some
anomalies can be expected to occur during normal operation (or during partitions).
In recent years, the most influential system design offering single-copy consistency is
Amazon's Dynamo, which I will discuss as an example of a system that offers eventual
consistency with probabilistic guarantees.
Eventual consistency with strong guarantees. This type of system guarantees that the
results converge to a common value equivalent to some correct sequential execution.
In other words, such systems do not produce any anomalous results; without any
coordination you can build replicas of the same service, and those replicas can
communicate in any pattern and receive the updates in any order, and they will
eventually agree on the end result as long as they all see the same information.
CRDT's (convergent replicated data types) are data types that guarantee convergence
to the same value in spite of network delays, partitions and message reordering. They
are provably convergent, but the data types that can be implemented as CRDT's are
limited.
Perhaps the most obvious characteristic of systems that do not enforce single-copy
consistency is that they allow replicas to diverge from each other. This means that
there is no strictly defined pattern of communication: replicas can be separated from
each other and yet continue to be available and accept writes.
Let's imagine a system of three replicas, each of which is partitioned from the others.
For example, the replicas might be in different datacenters and for some reason
unable to communicate. Each replica remains available during the partition, accepting
both reads and writes from some set of clients:
After some time, the partitions heal and the replica servers exchange information.
They have received different updates from different clients and have diverged each
other, so some sort of reconciliation needs to take place. What we would like to
happen is that all of the replicas converge to the same result.
[A] \
--> [merge]
[B] / |
|
[C] ----[merge]---> result
Another way to think about systems with weak consistency guarantees is to imagine a
set of clients sending messages to two replicas in some order. Because there is no
coordination protocol that enforces a single total order, the messages can get
delivered in different orders at the two replicas:
This is, in essence, the reason why we need coordination protocols. For example,
assume that we are trying to concatenate a string and the operations in messages 1, 2
and 3 are:
Then, without coordination, A will produce "Hello World!", and B will produce
"World!Hello ".
54/71
A: concat(concat(concat('', 'Hello '), 'World'), '!') = 'Hello World!'
B: concat(concat(concat('', 'World'), '!'), 'Hello ') = 'World!Hello '
This is, of course, incorrect. Again, what we'd like to happen is that the replicas
converge to the same result.
Keeping these two examples in mind, let's look at Amazon's Dynamo first to establish a
baseline, and then discuss a number of novel approaches to building systems with
weak consistency guarantees, such as CRDT's and the CALM theorem.
Amazon's Dynamo
Amazon's Dynamo system design (2007) is probably the best-known system that offers
weak consistency guarantees but high availability. It is the basis for many other real
world systems, including LinkedIn's Voldemort, Facebook's Cassandra and Basho's
Riak.
Dynamo is an eventually consistent, highly available key-value store. A key value store
is like a large hash table: a client can set values via set(key, value) and retrieve
them by key using get(key) . A Dynamo cluster consists of N peer nodes; each node
has a set of keys which is it responsible for storing.
Since Dynamo is a complete system design, there are many different parts to look at
beyond the core replication task. The diagram below illustrates some of the tasks;
notably, how a write is routed to a node and written to multiple replicas.
55/71
[ Client ]
|
( Mapping keys to nodes )
|
V
[ Node A ]
| \
( Synchronous replication task: minimum durability )
| \
[ Node B] [ Node C ]
A
|
( Conflict detection; asynchronous replication task:
ensuring that partitioned / recovered nodes recover )
|
V
[ Node D]
After looking at how a write is initially accepted, we'll look at how conflicts are
detected, as well as the asynchronous replica synchronization task. This task is needed
because of the high availability design, in which nodes may be temporarily unavailable
(down or partitioned). The replica synchronization task ensures that nodes can catch
up fairly rapidly even after a failure.
Consistent hashing
Whether we are reading or writing, the first thing that needs to happen is that we need
to locate where the data should live on the system. This requires some type of key-to-
node mapping.
In Dynamo, keys are mapped to nodes using a hashing technique known as consistent
hashing (which I will not discuss in detail). The main idea is that a key can be mapped
to a set of nodes responsible for it by a simple calculation on the client. This means
that a client can locate keys without having to query the system for the location of each
key; this saves system resources as hashing is generally faster than performing a
remote procedure call.
Partial quorums
Once we know where a key should be stored, we need to do some work to persist the
value. This is a synchronous task; the reason why we will immediately write the value
onto multiple nodes is to provide a higher level of durability (e.g. protection from the
immediate failure of a node).
Just like Paxos or Raft, Dynamo uses quorums for replication. However, Dynamo's
quorums are sloppy (partial) quorums rather than strict (majority) quorums.
Informally, a strict quorum system is a quorum system with the property that any two
quorums (sets) in the quorum system overlap. Requiring a majority to vote for an
update before accepting it guarantees that only a single history is admitted since each
majority quorum must overlap in at least one node. This was the property that Paxos,
for example, relied on.
56/71
Partial quorums do not have that property; what this means is that a majority is not
required and that different subsets of the quorum may contain different versions of
the same data. The user can choose the number of nodes to write to and read from:
the user can choose some number W-of-N nodes required for a write to succeed;
and
the user can specify the number of nodes (R-of-N) to be contacted during a read.
W and R specify the number of nodes that need to be involved to a write or a read.
Writing to more nodes makes writes slightly slower but increases the probability that
the value is not lost; reading from more nodes increases the probability that the value
read is up to date.
The usual recommendation is that R + W > N , because this means that the read and
write quorums overlap in one node - making it less likely that a stale value is returned.
A typical configuration is N = 3 (e.g. a total of three replicas for each value); this
means that the user can choose between:
R = 1, W = 3;
R = 2, W = 2 or
R = 3, W = 1
N is rarely more than 3, because keeping that many copies of large amounts of data
around gets expensive!
As I mentioned earlier, the Dynamo paper has inspired many other similar designs.
They all use the same partial quorum based replication approach, but with different
defaults for N, W and R:
There is another detail: when sending a read or write request, are all N nodes asked to
respond (Riak), or only a number of nodes that meets the minimum (e.g. R or W;
Voldemort). The "send-to-all" approach is faster and less sensitive to latency (since it
only waits for the fastest R or W nodes of N) but also less efficient, while the "send-to-
minimum" approach is more sensitive to latency (since latency communicating with a
single node will delay the operation) but also more efficient (fewer messages /
connections overall).
What happens when the read and write quorums overlap, e.g. ( R + W > N )?
Specifically, it is often claimed that this results in "strong consistency".
57/71
Is R + W > N the same as "strong consistency"?
No.
It's not completely off base: a system where R + W > N can detect read/write
conflicts, since any read quorum and any write quorum share a member. E.g. at least
one node is in both quorums:
1 2 N/2+1 N/2+2 N
[...] [R] [R + W] [W] [...]
This guarantees that a previous write will be seen by a subsequent read. However, this
only holds if the nodes in N never change. Hence, Dynamo doesn't qualify, because in
Dynamo the cluster membership can change if nodes fail.
Furthermore, Dynamo doesn't handle partitions in the manner that a system enforcing
a strong consistency model would: namely, writes are allowed on both sides of a
partition, which means that for at least some time the system does not act as a single
copy. So calling R + W > N "strongly consistent" is misleading; the guarantee is
merely probabilistic - which is not what strong consistency refers to.
We've already encountered a method for doing this: vector clocks can be used to
represent the history of a value. Indeed, this is what the original Dynamo design uses
for detecting conflicts.
However, using vector clocks is not the only alternative. If you look at many practical
system designs, you can deduce quite a bit about how they work by looking at the
metadata that they track.
58/71
No metadata. When a system does not track metadata, and only returns the value (e.g.
via a client API), it cannot really do anything special about concurrent writes. A
common rule is that the last writer wins: in other words, if two writers are writing at
the same time, only the value from the slowest writer is kept around.
Timestamps. Nominally, the value with the higher timestamp value wins. However, if
time is not carefully synchronized, many odd things can happen where old data from a
system with a faulty or fast clock overwrites newer values. Facebook's Cassandra is a
Dynamo variant that uses timestamps instead of vector clocks.
Version numbers. Version numbers may avoid some of the issues related with using
timestamps. Note that the smallest mechanism that can accurately track causality
when multiple histories are possible are vector clocks, not version numbers.
Vector clocks. Using vector clocks, concurrent and out of date updates can be detected.
Performing read repair then becomes possible, though in some cases (concurrent
changes) we need to ask the client to pick a value. This is because if the changes are
concurrent and we know nothing more about the data (as is the case with a simple
key-value store), then it is better to ask than to discard data arbitrarily.
When reading a value, the client contacts R of N nodes and asks them for the latest
value for a key. It takes all the responses, discards the values that are strictly older
(using the vector clock value to detect this). If there is only one unique vector clock +
value pair, it returns that. If there are multiple vector clock + value pairs that have been
edited concurrently (e.g. are not comparable), then all of those values are returned.
As is obvious from the above, read repair may return multiple values. This means that
the client / application developer must occasionally handle these cases by picking a
value based on some use-case specific criterion.
In addition, a key component of a practical vector clock system is that the clocks cannot
be allowed to grow forever - so there needs to be a procedure for occasionally garbage
collecting the clocks in a safe manner to balance fault tolerance with storage
requirements.
Replica synchronization is used to bring nodes up to date after a failure, and for
periodically synchronizing replicas with each other.
59/71
Every t seconds, each node picks a node to communicate with. This provides an
additional mechanism beyond the synchronous task (e.g. the partial quorum writes)
which brings the replicas up to date.
Gossip is scalable, and has no single point of failure, but can only provide probabilistic
guarantees.
By maintaining this fairly granular hashing, nodes can compare their data store
content much more efficiently than a naive technique. Once the nodes have identified
which keys have different values, they exchange the necessary information to bring the
replicas up to date.
How might we characterize the behavior of such a system? A fairly recent paper from
Bailis et al. (2012) describes an approach called PBS (probabilistically bounded
staleness) uses simulation and data collected from a real world system to characterize
the expected behavior of such a system.
PBS estimates the degree of inconsistency by using information about the anti-entropy
(gossip) rate, the network latency and local processing delay to estimate the expected
level of consistency of reads. It has been implemented in Cassandra, where timing
information is piggybacked on other messages and an estimate is calculated based on
a sample of this information in a Monte Carlo simulation.
Based on the paper, during normal operation eventually consistent data stores are
often faster and can read a consistent state within tens or hundreds of milliseconds.
The table below illustrates amount of time required from a 99.9% probability of
consistent reads given different R and W settings on empirical timing data from
LinkedIn (SSD and 15k RPM disks) and Yammer:
60/71
For example, going from R=1 , W=1 to R=2 , W=1 in the Yammer case reduces the
inconsistency window from 1352 ms to 202 ms - while keeping the read latencies lower
(32.6 ms) than the fastest strict quorum ( R=3 , W=1 ; 219.27 ms).
For more details, have a look at the PBS website and the associated paper.
Disorderly programming
Let's look back at the examples of the kinds of situations that we'd like to resolve. The
first scenario consisted of three different servers behind partitions; after the partitions
healed, we wanted the servers to converge to the same value. Amazon's Dynamo
made this possible by reading from R out of N nodes and then performing read
reconciliation.
... operation-centric work can be made commutative (with the right operations and the right
semantics) where a simple READ/WRITE semantic does not lend itself to commutativity.
For example, consider a system that implements a simple accounting system with the
debit and credit operations in two different ways:
The latter implementation knows more about the internals of the data type, and so it
can preserve the intent of the operations in spite of the operations being reordered.
Debiting or crediting can be applied in any order, and the end result is the same:
However, writing a fixed value cannot be done in any order: if writes are reordered,
the one of the writes will overwrite the other:
Let's take the example from the beginning of this chapter, but use a different
operation. In this scenario, clients are sending messages to two nodes, which see the
operations in different orders:
Instead of string concatenation, assume that we are looking to find the largest value
(e.g. MAX()) for a set of integers. The messages 1, 2 and 3 are:
61/71
1: { operation: max(previous, 3) }
2: { operation: max(previous, 5) }
3: { operation: max(previous, 7) }
In both cases, two replicas see updates in different order, but we are able to merge
the results in a way that has the same result in spite of what the order is. The result
converges to the same answer in both cases because of the merge procedure ( max )
we used.
It is likely not possible to write a merge procedure that works for all data types. In
Dynamo, a value is a binary blob, so the best that can be done is to expose it and ask
the application to handle each conflict.
However, if we know that the data is of a more specific type, handling these kinds of
conflicts becomes possible. CRDT's are data structures designed to provide data types
that will always converge, as long as they see the same set of operations (in any order).
It turns out that these structures are already known in mathematics; they are known
as join or meet semilattices.
A lattice is a partially ordered set with a distinct top (least upper bound) and a distinct
bottom (greatest lower bound). A semilattice is like a lattice, but one that only has a
distinct top or bottom. A join semilattice is one with a distinct top (least upper bound)
and a meet semilattice is one with a distinct bottom (greatest lower bound).
62/71
For example, here are two lattices: one drawn for a set, where the merge operator is
union(items) and one drawn for a strictly increasing integer counter, where the
merge operator is max(values) :
{ a, b, c } 7
/ | \ / \
{a, b} {b,c} {a,c} 5 7
| \ / | / / | \
{a} {b} {c} 3 5 7
With data types that can be expressed as semilattices, you can have replicas
communicate in any pattern and receive the updates in any order, and they will
eventually agree on the end result as long as they all see the same information. That is
a powerful property that can be guaranteed as long as the prerequisites hold.
This means that several familiar data types have more specialized implementations as
CRDT's which make a different tradeoff in order to resolve conflicts in an order-
independent manner. Unlike a key-value store which simply deals with registers (e.g.
values that are opaque blobs from the perspective of the system), someone using
CRDTs must use the right data type to avoid anomalies.
Counters
Grow-only counter (merge = max(values); payload = single integer)
Positive-negative counter (consists of two grow counters, one for
increments and another for decrements)
Registers
Last Write Wins -register (timestamps or version numbers; merge = max(ts);
payload = blob)
Multi-valued -register (vector clocks; merge = take both)
Sets
Grow-only set (merge = union(items); payload = set; no removal)
Two-phase set (consists of two sets, one for adding, and another for
removing; elements can be added once and removed once)
Unique set (an optimized version of the two-phase set)
Last write wins set (merge = max(ts); payload = set)
Positive-negative set (consists of one PN-counter per set item)
Observed-remove set
Graphs and text sequences (see the paper)
63/71
To ensure anomaly-free operation, you need to find the right data type for your
specific application - for example, if you know that you will only remove an item once,
then a two-phase set works; if you will only ever add items to a set and never remove
them, then a grow-only set works.
Not all data structures have known implementations as CRDTs, but there are CRDT
implementations for booleans, counters, sets, registers and graphs in the recent (2011)
survey paper from Shapiro et al.
However, there are many programming models in which the order of statements does
not play a significant role. For example, in the MapReduce model, both the Map and
the Reduce tasks are specified as stateless tuple-processing tasks that need to be run
on a dataset. Concrete decisions about how and in what order data is routed to the
tasks is not specified explicitly, instead, the batch job scheduler is responsible for
scheduling the tasks to run on the cluster.
Similarly, in SQL one specifies the query, but not how the query is executed. The query
is simply a declarative description of the task, and it is the job of the query optimizer to
figure out an efficient way to execute the query (across multiple machines, databases
and tables).
However, it should be clear from these two examples that there are many kinds of
data processing tasks which are amenable to being expressed in a declarative
language where the order of execution is not explicitly specified. Programming models
which express a desired result while leaving the exact order of statements up to an
optimizer to decide often have semantics that are order-independent. This means that
64/71
such programs may be possible to execute without coordination, since they depend on
the inputs they receive but not necessarily the specific order in which the inputs are
received.
The key point is that such programs may be safe to execute without coordination.
Without a clear rule that characterizes what is safe to execute without coordination,
and what is not, we cannot implement a program while remaining certain that the
result is correct.
This is what the CALM theorem is about. The CALM theorem is based on a recognition
of the link between logical monotonicity and useful forms of eventual consistency (e.g.
confluence / convergence). It states that logically monotonic programs are guaranteed
to be eventually consistent.
Then, if we know that some computation is logically monotonic, then we know that it is
also safe to execute without coordination.
Monotony
if sentence φ is a consequence of a set of premises Γ , then it can also be inferred
from any set Δ of premises extending Γ
Most standard logical frameworks are monotonic: any inferences made within a
framework such as first-order logic, once deductively valid, cannot be invalidated by
new information. A non-monotonic logic is a system in which that property does not
hold - in other words, if some conclusions can be invalidated by learning new
knowledge.
Within the artificial intelligence community, non-monotonic logics are associated with
defeasible reasoning - reasoning, in which assertions made utilizing partial information
can be invalidated by new knowledge. For example, if we learn that Tweety is a bird,
we'll assume that Tweety can fly; but if we later learn that Tweety is a penguin, then
we'll have to revise our conclusion.
Monotonicity concerns the relationship between premises (or facts about the world)
and conclusions (or assertions about the world). Within a monotonic logic, we know
that our results are retraction-free: monotone computations do not need to be
recomputed or coordinated; the answer gets more accurate over time. Once we know
that Tweety is a bird (and that we're reasoning using monotonic logic), we can safely
conclude that Tweety can fly and that nothing we learn can invalidate that conclusion.
Both basic Datalog and relational algebra (even with recursion) are known to be
monotonic. More specifically, computations expressed using a certain set of basic
operators are known to be monotonic (selection, projection, natural join, cross
product, union and recursive Datalog without negation), and non-monotonicity is
introduced by using more advanced operators (negation, set difference, division,
universal quantification, aggregation).
This means that computations expressed using a significant number of operators (e.g.
map, filter, join, union, intersection) in those systems are logically monotonic; any
computations using those operators are also monotonic and thus safe to run without
coordination. Expressions that make use of negation and aggregation, on the other
hand, are not safe to run without coordination.
and:
This idea can be seen from the other direction as well. Coordination protocols are themselves
aggregations, since they entail voting: Two-Phase Commit requires unanimous votes, Paxos
consensus requires majority votes, and Byzantine protocols require a 2/3 majority. Waiting
requires counting.
If, then we can express our computation in a manner in which it is possible to test for
monotonicity, then we can perform a whole-program static analysis that detects which
parts of the program are eventually consistent and safe to run without coordination
(the monotonic parts) - and which parts are not (the non-monotonic ones).
Note that this requires a different kind of language, since these inferences are hard to
make for traditional programming languages where sequence, selection and iteration
are at the core. Which is why the Bloom language was designed.
66/71
How does a computation differ from an assertion? Let's consider the query "is pizza a
vegetable?". To answer that, we need to get at the core: when is it acceptable to infer
that something is (or is not) true?
OWA + | OWA +
Monotonic logic | Non-monotonic logic
Can derive P(true) | Can assert P(true) | Cannot assert P(true)
Can derive P(false) | Can assert P(false) | Cannot assert P(true)
Cannot derive P(true) | Unknown | Unknown
or P(false)
When making the open world assumption, we can only safely assert something we can
deduce from what is known. Our information about the world is assumed to be
incomplete.
Let's first look at the case where we know our reasoning is monotonic. In this case, any
(potentially incomplete) knowledge that we have cannot be invalidated by learning
new knowledge. So if we can infer that a sentence is true based on some deduction,
such as "things that contain two tablespoons of tomato paste are vegetables" and
"pizza contains two tablespoons of tomato paste", then we can conclude that "pizza is
a vegetable". The same goes for if we can deduce that a sentence is false.
However, if we cannot deduce anything - for example, the set of knowledge we have
contains customer information and nothing about pizza or vegetables - then under the
open world assumption we have to say that we cannot conclude anything.
However, within the database context, and within many computer science applications
we prefer to make more definite conclusions. This means assuming what is known as
the closed-world assumption: that anything that cannot be shown to be true is false.
This means that no explicit declaration of falsehood is needed. In other words, the
database of facts that we have is assumed to be complete (minimal), so that anything
not in it can be assumed to be false.
For example, under the CWA, if our database does not have an entry for a flight
between San Francisco and Helsinki, then we can safely conclude that no such flight
exists.
We need one more thing to be able to make definite assertions: logical circumscription.
Circumscription is a formalized rule of conjecture. Domain circumscription conjectures
67/71
that the known entities are all there are. We need to be able to assume that the known
entities are all there are in order to reach a definite conclusion.
CWA + | CWA +
Circumscription + | Circumscription +
Monotonic logic | Non-monotonic logic
Can derive P(true) | Can assert P(true) | Can assert P(true)
Can derive P(false) | Can assert P(false) | Can assert P(false)
Cannot derive P(true) | Can assert P(false) | Can assert P(false)
or P(false)
What does this mean in practice? First, monotonic logic can reach definite conclusions
as soon as it can derive that a sentence is true (or false). Second, nonmonotonic logic
requires an additional assumption: that the known entities are all there is.
So why are two operations that are on the surface equivalent different? Why is adding
two numbers monotonic, but calculating an aggregation over two nodes not? Because
the aggregation does not only calculate a sum but also asserts that it has seen all of
the values. And the only way to guarantee that is to coordinate across nodes and
ensure that the node performing the calculation has really seen all of the values within
the system.
Purely monotone systems are rare. It seems that most applications operate under the
closed-world assumption even when they have incomplete data, and we humans are
fine with that. When a database tells you that a direct flight between San Francisco and
Helsinki does not exist, you will probably treat this as "according to this database, there
is no direct flight", but you do not rule out the possibility that that in reality such a
flight might still exist.
Really, this issue only becomes interesting when replicas can diverge (e.g. during a
partition or due to delays during normal operation). Then there is a need for a more
specific consideration: whether the answer is based on just the current node, or the
totality of the system.
In Bloom, each node has a database consisting of collections and lattices. Programs
are expressed as sets of unordered statements which interact with collections (sets of
facts) and lattices (CRDTs). Statements are order-independent by default, but one can
also write non-monotonic functions.
Have a look at the Bloom website and tutorials to learn more about Bloom.
Further reading
CRDTs
Marc Shapiro's talk @ Microsoft is a good starting point for understanding CRDT's.
If you liked the book, follow me on Github (or Twitter). I love seeing that I've had some
kind of positive impact. "Create more value than you capture" and all that.
Many many thanks to: logpath, alexras, globalcitizen, graue, frankshearar, roryokane,
jpfuentes2, eeror, cmeiklejohn, stevenproctor eos2102 and steveloughran for their
help! Of course, any mistakes and omissions that remain are my fault!
It's worth noting that my chapter on eventual consistency is fairly Berkeley-centric; I'd
like to change that. I've also skipped one prominent use case for time: consistent
snapshots. There are also a couple of topics which I should expand on: namely, an
explicit discussion of safety and liveness properties and a more detailed discussion of
consistent hashing. However, I'm off to Strange Loop 2013, so whatever.
69/71
If this book had a chapter 6, it would probably be about the ways in which one can
make use of and deal with large amounts of data. It seems that the most common
type of "big data" computation is one in which a large dataset is passed through a
single simple program. I'm not sure what the subsequent chapters would be (perhaps
high performance computing, given that the current focus has been on feasibility), but
I'll probably know in a couple of years.
Seminal papers
Each year, the Edsger W. Dijkstra Prize in Distributed Computing is given to
outstanding papers on the principles of distributed computing. Check out the link for
the full list, which includes classics such as:
70/71
Microsoft Academic Search has a list of top publications in distributed & parallel
computing ordered by number of citations - this may be an interesting list to skim for
more classics.
Systems
71/71