The CAP Theorem and The Design of Large Scale Distributed Systems: Part I
The CAP Theorem and The Design of Large Scale Distributed Systems: Part I
Silvia Bonomi
University of Rome “La Sapienza”
www.dis.uniroma1.it/~bonomi
Wireless and
Cloud Computing
Mobile ad-hoc
Platforms
networks
First Internet-based
systems for military Client-server
purpose architectures
Mainframe- based
Information Systems
2
Relational Databases History
} Relational Databases – mainstay of business
} Web-based applications caused spikes
} Especially true for public-facing e-Commerce sites
} Developers begin to front RDBMS with memcache or
integrate other caching mechanisms within the application
Scaling Up
} Issues with scaling up when the dataset is just too big
} RDBMS were not designed to be distributed
} Began to look at multi-node database solutions
} Known as ‘scaling out’ or ‘horizontal scaling’
} Different approaches include:
} Master-slave
} Sharding
Scaling RDBMS – Master/Slave
} Master-Slave
} All writes are written to the master. All reads performed against
the replicated slave databases
} Critical reads may be incorrect as writes may not have been
propagated down
} Large data sets can pose problems as master needs to duplicate
data to slaves
Scaling RDBMS - Sharding
} Partition or sharding
} Scales well for both reads and writes
} Not transparent, application needs to be partition-aware
} Can no longer have relationships/joins across partitions
} Loss of referential integrity across shards
Other ways to scale RDBMS
} Multi-Master replication
} INSERT only, not UPDATES/DELETES
} No JOINs, thereby reducing query time
} This involves de-normalizing data
} In-memory databases
Today…
Context
} Networked Shared-data Systems
A A
9
Fundamental Properties
} Consistency
} (informally) “every request receives the right response”
} E.g. If I get my shopping list on Amazon I expect it contains all the
previously selected items
} Availability
} (informally) “each request eventually receives a response”
} E.g. eventually I access my shopping list
10
CAP Theorem
} 2000: Eric Brewer, PODC conference keynote
} 2002: Seth Gilbert and Nancy Lynch, ACM SIGACT News 33(2)
11
Proof Intuition
C
Write (v1, A)
A A
Read (A)
A
C
Networked Shared-data system
12
consistency
C
Fox&Brewer “CAP Theorem”: Claim: every distributed
C-A-P: choose two. system is on one side of the
triangle.
Examples
} Single-site databases
} Cluster databases
Consistency Availability } LDAP
} Fiefdoms
Traits
} 2-phase commit
Tolerance to network } cache validation protocols
} The “inside”
Partitions
Observations
} CAP states that in case of failures you can have at most two of
these three properties for any shared-data system
Examples
} Distributed databases
Consistency Availability } Distributed locking
} Majority protocols
Traits
Tolerance to network } Pessimistic locking
Make minority partitions
Partitions }
unavailable
Forfeit Consistency
Examples
} Coda
Web caching
Consistency Availability }
} DNS
} Emissaries
Traits
} expirations/leases
Tolerance to network } conflict resolution
Partitions }
}
Optimistic
The “outside”
Consistency Boundary Summary
} We can have consistency & availability within a cluster.
} No partitions within boundary!
} Basically Available: the system available most of the time and
there could exists a subsystems temporarily unavailable
} Soft State: data are “volatile” in the sense that their persistence
is in the hand of the user that must take care of refresh them
21
CAP, ACID and BASE
} Relation among ACID and CAP is core complex
22
CAP, ACID and BASE
CAP ACID
23
Warning!
} What CAP says:
} When you have a partition in the network you cannot have
both C and A
24
2 of 3 is misleading
} Partitions are rare events
} there are little reasons to forfeit by design C or A
26
Consistency/Latency Trade Off
} CAP does not force designers to give up A or C but why
there exists a lot of systems trading C?
27
Consistency/Latency Trade Off
• High Availability is a strong requirement of modern shared-data systems
High
Availability
28
PACELC
} Abadi proposes to revise CAP as follows:
29
Partitions Management
C OV ER F E AT U RE
Figure 1. The state starts out consistent and remains so until a partition starts. To stay
available, both sides enter partition mode and continue to execute operations, creat-
Partition
ing concurrent states S1 and S2, Activating Partition
which are inconsistent. When the partition ends, the
truth becomes clear and partition
Detection recovery
Partition Recovery
starts. During recovery, the system merges
Mode
S1 and S2 into a consistent state S' and also compensates for any mistakes made during
the
30 partition.
Partition Detection
} CAP does not explicitly talk about latencies
} However…
} To keep the system live time-outs must be set
} When a time-out expires the system must take a decision
Is a
NO, YES,
partition
continue to go on with
happening?
wait execution
31
Partition Detection
} Partition Detection is not global
} An interacting part may detect the partition, the other not.
} Different processes may be in different states (partition mode
vs normal mode)
32
Which Operations Should Proceed?
} Live operation selection is an hard task
} Knowledge of the severity of invariant violation
} Examples
} every key in a DB must be unique
¨ Managing violation of unique keys is simple
¨ Merging element with the same key or keys update
33
Partition Recovery
} When a partition is repaired, partitions’ logs may be used
to recover consistency
34
Basic Techniques: Version Vector
} In the version vector we have an entry for any node
updating the state
} Each node has an identifier
} Each operation is stored in the log with attached a pair
<nodeId, timeStamp>
35
Version Vectors: example
1 1
0 1 ts(A) < ts(B) then A → B
0 0
ts(B) Ts(A)
1 0
0 0
ts (A) ≠ ts (B) then A || B
0 1 POTENTIALLY INCONSISTENT!
ts(B) Ts(A)
36
Basic Techniques: Version Vector
} Using version vectors it is always possible to determine if
two operations are causally related or they are
concurrent (and then dangerous)
37
Basic Techniques: CRDT
} Commutative Replicated Data Type (CRDT) are data
structures that provably converges after a partition (e.g. set).
} Characteristics:
} All the operations during a partition are commutative (e.g. add(a) and
add(b) are commutative) or
} Values are represented on a lattice and all operations during a partitions
are monotonically increasing wrt the lattice (giving an order among
them)
} Approach taken by Amazon with the shopping cart.
} Allows designers to choose A still ensuring the convergence after a
partition recovery
38
Basic Techniques:
Mistake Compensation
} Selecting A and forfaiting C, mistakes may be taken
} Invariants violation
39
What is NoSQL?
} Stands for Not Only SQL
} Class of non-relational data storage systems
} Usually do not require a fixed table schema nor do they use
the concept of joins
} All NoSQL offerings relax one or more of the ACID
properties (will talk about the CAP theorem)
Why NoSQL?
} For data storage, an RDBMS cannot be the be-all/end-all
} Just as there are different programming languages, need to have
other data storage tools in the toolbox
} A NoSQL solution is more acceptable to a client now than
even a year ago
How did we get here?
} Explosion of social media sites (Facebook, Twitter) with
large data needs
} Rise of cloud-based solutions such as Amazon S3 (simple
storage solution)
} Just as moving to dynamically-typed languages (Ruby/
Groovy), a shift to dynamically-typed data with frequent
schema changes
} Open-source community
Dynamo and BigTable
} Three major papers were the seeds of the NoSQL movement
} BigTable (Google)
} Dynamo (Amazon)
} Gossip protocol (discovery and error detection)
} Distributed key-value data store
} Eventual consistency
Thank You!
Questions?!