Chapter 7 Consistency and Replication
Chapter 7 Consistency and Replication
Chapter 7
1
Consistency and replication
An important issue in distributed systems is the replication of data. Data are
generally replicated to enhance reliability or improve performance. One of the
major problems is keeping replicas consistent.
This means that when one copy is updated we need to ensure that the other
copies are updated as well; otherwise the replicas will no longer be the same.
In this chapter, we take a detailed look at what consistency of replicated data
actually means and the various ways that consistency can be achieved.
2
More on Replication
• Replicas allows remote sites to continue working in
the event of local failures.
• It is also possible to protect against data corruption.
• Replicas allow data to reside close to where
it is used.
• Even a large number of replicated “local” systems
can improve performance: think of clusters.
• This directly supports the distributed systems goal
of enhanced scalability.
3
Replication and Scalability
• Replication is a widely-used scalability technique: think of
Web clients and Web proxies.
• When systems scale, the first problems to surface are those
associated with performance – as the systems get bigger
(e.g., more users), they get often slower.
• Replicating the data and moving it closer to where it is
needed helps to solve this scalability problem.
• A problem remains: how to efficiently synchronize all of
the replicas created to solve the scalability issue?
• Dilemma: adding replicas improves scalability, but incurs
the (oftentimes considerable) overhead of keeping the
replicas up-to-date!!!
• As we shall see, the solution often results in a relaxation of
4 any consistency constraints.
Replication and Consistency
• But if there are many replicas of the same thing,
how do we keep all of them up-to-date? How
do we keep the replicas consistent?
• Consistency can be achieved in a number of
ways. We will study a number of consistency
models, as well as protocols for implementing
the models.
• So, what’s the catch?
– It is not easy to keep all those replicas consistent.
5
Reasons for Replication
1. Performance Enhancement
The copy of data is placed in multiple locations, so client can get the data from
nearby location. This decreases the time take to access data .It enhances
performance of the distributed system. Multiple servers located at different
locations provided the same service to the client. It allows parallel processing of
the client’s request to the resource.
2 . Increase availability
Replication is a technique for automatically maintaining the availability of data
despite server failure.
3 .Fault tolerance
If a server fails , the data can be accessed from other servers.
6
Challenges in Replication
1. Placement (where to place replication )
• permanent replication: permanent replicas consist of cluster of
servers that may be geographically dispersed.
• Server initiated replication :server initiated caches including
placing replicas in the hosting servers and server caches.
• Client initiated replicas: client initiated replicas include web
browsers cache.
2 . Propagation of updates among replicas
3.Lack of consistency
7
Data-centric Consistency Models
10
Strict Consistency
14
Causal Consistency (1)
15
Causal Consistency (2)
19
More Client-Centric Consistency
• How fast should updates (writes) be made
available to read-only processes?
– Think of most database systems: mainly read.
– Think of the DNS: write-write conflicts do no
occur, only read-write conflicts.
– Think of WWW: as with DNS, except that heavy
use of client-side caching is present: even the
return of stale pages is acceptable to most users.
• These systems all exhibit a high degree of
acceptable inconsistency … with the replicas
gradually becoming consistent over time.
20
Toward Eventual Consistency
• The only requirement is that all replicas will
eventually be the same.
• All updates must be guaranteed to propagate to
all replicas … eventually!
• This works well if every client always updates
the same replica.
• Things are a little difficult if the clients are
mobile.
21
Eventual Consistency
23
Monotonic Reads (2)
24
Monotonic Writes (1)
• In a monotonic-write consistent
store, the following condition holds:
– A write operation by a process on a
data item x is completed before any
successive write operation on x by the same
process.
25
Monotonic Writes (2)
26
Read Your Writes (1)
27
Writes Follow Reads (1)
29
Content Replication and Placement
• Regardless of which consistency model is
chosen, we need to decide where, when and by
whom copies of the data-store are to be placed.
30
Replica Placement Types
• There are three types of replica:
1. Permanent replicas: tend to be small in number,
organized as COWs (Clusters of Workstations) or
mirrored systems.
2. Server-initiated replicas: used to enhance
performance at the initiation of the owner of the
data-store. Typically used by web hosting
companies to geographically locate replicas close
to where they are needed most. (Often referred to
as “push caches”).
3. Client-initiated replicas: created as a result of client
requests – think of browser caches. Works well
assuming, of course, that the cached data does not
31 go stale too soon.
Content Distribution
• When a client initiates an update to a distributed
data-store, what gets propagated?
• There are three possibilities:
1. Propagate notification of the update to the other
replicas – this is an “invalidation protocol” which
indicates that the replica’s data is no longer up-to-
date. Can work well when there’s many writes.
2. Transfer the data from one replica to another –
works well when there’s many reads.
3. Propagate the update to the other replicas – this is
“active replication”, and shifts the workload to each
32 of the replicas upon an “initial write”.
Push vs. Pull Protocols
• Another design issue relates to whether or not the
updates are pushed or pulled?
1. Push-based/Server-based Approach: sent
“automatically” by server, the client does not request
the update. This approach is useful when a high
degree of consistency is needed. Often used between
permanent and server-initiated replicas.
2. Pull-based/Client-based Approach: used by client
caches (e.g., browsers), updates are requested by the
client from the server. No request, no update!
33
Pull versus Push Protocols
35
Primary-Based Protocols
36
Remote-Write Protocols
38
Local-Write Protocols
• In this protocol, a single copy of the data item
is still maintained.
• Upon a write, the data item gets transferred to
the replica that is writing.
• That is, the status of primary for a data item is
transferable.
• This is also called a “fully migrating approach”.
39
Local-Write Protocols
41
Cache-coherence protocols
Caches form a special case of replication, in the sense that they are generally
controlled by clients instead of servers. However, cache-coherence protocols,
which ensure that a cache is consistent with the server-initiated replicas are, in
principle, not very different from the consistency protocols discussed so far.
First, caching solutions may differ in their coherence detection strategy, that
is, when inconsistencies are actually detected. In static solutions, a compiler is
assumed to perform the necessary analysis prior to execution, and to determine
which data may actually lead to inconsistencies because they may be cached.
Another design issue for cache-coherence protocols is the coherence en-
forcement strategy, which determines how caches are kept consistent with the
copies stored at servers. The simplest solution is to disallow shared data to be
cached at all. Instead, shared data are kept only at the servers, which maintain
consistency using one of the primary-based or replication-write protocols
discussed above. Clients are allowed to cache only private data. Obviously, this
solution can offer only limited performance improvements.
42
Caching and replication in the Web
The Web is arguably the largest distributed system ever built. Originating from
a relatively simple client-server architecture, it is now a sophisticated system
consisting of many techniques to ensure stringent performance and availability
requirements. These requirements have led to numerous proposals for caching
and replicating Web content.
Proxy.
A Web proxy accepts requests from local clients and passes these to Web
servers. When a response comes in, the result is passed to the client. The
advantage of this approach is that the proxy can cache the result and return that
result to another client, if necessary.
In addition to caching at browsers and proxies, ISPs generally also place caches
in their networks. Such schemes are mainly used to reduce network traffic
(which is good for the ISP) and to improve performance (which is good for end
users)
43