0% found this document useful (0 votes)
6 views4 pages

DSs CH 07 - Replication, Consistency Fault Tolerance

The document discusses data replication in distributed systems, highlighting its importance for reliability, performance, and fault tolerance. It addresses the challenges of maintaining consistency among replicas and the trade-offs between performance and bandwidth usage. Additionally, it covers fault tolerance concepts, types of faults, and failure models relevant to distributed systems.

Uploaded by

MIKIAS GEBEYEHU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views4 pages

DSs CH 07 - Replication, Consistency Fault Tolerance

The document discusses data replication in distributed systems, highlighting its importance for reliability, performance, and fault tolerance. It addresses the challenges of maintaining consistency among replicas and the trade-offs between performance and bandwidth usage. Additionally, it covers fault tolerance concepts, types of faults, and failure models relevant to distributed systems.

Uploaded by

MIKIAS GEBEYEHU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

DSs Lecture Notes

Last Chapters
Replication, Consistency and Fault Tolerance

Data replication is main issues discussed in distributed systems, generally it enhance reliability
and improve performance. One major problem with replication is keeping consistence among
replicas. This means that when one copy is updated, we need to ensure that the other copies are
updated as well, otherwise the replicas will no longer be the same.
1. Reasons for Replication and the Problems
Remember that this course begins by introducing basic concepts of distributed systems and
describing its characteristics together with the main challenges raised while attaining those goals
of distributed systems. For instance overviews of distributed systems concepts like distributions
and replications of data (resources) and their importance to achieve better performance, fault
tolerance, system availability and reliability.
a) Increasing Availability, Reliability and Fault Tolerance
One major reason to replicate data is to increase the reliability of a system. If a file system has
been replicated it may be possible to continue working after one replica crashes by simply
switching to one of the other replicas. Also, by maintaining multiple copies, it becomes possible
to provide better protection against corrupted data. For example, imagine there are three copies
of a file and every read and write operation is performed on each copy. We can safeguard
ourselves against a single, failing write operation, by considering the value that is returned by at
least two copies as being the correct one.
b) Enhancing Performance
The other reason for replicating data is enhancing performance. Replication for performance can
be when a distributed system needs to scale in numbers (size) and in geographical area. For
example, when an increasing number of processes needs to access data that are managed by a
single server, we can solve the bottleneck problem by duplicating the server. So that,
performance can be improved and subsequently dividing the work.
Scaling with respect to the size of a geographical area may also require replication. The basic idea
is that by placing a copy of data in the proximity of the process using them, the time to access
the data decreases. As a consequence, the performance as perceived by that process increases.
This example also illustrates that the benefits of replication for performance may be hard to
evaluate. Although a client process may perceive better performance, it may also be the case that
more network bandwidth is now consumed keeping all replicas up to date.
Unfortunately, there is a price to be paid when data are replicated, that is the problem with
replication is that having multiple copies may lead to consistency problems. Whenever a copy is
modified, that copy becomes different from the rest. Consequently, modifications have to be

BireZman ([email protected]) UoG, Computer Science Department 1


DSs Lecture Notes

carried out on all copies to ensure consistency. Exactly when and how those modifications need
to be carried out determines the price of replication.
To understand the problem, consider improving access times to Web pages. If no special
measures are taken, fetching a page from a remote Web server may sometimes even take
seconds to complete. To improve performance, Web browsers often locally store a copy of a
previously fetched Web page (that is, they cache a Web page). If a user requires that page again,
the browser automatically returns the local copy. The access time as perceived by the user is
excellent. However, if the user always wants to have the latest version of a page, he may be in
for bad luck. The problem is that if the page has been modified in the meantime, modification
will not have been propagated to cached copies, making those copies out-of-date.
One solution to the problem of returning a stale copy to the user is to forbid the browser to keep
local copies in the first place, effectively letting the server be fully in charge of replication.
However, this solution may still lead to poor access times if no replica is placed near the user.
Another- solution is to let the Web server invalidate or update each cached copy, but this requires
that the server keeps track of all caches and sending those messages. This, in turn, may degrade
the overall performance of the server. We return to performance versus scalability issues below.
2. Replication as Scaling Technique
Replication and caching for performance are widely applied as scaling techniques. Scalability
issues generally appear in the form of performance problems. Placing copies of data close to the
processes using them can improve performance through reduction of access time and thus solve
scalability problems.
A possible trade-off that needs to be made is that keeping copies up to date may require more
network bandwidth. If the copies are refreshed more often than used (low access-to-update
ratio), the cost (bandwidth) is more expensive than the benefits; not all updates have been used
Replication itself be subject to serious scalability problems intuitively, a read operation made on
any copy should return the same value (the copies are always the same). Thus, when an update
operation is performed on one copy, it should be propagated to all copies before a subsequent
operation takes places: this is sometimes called tight consistency (a write is performed at all
copies in a single atomic operation or transaction). But it’s difficult to implement since that all
replicas first need to reach agreement on when exactly an update is to be performed locally (for
example, by deciding a global ordering of operations and this takes a lot of communication time).
Keeping copies consistent requires also global synchronization which is generally costly in terms
of performance. So the solution is to loosen the consistency constraints, for instance:-
 Updates do not need to be executed as atomic operations (no more instantaneous global
synchronization), but copies may not be always the same everywhere.

BireZman ([email protected]) UoG, Computer Science Department 2


DSs Lecture Notes

 To what extent the consistency can be loosened depends on the specific application (the
purpose of data as well as access and update patterns).
3. Fault Tolerance
A characteristic feature of distributed systems that distinguishes them from single-machine
systems is the notion of partial-failure. A partial-failure may happen when one component in a
distributed system fails. This failure may affect the proper operation of other components, while
at the same time leaving yet other components totally unaffected. In contrast, a failure in non-
distributed systems is often total in the sense that it affects all components, and may easily bring
down the entire system.
An important goal in distributed systems design is to construct the system in such a way that it
can automatically recover from partial failures without seriously affecting the overall
performance. In particular, whenever a failure occurs, the distributed system should continue to
operate in an acceptable way while repairs are being made, that is, it should tolerate faults and
continue to operate to some extent even in their presence.
3.1. Fault Tolerance Basic Concepts
Fault tolerance is strongly related to dependable systems; and this dependability in turn covers
the following four basic concepts: -
a) Availability: Refers to the probability that the system is operating correctly at any given
time; it is defined in terms of an instant in time.
b) Reliability: A property that a system can run continuously without failure; it is defined in
terms of a time interval.
c) Safety: Refers to the situation that even if a system temporarily fails to operate correctly,
nothing catastrophic happens.
d) Maintainability: Refers how easily a failed system can be repaired.
Often, dependable systems are also required to provide a high degree of security, especially
when it comes to issues such as integrity.
A system is said to fail when it cannot meet its promises, for instance failing to provide it users
one or more of the services it promises. An error is a part of a system’s state that may lead to a
failure, for example packets damaged when transmitting across a network before they arrive at
the receiver.
The cause of an error is called a fault and finding out what caused an error is important. Fault
can be due to wrong or bad transmission medium which is relatively easy to remove the fault or
transmission errors caused by bad weather conditions such as in wireless networks. So building
dependable systems closely relates to controlling faults (i.e.; preventing, removing, and
forecasting faults).

BireZman ([email protected]) UoG, Computer Science Department 3


DSs Lecture Notes

Generally, a fault tolerant system is a system that can provide its services even in the presence
of faults. Faults are classified into three:-
a) Transient: Occurs once and then disappears. If the operation is repeated, the fault goes
away. For example, a bird flying through a beam of a microwave transmitter may cause
some lost bits.
b) Intermittent: It occurs, then vanishes on its own accord, then reappears, and so on (e.g., a
loose connection). It is difficult to diagnose like taking yourself to the nearest clinic, but
does not show any sickness by the time you reach there.
c) Permanent: One that continues to exist until the faulty component is repaired, for example
disk head crash, software bug, etc.
3.2. Failure Models
A system that fails is not adequately providing the services it was designed for. If we consider a
distributed system as a collection of servers that communicate with one another and with their
clients, not adequately providing services means that servers, communication channels, or
possibly both, are not doing what they are supposed to do. However, a malfunctioning server
itself may not always be the fault we are looking for. If such a server depends on other servers to
adequately provide its services, the cause of an error may need to be searched for somewhere
else. Such dependency relations appear in abundance in distributed systems.
There are several classification schemes of failure, five of them are:
a) Crash Failure: A server halts, but was working correctly until it stopped.
b) Omission Failure: A server fails to respond to incoming requests.
 Receive Omission: A server fails to receive incoming messages, for example there
may be no thread is listening.
 Send Omission: A server fails to send messages.
c) Timing Failure: A server's response lies outside the specified time interval, example maybe
it is too fast over flooding the receiver or too slow.
d) Response Failure: The server's response is incorrect.
 Value Failure: The value of the response is wrong, for example a search engine
returning wrong Web pages as a result of a search.
 State Transition Failure: The server deviates from the correct flow of control, for
example taking default actions when it fails to understand the request.
e) Arbitrary Failure (Byzantine Failure): A server may produce arbitrary responses at arbitrary
times (most serious).

BireZman ([email protected]) UoG, Computer Science Department 4

You might also like