DSs CH 07 - Replication, Consistency Fault Tolerance

The document discusses data replication in distributed systems, highlighting its importance for reliability, performance, and fault tolerance. It addresses the challenges of maintaining consistency among replicas and the trade-offs between performance and bandwidth usage. Additionally, it covers fault tolerance concepts, types of faults, and failure models relevant to distributed systems.

Uploaded by

MIKIAS GEBEYEHU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views4 pages

DSs CH 07 - Replication, Consistency Fault Tolerance

Uploaded by

MIKIAS GEBEYEHU

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

DSs Lecture Notes

Last Chapters
Replication, Consistency and Fault Tolerance

Data replication is main issues discussed in distributed systems, generally it enhance reliability
and improve performance. One major problem with replication is keeping consistence among
replicas. This means that when one copy is updated, we need to ensure that the other copies are
updated as well, otherwise the replicas will no longer be the same.
1. Reasons for Replication and the Problems
Remember that this course begins by introducing basic concepts of distributed systems and
describing its characteristics together with the main challenges raised while attaining those goals
of distributed systems. For instance overviews of distributed systems concepts like distributions
and replications of data (resources) and their importance to achieve better performance, fault
tolerance, system availability and reliability.
a) Increasing Availability, Reliability and Fault Tolerance
One major reason to replicate data is to increase the reliability of a system. If a file system has
been replicated it may be possible to continue working after one replica crashes by simply
switching to one of the other replicas. Also, by maintaining multiple copies, it becomes possible
to provide better protection against corrupted data. For example, imagine there are three copies
of a file and every read and write operation is performed on each copy. We can safeguard
ourselves against a single, failing write operation, by considering the value that is returned by at
least two copies as being the correct one.
b) Enhancing Performance
The other reason for replicating data is enhancing performance. Replication for performance can
be when a distributed system needs to scale in numbers (size) and in geographical area. For
example, when an increasing number of processes needs to access data that are managed by a
single server, we can solve the bottleneck problem by duplicating the server. So that,
performance can be improved and subsequently dividing the work.
Scaling with respect to the size of a geographical area may also require replication. The basic idea
is that by placing a copy of data in the proximity of the process using them, the time to access
the data decreases. As a consequence, the performance as perceived by that process increases.
This example also illustrates that the benefits of replication for performance may be hard to
evaluate. Although a client process may perceive better performance, it may also be the case that
more network bandwidth is now consumed keeping all replicas up to date.
Unfortunately, there is a price to be paid when data are replicated, that is the problem with
replication is that having multiple copies may lead to consistency problems. Whenever a copy is
modified, that copy becomes different from the rest. Consequently, modifications have to be

BireZman ([email protected]) UoG, Computer Science Department 1

DSs Lecture Notes

carried out on all copies to ensure consistency. Exactly when and how those modifications need
to be carried out determines the price of replication.
To understand the problem, consider improving access times to Web pages. If no special
measures are taken, fetching a page from a remote Web server may sometimes even take
seconds to complete. To improve performance, Web browsers often locally store a copy of a
previously fetched Web page (that is, they cache a Web page). If a user requires that page again,
the browser automatically returns the local copy. The access time as perceived by the user is
excellent. However, if the user always wants to have the latest version of a page, he may be in
for bad luck. The problem is that if the page has been modified in the meantime, modification
will not have been propagated to cached copies, making those copies out-of-date.
One solution to the problem of returning a stale copy to the user is to forbid the browser to keep
local copies in the first place, effectively letting the server be fully in charge of replication.
However, this solution may still lead to poor access times if no replica is placed near the user.
Another- solution is to let the Web server invalidate or update each cached copy, but this requires
that the server keeps track of all caches and sending those messages. This, in turn, may degrade
the overall performance of the server. We return to performance versus scalability issues below.
2. Replication as Scaling Technique
Replication and caching for performance are widely applied as scaling techniques. Scalability
issues generally appear in the form of performance problems. Placing copies of data close to the
processes using them can improve performance through reduction of access time and thus solve
scalability problems.
A possible trade-off that needs to be made is that keeping copies up to date may require more
network bandwidth. If the copies are refreshed more often than used (low access-to-update
ratio), the cost (bandwidth) is more expensive than the benefits; not all updates have been used
Replication itself be subject to serious scalability problems intuitively, a read operation made on
any copy should return the same value (the copies are always the same). Thus, when an update
operation is performed on one copy, it should be propagated to all copies before a subsequent
operation takes places: this is sometimes called tight consistency (a write is performed at all
copies in a single atomic operation or transaction). But it’s difficult to implement since that all
replicas first need to reach agreement on when exactly an update is to be performed locally (for
example, by deciding a global ordering of operations and this takes a lot of communication time).
Keeping copies consistent requires also global synchronization which is generally costly in terms
of performance. So the solution is to loosen the consistency constraints, for instance:-
 Updates do not need to be executed as atomic operations (no more instantaneous global
synchronization), but copies may not be always the same everywhere.

BireZman ([email protected]) UoG, Computer Science Department 2

DSs Lecture Notes

 To what extent the consistency can be loosened depends on the specific application (the
purpose of data as well as access and update patterns).
3. Fault Tolerance
A characteristic feature of distributed systems that distinguishes them from single-machine
systems is the notion of partial-failure. A partial-failure may happen when one component in a
distributed system fails. This failure may affect the proper operation of other components, while
at the same time leaving yet other components totally unaffected. In contrast, a failure in non-
distributed systems is often total in the sense that it affects all components, and may easily bring
down the entire system.
An important goal in distributed systems design is to construct the system in such a way that it
can automatically recover from partial failures without seriously affecting the overall
performance. In particular, whenever a failure occurs, the distributed system should continue to
operate in an acceptable way while repairs are being made, that is, it should tolerate faults and
continue to operate to some extent even in their presence.
3.1. Fault Tolerance Basic Concepts
Fault tolerance is strongly related to dependable systems; and this dependability in turn covers
the following four basic concepts: -
a) Availability: Refers to the probability that the system is operating correctly at any given
time; it is defined in terms of an instant in time.
b) Reliability: A property that a system can run continuously without failure; it is defined in
terms of a time interval.
c) Safety: Refers to the situation that even if a system temporarily fails to operate correctly,
nothing catastrophic happens.
d) Maintainability: Refers how easily a failed system can be repaired.
Often, dependable systems are also required to provide a high degree of security, especially
when it comes to issues such as integrity.
A system is said to fail when it cannot meet its promises, for instance failing to provide it users
one or more of the services it promises. An error is a part of a system’s state that may lead to a
failure, for example packets damaged when transmitting across a network before they arrive at
the receiver.
The cause of an error is called a fault and finding out what caused an error is important. Fault
can be due to wrong or bad transmission medium which is relatively easy to remove the fault or
transmission errors caused by bad weather conditions such as in wireless networks. So building
dependable systems closely relates to controlling faults (i.e.; preventing, removing, and
forecasting faults).

BireZman ([email protected]) UoG, Computer Science Department 3

DSs Lecture Notes

Generally, a fault tolerant system is a system that can provide its services even in the presence
of faults. Faults are classified into three:-
a) Transient: Occurs once and then disappears. If the operation is repeated, the fault goes
away. For example, a bird flying through a beam of a microwave transmitter may cause
some lost bits.
b) Intermittent: It occurs, then vanishes on its own accord, then reappears, and so on (e.g., a
loose connection). It is difficult to diagnose like taking yourself to the nearest clinic, but
does not show any sickness by the time you reach there.
c) Permanent: One that continues to exist until the faulty component is repaired, for example
disk head crash, software bug, etc.
3.2. Failure Models
A system that fails is not adequately providing the services it was designed for. If we consider a
distributed system as a collection of servers that communicate with one another and with their
clients, not adequately providing services means that servers, communication channels, or
possibly both, are not doing what they are supposed to do. However, a malfunctioning server
itself may not always be the fault we are looking for. If such a server depends on other servers to
adequately provide its services, the cause of an error may need to be searched for somewhere
else. Such dependency relations appear in abundance in distributed systems.
There are several classification schemes of failure, five of them are:
a) Crash Failure: A server halts, but was working correctly until it stopped.
b) Omission Failure: A server fails to respond to incoming requests.
 Receive Omission: A server fails to receive incoming messages, for example there
may be no thread is listening.
 Send Omission: A server fails to send messages.
c) Timing Failure: A server's response lies outside the specified time interval, example maybe
it is too fast over flooding the receiver or too slow.
d) Response Failure: The server's response is incorrect.
 Value Failure: The value of the response is wrong, for example a search engine
returning wrong Web pages as a result of a search.
 State Transition Failure: The server deviates from the correct flow of control, for
example taking default actions when it fails to understand the request.
e) Arbitrary Failure (Byzantine Failure): A server may produce arbitrary responses at arbitrary
times (most serious).

BireZman ([email protected]) UoG, Computer Science Department 4

2.DPA Taxmann - Alpesh Soni
No ratings yet
2.DPA Taxmann - Alpesh Soni
313 pages
Module 1
No ratings yet
Module 1
47 pages
8.8 - XML API Developer Guide
No ratings yet
8.8 - XML API Developer Guide
220 pages
Recertification - Acronis #CyberFit Cloud Tech Fundamentals
No ratings yet
Recertification - Acronis #CyberFit Cloud Tech Fundamentals
78 pages
Lecture 2.0 - Issues in Design of Distributed System
No ratings yet
Lecture 2.0 - Issues in Design of Distributed System
14 pages
Distributed Systems
No ratings yet
Distributed Systems
234 pages
Fault Tolerance FDCC
No ratings yet
Fault Tolerance FDCC
76 pages
Course Notes
No ratings yet
Course Notes
213 pages
System Design
No ratings yet
System Design
22 pages
Intro To DS Chapter 5
No ratings yet
Intro To DS Chapter 5
76 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
Du3 1
No ratings yet
Du3 1
54 pages
Consistency and Replication SLM
No ratings yet
Consistency and Replication SLM
25 pages
Data Science Tutorial 1
No ratings yet
Data Science Tutorial 1
26 pages
Ciampa CompTIASec+ 7e PPT Mod04
No ratings yet
Ciampa CompTIASec+ 7e PPT Mod04
41 pages
Crossplane Overview
100% (1)
Crossplane Overview
2 pages
Hotel Management - Case Study
100% (1)
Hotel Management - Case Study
20 pages
Distributed Computing Note
100% (1)
Distributed Computing Note
54 pages
Replication: Distributed Computing
No ratings yet
Replication: Distributed Computing
43 pages
Nasscom Practice Questions
No ratings yet
Nasscom Practice Questions
13 pages
Slides
No ratings yet
Slides
31 pages
Project Status Report Monthly
100% (2)
Project Status Report Monthly
7 pages
Distributed S Notes 240228 144257
No ratings yet
Distributed S Notes 240228 144257
69 pages
Distributed Systems Slides-Lesson2
No ratings yet
Distributed Systems Slides-Lesson2
47 pages
REPLICATION
No ratings yet
REPLICATION
20 pages
Introduction To Distributed Computing
No ratings yet
Introduction To Distributed Computing
57 pages
qt9dd8z166 Nosplash
No ratings yet
qt9dd8z166 Nosplash
21 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
48 pages
DC UT1 QB Soln
No ratings yet
DC UT1 QB Soln
19 pages
Dependable Systems
No ratings yet
Dependable Systems
22 pages
DC Mod 5
No ratings yet
DC Mod 5
12 pages
Explain The Issues in Designing Distributed Systems
No ratings yet
Explain The Issues in Designing Distributed Systems
2 pages
01 Introduction
No ratings yet
01 Introduction
25 pages
Chapter 3
No ratings yet
Chapter 3
40 pages
Dis Sys
No ratings yet
Dis Sys
16 pages
Week 2
No ratings yet
Week 2
33 pages
CISSP Models Process Frameworks Handout
0% (1)
CISSP Models Process Frameworks Handout
82 pages
Data Replication in DBMS
No ratings yet
Data Replication in DBMS
10 pages
Chapter 8-Fault Tolerance
100% (1)
Chapter 8-Fault Tolerance
71 pages
Activity - Assessment 2 - Decmar J. Jaclop - CS2E
No ratings yet
Activity - Assessment 2 - Decmar J. Jaclop - CS2E
7 pages
Chapter 1 - Intro
No ratings yet
Chapter 1 - Intro
31 pages
Ch02 - Big Data Storage Concepts
No ratings yet
Ch02 - Big Data Storage Concepts
23 pages
Chapter 7kec
No ratings yet
Chapter 7kec
8 pages
Distributed Systems Lecture 1-2
No ratings yet
Distributed Systems Lecture 1-2
20 pages
CH 1 Distributed System
No ratings yet
CH 1 Distributed System
13 pages
Testing
No ratings yet
Testing
11 pages
DS Chapter V7replication
No ratings yet
DS Chapter V7replication
33 pages
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
No ratings yet
Unit-6 Transactions & Replications Syllabus: Introduction, System Model and Group Communication, Concurrency Control in Distributed
20 pages
Distributed 3
No ratings yet
Distributed 3
5 pages
CH 1 Distributed System
No ratings yet
CH 1 Distributed System
12 pages
Lec01 Notes
No ratings yet
Lec01 Notes
6 pages
Open Test Siphael
No ratings yet
Open Test Siphael
3 pages
Simio Network License Server
No ratings yet
Simio Network License Server
48 pages
Cloud Exam
No ratings yet
Cloud Exam
3 pages
Replicated Data Consistency Explained Through Baseball: Doug Terry Microsoft Research Silicon Valley
No ratings yet
Replicated Data Consistency Explained Through Baseball: Doug Terry Microsoft Research Silicon Valley
14 pages
Oracle Database Auditing: Using Accounting Setup Manager
No ratings yet
Oracle Database Auditing: Using Accounting Setup Manager
18 pages
Failure Detector: Degrees of Completeness
No ratings yet
Failure Detector: Degrees of Completeness
4 pages
CBDT3103 Answer
No ratings yet
CBDT3103 Answer
9 pages
(Ci) 09 (Ci) 12
No ratings yet
(Ci) 09 (Ci) 12
31 pages
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
No ratings yet
Fault Tolerance:-: Introduction, Process Resilience, Distributed Commit, Recovery
52 pages
Questions and Answers 5-10 Marks
No ratings yet
Questions and Answers 5-10 Marks
2 pages
Advanced Distributed Systems Replication: What Is Replication? Reasons For Replication
No ratings yet
Advanced Distributed Systems Replication: What Is Replication? Reasons For Replication
20 pages
Accounting Information Systems: Dr. Hisham Madi
No ratings yet
Accounting Information Systems: Dr. Hisham Madi
24 pages
Venu Gopal CV
No ratings yet
Venu Gopal CV
3 pages
BJMP Visitor Log Monitoring System
No ratings yet
BJMP Visitor Log Monitoring System
3 pages
DevOps Course Content by ImranTeli
100% (1)
DevOps Course Content by ImranTeli
13 pages
07 Replication
No ratings yet
07 Replication
14 pages
Web Based Textile Monitoring and Analyzing
No ratings yet
Web Based Textile Monitoring and Analyzing
5 pages
Course - Registration Guidelines Iisc
No ratings yet
Course - Registration Guidelines Iisc
5 pages
Design Distributed Database
No ratings yet
Design Distributed Database
2 pages
Banking System
No ratings yet
Banking System
6 pages
Fundamental of Azure, Azure Subscription
No ratings yet
Fundamental of Azure, Azure Subscription
29 pages
Distributed Systems As DS DS
No ratings yet
Distributed Systems As DS DS
7 pages
Data Replication Techniques
No ratings yet
Data Replication Techniques
3 pages
Control-M CM For DataStage Cookbook
No ratings yet
Control-M CM For DataStage Cookbook
5 pages
Why Transparency, Scalability, Dependability, Performance and Flexibility Are Challenging in A Distributed System?
No ratings yet
Why Transparency, Scalability, Dependability, Performance and Flexibility Are Challenging in A Distributed System?
2 pages
The Surprising Power of Epidemic Communication: Kenneth P. Birman
No ratings yet
The Surprising Power of Epidemic Communication: Kenneth P. Birman
5 pages
Cisco ACE30 Application Control Engine Module Data Sheet
No ratings yet
Cisco ACE30 Application Control Engine Module Data Sheet
6 pages
Descriptive Question bank-DSBDA
No ratings yet
Descriptive Question bank-DSBDA
3 pages
Module 4 Problem Solutions
No ratings yet
Module 4 Problem Solutions
3 pages
WRL0004 TMP
No ratings yet
WRL0004 TMP
9 pages
Fault Tolerance in Distributed Systems
No ratings yet
Fault Tolerance in Distributed Systems
5 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Membuat Virus Dengan Delphi
No ratings yet
Membuat Virus Dengan Delphi
8 pages
Sparsh Arora 9030241432: What Are The Different Ways To Control Access To Data, Computers and Networks?
No ratings yet
Sparsh Arora 9030241432: What Are The Different Ways To Control Access To Data, Computers and Networks?
8 pages
Note 859998 - Installing SAP Credit Management 6.0: Symptom
No ratings yet
Note 859998 - Installing SAP Credit Management 6.0: Symptom
4 pages
ASCP Plan Output Does Not Show Purchase Supplies - Internal Requisitions, PO's, Requisitions, or Shipments (ID 565210.1)
No ratings yet
ASCP Plan Output Does Not Show Purchase Supplies - Internal Requisitions, PO's, Requisitions, or Shipments (ID 565210.1)
3 pages
World Setup en
No ratings yet
World Setup en
2 pages
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Operating Systems Interview Questions You'll Most Likely Be Asked
From Everand
Operating Systems Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet