Bahir Dar Institute of Technology
Faculty of Computing Network and Internet Chair
Introduction of Distributed System
Content
Fault Tolerance
Introduction to Fault Tolerance
Process Resilience
Reliable Client-Server Communication
Reliable Group Communication
Distributed Commit Recovery
Basic Concepts
Fault Tolerance is closely related to the notion of
“Dependability”.
In Distributed Systems, this is characterized under a number of
headings:
Availability – the system is ready to be used immediately.
Reliability – the system can run continuously without failure.
Safety – if a system fails, nothing catastrophic will happen.
Maintainability – when a system fails, it can be repaired easily
What Is A fault ?
A fault is a defect or flaw in a system's hardware, software,
or design that has the potential to cause an error.
Faults can be classified into several types, including:
Hardware Faults
Software Faults
Design Faults
Operational Faults
Error in Distributed System
An error is the manifestation of a fault.
It is an incorrect state or condition within the system that can
potentially lead to a failure.
Errors can be transient, intermittent, or permanent:
Transient Errors: Temporary errors that disappear without
intervention.
Intermittent Errors: Errors that occur sporadically and
unpredictably.
Permanent Errors: Persistent errors that continue until
corrective action is taken.
Failure In Distributed System
A system is said to fail when it cannot meet its promises.
A failure is brought about by the existence of errors in the
system.
The cause of an error is called a fault.
Types of Fault
There are three main types of ‘fault’:
Transient Fault – appears once, then disappears.
Intermittent Fault – occurs, vanishes, reappears; but: follows
no real pattern (worst kind).
Permanent Fault – once it occurs, only the replacement/repair
of a faulty component will allow the DS to function normally.
Failure Models
Crash Failure: The system stops functioning and does not
respond to any inputs.
Omission failure: a server fails to respond to incoming requests
Timing Failure: The system's response is either too early or too
late, violating timing constraints.
Response Failure: system produces an incorrect response or
output.
Receive omission: a server fails to receive incoming
messages; e.g., may be no thread is listening
Send omission: a server fails to send messages
Value failure: the value of the response is wrong; e.g., a
search engine returning wrong Web pages as a result of a
search
Failure Masking by Redundancy
If a system is to be fault tolerant, the best it can do is to try to
hide the occurrence of failures from other processes.
information redundancy:-add extra bits to allow recovery from
garbled bits (error correction)
time redundancy:- an action is performed more than once if
needed; Ex. an aborted transaction; useful for transient and
intermittent faults
physical redundancy:- add extra equipment (HW) or
Process Resilience
Processes can be made fault tolerant by arranging to have a
group of processes, with each member of the group being
identical.
A message sent to the group is delivered to all of the “copies”
of the process (the group members), and then only one of
them performs the required service.
If one of the processes fail, it is assumed that one of the others
will still be able to function (and service any pending request
or operation).
Flat vs. Hierarchical Groups
a) Communication in a flat group – all the processes are equal,
decisions are made collectively. Note: no single point-of-failure,
however: decision making is complicated as consensus is
required.
b) Communication in a simple hierarchical group – one of the
processes is elected to be the coordinator, which selects another
process (a worker) to perform the operation. Note: single point-
of-failure, however: decisions are easily and quickly made by
the coordinator without first having to get consensus.
Reliable Client-Server Communication
Fault tolerance in distributed systems concentrates on faulty
processes
but communication failures also have to be considered
a communication channel may exhibit failures in the form of
crash
omission
timing
arbitrary (duplicate messages as a result of buffering at
nodes and the sender retransmitting)
Point-to-Point Communication
reliable transport protocols such as TCP can be used that
mask most communication failures such as omissions (lost
messages) using acknowledgements and retransmissions
RPC Semantics in the Presence of Failures
the goal of RPC is to hide communication by making remote
procedure calls look like local ones.
five different classes of failures can occur in RPC systems, each
requiring a different solution
The client cannot locate the server, so no request can be sent.
The client’s request to the server is lost, so no response is
returned by the server to the waiting client.
The server crashes after receiving the request, and the service
request is left acknowledged, but undone.
The server’s reply is lost on its way to the client, the service
has completed, but the results never arrive at the client
The client crashes after sending its request, and the server
sends a reply to a newly-restarted client that may not be
expecting it.
Reliable Group Communication
Reliable multicast services guarantee that all messages are
delivered to all members of a process group.
This is a simple solution to reliable multicasting when all
receivers are known and are assumed not to fail.
The sending process assigns a sequence number to outgoing
messages (making it easy to spot when a message is
missing).
a) Message transmission – note that the third receiver is
expecting 24.
b) Reporting feedback – the third receiver informs the sender.
c) But, how long does the sender keep its history-buffer
populated?
d) Also, such schemes perform poorly as the group grows …
there are too many ACKs.
Read Assignment
Distributed Commit Recovery in distributed system?
What is 3 phase commit in distributed system?
Thank You!!!