0% found this document useful (0 votes)
52 views21 pages

Ds Chapter 7

The document provides an introduction to distributed systems with a focus on fault tolerance, outlining key concepts such as types of faults, errors, and failures. It discusses methods for achieving fault tolerance, including redundancy and process resilience, as well as reliable communication strategies in client-server and group communication contexts. Additionally, it touches on failure models and the challenges associated with remote procedure calls (RPC) in the presence of failures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views21 pages

Ds Chapter 7

The document provides an introduction to distributed systems with a focus on fault tolerance, outlining key concepts such as types of faults, errors, and failures. It discusses methods for achieving fault tolerance, including redundancy and process resilience, as well as reliable communication strategies in client-server and group communication contexts. Additionally, it touches on failure models and the challenges associated with remote procedure calls (RPC) in the presence of failures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Bahir Dar Institute of Technology

Faculty of Computing Network and Internet Chair

Introduction of Distributed System


Content
 Fault Tolerance
 Introduction to Fault Tolerance
 Process Resilience
 Reliable Client-Server Communication
 Reliable Group Communication
 Distributed Commit Recovery
Basic Concepts

 Fault Tolerance is closely related to the notion of


“Dependability”.
 In Distributed Systems, this is characterized under a number of
headings:
 Availability – the system is ready to be used immediately.
 Reliability – the system can run continuously without failure.
 Safety – if a system fails, nothing catastrophic will happen.
 Maintainability – when a system fails, it can be repaired easily
What Is A fault ?

 A fault is a defect or flaw in a system's hardware, software,


or design that has the potential to cause an error.
 Faults can be classified into several types, including:
 Hardware Faults
 Software Faults
 Design Faults
 Operational Faults
Error in Distributed System

 An error is the manifestation of a fault.


 It is an incorrect state or condition within the system that can
potentially lead to a failure.
 Errors can be transient, intermittent, or permanent:
 Transient Errors: Temporary errors that disappear without
intervention.
 Intermittent Errors: Errors that occur sporadically and
unpredictably.
 Permanent Errors: Persistent errors that continue until
corrective action is taken.
Failure In Distributed System
 A system is said to fail when it cannot meet its promises.

 A failure is brought about by the existence of errors in the


system.

 The cause of an error is called a fault.


Types of Fault
There are three main types of ‘fault’:

 Transient Fault – appears once, then disappears.

 Intermittent Fault – occurs, vanishes, reappears; but: follows


no real pattern (worst kind).

 Permanent Fault – once it occurs, only the replacement/repair


of a faulty component will allow the DS to function normally.
Failure Models

 Crash Failure: The system stops functioning and does not


respond to any inputs.
 Omission failure: a server fails to respond to incoming requests
 Timing Failure: The system's response is either too early or too
late, violating timing constraints.
 Response Failure: system produces an incorrect response or
output.
 Receive omission: a server fails to receive incoming
messages; e.g., may be no thread is listening
 Send omission: a server fails to send messages
 Value failure: the value of the response is wrong; e.g., a
search engine returning wrong Web pages as a result of a
search
Failure Masking by Redundancy
 If a system is to be fault tolerant, the best it can do is to try to
hide the occurrence of failures from other processes.
 information redundancy:-add extra bits to allow recovery from
garbled bits (error correction)
 time redundancy:- an action is performed more than once if
needed; Ex. an aborted transaction; useful for transient and
intermittent faults
 physical redundancy:- add extra equipment (HW) or
Process Resilience
 Processes can be made fault tolerant by arranging to have a
group of processes, with each member of the group being
identical.

 A message sent to the group is delivered to all of the “copies”


of the process (the group members), and then only one of
them performs the required service.

 If one of the processes fail, it is assumed that one of the others


will still be able to function (and service any pending request
or operation).
Flat vs. Hierarchical Groups
a) Communication in a flat group – all the processes are equal,
decisions are made collectively. Note: no single point-of-failure,
however: decision making is complicated as consensus is
required.
b) Communication in a simple hierarchical group – one of the
processes is elected to be the coordinator, which selects another
process (a worker) to perform the operation. Note: single point-
of-failure, however: decisions are easily and quickly made by
the coordinator without first having to get consensus.
Reliable Client-Server Communication
 Fault tolerance in distributed systems concentrates on faulty
processes
 but communication failures also have to be considered
 a communication channel may exhibit failures in the form of
 crash
 omission
 timing
 arbitrary (duplicate messages as a result of buffering at
nodes and the sender retransmitting)
Point-to-Point Communication
 reliable transport protocols such as TCP can be used that
mask most communication failures such as omissions (lost
messages) using acknowledgements and retransmissions
RPC Semantics in the Presence of Failures

the goal of RPC is to hide communication by making remote


procedure calls look like local ones.
five different classes of failures can occur in RPC systems, each
requiring a different solution
 The client cannot locate the server, so no request can be sent.
 The client’s request to the server is lost, so no response is
returned by the server to the waiting client.
 The server crashes after receiving the request, and the service
request is left acknowledged, but undone.
 The server’s reply is lost on its way to the client, the service
has completed, but the results never arrive at the client
 The client crashes after sending its request, and the server
sends a reply to a newly-restarted client that may not be
expecting it.
Reliable Group Communication
 Reliable multicast services guarantee that all messages are
delivered to all members of a process group.
 This is a simple solution to reliable multicasting when all
receivers are known and are assumed not to fail.
 The sending process assigns a sequence number to outgoing
messages (making it easy to spot when a message is
missing).
a) Message transmission – note that the third receiver is
expecting 24.
b) Reporting feedback – the third receiver informs the sender.
c) But, how long does the sender keep its history-buffer
populated?
d) Also, such schemes perform poorly as the group grows …
there are too many ACKs.
Read Assignment

 Distributed Commit Recovery in distributed system?


 What is 3 phase commit in distributed system?
Thank You!!!

You might also like