The document provides an introduction to distributed systems with a focus on fault tolerance, outlining key concepts such as types of faults, errors, and failures. It discusses methods for achieving fault tolerance, including redundancy and process resilience, as well as reliable communication strategies in client-server and group communication contexts. Additionally, it touches on failure models and the challenges associated with remote procedure calls (RPC) in the presence of failures.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0 ratings0% found this document useful (0 votes)
52 views21 pages
Ds Chapter 7
The document provides an introduction to distributed systems with a focus on fault tolerance, outlining key concepts such as types of faults, errors, and failures. It discusses methods for achieving fault tolerance, including redundancy and process resilience, as well as reliable communication strategies in client-server and group communication contexts. Additionally, it touches on failure models and the challenges associated with remote procedure calls (RPC) in the presence of failures.
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21
Bahir Dar Institute of Technology
Faculty of Computing Network and Internet Chair
Introduction of Distributed System
Content Fault Tolerance Introduction to Fault Tolerance Process Resilience Reliable Client-Server Communication Reliable Group Communication Distributed Commit Recovery Basic Concepts
Fault Tolerance is closely related to the notion of
“Dependability”. In Distributed Systems, this is characterized under a number of headings: Availability – the system is ready to be used immediately. Reliability – the system can run continuously without failure. Safety – if a system fails, nothing catastrophic will happen. Maintainability – when a system fails, it can be repaired easily What Is A fault ?
A fault is a defect or flaw in a system's hardware, software,
or design that has the potential to cause an error. Faults can be classified into several types, including: Hardware Faults Software Faults Design Faults Operational Faults Error in Distributed System
An error is the manifestation of a fault.
It is an incorrect state or condition within the system that can potentially lead to a failure. Errors can be transient, intermittent, or permanent: Transient Errors: Temporary errors that disappear without intervention. Intermittent Errors: Errors that occur sporadically and unpredictably. Permanent Errors: Persistent errors that continue until corrective action is taken. Failure In Distributed System A system is said to fail when it cannot meet its promises.
A failure is brought about by the existence of errors in the
system.
The cause of an error is called a fault.
Types of Fault There are three main types of ‘fault’:
Transient Fault – appears once, then disappears.
Permanent Fault – once it occurs, only the replacement/repair
of a faulty component will allow the DS to function normally. Failure Models
Crash Failure: The system stops functioning and does not
respond to any inputs. Omission failure: a server fails to respond to incoming requests Timing Failure: The system's response is either too early or too late, violating timing constraints. Response Failure: system produces an incorrect response or output. Receive omission: a server fails to receive incoming messages; e.g., may be no thread is listening Send omission: a server fails to send messages Value failure: the value of the response is wrong; e.g., a search engine returning wrong Web pages as a result of a search Failure Masking by Redundancy If a system is to be fault tolerant, the best it can do is to try to hide the occurrence of failures from other processes. information redundancy:-add extra bits to allow recovery from garbled bits (error correction) time redundancy:- an action is performed more than once if needed; Ex. an aborted transaction; useful for transient and intermittent faults physical redundancy:- add extra equipment (HW) or Process Resilience Processes can be made fault tolerant by arranging to have a group of processes, with each member of the group being identical.
A message sent to the group is delivered to all of the “copies”
of the process (the group members), and then only one of them performs the required service.
If one of the processes fail, it is assumed that one of the others
will still be able to function (and service any pending request or operation). Flat vs. Hierarchical Groups a) Communication in a flat group – all the processes are equal, decisions are made collectively. Note: no single point-of-failure, however: decision making is complicated as consensus is required. b) Communication in a simple hierarchical group – one of the processes is elected to be the coordinator, which selects another process (a worker) to perform the operation. Note: single point- of-failure, however: decisions are easily and quickly made by the coordinator without first having to get consensus. Reliable Client-Server Communication Fault tolerance in distributed systems concentrates on faulty processes but communication failures also have to be considered a communication channel may exhibit failures in the form of crash omission timing arbitrary (duplicate messages as a result of buffering at nodes and the sender retransmitting) Point-to-Point Communication reliable transport protocols such as TCP can be used that mask most communication failures such as omissions (lost messages) using acknowledgements and retransmissions RPC Semantics in the Presence of Failures
the goal of RPC is to hide communication by making remote
procedure calls look like local ones. five different classes of failures can occur in RPC systems, each requiring a different solution The client cannot locate the server, so no request can be sent. The client’s request to the server is lost, so no response is returned by the server to the waiting client. The server crashes after receiving the request, and the service request is left acknowledged, but undone. The server’s reply is lost on its way to the client, the service has completed, but the results never arrive at the client The client crashes after sending its request, and the server sends a reply to a newly-restarted client that may not be expecting it. Reliable Group Communication Reliable multicast services guarantee that all messages are delivered to all members of a process group. This is a simple solution to reliable multicasting when all receivers are known and are assumed not to fail. The sending process assigns a sequence number to outgoing messages (making it easy to spot when a message is missing). a) Message transmission – note that the third receiver is expecting 24. b) Reporting feedback – the third receiver informs the sender. c) But, how long does the sender keep its history-buffer populated? d) Also, such schemes perform poorly as the group grows … there are too many ACKs. Read Assignment
Distributed Commit Recovery in distributed system?
What is 3 phase commit in distributed system? Thank You!!!