08 Falhas
08 Falhas
Chapter 7
Introduction to fault tolerance
** Fault Tolerance
This is the easiest one to deal with: just have the operation
system or client stub start a timer when sending the request.
(i) Before the client stub sends a RPC message, it makes a log entry telling what it is
about to do.
(ii) When a client reboots, it broadcast a message to all machine declaring the start a
new epoch. So, old computations of that client are killed.
(iii) When an epoch broadcast comes in, each machine checks to see if it has any
remote computations, and if so, tries to locate their owner. Only if the owner
cannot be found is the computation killed.
(iv) The RPC receives a standard amount of time to do the job. When the client
reboot, all orphans are sure to be gone.
Reliable group communication
** Basic Reliable-Multicasting Schemes
Reliable multicasting means that a message that is sent
to a process group should be delivered to each
member of that group.