0% found this document useful (0 votes)

208 views102 pages

CS3551 DC 5 Units Notes

Uploaded by

Sameer Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

208 views102 pages

CS3551 DC 5 Units Notes

Uploaded by

Sameer Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 102

M.A.

M COLLEGE OF ENGINEERING
EnggTree.com
UNIT III

DISTRIBUTED MUTEX & DEADLOCK

Distributed mutual exclusion algorithms: Introduction – Preliminaries – Lamport‘s algorithm –Ricart-Agrawala
algorithm – Maekawa‘s algorithm – Suzuki–Kasami‘s broadcast algorithm. Deadlock detection in distributed
systems: Introduction – System model – Preliminaries –Models of deadlocks – Knapp‘s classification –
Algorithms for the single resource model, the AND model and the OR model.

DISTRIBUTED MUTUAL EXCLUSION ALGORITHMS

 Mutual exclusion is a concurrency control property which is introduced to prevent
race conditions.
 It is the requirement that a process cannot access a shared resource while another
concurrent process is currently present or executing the same resource.

Mutual exclusion in a distributed system states that only one process is allowed to execute the
critical section (CS) at any given time.

 Message passing is the sole means for implementing distributed mutual exclusion.
 The decision as to which process is allowed access to the CS next is arrived at by
message passing, in which each process learns about the state of all other processes in
some consistent way.
 There are three basic approaches for implementing distributed mutual exclusion:
1. Token-based approach:www.EnggTree.com
 A unique token is shared among all the sites.
 If a site possesses the unique token, it is allowed to enter its critical section
 This approach uses sequence number to order requests for the critical section.
 Each requests for critical section contains a sequence number. This sequence
number is used to distinguish old and current requests.
 This approach insures Mutual exclusion as the token is unique.
 Eg: Suzuki-Kasami’s Broadcast Algorithm
2. Non-token-based approach:
 A site communicates with other sites in order to determine which sites should
execute critical section next. This requires exchange of two or more successive
round of messages among sites.
 This approach use timestamps instead of sequence number to order requests
for the critical section.
 When ever a site make request for critical section, it gets a timestamp.
Timestamp is also used to resolve any conflict between critical section requests.
 All algorithm which follows non-token based approach maintains a logical
clock. Logical clocks get updated according to Lamport’s scheme.
 Eg: Lamport's algorithm, Ricart–Agrawala algorithm

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
3. Quorum-based approach:EnggTree.com
 Instead of requesting permission to execute the critical section from all other
sites, Each site requests only a subset of sites which is called a quorum.
 Any two subsets of sites or Quorum contains a common site.
 This common site is responsible to ensure mutual exclusion.
 Eg: Maekawa’s Algorithm

Preliminaries
 The system consists of N sites, S1, S2, S3, …, SN.
 Assume that a single process is running on each site.
 The process at site Si is denoted by pi. All these processes communicate
asynchronously over an underlying communication network.
 A process wishing to enter the CS requests all other or a subset of processes by
sending REQUEST messages, and waits for appropriate replies before entering the
CS.
 While waiting the process is not allowed to make further requests to enter the CS.
 A site can be in one of the following three states: requesting the CS, executing the CS,
or neither requesting nor executing the CS.
 In the requesting the CS state, the site is blocked and cannot make further requests for
the CS.
 In the idle state, the site is executing outside the CS.
 In the token-based algorithms, a site can also be in a state where a site holding the
token is executing outside the CS. Such state is referred to as the idle token state.
 At any instant, a site may have several pending requests for CS. A site queues up
these requests and serves them one at a time.
 www.EnggTree.com
N denotes the number of processes or sites involved in invoking the critical section, T
denotes the average message delay, and E denotes the average critical section
execution time.

Requirements of mutual exclusion algorithms

 Safety property:

The safety property states that at any instant, only one process can execute the
critical section. This is an essential property of a mutual exclusion algorithm.
 Liveness property:
This property states the absence of deadlock and starvation. Two or more sites
should not endlessly wait for messages that will never arrive. In addition, a site must
not wait indefinitely to execute the CS while other sites are repeatedly executing the
CS. That is, every requesting site should get an opportunity to execute the CS in finite
time.
 Fairness:
Fairness in the context of mutual exclusion means that each process gets a fair
chance to execute the CS. In mutual exclusion algorithms, the fairness property
generally means that the CS execution requests are executed in order of their arrival in
the system.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
Performance metrics EnggTree.com
 Message complexity: This is the number of messages that are required per CS
execution by a site.
 Synchronization delay: After a site leaves the CS, it is the time required and before
the next site enters the CS. (Figure 3.1)
 Response time: This is the time interval a request waits for its CS execution to be
over after its request messages have been sent out. Thus, response time does not
include the time a request waits at a site before its request messages have been sent
out. (Figure 3.2)
 System throughput: This is the rate at which the system executes requests for the
CS. If SD is the synchronization delay and E is the average critical section execution
time.

Figure 3.1 Synchronization delay

www.EnggTree.com

Figure 3.2 Response Time

Low and High Load Performance:
 The performance of mutual exclusion algorithms is classified as two special loading
conditions, viz., “low load” and “high load”.
 The load is determined by the arrival rate of CS execution requests.
 Under low load conditions, there is seldom more than one request for the critical
section present in the system simultaneously.
 Under heavy load conditions, there is always a pending request for critical section at a
site.

Best and worst case performance

 In the best case, prevailing conditions are such that a performance metric attains the
best possible value. For example, the best value of the response time is a roundtrip
message delay plus the CS execution time, 2T +E.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
 EnggTree.com
For examples, the best and worst values of the response time are achieved when load
is, respectively, low and high;
 The best and the worse message traffic is generated at low and heavy load conditions,
respectively.

LAMPORT’S ALGORITHM
 Lamport’s Distributed Mutual Exclusion Algorithm is a permission based algorithm
proposed by Lamport as an illustration of his synchronization scheme for distributed
systems.
 In permission based timestamp is used to order critical section requests and to resolve
any conflict between requests.
 In Lamport’s Algorithm critical section requests are executed in the increasing order of
timestamps i.e a request with smaller timestamp will be given permission toexecute
critical section first than a request with larger timestamp.
 Three type of messages ( REQUEST, REPLY and RELEASE) are used and
communication channels are assumed to follow FIFO order.
 A site send a REQUEST message to all other site to get their permission to enter
critical section.
 A site send a REPLY message to requesting site to give its permission to enter the
critical section.
 A site send a RELEASE message to all other site upon exiting the critical section.
 Every site Si, keeps a queue to store critical section requests ordered by their
timestamps.
 request_queuei denotes the queue of site Si.
 A timestamp is given to each critical section request using Lamport’s logical clock.
www.EnggTree.com
 Timestamp is used to determine priority of critical section requests. Smaller timestamp
gets high priority over larger timestamp. The execution of critical section request is
always in the order of their timestamp.

Fig 3.1: Lamport’s distributed mutual exclusion algorithm

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING

To enter Critical section:

EnggTree.com
 When a site Si wants to enter the critical section, it sends a request message Request(tsi,
i) to all other sites and places the request on request_queue i. Here, Tsi denotes the
timestamp of Site Si.
 When a site Sj receives the request message REQUEST(tsi, i) from site Si, it returns a
timestamped REPLY message to site Si and places the request of site Si on
request_queuej

To execute the critical section:

 A site Si can enter the critical section if it has received the message with timestamp
larger than (tsi, i) from all other sites and its own request is at the top of request_queuei.

To release the critical section:

 When a site Si exits the critical section, it removes its own request from the top of its
request queue and sends a timestamped RELEASE message to all other sites. When a
site Sj receives the timestamped RELEASE message from site Si, it removes the request
of Sia from its request queue.

Correctness
Theorem: Lamport’s algorithm achieves mutual exclusion.
Proof: Proof is by contradiction.
 Suppose two sites Si and Sj are executing the CS concurrently. For this to happen
conditions L1 and L2 must hold at both the sites concurrently.
 This implies that at some instant in time, say t, both S i and Sj have their own requests
at the top of their request queues and condition L1 holds at them. Without loss of
generality, assume that Si ’s request has smaller timestamp than the request of Sj .
 From condition L1 and FIFO property of the communication channels, it is clear that at
instant t the request of Si must be present in request queuej when Sj was executing its
www.EnggTree.com
CS. This implies that Sj ’s own request is at the top of its own request queue when a
smaller timestamp request, Si ’s request, is present in the request queuej – a
contradiction!

Theorem: Lamport’s algorithm is fair.

Proof: The proof is by contradiction.
 Suppose a site Si ’s request has a smaller timestamp than the request of another site S j
and Sj is able to execute the CS before Si .
 For Sj to execute the CS, it has to satisfy the conditions L1 and L2. This implies that
at some instant in time say t, Sj has its own request at the top of its queue and it has also
received a message with timestamp larger than the timestamp of its request from all
other sites.
 But request queue at a site is ordered by timestamp, and according to our assumption
Si has lower timestamp. So Si ’s request must be placed ahead of the Sj ’s request in the
request queuej . This is a contradiction!

Message Complexity:
Lamport’s Algorithm requires invocation of 3(N – 1) messages per critical section execution.
These 3(N – 1) messages involves
 (N – 1) request messages
 (N – 1) reply messages
 (N – 1) release messages

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
Drawbacks of Lamport’s Algorithm:
 Unreliable approach: failure of any one of the processes will halt the progress of
entire system.
 High message complexity: Algorithm requires 3(N-1) messages per critical section
invocation.

Performance:
Synchronization delay is equal to maximum message transmission time. It requires 3(N – 1)
messages per CS execution. Algorithm can be optimized to 2(N – 1) messages by omitting
the REPLY message in some situations.

www.EnggTree.com

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
RICART–AGRAWALA ALGORITHM
 Ricart–Agrawala algorithm is an algorithm to for mutual exclusion in a distributed
system proposed by Glenn Ricart and Ashok Agrawala.
 This algorithm is an extension and optimization of Lamport’s Distributed Mutual
Exclusion Algorithm.
 It follows permission based approach to ensure mutual exclusion.
 Two type of messages ( REQUEST and REPLY) are used and communication
channels are assumed to follow FIFO order.
 A site send a REQUEST message to all other site to get their permission to enter
critical section.
 A site send a REPLY message to other site to give its permission to enter the critical
section.
 A timestamp is given to each critical section request using Lamport’s logical clock.
 Timestamp is used to determine priority of critical section requests.
 Smaller timestamp gets high priority over larger timestamp.
 The execution of critical section request is always in the order of their timestamp.

www.EnggTree.com

Fig 3.2: Ricart–Agrawala algorithm

To enter Critical section:

 When a site Si wants to enter the critical section, it send a timestamped REQUEST
message to all other sites.
 When a site Sj receives a REQUEST message from site Si, It sends a REPLY message
to site Si if and only if Site Sj is neither requesting nor currently executing the critical
section.
 In case Site Sj is requesting, the timestamp of Site Si‘s request is smaller than its own
request.
 Otherwise the request is deferred by site Sj.

To execute the critical section:

Site Si enters the critical section if it has received the REPLY message from all other
sites.

To release the critical section:

Upon exiting site Si sends REPLY message to all the deferred requests.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
Theorem: Ricart-Agrawala algorithm achieves mutual exclusion.
Proof: Proof is by contradiction.
 Suppose two sites Si and Sj ‘ are executing the CS concurrently and Si ’s request has
higher priority than the request of Sj . Clearly, Si received Sj ’s request after it has made
its own request.
 Thus, Sj can concurrently execute the CS with Si only if Si returns a REPLY to Sj (in
response to Sj ’s request) before Si exits the CS.
 However, this is impossible because Sj ’s request has lower priority. Therefore,Ricart-
Agrawala algorithm achieves mutual exclusion.

Message Complexity:
Ricart–Agrawala algorithm requires invocation of 2(N – 1) messages per critical section
execution. These 2(N – 1) messages involve:
 (N – 1) request messages
 (N – 1) reply messages

Drawbacks of Ricart–Agrawala algorithm:

 Unreliable approach: failure of any one of node in the system can halt the progress
of the system. In this situation, the process will starve forever. The problem of failure
of node can be solved by detecting failure after some timeout.

Performance:
Synchronization delay is equal to maximum message transmission time It requires
2(N – 1) messages per Critical section execution.

MAEKAWA‘s ALGORITHM
 Maekawa’s Algorithm is quorum based approach to ensure mutual exclusion in
www.EnggTree.com
distributed systems.

Fig 3.3: Maekawa‘s Algorithm

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com

 In permission based algorithms like Lamport’s Algorithm, Ricart-Agrawala Algorithm

etc. a site request permission from every other site but in quorum based approach, a site
does not request permission from every other site but from a subset ofsites which is
called quorum.
 Three type of messages ( REQUEST, REPLY and RELEASE) are used.
 A site send a REQUEST message to all other site in its request set or quorum to get
their permission to enter critical section.
 A site send a REPLY message to requesting site to give its permission to enter the
critical section.
 A site send a RELEASE message to all other site in its request set or quorum upon
exiting the critical section

The following are the conditions for Maekawa’s algorithm:

Maekawa used the theory of projective planes and showed that N = K(K – 1)+ 1. This
relation gives |Ri|= √N.

To enter Critical section:

 When a site Si wants to enter the critical section, it sends a request message
REQUEST(i) to all other sites in the request set Ri.
 When a site Sj receives thewww.EnggTree.com
request message REQUEST(i) from site S i, it returns a
REPLY message to site Si if it has not sent a REPLY message to the site from the
time it received the last RELEASE message. Otherwise, it queues up the request.

To execute the critical section:

 A site Si can enter the critical section if it has received the REPLY message from all the
site in request set Ri

To release the critical section:

 When a site Si exits the critical section, it sends RELEASE(i) message to all other
sites in request set Ri
 When a site Sj receives the RELEASE(i) message from site Si, it send REPLY
message to the next site waiting in the queue and deletes that entry from the queue
 In case queue is empty, site Sj update its status to show that it has not sent any
REPLY message since the receipt of the last RELEASE message. 

Correctness
Theorem: Maekawa’s algorithm achieves mutual exclusion.
Proof: Proof is by contradiction.
 Suppose two sites Si and Sj are concurrently executing the CS.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
 EnggTree.com
This means site Si received a REPLY message from all sites in Ri and concurrently
site Sj was able to receive a REPLY message from all sites in Rj .
 If Ri ∩ Rj = {Sk }, then site Sk must have sent REPLY messages to both Si and Sj
concurrently, which is a contradiction

Message Complexity:
Maekawa’s Algorithm requires invocation of 3√N messages per critical section execution as
the size of a request set is √N. These 3√N messages involves.
 √N request messages
 √N reply messages
 √N release messages

Drawbacks of Maekawa’s Algorithm:

This algorithm is deadlock prone because a site is exclusively locked by other sites
and requests are not prioritized by their timestamp.

Performance:
Synchronization delay is equal to twice the message propagation delay time. It requires 3√n
messages per critical section execution.

SUZUKI–KASAMI‘s BROADCAST ALGORITHM

 Suzuki–Kasami algorithm is a token-based algorithm for achieving mutual exclusion
in distributed systems.
 This is modification of Ricart–Agrawala algorithm, a permission based (Non-token
based) algorithm which uses REQUEST and REPLY messages to ensure mutual
exclusion.
 In token-based algorithms, A site is allowed to enter its critical section if it possesses
the unique token.
www.EnggTree.com
 Non-token based algorithms uses timestamp to order requests for the critical section
where as sequence number is used in token based algorithms.
 Each requests for critical section contains a sequence number. This sequence number
is used to distinguish old and current requests.

Fig 3.4: Suzuki–Kasami‘s broadcast algorithm

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
To enter Critical section: EnggTree.com
 When a site Si wants to enter the critical section and it does not have the token then it
increments its sequence number RNi[i] and sends a request message REQUEST(i, s n)
to all other sites in order to request the token.
 Here sn is update value of RNi[i]
 When a site Sj receives the request message REQUEST(i, sn) from site Si, it sets
RNj[i] to maximum of RNj[i] and sni.eRNj[i] = max(RNj[i], sn).
After updating RNj[i], Site Sj sends the token to site Si if it has token and RNj[i] =
LN[i] + 1

To execute the critical section:

 Site Si executes the critical section if it has acquired the token.

To release the critical section:

After finishing the execution Site Si exits the critical section and does following:
 sets LN[i] = RNi[i] to indicate that its critical section request RNi[i] has been executed
 For every site Sj, whose ID is not prsent in the token queue Q, it appends its ID to Q if
RNj[j] = LN[j] + 1 to indicate that site Sj has an outstanding request.
 After above updation, if the Queue Q is non-empty, it pops a site ID from the Q and
sends the token to site indicated by popped ID.
 If the queue Q is empty, it keeps the token

Correctness
Mutual exclusion is guaranteed because there is only one token in the system and a site holds
the token during the CS execution.
Theorem: A requesting site enters the CS in finite time.
Proof: Token request messages of a site Si reach other sites in finite time.
Since one of these sites will have token in finite time, site Si ’s request will be placed in the
token queue in finite time. www.EnggTree.com
Since there can be at most N − 1 requests in front of this request in the token queue, site Si
will get the token and execute the CS in finite time.

Message Complexity:
The algorithm requires 0 message invocation if the site already holds the idle token at the
time of critical section request or maximum of N message per critical section execution. This
N messages involves
 (N – 1) request messages
 1 reply message

Drawbacks of Suzuki–Kasami Algorithm:

 Non-symmetric Algorithm: A site retains the token even if it does not have requested
for critical section.

Performance:
Synchronization delay is 0 and no message is needed if the site holds the idle token at the
time of its request. In case site does not holds the idle token, the maximum synchronization
delay is equal to maximum message transmission time and a maximum of N message is
required per critical section invocation.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
DEADLOCK DETECTION IN DISTRIBUTED SYSTEMS
Deadlock can neither be prevented nor avoided in distributed system as the system is
so vast that it is impossible to do so. Therefore, only deadlock detection can be implemented.
The techniques of deadlock detection in the distributed system require the following:
 Progress:The method should be able to detect all the deadlocks in the system.
 Safety: The method should not detect false of phantom deadlocks.

There are three approaches to detect deadlocks in distributed systems.

Centralized approach:
 Here there is only one responsible resource to detect deadlock.
 The advantage of this approach is that it is simple and easy to implement, while the
drawbacks include excessive workload at one node, single point failure which in turns
makes the system less reliable.

Distributed approach:
 In the distributed approach different nodes work together to detect deadlocks. No
single point failure as workload is equally divided among all nodes.
 The speed of deadlock detection also increases.

Hierarchical approach:
 This approach is the most advantageous approach.
 It is the combination of both centralized and distributed approaches of deadlock
detection in a distributed system.
 In this approach, some selected nodes or cluster of nodes are responsible for deadlock
detection and these selected nodes are controlled by a single node.

System Model
www.EnggTree.com
 A distributed program is composed of a set of n asynchronous processes p1, p2, . .
. , pi , . . . , pn that communicates by message passing over the communication
network.
 Without loss of generality we assume that each process is running on a different
processor.
 The processors do not share a common global memory and communicate solely by
passing messages over the communication network.
 There is no physical global clock in the system to which processes have
instantaneous access.
 The communication medium may deliver messages out of order, messages may be
lost garbled or duplicated due to timeout and retransmission, processors may fail
and communication links may go down.
We make the following assumptions:
 The systems have only reusable resources.
 Processes are allowed to make only exclusive access to resources.
 There is only one copy of each resource.
 A process can be in two states: running or blocked.
 In the running state (also called active state), a process has all the needed
resources and is either executing or is ready for execution.
 In the blocked state, a process is waiting to acquire some resource.
Wait for graph
This is used for deadlock deduction. A graph is drawn based on the request and
acquirement of the resource. If the graph created has a closed loop or a cycle, then there is a
deadlock.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com

Fig 3.5: Wait for graph

Preliminaries

Deadlock Handling Strategies

Handling of deadlock becomes highly complicated in distributed systems because no
site has accurate knowledge of the current state of the system and because every inter-site
communication involves a finite and unpredictable delay. There are three strategies for
handling deadlocks:
 Deadlock prevention:
 This is achieved either by having a process acquire all the needed resources
simultaneously before it begins executing or by preempting a process which
holds the needed resource.
 This approach is highly inefficient and impractical in distributed systems.
www.EnggTree.com
 Deadlock avoidance:
 A resource is granted to a process if the resulting global system state is safe.
This is impractical in distributed systems.
 Deadlock detection:
 This requires examination of the status of process-resource interactions for
presence of cyclic wait.
 Deadlock detection in distributed systems seems to be the best approach to
handle deadlocks in distributed systems.

Issues in deadlock Detection

Deadlock handling faces two major issues
1. Detection of existing deadlocks
2. Resolutionof detected deadlocks
Deadlock Detection
 Detection of deadlocks involves addressing two issues namely maintenance of the
WFG and searching of the WFG for the presence of cycles or knots.
 In distributed systems, a cycle or knot may involve several sites, the search for cycles
greatly depends upon how the WFG of the system is represented across the system.
 Depending upon the way WFG information is maintained and the search for cycles is
carried out, there are centralized, distributed, and hierarchical algorithms for deadlock
detection in distributed systems.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
Correctness criteria EnggTree.com
A deadlock detection algorithm must satisfy the following two conditions:
1. Progress-No undetected deadlocks:
The algorithm must detect all existing deadlocks in finite time. In other words, after all
wait-for dependencies for a deadlock have formed, the algorithm should not wait for any
more events to occur to detect the deadlock.
2. Safety -No false deadlocks:
The algorithm should not report deadlocks which do not exist. This is also called as
called phantom or false deadlocks.

Resolution of a Detected Deadlock

 Deadlock resolution involves breaking existing wait-for dependencies between the
processes to resolve the deadlock.
 It involves rolling back one or more deadlocked processes and assigning their
resources to blocked processes so that they can resume execution.
 The deadlock detection algorithms propagate information regarding wait-for
dependencies along the edges of the wait-for graph.
 When a wait-for dependency is broken, the corresponding information should be
immediately cleaned from the system.
 If this information is not cleaned in a timely manner, it may result in detection of
phantom deadlocks.

MODELS OF DEADLOCKS
The models of deadlocks are explained based on their hierarchy. The diagrams illustrate the
working of the deadlock models. Pa, Pb, Pc, Pdare passive processes that had already acquired
the resources. Peis active process that is requesting the resource.

Single Resource Model

www.EnggTree.com
 A process can have at most one outstanding request for only one unit of a resource.
 The maximum out-degree of a node in a WFG for the single resource model can be 1,
the presence of a cycle in the WFG shall indicate that there is a deadlock.

Fig 3.6: Deadlock in single resource model

AND Model
 In the AND model, a passive process becomes active (i.e., its activation condition is
fulfilled) only after a message from each process in its dependent set has arrived.
 In the AND model, a process can request more than one resource simultaneously and the
request is satisfied only after all the requested resources are granted to the process.
 The requested resources may exist at different locations.
 The out degree of a node in the WFG for AND model can be more than 1.
 The presence of a cycle in the WFG indicates a deadlock in the AND model.
 Each node of the WFG in such a model is called an AND node.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
 In the AND model, if a cycle is detected in the WFG, it implies a deadlock but not vice
versa. That is, a process may not be a part of a cycle, it can still be deadlocked.

Fig 3.7: Deadlock in AND model

OR Model

 A process can make a request for numerous resources simultaneously and the request
is satisfied if any one of the requested resources is granted.
 Presence of a cycle in the WFG of an OR model does not imply a deadlock
in the OR model.
 In the OR model, the presence of a knot indicates a deadlock.

Deadlock in OR model: a process Pi is blocked if it has a pending OR request to be satisfied.

 With every blocked process, there is an associated set of processes called dependent
set.
 A process shall move from an idle to an active state on receiving a grant message
from any of the processes in its dependent set.
 A process is permanently blocked if it never receives a grant message from any of the
processes in its dependent set.
www.EnggTree.com
 A set of processes S is deadlocked if all the processes in S are permanently blocked.
 In short, a processis deadlocked or permanently blocked, if the following conditions
are met:
1. Each of the process is the set S is blocked.
2. The dependent set for each process in S is a subset of S.
3. No grant message is in transit between any two processes in set S.
 A blocked process P is the set S becomes active only after receiving a grant message
from a process in its dependent set, which is a subset of S.

Fig 3.8: OR Model

Model (p out of q model)

 This is a variation of AND-OR model.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
 EnggTree.com
This allows a request to obtain any k available resources from a pool of n resources.
Both the models are the same in expressive power.
 This favours more compact formation of a request.
 Every request in this model can be expressed in the AND-OR model and vice-versa.

 Note that AND requests for p resources can be stated as and OR requests for p
resources can be stated as

Fig 3.9: p out of q Model

Unrestricted model
 No assumptions are made regarding the underlying structure of resource requests.
 In this model, only one assumption that the deadlock is stable is made and hence it is
the most general model.
 This model helps separate concerns: Concerns about properties of the problem (stability
and deadlock) are separated from underlying distributed systems computations (e.g.,
message passing versus synchronous communication).

www.EnggTree.com

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
KNAPP’S CLASSIFICATION OF DISTRIBUTED DEADLOCK DETECTION
ALGORITHMS
The four classes of distributed deadlock detection algorithm are:
1. Path-pushing
2. Edge-chasing
3. Diffusion computation
4. Global state detection

Path Pushing algorithms

 In path pushing algorithm, the distributed deadlock detection are detected by

maintaining an explicit global wait for graph.
 The basic idea is to build a global WFG (Wait For Graph) for each site of the
distributed system.
 At each site whenever deadlock computation is performed, it sends its local WFG to
all the neighbouring sites.
 After the local data structure of each site is updated, this updated WFG is then passed
along to other sites, and the procedure is repeated until some site has a sufficiently
complete picture of the global state to announce deadlock or to establish that no
deadlocks are present.
 This feature of sending around the paths of global WFGhas led to the term path-
pushing algorithms.
Examples:Menasce-Muntz , Gligor and Shattuck, Ho and Ramamoorthy, Obermarck

Edge Chasing Algorithms

 The presence of a cycle in awww.EnggTree.com
distributed graph structure is be verified by propagating
special messages called probes, along the edges of the graph.
 These probe messages are different than the request and reply messages.
 The formation of cycle can be deleted by a site if it receives the matching probe sent
by it previously.
 Whenever a process that is executing receives a probe message, it discards this
message and continues.
 Only blocked processes propagate probe messages along their outgoing edges.
 Main advantage of edge-chasing algorithms is that probes are fixed size messages
which is normally very short.
Examples:Chandy et al., Choudhary et al., Kshemkalyani–Singhal, Sinha–Natarajan
algorithms.

Diffusing Computation Based Algorithms

 In diffusion computation based distributed deadlock detection algorithms, deadlock
detection computation is diffused through the WFG of the system.
 These algorithms make use of echo algorithms to detect deadlocks.
 This computation is superimposed on the underlying distributed computation.
 If this computation terminates, the initiator declares a deadlock.
 To detect a deadlock, a process sends out query messages along all the outgoing edges in
the WFG.
 These queries are successively propagated (i.e., diffused) through the edges of the WFG.
 When a blocked process receives first query message for a particular deadlock detection
initiation, it does not send a reply message until it has received a reply message for
every query it sent.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
 EnggTree.com
For all subsequent queries for this deadlock detection initiation, it immediately sends
back a reply message.
 The initiator of a deadlock detection detects a deadlock when it receives reply for every
query it had sent out.
Examples:Chandy–Misra–Haas algorithm for one OR model, Chandy–Herman algorithm

Global state detection-based algorithms

Global state detection based deadlock detection algorithms exploit the following facts:
1. A consistent snapshot of a distributed system can be obtained without freezing the
underlying computation.
2. If a stable property holds in the system before the snapshot collection is initiated, this
property will still hold in the snapshot.

Therefore, distributed deadlocks can be detected by taking a snapshot of the system and
examining it for the condition of a deadlock

MITCHELL AND MERRITT’S ALGORITHM FOR THE SINGLE-RESOURCE

MODEL
 This deadlock detection algorithm assumes a single resource model.
 This detects the local and global deadlocks each process has assumed two different
labels namely private and public each label is accountant the process id guarantees
only one process will detect a deadlock.
 Probes are sent in the opposite direction to the edges of the WFG.
 When a probe initiated by a process comes back to it, the process declares deadlock.

Features:
1. Only one process in a cycle www.EnggTree.com
detects the deadlock. This simplifies the deadlock
resolution – this process can abort itself to resolve the deadlock. This algorithm can
be improvised by including priorities, and the lowest priority process in a cycle
detects deadlock and aborts.
2. In this algorithm, a process that is detected in deadlock is aborted spontaneously, even
though under this assumption phantom deadlocks cannot be excluded. It can be
shown, however, that only genuine deadlocks will be detected in the absence of
spontaneous aborts.

Each node of the WFG has two local variables, called labels:
1. a private label, which is unique to the node at all times, though it is not constant.
2. a public label, which can be read by other processes and which may not be unique.

Each process is represented as u/v where u and u are the public and private labels,
respectively. Initially, private and public labels are equal for each process. A global WFG
is maintained and it defines the entire state sof the system.

 The algorithm is defined by the four state transitions as shown in Fig.3.10, where z =
inc(u, v), and inc(u, v) yields aunique label greater than both u and v labels that are
notshown do not change.
 The transitions in the defined by the algorithm are block, activate , transmit and
detect.
 Block creates an edge in the WFG.
 Two messages are needed, one resource request and onemessage back to the blocked
process to inform it of thepublic label of the process it is waiting for.
 Activate denotes that a process has acquired the resourcefrom the process it was

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
waiting for. EnggTree.com
 Transmit propagates larger labels in the opposite directionof the edges by sending a
probe message.

Fig 3.10: Four possible state transitions

 Detect means that the probe with the private label of some process has returned to it,
indicating a deadlock.
 This algorithm can easily be extended to include priorities, so that whenever a
deadlock occurs, the lowest priority process gets aborted.
 This priority based algorithm has two phases.
1. The first phase is almost identical to the algorithm.
2. The second phase thewww.EnggTree.com
smallest priority is propagated around the circle. The
propagation stops when one process recognizes the propagated priority as its
own.

Message Complexity:
If we assume that a deadlock persists long enough to be detected, the worst-case complexity
of the algorithm is s(s - 1)/2 Transmit steps, where s is the number of processes in the cycle.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com

CHANDY–MISRA–HAAS ALGORITHM FOR THE AND MODEL

 This is considered an edge-chasing, probe-based algorithm.
 It is also considered one of the best deadlock detection algorithms for distributed
systems.
 If a process makes a request for a resource which fails or times out, the process
generates a probe message and sends it to each of the processes holding one or more
of its requested resources.
 This algorithm uses a special message called probe, which is a triplet (i, j,k), denoting
that it belongs to a deadlock detection initiated for process Pi andit is being sent by the
home site of process Pj to the home site of process Pk.
 Each probe message contains the following information:
 the id of the process that is blocked (the one that initiates the probe message);
 the id of the process is sending this particular version of the probe message;
 the id of the process that should receive this probe message.
 A probe message travels along the edges of the global WFG graph, and a deadlock is
detected when a probe message returns to the process that initiated it.
 A process Pj is said to be dependent on another process Pk if there exists a sequence of
www.EnggTree.com
processes Pj, Pi1 , Pi2 , . . . , Pim, Pksuch that each process except Pkin the sequence is
blocked and each process, except the Pj, holds a resource for which the previous process
in the sequence is waiting.
 Process Pj is said to be locally dependent upon process Pk if Pj is dependent upon
Pkand both the processes are on the same site.
 When a process receives a probe message,it checks to see if it is also waiting for
resources
 If not, it is currently using the needed resource and will eventually finish and release
the resource.
 If it is waiting for resources, it passes on the probe message to all processes it knows
to be holding resources it has itself requested.
 The process first modifies the probe message, changing the sender and receiver ids.
 If a process receives a probe message that it recognizes as having initiated,it knows
there is a cycle in the system and thus, deadlock.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
Data structures
Each process Pi maintains a boolean array, dependenti, where dependent(j) is true only if Pi
knows that Pj is dependent on it. Initially, dependenti (j) is false for all i and j.

www.EnggTree.com
Fig 3.11: Chandy–Misra–Haas algorithm for the AND model

Performance analysis
 In the algorithm, one probe message is sent on every edge of the WFG which
connects processes on two sites.
 The algorithm exchanges at most m(n − 1)/2 messages to detect a deadlock that
involves m processes and spans over n sites.
 The size of messages is fixed and is very small (only three integer words).
 The delay in detecting a deadlock is O(n).

Advantages:
 It is easy to implement.
 Each probe message is of fixed length.
 There is very little computation.
 There is very little overhead.
 There is no need to construct a graph, nor to pass graph information to other sites.
 This algorithm does not find false (phantom) deadlock.
 There is no need for special data structures.

CHANDY–MISRA–HAAS ALGORITHM FOR THE OR MODEL

 A blocked process determines if it is deadlocked by initiating a diffusion computation.
 Two types of messages are used in a diffusion computation:
 query(i, j, k)
 reply(i, j, k)

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
M.A.M COLLEGE OF ENGINEERING
EnggTree.com
denoting that they belong to a diffusion computation initiated by a process p i and are being
sent from process pj to process pk.
 A blocked process initiates deadlock detection by sending query messages to all
processes in its dependent set.
 If an active process receives a query or reply message, it discards it.
 When a blocked process Pk receives a query(i, j, k) message, it takes the following
actions:
1. If this is the first query message received by Pk for the deadlock detection
initiated by Pi, then it propagates the query to all the processes in its dependent
set and sets a local variable numk (i) to the number of query messages sent.
2. If this is not the engaging query, then Pk returns a reply message to it
immediately provided Pk has been continuously blocked since it received the
corresponding engaging query. Otherwise, it discards the query.
 Process Pk maintains a boolean variable waitk(i) that denotes the fact that it
has been continuously blocked since it received the last engaging query from
process Pi.
 When a blocked process Pk receives a reply(i, j, k) message, it decrements
numk(i) only if waitk(i) holds.
 A process sends a reply message in response to an engaging query only after it
has received a reply to every query message it has sent out for this engaging
query.
 The initiator process detects a deadlock when it has received reply messages to
all the query messages it has sent out.
www.EnggTree.com

Fig 3.12: Chandy–Misra–Haas algorithm for the OR model

Performance analysis
 For every deadlock detection, the algorithm exchanges e query messages ande reply
messages, where e = n(n – 1) is the number of edges.

Downloaded from EnggTree.com

CS3551 – DISTRIBUTED COMPUTING
EnggTree.com
CS3551 DISTRIBUTED COMPUTING

UNIT IV
CONSENSUS AND RECOVERY
Consensusand agreement algorithms: Problem definition – Overview of results – Agreement in
a failure – free system (Synchronous and Asynchronous) – Agreement in synchronous systems
with failures. Check pointing and rollback recovery: Introduction – Background and definitions
– Issues in failure recovery – Checkpoint-based recovery – Coordinated check pointing algorithm
– Algorithm for asynchronous check pointing and recovery.

CONSENSUS PROBLEM IN ASYNCHRONOUS SYSTEMS.

Table: Overview of results on agreement.

f denotes number of failure-prone processes. n is the total number of processes.

Failure Synchronous system Asynchronous
Mode (message-passing and system
shared memory) (message-passing and
shared memory)
No agreement attainable; agreement attainable;
Failure common knowledge concurrent common
www.EnggTree.com
attainable knowledge
Crash agreement attainable agreement not
Failure f < n processes attainable

Byzantie agreement attainable agreement not

Failure f ≤ [(n - 1)/3] Byzantine attainable
processes
In a failure-free system, consensus can be attained in a straightforward manner.

Consensus Problem (all processes have an initial value)

Agreement: All non-faulty processes must agree on the same (single) value.

Validity: If all the non-faulty processes have the same initial value, then the agreed upon value
by all the non-faulty processes must be that same value.

Termination: Each non-faulty process must eventually decide on a value.

Consensus Problem in Asynchronous Systems.

The overhead bounds are for the given algorithms, and not necessarily tight bounds for the
problem.
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Solvable Failure model Definition

Variants and overhead
Reliable Crash Failure, n > f Validity,
broadcast (MP) Agreement,
Integrity conditions
k-set Crash Failure, f < k size of the set of
consensus < n. (MP and SM) values agreed upon
must be less than k
C-agreement Crash Failure, n ≥ values agreed upon
5f + 1 (MP) are within ɛ of each
other
Renaming up to f fail-stop select a unique name
processes, n ≥ 2f + from a set of names
1 (MP)
Crash Failure, f ≤ n
- 1 (SM)
Circumventing the impossibility results for consensus in asynchronous
systems:
www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

STEPS FOR BYZANTINEGENERALS (ITERATIVE

FORMULATION), SYNCHRONOUS, MESSAGE-PASSING:

www.EnggTree.com

Byzantine Agreement (single source has an initial value) Agreement:

All non faulty processes must agree on the same value.

Validity: If the source process is non-faulty, then the agreed upon value by all the non- faulty
processes must be the same as the initial value of the source.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

STEPS FOR BYZANTINE GENERALS (RECURSIVE FORMULATION),

SYNCHRONOUS, MESSAGE-PASSING:

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

CODE FOR THE PHASE KING ALGORITHM:

Each phase has a unique "phase king" derived, say, from PID. Each phase has two rounds:

 1 in 1st round, each process sends its estimate to all other processes.

 2 in 2nd round, the "Phase king" process arrives at an estimate based on the values it
received in 1st round, and broadcasts its new estimate to all others.

Fig. Message pattern for the phase-king algorithm.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

PHASE KING ALGORITHM CODE:

(f + 1) phases, (f + 1)[(n - 1)(n + 1)] messages, and can tolerate up to f < dn=4e malicious processes

Correctness Argument

 1 Among f + 1 phases, at least one phase k where phase-king is non-malicious.

 2 In phase k, all non-malicious processes Pi and Pj will have same estimate of consensus
value as Pk does.

 Pi and Pj use their own majority values. Pi 's mult > n=2 + f )

 Pi uses its majority value; Pj uses phase-king's tie-breaker value. (Pi’s mult > n=2 + f ,
Pj 's mult > n=2 for same value)

 Pi and Pj use the phase-king's tie-breaker value. (In the phase in which Pk is non-
malicious, it sends same value to Pi and Pj )

In all 3 cases, argue that Pi and Pj end up with same value as estimate
www.EnggTree.com
 If all non-malicious processes have the value x at the start of a phase, they will continue
to have x as the consensus value at the end of the phase.

CODE FOR THE EPSILON CONSENSUS (MESSAGE-PASSING, ASYNCHRONOUS):

Agreement: All non-faulty processes must make a decision and the values decided upon by any
two non-faulty processes must be within range of each other.

Validity: If a non-faulty process Pi decides on some value vi , then that value must be within the
range of values initially proposed by the processes.

Termination: Each non-faulty process must eventually decide on a value. The algorithm for the
message-passing model assumes n ≥ 5f + 1, although the problem is solvable for n > 3f + 1.

 Main loop simulates sync rounds.

 Main lines (1d)-(1f): processes perform all-all msg exchange

 Process broadcasts its estimate of consensus value, and awaits n - f similar

 msgs from other processes

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 the processes' estimate of the consensus value converges at a particular rate,

 until it is _ from any other processes estimate.

 # rounds determined by lines (1a)-(1c).

www.EnggTree.com

TWO-PROCESS WAIT-FREE CONSENSUS USING FIFO QUEUE, COMPARE &

SWAP:

Wait-free Shared Memory Consensus using Shared Objects:

Not possible to go from bivalent to univalent state if even a single failure is allowed. Difficulty is
not being able to read & write a variable atomically.

 It is not possible to reach consensus in an asynchronous shared memory system using

Read/Write atomic registers, even if a single process can fail by crashing.

 There is no wait-free consensus algorithm for reaching consensus in an asynchronous

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

shared memory system using Read/Write atomic registers.

To overcome these negative results:

 Weakening the consensus problem, e.g., k-set consensus, approximate consensus, and
renaming using atomic registers.

 Using memory that is stronger than atomic Read/Write memory to design wait- free
consensus algorithms. Such a memory would need corresponding access primitives.

Are there objects (with supporting operations), using which there is a wait-free (i.e., (n -1)- crash
resilient) algorithm for reaching consensus in a n-process system? Yes, e.g., Test&Set, Swap,
Compare&Swap. The crash failure model requires the solutions to be wait-free.

TWO-PROCESS WAIT-FREE CONSENSUS USING FIFO QUEUE:

www.EnggTree.com

WAIT-FREE CONSENSUS USING COMPARE & SWAP:

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

NONBLOCKING UNIVERSAL ALGORITHM:

Universality of Consensus Objects

An object is defined to be universal if that object along with read/write registers can simulate
any other object in a wait-free manner. In any system containing up to k processes, an object X
such that CN(X) = k is universal.

For any system with up to k processes, the universality of objects X with consensus number k is
shown by giving a universal algorithm to wait-free simulate any object using objects of type X
and read/write registers.

This is shown in two steps.

 1 A universal algorithm to wait-free simulate any object whatsoever using read/write

registers and arbitrary k-processor consensus objects is given. This is the main step.

 2 Then, the arbitrary k-process consensus objects are simulated with objects of type X,
having consensus number k. This trivially follows after the first step.

Any object X with consensus numberwww.EnggTree.com

k is universal in a system with n ≤ k processes.

A nonblocking operation, in the context of shared memory operations, is an operation that may
not complete itself but is guaranteed to complete at least one of the pending operations in a
finite number of steps.

Nonblocking Universal Algorithm:

The linked list stores the linearized sequence of operations and states following each operation.

Operations to the arbitrary object Z are simulated in a nonblocking way using an arbitrary
consensus object (the field op.next in each record) which is accessed via the Decide call.

Each process attempts to thread its own operation next into the linked list.

 There are as many universal objects as there are operations to thread.

 A single pointer/counter cannot be used instead of the array Head. Because reading and
updating the pointer cannot be done atomically in a wait-free manner.

 Linearization of the operations given by the sequence number. As algorithm is

nonblock

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Check pointing and rollback recovery: Introduction

 Rollback recovery protocols restore the system back to a consistent state after a failure,
 It achieves fault tolerance by periodically saving the state of a process during the failure-
free execution
 It treats a distributed system application as a collection of processes that communicate
over a network
Checkpoints
The saved state is called a checkpoint, and the procedure of restarting from a previously check
pointed state is called rollback recovery. A checkpoint can be saved on either the stable storage
or the volatile storage
Why is rollback recovery of distributed systems complicated?
Messages induce inter-process dependencies during failure-free operation
Rollback propagation
The dependencies among messages may force some of the processes that did not fail to roll back.
This phenomenon of cascaded rollback is called the domino effect.
Uncoordinated check pointing
www.EnggTree.com
If each process takes its checkpoints independently, then the system cannot avoid the domino
effect – this scheme is called independent or uncoordinated check pointing
Techniques that avoid domino effect
1. Coordinated check pointing rollback recovery - Processes coordinate their checkpoints to
form a system-wide consistent state
2. Communication-induced check pointing rollback recovery - Forces each process to take
checkpoints based on information piggybacked on the application.

3. Log-based rollback recovery - Combines check pointing with logging of non-

deterministic events • relies on piecewise deterministic (PWD) assumption.
Background and definitions
System model
 A distributed system consists of a fixed number of processes, P1, P2,…_ PN , which
communicate only through messages.
 Processes cooperate to execute a distributed application and interact with the outside world
by receiving and sending input and output messages, respectively.
 Rollback-recovery protocols generally make assumptions about the reliability of theinter-
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

process communication.
 Some protocols assume that the communication uses first-in-first-out (FIFO) order, while
other protocols assume that the communication subsystem can lose, duplicate, or reorder
messages.
 Rollback-recovery protocols therefore must maintain information about the internal
interactions among processes and also the external interactions with the outside world.

An example of a distributed system with three processes.

A local checkpoint
 All processes save their localwww.EnggTree.com
states at certain instants of time
 A local check point is a snapshot of the state of the process at a given instance
 Assumption
– A process stores all local checkpoints on the stable storage
– A process is able to roll back to any of its existing local checkpoints

 𝐶𝑖,𝑘 – The kth local checkpoint at process 𝑃𝑖

 𝐶𝑖,0 – A process 𝑃𝑖 takes a checkpoint 𝐶𝑖,0 before it starts execution
Consistent states
 A global state of a distributed system is a collection of the individual states of all
participating processes and the states of the communication channels
 Consistent global state
– a global state that may occur during a failure-free execution of distribution of
distributed computation
– if a process‟s state reflects a message receipt, then the state of the corresponding
sender must reflect the sending of the message
 A global checkpoint is a set of local checkpoints, one from each process

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 A consistent global checkpoint is a global checkpoint such that no message is sent by a

process after taking its local point that is received by another process before taking its
checkpoint.

www.EnggTree.com

 For instance, Figure shows two examples of global states.

 The state in fig (a) is consistent and the state in Figure (b) is inconsistent.
 Note that the consistent state in Figure (a) shows message m1 to have been sent but not
yet received, but that is alright.
 The state in Figure (a) is consistent because it represents a situation in which every
message that has been received, there is a corresponding message send event.
 The state in Figure (b) is inconsistent because process P2 is shown to have received m2
but the state of process P1 does not reflect having sent it.
 Such a state is impossible in any failure-free, correct computation. Inconsistent states
occur because of failures.
Interactions with outside world
A distributed system often interacts with the outside world to receive input data or deliver the
outcome of a computation. If a failure occurs, the outside world cannot be expected to roll back.
For example, a printer cannot roll back the effects of printing a character
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Outside World Process (OWP)

 It is a special process that interacts with the rest of the system through message passing.
 It is therefore necessary that the outside world see a consistent behavior of the system
despite failures.
 Thus, before sending output to the OWP, the system must ensure that the state from which
the output is sent will be recovered despite any future failure.
A common approach is to save each input message on the stable storage before allowing the
application program to process it.
An interaction with the outside world to deliver the outcome of a computation is shown on the
process-line by the symbol “||”.
Different types of Messages

1. In-transit message
 messages that have been sent but not yet received
2. Lost messages
 messages whose “send‟ is done but “receive‟ is undone due to rollback
www.EnggTree.com
3. Delayed messages
 messages whose “receive‟ is not recorded because the receiving process was
either down or the message arrived after rollback
4. Orphan messages
 messages with “receive‟ recorded but message “send‟ not recorded
 do not arise if processes roll back to a consistent global state
5. Duplicate messages
 arise due to message logging and replaying during process recovery

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

In-transit messages
In Figure , the global state {C1,8 , C2, 9 , C3,8, C4,8} shows that message m1 has been sent but
www.EnggTree.com
not yet received. We call such a message an in-transit message. Message m2 is also an in-transit
message.

Delayed messages
Messages whose receive is not recorded because the receiving process was either down or the
message arrived after the rollback of the receiving process, are called delayed messages. For
example, messages m2 and m5 in Figure are delayed messages.
Lost messages
Messages whose send is not undone but receive is undone due to rollback are called lostmessages.
This type of messages occurs when the process rolls back to a checkpoint prior to reception of the
message while the sender does not rollback beyond the send operation of the message. In Figure ,
message m1 is a lost message.
Duplicate messages
 Duplicate messages arise due to message logging and replaying during process
recovery. For example, in Figure, message m4 was sent and received before the
rollback. However, due to the rollback of process P4 to C4,8 and process P3 to C3,8,
both send and receipt of message m4 are undone.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 When process P3 restarts from C3,8, it will resend message m4.

 Therefore, P4 should not replay message m4 from its log.
 If P4 replays message m4, then message m4 is called a duplicate message.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Issues in failure recovery

In a failure recovery, we must not only restore the system to a consistent state, but also
appropriately handle messages that are left in an abnormal state due to the failure and recovery

The computation comprises of three processes Pi, Pj , and Pk, connected through a communication
network. The processes communicate solely by exchanging messages over fault- free, FIFO
communication channels.
www.EnggTree.com

Processes Pi, Pj , and Pk have taken checkpoints

 The rollback of process 𝑃𝑖 to checkpoint 𝐶𝑖,1 created an orphan message H

 Orphan message I is created due to the roll back of process 𝑃𝑗 to checkpoint 𝐶𝑗,1
 Messages C, D, E, and F are potentially problematic
– Message C: a delayed message
– Message D: a lost message since the send event for D is recorded in the
restored state for 𝑃𝑗, but the receive event has been undone at process 𝑃𝑖.
– Lost messages can be handled by having processes keep a message log of all
the sent messages
– Messages E, F: delayed orphan messages. After resuming execution from their
checkpoints, processes will generate both of these messages

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Checkpoint-based recovery
Checkpoint-based rollback-recovery techniques can be classified into three categories:
1. Uncoordinated checkpointing
2. Coordinated checkpointing
3. Communication-induced checkpointing

1. Uncoordinated Checkpointing
 Each process has autonomy in deciding when to take checkpoints
 Advantages
The lower runtime overhead during normal execution

 Disadvantages
1. Domino effect during a recovery
2. Recovery from a failure is slow because processes need to iterate to find a
consistent set of checkpoints
3. Each process maintains multiple checkpoints and periodically invoke a
www.EnggTree.com
garbage collection algorithm
4. Not suitable for application with frequent output commits
 The processes record the dependencies among their checkpoints caused by message
exchange during failure-free operation
 The following direct dependency tracking technique is commonly used in uncoordinated
checkpointing.
Direct dependency tracking technique
 Assume each process 𝑃𝑖 starts its execution with an initial checkpoint 𝐶𝑖,0
 𝐼𝑖,𝑥 : checkpoint interval, interval between 𝐶𝑖,𝑥−1 and 𝐶𝑖,𝑥
 When 𝑃𝑗 receives a message m during 𝐼𝑗,𝑦 , it records the dependency from 𝐼𝑖,𝑥 to 𝐼𝑗,𝑦,
which is later saved onto stable storage when 𝑃𝑗 takes 𝐶𝑗,𝑦

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 When a failure occurs, the recovering process initiates rollback by broadcasting a

dependency request message to collect all the dependency information maintained by each
process.

 When a process receives this message, it stops its execution and replies with the
dependency information saved on the stable storage as well as with the dependency
information, if any, which is associated with its current state.
www.EnggTree.com
 The initiator then calculates the recovery line based on the global dependency information
and broadcasts a rollback request message containing the recovery line.

 Upon receiving this message, a process whose current state belongs to the recovery line
simply resumes execution; otherwise, it rolls back to an earlier checkpoint as indicated by
the recovery line. 
2. Coordinated Checkpointing
In coordinated checkpointing, processes orchestrate their checkpointing activities so that all
local checkpoints form a consistent global state
Types
1. Blocking Checkpointing: After a process takes a local checkpoint, to prevent orphan
messages, it remains blocked until the entire checkpointing activity is complete
Disadvantages: The computation is blocked during the checkpointing
2. Non-blocking Checkpointing: The processes need not stop their execution while taking
checkpoints. A fundamental problem in coordinated checkpointing is to prevent a process
from receiving application messages that could make the checkpoint inconsistent.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Example (a) : Checkpoint inconsistency

 Message m is sent by 𝑃0 after receiving a checkpoint request from the checkpoint
coordinator
 Assume m reaches 𝑃1 before the checkpoint request
 This situation results in an inconsistent checkpoint since checkpoint 𝐶1,𝑥 shows the receipt
of message m from 𝑃0, while checkpoint 𝐶0,𝑥 does not show m being sent from
𝑃0
Example (b) : A solution with FIFO channels
 If channels are FIFO, this problem can be avoided by preceding the first post-checkpoint
message on each channel by a checkpoint request, forcing each process to take a checkpoint
before receiving the first post-checkpoint message

www.EnggTree.com

Impossibility of min-process non-blocking checkpointing

 A min-process, non-blocking checkpointing algorithm is one that forces only a minimum
number of processes to take a new checkpoint, and at the same time it does not force any
process to suspend its computation.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Algorithm
 The algorithm consists of two phases. During the first phase, the checkpoint initiator
identifies all processes with which it has communicated since the last checkpoint and sends
them a request.
 Upon receiving the request, each process in turn identifies all processes it has
communicated with since the last checkpoint and sends them a request, and so on, until
no more processes can be identified.
 During the second phase, all processes identified in the first phase take a checkpoint. The
result is a consistent checkpoint that involves only the participating processes.

 In this protocol, after a process takes a checkpoint, it cannot send any message until the
second phase terminates successfully, although receiving a message after the checkpoint
has been taken is allowable.
3. Communication-induced Checkpointing
Communication-induced checkpointing is another way to avoid the domino effect, while allowing
processes to take some of their checkpoints independently. Processes may be forced to take
additional checkpoints
www.EnggTree.com
Two types of checkpoints
1. Autonomous checkpoints
2. Forced checkpoints
The checkpoints that a process takes independently are called local checkpoints, while those that
a process is forced to take are called forced checkpoints.
 Communication-induced check pointing piggybacks protocol- related information on
each application message
 The receiver of each application message uses the piggybacked information to determine
if it has to take a forced checkpoint to advance the global recovery line
 The forced checkpoint must be taken before the application may process the contents of
the message
 In contrast with coordinated check pointing, no special coordination messages are
exchanged
Two types of communication-induced checkpointing
1. Model-based checkpointing
2. Index-based checkpointing.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Model-based checkpointing
 Model-based checkpointing prevents patterns of communications and checkpoints
that could result in inconsistent states among the existing checkpoints.
 No control messages are exchanged among the processes during normal operation. All
information necessary to execute the protocol is piggybacked on application messages

 There are several domino-effect-free checkpoint and communication model.

 The MRS (mark, send, and receive) model of Russell avoids the domino effect by
ensuring that within every checkpoint interval all message receiving events precede all
message-sending events.
Index-based checkpointing.
 Index-based communication-induced checkpointing assigns monotonically increasing
indexes to checkpoints, such that the checkpoints having the same index at different
processes form a consistent state.
Log-based rollback recovery
A log-based rollback recovery makes use of deterministic and nondeterministic events in a
computation.
www.EnggTree.com

Deterministic and non-deterministic events

 Log-based rollback recovery exploits the fact that a process execution can be modeled
as a sequence of deterministic state intervals, each starting with the execution of a non-
deterministic event.
 A non-deterministic event can be the receipt of a message from another process or an
event internal to the process.
 Note that a message send event is not a non-deterministic event.
 For example, in Figure, the execution of process P0 is a sequence of four deterministic
intervals. The first one starts with the creation of the process, while the remaining three
start with the receipt of messages m0, m3, and m7, respectively.
 Send event of message m2 is uniquely determined by the initial state of P0 and by the
receipt of message m0, and is therefore not a non-deterministic event.
 Log-based rollback recovery assumes that all non-deterministic events can be identified
and their corresponding determinants can be logged into the stable storage.
 Determinant: the information need to “replay” the occurrence of a non-deterministic

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

event (e.g., message reception).

 During failure-free operation, each process logs the determinants of all non-
deterministic events that it observes onto the stable storage. Additionally, each process
also takes checkpoints to reduce the extent of rollback during recovery.

www.EnggTree.com

The no-orphans consistency condition

Let e be a non-deterministic event that occurs at process p. We define the following:
• Depend(e): the set of processes that are affected by a non-deterministic event e.
• Log(e): the set of processes that have logged a copy of e’s determinant in their volatile
memory.
• Stable(e): a predicate that is true if e’s determinant is logged on the stable storage.

Suppose a set of processes crashes. A process p in becomes an orphan when p itself does
not fail and p’s state depends on the execution of a nondeterministic event e whose determinant
cannot be recovered from the stable storage or from the volatile memory of a surviving process.
storage or from the volatile memory of a surviving process. Formally, it can be stated as follows

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Types

1. Pessimistic Logging
 Pessimistic logging protocols assume that a failure can occur after any non-deterministic
event in the computation. However, in reality failures are rare
 Pessimistic protocols implement the following property, often referred to as synchronous logging,
which is a stronger than the always-no-orphans condition
 Synchronous logging
– ∀e: ￢Stable(e)
www.EnggTree.com
⇒ |Depend(e)| = 0
 Thai is,if an event has not been logged on the stable storage, then no process can depend
on it.
Example:
 Suppose processes 𝑃1 and 𝑃2 fail as shown, restart from checkpoints B and C, and roll
forward using their determinant logs to deliver again the same sequence of messages as in
the pre-failure execution
 Once the recovery is complete, both processes will be consistent with the state of 𝑃0
that includes the receipt of message 𝑚7 from 𝑃1

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

• Disadvantage: performance penalty for synchronous logging

• Advantages:
• immediate output commit
• restart from most recent checkpoint
• recovery limited to failed process(es)
• simple garbage collection
• Some pessimistic logging systems reduce the overhead of synchronous logging without
relying on hardware. For example, the sender-based message logging (SBML) protocol
keeps the determinants corresponding to the delivery of each message m in the volatile
memory of its sender.
• The sender-based message logging (SBML) protocol
Two steps.
1. First, before sending m, the sender logs its content in volatile memory.
2. Then, when the receiver of m responds with an acknowledgment that includes the order
in which the message was delivered, the sender adds to the determinant the ordering
www.EnggTree.com
information.
2. Optimistic Logging
• Processes log determinants asynchronously to the stable storage
• Optimistically assume that logging will be complete before a failure occurs
• Do not implement the always-no-orphans condition
• To perform rollbacks correctly, optimistic logging protocols track causal dependencies
during failure free execution
• Optimistic logging protocols require a non-trivial garbage collection scheme
• Pessimistic protocols need only keep the most recent checkpoint of each process, whereas
optimistic protocols may need to keep multiple checkpoints for each process

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

• Consider the example shown in Figure Suppose process P2 fails before the determinant for
m5 is logged to the stable storage. Process P1 then becomes an orphan process and must
roll back to undo the effects of receiving the orphan message m6. The rollback of P1
further forces P0 to roll back to undo the effects of receiving message m7.
• Advantage: better performance in failure-free execution
• Disadvantages:
• coordination required on output commit
• more complex garbage collection
• Since determinants are logged asynchronously, output commit in optimistic logging
protocols requires a guarantee that no failure scenario can revoke the output. For example,
if process P0 needs to commit output at state X, it must log messages m4 andm7 to the
stable storage and ask P2 to log m2 and m5. In this case, if any process fails, the
computation can be reconstructed up to state X.

3. Causal Logging
• Combines the advantages of both pessimistic and optimistic logging at the expense of a more
complex recovery protocol
www.EnggTree.com
• Like optimistic logging, it does not require synchronous access to the stable storage except
during output commit
• Like pessimistic logging, it allows each process to commit output independently and never
creates orphans, thus isolating processes from the effects of failures at other processes
• Make sure that the always-no-orphans property holds
• Each process maintains information about all the events that have causally affected its state

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

• Consider the example in Figure Messages m5 and m6 are likely to be lost on the failures
of P1 and P2 at the indicated instants. Process
• P0 at state X will have logged the determinants of the nondeterministic events that
causally precede its state according to Lamport’s happened-before relation.
www.EnggTree.com
• These events consist of the delivery of messages m0, m1, m2, m3, and m4.
• The determinant of each of these non-deterministic events is either logged on the stable
storage or is available in the volatile log of process P0.
• The determinant of each of these events contains the order in which its original receiver
delivered the corresponding message.

• The message sender, as in sender-based message logging, logs the message content. Thus,
process P0 will be able to “guide” the recovery of P1 and P2 since it knows the order in
which P1 should replay messages m1 and m3 to reach the state from which P1 sent message
m4.
• Similarly, P0 has the order in which P2 should replay message m2 to be consistent with
both P0 and P1.
• The content of these messages is obtained from the sender log of P0 or regenerated
deterministically during the recovery of P1 and P2.
• Note that information about messages m5 and m6 is lost due to failures. These messages
may be resent after recovery possibly in a different order.
• However, since they did not causally affect the surviving process or the outside world, the
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

resulting state is consistent.

• Each process maintains information about all the events that have causally affected its state.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

KOO AND TOUEG COORDINATED CHECKPOINTING AND RECOVERY

TECHNIQUE:
 Koo and Toueg coordinated check pointing and recovery technique takes a consistent set
of checkpoints and avoids the domino effect and livelock problems during the recovery.
• Includes 2 parts: the check pointing algorithm and the recovery algorithm

A. The Checkpointing Algorithm

The checkpoint algorithm makes the following assumptions about the distributed system:
 Processes communicate by exchanging messages through communication channels.
 Communication channels are FIFO.
 Assume that end-to-end protocols (the sliding window protocol) exist to handle with
message loss due to rollback recovery and communication failure.
 Communication failures do not divide the network.
The checkpoint algorithm takes two kinds of checkpoints on the stable storage: Permanent and
Tentative.
www.EnggTree.com
A permanent checkpoint is a local checkpoint at a process and is a part of a consistent global
checkpoint.
A tentative checkpoint is a temporary checkpoint that is made a permanent checkpoint on
the successful termination of the checkpoint algorithm.
The algorithm consists of two phases.

First Phase
1. An initiating process Pi takes a tentative checkpoint and requests all other processes to take
tentative checkpoints. Each process informs Pi whether it succeeded in taking a tentative
checkpoint.
2. A process says “no” to a request if it fails to take a tentative checkpoint
3. If Pi learns that all the processes have successfully taken tentative checkpoints, Pi decides
that all tentative checkpoints should be made permanent; otherwise, Pi decides that all the
tentative checkpoints should be thrown-away.
Second Phase
1. Pi informs all the processes of the decision it reached at the end of the first phase.
2. A process, on receiving the message from Pi will act accordingly.
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

3. Either all or none of the processes advance the checkpoint by taking permanent
checkpoints.
4. The algorithm requires that after a process has taken a tentative checkpoint, it cannot
send messages related to the basic computation until it is informed of Pi’s decision.
Correctness: for two reasons
i. Either all or none of the processes take permanent checkpoint
ii. No process sends message after taking permanent checkpoint
An Optimization
The above protocol may cause a process to take a checkpoint even when it is not necessary for
consistency. Since taking a checkpoint is an expensive operation, we avoid taking checkpoints.

B. The Rollback Recovery Algorithm

The rollback recovery algorithm restores the system state to a consistent state after a failure. The
rollback recovery algorithm assumes that a single process invokes the algorithm. It assumes that
the checkpoint and the rollback recovery algorithms are not invoked concurrently. The rollback
recovery algorithm has two phases.
First Phase
www.EnggTree.com
1. An initiating process Pi sends a message to all other processes to check if they all are
willing to restart from their previous checkpoints.
2. A process may reply “no” to a restart request due to any reason (e.g., it is already
participating in a check pointing or a recovery process initiated by some other process).
3. If Pi learns that all processes are willing to restart from their previous checkpoints, Pi
decides that all processes should roll back to their previous checkpoints. Otherwise,
4. Pi aborts the roll back attempt and it may attempt a recovery at a later time.
Second Phase
1. Pi propagates its decision to all the processes.
2. On receiving Pi’s decision, a process acts accordingly.
3. During the execution of the recovery algorithm, a process cannot send messages related
to the underlying computation while it is waiting for Pi’s decision.
Correctness: Resume from a consistent state
Optimization: May not to recover all, since some of the processes did not change anything

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

The above protocol, in the event of failure of process X, the above protocol will require
processes X, Y, and Z to restart from checkpoints x2, y2, and z2, respectively.
www.EnggTree.com
Process Z need not roll back because there has been no interaction between process Z and the
other two processes since the last checkpoint at Z.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

ALGORITHM FOR ASYNCHRONOUS CHECKPOINTING AND RECOVERY:

The algorithm of Juang and Venkatesan for recovery in a system that uses asynchronous check
pointing.
A. System Model and Assumptions
The algorithm makes the following assumptions about the underlying system:
 The communication channels are reliable, deliver the messages in FIFO order and have
infinite buffers.
 The message transmission delay is arbitrary, but finite.
 Underlying computation/application is event-driven: process P is at state s, receives
message m, processes the message, moves to state s’ and send messages out. So the
triplet (s, m, msgs_sent) represents the state of P
Two type of log storage are maintained:
– Volatile log: short time to access but lost if processor crash. Move to stable log
periodically.
– Stable log: longer time to access but remained if crashed
A. Asynchronous Check pointing
www.EnggTree.com
– After executing an event, the triplet is recorded without any synchronization with
other processes.
– Local checkpoint consist of set of records, first are stored in volatile log, then
moved to stable log.
B. The Recovery Algorithm
Notations and data structure
The following notations and data structure are used by the algorithm:
• RCVDi←j(CkPti) represents the number of messages received by processor pi from processor
pj , from the beginning of the computation till the checkpoint CkPti.

• SENTi→j(CkPti) represents the number of messages sent by processor pi to processor pj , from

the beginning of the computation till the checkpoint CkPti.

Basic idea
 Since the algorithm is based on asynchronous check pointing, the main issue in the
recovery is to find a consistent set of checkpoints to which the system can be restored.
 The recovery algorithm achieves this by making each processor keep track of both the

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

number of messages it has sent to other processors as well as the number of messages it
has received from other processors.
 Whenever a processor rolls back, it is necessary for all other processors to find out if any
message has become an orphan message. Orphan messages are discovered by comparing
the number of messages sent to and received from neighboring processors.
For example, if RCVDi←j(CkPti) > SENTj→i(CkPtj) (that is, the number of messages received
by processor pi from processor pj is greater than the number of messages sent by processor pj to
processor pi, according to the current states the processors), then one or more messages at
processor pj are orphan messages.
The Algorithm
When a processor restarts after a failure, it broadcasts a ROLLBACK message that it had failed
Procedure RollBack_Recovery
processor pi executes the following:
STEP (a)
if processor pi is recovering after a failure then
CkPti := latest event logged in the stable storage
www.EnggTree.com
else
CkPti := latest event that took place in pi {The latest event at pi can be either in stable or in
volatile storage.}
end if
STEP (b)
for k = 1 1 to N {N is the number of processors in the system} do
for each neighboring processor pj do
compute SENTi→j(CkPti)
send a ROLLBACK(i, SENTi→j(CkPti)) message to pj
end for
for every ROLLBACK(j, c) message received from a neighbor j do
if RCVDi←j(CkPti) > c {Implies the presence of orphan messages} then
find the latest event e such that RCVDi←j(e) = c {Such an event e may be in the volatile storage
or stable storage.}
CkPti := e
end if

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

end for
end for{for k}
D. An Example
Consider an example shown in Figure 2 consisting of three processors. Suppose processor Y
fails and restarts. If event ey2 is the latest checkpointed event at Y, then Y will restart from the
state corresponding to ey2.

www.EnggTree.com

Figure 2: An example of Juan-Venkatesan algorithm.

 Because of the broadcast nature of ROLLBACK messages, the recovery algorithm is
initiated at processors X and Z.
 Initially, X, Y, and Z set CkPtX ← ex3, CkPtY ← ey2 and CkPtZ ← ez2, respectively,
and X, Y, and Z send the following messages during the first iteration:
 Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;
 X sends ROLLBACK(X,2) to Y and ROLLBACK(X,0) to Z;
 Z sends ROLLBACK(Z,0) to X and ROLLBACK(Z,1) to Y.
Since RCVDX←Y (CkPtX) = 3 > 2 (2 is the value received in the ROLLBACK(Y,2) message
from Y), X will set CkPtX to ex2 satisfying RCVDX←Y (ex2) = 1≤ 2.

Since RCVDZ←Y (CkPtZ) = 2 > 1, Z will set CkPtZ to ez1 satisfying RCVDZ←Y (ez1) = 1 ≤
1.
At Y, RCVDY←X(CkPtY ) = 1 < 2 and RCVDY←Z(CkPtY ) = 1 = SENTZ←Y (CkPtZ).
Y need not roll back further.
M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

In the second iteration, Y sends ROLLBACK(Y,2) to X and ROLLBACK(Y,1) to Z;

Z sends ROLLBACK(Z,1) to Y and ROLLBACK(Z,0) to X;

X sends ROLLBACK(X,0) to Z and ROLLBACK(X, 1) to Y.
If Y rolls back beyond ey3 and loses the message from X that caused ey3, X can resend this
message to Y because ex2 is logged at X and this message available in the log. The second and
third iteration will progress in the same manner. The set of recovery points chosen at the end of
the first iteration, {ex2, ey2, ez1}, is consistent, and no further rollback occurs.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

UNIT V
CLOUD COMPUTING
Definition of Cloud Computing – Characteristics of Cloud – Cloud Deployment Models –
Cloud Service Models – Driving Factors and Challenges of Cloud – Virtualization – Load
Balancing – Scalability and Elasticity – Replication – Monitoring – Cloud Services and
Platforms: Compute Services – Storage Services – Application Services

Definition of Cloud Computing

Cloud computing is on-demand access, via the internet, to computing resources

applications, servers (physical servers and virtual servers), data storage, development tools,
networking capabilities, and more—hosted at a remote data center managed by a cloud services
provider (or CSP). The CSP makes these resources available for a monthly subscription fee or
bills them according to usage.

Cloud computing is a virtualization-based technology that allows us to create,

configure, and customize applications via an internet connection. The cloud technology
includes a development platform, hard disk, software application, and database.

The term cloud refers to a network or the internet. It is a technology that uses remote
www.EnggTree.com
servers on the internet to store, manage, and access data online rather than local drives. The
data can be anything such as files, images, documents, audio, video, and more.

Cloud Computing is defined as storing and accessing of data and computing services
over the internet. It doesn’t store any data on your personal computer. It is the on-demand
availability of computer services like servers, data storage, networking, databases, etc. The
main purpose of cloud computing is to give access to data centers to many users. Users can
also access data from a remote server.

Cloud computing decreases the hardware and software demand from the user’s side.
The only thing that user must be able to run is the cloud computing systems interface software,
which can be as simple as Web browser, and the Cloud network takes care of the rest. We all
have experienced cloud computing at some instant of time, some of the popular cloud services
we have used or we are still using are mail services like gmail, hotmail or yahoo etc.

Examples of Cloud Computing Services: AWS, Azure,

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Characteristics of Cloud

The characteristics of cloud computing are given below:

1) Agility
The cloud works in a distributed computing environment. It shares resources among users and
works very fast.

2) High availability and reliability

The availability of servers is high and more reliable because the chances of infrastructure
failure are minimum.

3) High Scalability
Cloud offers "on-demand" provisioning of resources on a large scale, without having engineers
for peak loads.

4) Multi-Sharing
With the help of cloud computing, multiple users and applications can work more efficiently
with cost reductions by sharing common infrastructure.

www.EnggTree.com
5) Device and Location Independence
Cloud computing enables the users to access systems using a web browser regardless of their
location or what device they use e.g. PC, mobile phone, etc. As infrastructure is off-site
(typically provided by a third-party) and accessed via the Internet, users can connect from
anywhere.

6) Maintenance
Maintenance of cloud computing applications is easier, since they do not need to be installed
on each user's computer and can be accessed from different places. So, it reduces the cost also.

7) Low Cost
By using cloud computing, the cost will be reduced because to take the services of cloud
computing, IT company need not to set its own infrastructure and pay-as-per usage of
resources.

8) Services in the pay-per-use mode

Application Programming Interfaces (APIs) are provided to the users so that they can access
services on the cloud by using these APIs and pay the charges as per the usage of services.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Cloud Deployment Models

The cloud deployment model identifies the specific type of cloud environment based
on ownership, scale, access, and the cloud’s nature and purpose. There are various deployment
models are based on the location and who manages the infrastructure.

Type of Cloud Deployment Model

Here are some important types of Cloud Deployment models:

 Private Cloud: Resource managed and used by the organization.

 Public Cloud: Resource available for the general public under the Pay as you go
model.
 Community Cloud: Resource shared by several organizations, usually in the same
industry.
 Hybrid Cloud: This cloud deployment model is partly managed by the service
provided and partly by the organization.

Public Cloud
www.EnggTree.com
The public cloud is available to the general public, and resources are shared between
all users. They are available to anyone, from anywhere, using the Internet. The public cloud
deployment model is one of the most popular types of cloud.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

This computing model is hosted at the vendor’s data center. The public cloud model
makes the resources, such as storage and applications, available to the public over the
WWW. It serves all the requests; therefore, resources are almost infinite.

Characteristics of Public Cloud

Here are the essential characteristics of the Public Cloud:

 Uniformly designed Infrastructure

 Works on the Pay-as-you-go basis
 Economies of scale
 SLA guarantees that all users have a fair share with no priority
 It is a multitenancy architecture, so data is highly likely to be leaked

Advantages of Public Cloud Deployments

Here are the pros/benefits of the Public Cloud Deployment Model:

 Highly available anytime and anywhere, with robust permission and authentication
mechanism. www.EnggTree.com
 There is no need to maintain the cloud.
 Does not have any limit on the number of users.
 The cloud service providers fully subsidize the entire Infrastructure. Therefore, you
don’t need to set up any hardware.
 Does not cost you any maintenance charges as the service provider does it.
 It works on the Pay as You Go model, so you don’t have to pay for items you don’t
use.
 There is no significant upfront fee, making it excellent for enterprises that require
immediate access to resources.

Disadvantages of Public Cloud Deployments

Here are the cons/drawbacks of the Public Cloud Deployment Model:

 It has lots of issues related to security.

 Privacy and organizational autonomy are not possible.
 You don’t control the systems hosting your business applications.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Private Cloud Model

The private cloud deployment model is a dedicated environment for one user or customer.
You don’t share the hardware with any other users, as all the hardware is yours. It is a one-to-
one environment for single use, so there is no need to share your hardware with anyone else.
The main difference between private and public cloud deployment models is how you handle
the hardware. It is also referred to as “internal cloud,” which refers to the ability to access
systems and services within an organization or border.

www.EnggTree.com

Characteristics of Private Cloud

Here are the essential characteristics of the Private Cloud:

 It has a non-uniformly designed infrastructure.

 Very low risk of data leaks.
 Provides End-to-End Control.
 Weak SLA, but you can apply custom policies.
 Internal Infrastructure to manage resources easily.

Advantages of Private Cloud Deployments

Here are the pros/benefits of the Private Cloud Deployment Model:

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 You have complete command over service integration, IT operations, policies, and user
behavior.
 Companies can customize their solution according to market demands.
 It offers exceptional reliability in performance.
 A private cloud enables the company to tailor its solution to meet specific needs.
 It provides higher control over system configuration according to the company’s
requirements.
 Private cloud works with legacy systems that cannot access the public cloud.
 This Cloud Computing Model is small, and therefore it is easy to manage.
 It is suitable for storing corporate information that only permitted staff can access.
 You can incorporate as many security services as possible to secure your cloud.

Disadvantages of Private Cloud Deployments

Here are the cons/drawbacks of the Private Cloud Deployment Model:

 It is a fully on-premises-hosted cloud that requires significant capital to purchase and

maintain the necessary hardware.
www.EnggTree.com
 Companies that want extra computing power must take extra time and money to scale
up their Infrastructure.
 Scalability depends on the choice of hardware.

Hybrid Cloud Model

A hybrid cloud deployment model combines public and private clouds. Creating a
hybrid cloud computing model means that a company uses the public cloud but owns on-
premises systems and provides a connection between the two. They work as one system, which
is a beneficial model for a smooth transition into the public cloud over an extended period.

Some companies cannot operate solely in the public cloud because of security concerns
or data protection requirements. So, they may select the hybrid cloud to combine the
requirements with the benefits of a public cloud. It enables on-premises applications with
sensitive data to run alongside public cloud applications.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Characteristics of Hybrid Cloud

Here are the Characteristics of the Hybrid Cloud:

 Provides betters security and privacy

 Offers improved scalability
 www.EnggTree.com
Cost-effective Cloud Deployment Model
 Simplifies data and application portability

Advantages of Hybrid Cloud Deployments

Here are the pros/benefits of the Hybrid Cloud Deployment Model:

 It gives the power of both public and private clouds.

 It offers better security than the Public Cloud.
 Public clouds provide scalability. Therefore, you can only pay for the extra capacity if
required.
 It enables businesses to be more flexible and to design personalized solutions that meet
their particular needs.
 Data is separated correctly, so the chances of data theft by attackers are considerably
reduced.
 It provides robust setup flexibility so that customers can customize their solutions to fit
their requirements.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Disadvantages of Hybrid Cloud Deployments

Here are the cons/drawbacks of the Hybrid Cloud Deployment Model:

 It is applicable only when a company has varied use or demand for managing the
workloads.
 Managing a hybrid cloud is complex, so if you use a hybrid cloud, you may spend too
much.
 Its security features are not good as the Private Cloud.

Community Cloud Model

Community clouds are cloud-based infrastructure models that enable multiple

organizations to share resources and services based on standard regulatory requirements. It
provides a shared platform and resources for organizations to work on their business
requirements. This Cloud Computing model is operated and managed by community members,
third-party vendors, or both. The organizations that share standard business requirements make
up the members of the community cloud.

www.EnggTree.com

Advantages of Community Cloud Deployments

Here are the pros/benefits of the Community Cloud Deployment Model:

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 You can establish a low-cost private cloud.

 It helps you to do collaborative work on the cloud.
 It is cost-effective, as multiple organizations or communities share the cloud.
 You can share resources, Infrastructure, etc., with multiple organizations.
 It is a suitable model for both collaboration and data sharing.
 Gives better security than the public cloud.
 It offers a collaborative space that allows clients to enhance their efficiency.

Disadvantages of Community Cloud Deployments

Here are the cons/drawbacks of the Community Cloud Deployment Model:

 Because of its restricted bandwidth and storage capacity, community resources often
pose challenges.
 It is not a very popular and widely adopted cloud computing model.
 Security and segmentation are challenging to maintain.

Multi-cloud Model
www.EnggTree.com

Multi-cloud computing refers to using public cloud services from many cloud service
providers. A company must run workloads on IaaS or PaaS in a multi-cloud configuration from
multiple vendors, such as Azure, AWS, or Google Cloud Platform.

There are many reasons an organization selects a multi-cloud strategy. Some use it to
avoid vendor lock-in problems, while others combat shadow IT through multi-cloud

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

deployments. So, employees can still benefit from a specific public cloud service if it does not
meet strict IT policies.

Benefits of Multi-Cloud Deployment Model

Here are the pros/benefits of the Multi-Cloud Deployment Model:

 A multi-cloud deployment model helps organizations choose the specific services that
work best for them.
 It provides a reliable architecture.
 With multi-cloud models, companies can choose the best Cloud service provider based
on contract options, flexibility with payments, and customizability of capacity.
 It allows you to select cloud regions and zones close to your clients.

Disadvantages of Multi-Cloud Deployments

Here are the cons/drawbacks of the Multi-Cloud Deployment Model:

 Multi-cloud adoption increases the complexity of your business.

 www.EnggTree.com
Finding developers, engineers, and cloud security experts who know multiple clouds is
difficult.
 Comparison of Top Cloud Deployment Models

Parameters Public Private Community Hybrid

Setup and Easy Need help Require a Require a

use from a professional profession
professional IT team. al IT team.
IT team.
Scalability Very Low Moderate High
and High
Elasticity
Data Little to Very High Relatively High
Control none High
Security Very Very high High Very high
and low
privacy
Reliability Low High Higher High
Demand No Very high in- No In-house
for in- house software is
house software not a must.
software requirement

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

How to select the suitable Cloud Deployment Models

Companies are extensively using these cloud computing models all around the world.
Each of them solves a specific set of problems. So, finding the right Cloud Deployment Model
for you or your company is important.

Here are points you should remember for selecting the right Cloud Deployment Model:

 Scalability: You need to check if your user activity is growing quickly or unpredictably
with spikes in demand.
 Privacy and security: Select a service provider that protects your privacy and the
security of your sensitive data.
 Cost: You must decide how many resources you need for your cloud solution. Then
calculate the approximate monthly cost for those resources with different cloud
providers.
 Ease of use: You must select a model with no steep learning curve.
 Legal Compliance: You need to check whether any relevant low stop you from
selecting any specific cloud deployment model.
www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Cloud Service Models

SaaS, PaaS, and IaaS are the three main cloud computing service model categories.
You can access all three via an Internet browser or online apps available on different devices.
The cloud service model enables the team to collaborate online instead of offline creation and
then share online.

www.EnggTree.com

Software as a Service (SaaS)

Software as a Service (SaaS) is a web-based deployment model that makes the software
accessible through a web browser. SaaS software users don’t need to care where the software
is hosted, which operating system it uses, or even which programming language it is written
in. The SaaS software is accessible from any device with an internet connection.

This cloud service model ensures that consumers always use the most current version
of the software. The SaaS provider handles maintenance and support. In the SaaS model, users
don’t control the infrastructure, such as storage, processing power, etc.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Characteristics of SaaS

There are the following characteristics of SaaS:

 It is managed from a central location.

 Hosted directly on a remote server.
 It is accessible over the Internet.
 SaaS users are not responsible for hardware and software updates.
www.EnggTree.com
 The services are purchased on a pay-as-per-use basis.

Advantages SaaS

Here are the important advantages/pros of SaaS:

 The biggest benefit of using SaaS is that it is easy to set up, so you can start using it
instantly.
 Compared with on-premises software, it is more cost-effective.
 You don’t need to manage or upgrade the software, as it is typically included in a SaaS
subscription or purchase.
 It won’t use your local resources, such as the hard disk typically required to install
desktop software.
 It is a cloud computing service category that provides a wide range of hosted
capabilities and services.
 Developers can easily build and deploy web-based software applications.
 You can easily access it through a browser.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Disadvantages SaaS

Here are the important cons/drawbacks of SaaS:

 Integrations are up to the provider, so it’s impossible to “patch” an integration on your

end.
 SaaS tools may become incompatible with other tools and hardware already used in
your business.
 You depend on the SaaS company’s security measures, so your data may be
compromised if any leaks occur.

Consider Before SaaS Implementation

Need to consider before SaaS implementation:

 It would help if you opted for configuration over customization within a SaaS-based
delivery model.
 You must carefully understand the usage rates and set clear objectives to achieve the
SaaS adoption.
www.EnggTree.com
 You can complement your SaaS solution with integrations and security options to make
it more user-oriented.

Platform as a Service (PaaS)

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Platform-as-a-Service (PaaS) provides a cloud computing framework for software

application creation and deployment. It is a platform for the deployment and management of
software apps. This flexible cloud computing model scales up automatically on demand. It also
manages the servers, storage, and networking, while the developers manage only the
application part. It offers a runtime environment for application development and deployment
tools.

This Model provides all the facilities required to support the complex life cycle of
building and delivering web applications and services entirely for the Internet. This cloud
computing model enables developers to rapidly develop, run, and manage their apps without
building and maintaining the infrastructure or platform.

Characteristics of PaaS

There are the following characteristics of PaaS:

 Builds on virtualization technology, so computing resources can easily be scaled up

(Auto-scale) or down according to the organization’s needs.
 Support multiple programming languages and frameworks.
www.EnggTree.com
 Integrates with web services and databases.

Advantages PaaS

Here are the important benefits/pros of PaaS:

 Simple, cost-effective development and deployment of apps

 Developers can customize SaaS apps without the headache of maintaining the software
 Provide automation of Business Policy
 Easy migration to the Hybrid Model
 It allows developers to build applications without the overhead of the underlying
operating system or cloud infrastructure
 Offers freedom to developers to focus on the application’s design while the platform
takes care of the language and the database
 It helps developers to collaborate with other developers on a single app

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Disadvantages of SaaS

Here are the important cons/drawbacks of PaaS:

 You have control over the app’s code and not its infrastructure.
 The PaaS organization stores your data, so it sometimes poses a security risk to your
app’s users.
 Vendors provide varying service levels, so selecting the right services is essential.
 The risk of lock-in with a vendor may affect the ecosystem you need for your
development environment.

Consider Before PaaS Implementation

Here are essential things you need to consider before PaaS implementation:

 Analyze your business needs, decide the automation levels, and also decides whether
you want a self-service or fully automated PaaS model.
 You need to determine whether to deploy on a private or public cloud.
 Plan through the customization and efficiency levels.
www.EnggTree.com
Infrastructure as a Service (IaaS)

Infrastructure-as-a-Service (IaaS) is a cloud computing service offering on-demand

computing, storage, and networking resources. It usually works on a pay-as-you-go basis.

Organizations can purchase resources on-demand and as needed instead of buying the
hardware outright.

The IaaS cloud vendor hosts the infrastructure components, including the on-premises
data center, servers, storage, networking hardware, and the hypervisor (virtualization layer).

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

This Model contains the basic building blocks for your web application. It provides
complete control over the hardware that runs your application (storage, servers, VMs, networks
& operating systems). IaaS model gives you the best flexibility and management control over
your IT resources.

Characteristics of IaaS

There are the following characteristics of IaaS:

 Resources are available as a service

 Services are highly scalable
 Dynamic and flexible Cloud Service Model
 GUI and API-based access
 Automate the administrative tasks

Advantages of IaaS

Here are the important benefits/pros of PaaS:

 www.EnggTree.com
Easy to automate the deployment of storage, networking, and servers.
 Hardware purchases can be based on consumption.
 Clients keep complete control of their underlying infrastructure.
 The provider can deploy the resources to a customer’s environment anytime.
 It can be scaled up or downsized according to your needs.

Disadvantages of IaaS

Here are the important Cons/drawbacks of IaaS:

 You should ensure that your apps and operating systems are working correctly and
providing the utmost security.
 You’re in charge of the data, so if any of it is lost, it’s up to you to recover it.
 IaaS firms only provide the servers and API, so you must configure everything else.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Consider Before IaaS Implementation

Here are some specific considerations you should remember before IaaS Implementation:

 You should clearly define your access needs and your network’s bandwidth to
facilitate smooth implementation and functioning.
 Plan out detailed data storage and security strategy to streamline the business process.
 Ensure that your organization has a proper disaster recovery plan to keep your data
safe and accessible.

How can select the Best SaaS Service Provider

Here are some essential criteria for selecting the best cloud service provider:

 Financial stability: Look for a well-financed cloud provider that has steady profits
from the infrastructure. If the company shuts down because of monetary issues, your
solutions will also be in jeopardy.
 Industries that prefer the solution: Before finalizing cloud services, examine its
existing clients and markets. Your cloud service provider should be popular among
www.EnggTree.com
companies in your niche or neighboring ones.
 Datacenter locations: To avoid safety risks, ensure that cloud providers enable your
data’s geographical distribution.
 Encryption standards: You should make sure the cloud provider supports major
encryption algorithms.
 Check accreditation and auditing: The widely used online auditing standard is
SSAE. This procedure helps you to verify the safety of online data storage. ISO
27001 certificate verifies that a cloud provider complies with international safety
standards for data storage.
 Backup: The provider should support incremental backups so that you can store
offsite and quickly restore.

Driving Factors and Challenges of Cloud

Data Security and Privacy

Data security is a major concern when switching to cloud computing. User or

organizational data stored in the cloud is critical and private. Even if the cloud service provider

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

assures data integrity, it is your responsibility to carry out user authentication and
authorization, identity management, data encryption, and access control. Security issues on the
cloud include identity theft, data breaches, malware infections, and a lot more which eventually
decrease the trust amongst the users of your applications. This can in turn lead to potential loss
in revenue alongside reputation and stature. Also, dealing with cloud computing requires
sending and receiving huge amounts of data at high speed, and therefore is susceptible to data
leaks.

Cost Management

Even as almost all cloud service providers have a “Pay As You Go” model, which
reduces the overall cost of the resources being used, there are times when there are huge costs
incurred to the enterprise using cloud computing. When there is under optimization of the
resources, let’s say that the servers are not being used to their full potential, add up to the
hidden costs. If there is a degraded application performance or sudden spikes or overages in
the usage, it adds up to the overall cost. Unused resources are one of the other main reasons
why the costs go up. If you turn on the services or an instance of cloud and forget to turn it off
during the weekend or when there is no current use of it, it will increase the cost without even
www.EnggTree.com
using the resources.

Multi-Cloud Environments

Due to an increase in the options available to the companies, enterprises not only use a
single cloud but depend on multiple cloud service providers. Most of these companies use
hybrid cloud tactics and close to 84% are dependent on multiple clouds. This often ends up
being hindered and difficult to manage for the infrastructure team. The process most of the
time ends up being highly complex for the IT team due to the differences between multiple
cloud providers.

Performance Challenges

Performance is an important factor while considering cloud-based solutions. If the

performance of the cloud is not satisfactory, it can drive away users and decrease profits. Even
a little latency while loading an app or a web page can result in a huge drop in the percentage
of users. This latency can be a product of inefficient load balancing, which means that the
server cannot efficiently split the incoming traffic so as to provide the best user experience.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Challenges also arise in the case of fault tolerance, which means the operations continue as
required even when one or more of the components fail.

Interoperability and Flexibility

When an organization uses a specific cloud service provider and wants to switch to
another cloud-based solution, it often turns up to be a tedious procedure since applications
written for one cloud with the application stack are required to be re-written for the other cloud.
There is a lack of flexibility from switching from one cloud to another due to the complexities
involved. Handling data movement, setting up the security from scratch and network also add
up to the issues encountered when changing cloud solutions, thereby reducing flexibility.

High Dependence on Network

Since cloud computing deals with provisioning resources in real-time, it deals with
enormous amounts of data transfer to and from the servers. This is only made possible due to
the availability of the high-speed network. Although these data and resources are exchanged
over the network, this can prove to be highly vulnerable in case of limited bandwidth or cases
when there is a sudden outage. Even when the enterprises can cut their hardware costs, they
www.EnggTree.com
need to ensure that the internet bandwidth is high as well there are zero network outages, or
else it can result in a potential business loss. It is therefore a major challenge for smaller
enterprises that have to maintain network bandwidth that comes with a high cost.

Lack of Knowledge and Expertise

Due to the complex nature and the high demand for research working with the cloud
often ends up being a highly tedious task. It requires immense knowledge and wide expertise
on the subject. Although there are a lot of professionals in the field they need to constantly
update themselves. Cloud computing is a highly paid job due to the extensive gap between
demand and supply. There are a lot of vacancies but very few talented cloud engineers,
developers, and professionals. Therefore, there is a need for upskilling so these professionals
can actively understand, manage and develop cloud-based applications with minimum issues
and maximum reliability.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Virtualization

Virtualization is a technique how to separate a service from the underlying physical

delivery of that service. It is the process of creating a virtual version of something like
computer hardware. It was initially developed during the mainframe era. It involves using
specialized software to create a virtual or software-created version of a computing resource
rather than the actual version of the same resource. With the help of Virtualization, multiple
operating systems and applications can run on the same machine and its same hardware at the
same time, increasing the utilization and flexibility of hardware.

www.EnggTree.com

Host Machine: The machine on which the virtual machine is going to be built is known as Host
Machine.
Guest Machine: The virtual machine is referred to as a Guest Machine.

Virtualization has a prominent impact on Cloud Computing. In the case of cloud

computing, users store data in the cloud, but with the help of Virtualization, users have the
extra benefit of sharing the infrastructure. Cloud Vendors take care of the required physical
resources, but these cloud providers charge a huge amount for these services which impacts
every user or organization. Virtualization helps Users or Organisations in maintaining those
services which are required by a company through external (third-party) people, which helps
in reducing costs to the company. This is the way through which Virtualization works in Cloud
Computing.

Benefits of Virtualization

 More flexible and efficient allocation of resources.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 Enhance development productivity.

 It lowers the cost of IT infrastructure.
 Remote access and rapid scalability.
 High availability and disaster recovery.
 Pay peruse of the IT infrastructure on demand.
 Enables running multiple operating systems.

Drawback of Virtualization

 High Initial Investment: Clouds have a very high initial investment, but it is also true
that it will help in reducing the cost of companies.
 Learning New Infrastructure: As the companies shifted from Servers to Cloud, it
requires highly skilled staff who have skills to work with the cloud easily, and for this,
you have to hire new staff or provide training to current staff.
 Risk of Data: Hosting data on third-party resources can lead to putting the data at risk,
it has the chance of getting attacked by any hacker or cracker very easily.

Characteristics of Virtualization
www.EnggTree.com
 Increased Security: The ability to control the execution of a guest program in a
completely transparent manner opens new possibilities for delivering a secure,
controlled execution environment. All the operations of the guest programs are
generally performed against the virtual machine, which then translates and applies them
to the host programs.
 Managed Execution: In particular, sharing, aggregation, emulation, and isolation are
the most relevant features.
 Sharing: Virtualization allows the creation of a separate computing environment
within the same host.
 Aggregation: It is possible to share physical resources among several guests, but
virtualization also allows aggregation, which is the opposite process.

Types of Virtualization

1. Application Virtualization
2. Network Virtualization
3. Desktop Virtualization

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

4. Storage Virtualization
5. Server Virtualization
6. Data virtualization

1. Application Virtualization:

Application virtualization helps a user to have remote access to an application from a

server. The server stores all personal information and other characteristics of the application
but can still run on a local workstation through the internet. An example of this would be a
user who needs to run two different versions of the same software. Technologies that use
application virtualization are hosted applications and packaged applications.

2. Network Virtualization:

The ability to run multiple virtual networks with each having a separate control and
data plan. It co-exists together on top of one physical network. It can be managed by individual
parties that are potentially confidential to each other. Network virtualization provides a facility
to create and provision virtual networks, logical switches, routers, firewalls, load balancers,
Virtual Private Networks (VPN), and workload security within days or even weeks.
www.EnggTree.com

3. Desktop Virtualization:

Desktop virtualization allows the users’ OS to be remotely stored on a server in the

data center. It allows the user to access their desktop virtually, from any location by a different
machine. Users who want specific operating systems other than Windows Server will need to
have a virtual desktop. The main benefits of desktop virtualization are user mobility,
portability, and easy management of software installation, updates, and patches.

4. Storage Virtualization:

Storage virtualization is an array of servers that are managed by a virtual storage

system. The servers aren’t aware of exactly where their data is stored and instead function
more like worker bees in a hive. It makes managing storage from multiple sources be managed
and utilized as a single repository. storage virtualization software maintains smooth operations,
consistent performance, and a continuous suite of advanced functions despite changes, breaks
down, and differences in the underlying equipment.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

5. Server Virtualization:

This is a kind of virtualization in which the masking of server resources takes place.
Here, the central server (physical server) is divided into multiple different virtual servers by
changing the identity number, and processors. So, each system can operate its operating
systems in an isolated manner. Where each sub-server knows the identity of the central server.
It causes an increase in performance and reduces the operating cost by the deployment of main
server resources into a sub-server resource. It’s beneficial in virtual migration, reducing energy
consumption, reducing infrastructural costs, etc.

6. Data Virtualization:

This is the kind of virtualization in which the data is collected from various sources and
managed at a single place without knowing more about the technical information like how data
is collected, stored & formatted then arranged that data logically so that its virtual view can be
accessed by its interested people and stakeholders, and users through the various cloud services
remotely. Many big giant companies are providing their services like Oracle, IBM, At scale,
Cdata, etc.
www.EnggTree.com
Load Balancing

Load balancing is the method that allows you to have a proper balance of the amount
of work being done on different pieces of device or hardware equipment. Typically, what
happens is that the load of the devices is balanced between different servers or between the
CPU and hard drives in a single cloud server.

Load balancing was introduced for various reasons. One of them is to improve the
speed and performance of each single device, and the other is to protect individual devices
from hitting their limits by reducing their performance.

Cloud load balancing is defined as dividing workload and computing properties in

cloud computing. It enables enterprises to manage workload demands or application demands
by distributing resources among multiple computers, networks or servers. Cloud load
balancing involves managing the movement of workload traffic and demands over the Internet.

Traffic on the Internet is growing rapidly, accounting for almost 100% of the current
traffic annually. Therefore, the workload on the servers is increasing so rapidly, leading to

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

overloading of the servers, mainly for the popular web servers. There are two primary solutions
to overcome the problem of overloading on the server-

First is a single-server solution in which the server is upgraded to a higher-performance

server. However, the new server may also be overloaded soon, demanding another upgrade.
Moreover, the upgrading process is arduous and expensive.

The second is a multiple-server solution in which a scalable service system on a cluster

of servers is built. That's why it is more cost-effective and more scalable to build a server
cluster system for network services.

Cloud-based servers can achieve more precise scalability and availability by using farm
server load balancing. Load balancing is beneficial with almost any type of service, such as
HTTP, SMTP, DNS, FTP, and POP/IMAP.

It also increases reliability through redundancy. A dedicated hardware device or

program provides the balancing service.

Different Types of Load Balancing Algorithms in Cloud Computing:

www.EnggTree.com
1. Static Algorithm

Static algorithms are built for systems with very little variation in load. The entire
traffic is divided equally between the servers in the static algorithm. This algorithm requires
in-depth knowledge of server resources for better performance of the processor, which is
determined at the beginning of the implementation.

However, the decision of load shifting does not depend on the current state of the
system. One of the major drawbacks of static load balancing algorithm is that load balancing
tasks work only after they have been created. It could not be implemented on other devices for
load balancing.

2. Dynamic Algorithm

The dynamic algorithm first finds the lightest server in the entire network and gives it
priority for load balancing. This requires real-time communication with the network which can
help increase the system's traffic. Here, the current state of the system is used to control the
load.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

The characteristic of dynamic algorithms is to make load transfer decisions in the

current system state. In this system, processes can move from a highly used machine to an
underutilized machine in real time.

3. Round Robin Algorithm

Round robin load balancing algorithm uses round-robin method to assign jobs. First, it
randomly selects the first node and assigns tasks to other nodes in a round-robin manner. This
is one of the easiest methods of load balancing.

Processors assign each process circularly without defining any priority. It gives fast
response in case of uniform workload distribution among the processes. All processes have
different loading times. Therefore, some nodes may be heavily loaded, while others may
remain under-utilised.

4. Weighted Round Robin Load Balancing Algorithm

www.EnggTree.com
Weighted Round Robin Load Balancing Algorithms have been developed to enhance
the most challenging issues of Round Robin Algorithms. In this algorithm, there are a specified
set of weights and functions, which are distributed according to the weight values.

Processors that have a higher capacity are given a higher value. Therefore, the highest
loaded servers will get more tasks. When the full load level is reached, the servers will receive
stable traffic.

5. Opportunistic Load Balancing Algorithm

The opportunistic load balancing algorithm allows each node to be busy. It never
considers the current workload of each system. Regardless of the current workload on each
node, OLB distributes all unfinished tasks to these nodes.

The processing task will be executed slowly as an OLB, and it does not count the
implementation time of the node, which causes some bottlenecks even when some nodes are
free.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

6. Minimum to Minimum Load Balancing Algorithm

Under minimum to minimum load balancing algorithms, first of all, those tasks take
minimum time to complete. Among them, the minimum value is selected among all the
functions. According to that minimum time, the work on the machine is scheduled.

Other tasks are updated on the machine, and the task is removed from that list. This
process will continue till the final assignment is given. This algorithm works best where many
small tasks outweigh large tasks.

Load balancing solutions can be categorized into two types -

Software-based load balancers: Software-based load balancers run on standard hardware

(desktop, PC) and standard operating systems.

Hardware-based load balancers: Hardware-based load balancers are dedicated boxes that
contain application-specific integrated circuits (ASICs) optimized for a particular use. ASICs
allow network traffic to be promoted at high speeds and are often used for transport-level load
balancing because hardware-based load balancing is faster than a software solution.
www.EnggTree.com
Major Examples of Load Balancers

Direct Routing Request Despatch Technique: This method of request dispatch is similar to
that implemented in IBM's NetDispatcher. A real server and load balancer share a virtual IP
address. The load balancer takes an interface built with a virtual IP address that accepts request
packets and routes the packets directly to the selected server.

Dispatcher-Based Load Balancing Cluster: A dispatcher performs smart load balancing

using server availability, workload, capacity and other user-defined parameters to regulate
where TCP/IP requests are sent. The dispatcher module of a load balancer can split HTTP
requests among different nodes in a cluster. The dispatcher divides the load among multiple
servers in a cluster, so services from different nodes act like a virtual service on only one IP
address; Consumers interconnect as if it were a single server, without knowledge of the back-
end infrastructure.

Linux Virtual Load Balancer: This is an open-source enhanced load balancing solution used
to build highly scalable and highly available network services such as HTTP, POP3, FTP,

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

SMTP, media and caching, and Voice over Internet Protocol (VoIP) is done. It is a simple and
powerful product designed for load balancing and fail-over. The load balancer itself is the
primary entry point to the server cluster system. It can execute Internet Protocol Virtual Server
(IPVS), which implements transport-layer load balancing in the Linux kernel, also known as
layer-4 switching.

Types of Load Balancing

Network Load Balancing

Cloud load balancing takes advantage of network layer information and leaves it to
decide where network traffic should be sent. This is accomplished through Layer 4 load
balancing, which handles TCP/UDP traffic. It is the fastest local balancing solution, but it
cannot balance the traffic distribution across servers.

HTTP(S) load balancing

HTTP(s) load balancing is the oldest type of load balancing, and it relies on Layer 7.
This means that load balancing operates in the layer of operations. It is the most flexible type
www.EnggTree.com
of load balancing because it lets you make delivery decisions based on information retrieved
from HTTP addresses.

Internal Load Balancing

It is very similar to network load balancing, but is leveraged to balance the

infrastructure internally.

Load balancers can be further divided into hardware, software and virtual load
balancers.

Hardware Load Balancer

It depends on the base and the physical hardware that distributes the network and
application traffic. The device can handle a large traffic volume, but these come with a hefty
price tag and have limited flexibility.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Software Load Balancer

It can be an open source or commercial form and must be installed before it can be
used. These are more economical than hardware solutions.

Virtual Load Balancer

It differs from a software load balancer in that it deploys the software to the hardware
load-balancing device on the virtual machine.

WHY CLOUD LOAD BALANCING IS IMPORTANT IN CLOUD COMPUTING?

Here are some of the importance of load balancing in cloud computing.

Offers better performance

The technology of load balancing is less expensive and also easy to implement. This
allows companies to work on client applications much faster and deliver better results at a
lower cost.

Helps Maintain Website Traffic www.EnggTree.com

Cloud load balancing can provide scalability to control website traffic. By using
effective load balancers, it is possible to manage high-end traffic, which is achieved using
network equipment and servers. E-commerce companies that need to deal with multiple
visitors every second use cloud load balancing to manage and distribute workloads.

Can Handle Sudden Bursts in Traffic

Load balancers can handle any sudden traffic bursts they receive at once. For example,
in case of university results, the website may be closed due to too many requests. When one
uses a load balancer, he does not need to worry about the traffic flow. Whatever the size of the
traffic, load balancers will divide the entire load of the website equally across different servers
and provide maximum results in minimum response time.

Greater Flexibility

The main reason for using a load balancer is to protect the website from sudden crashes.
When the workload is distributed among different network servers or units, if a single node

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

fails, the load is transferred to another node. It offers flexibility, scalability and the ability to
handle traffic better. Because of these characteristics, load balancers are beneficial in cloud
environments. This is to avoid heavy workload on a single server.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Scalability and Elasticity

Cloud Elasticity

Elasticity refers to the ability of a cloud to automatically expand or compress the

infrastructural resources on a sudden up and down in the requirement so that the workload can
be managed efficiently. This elasticity helps to minimize infrastructural costs. This is not
applicable for all kinds of environments, it is helpful to address only those scenarios where the
resource requirements fluctuate up and down suddenly for a specific time interval. It is not
quite practical to use where persistent resource infrastructure is required to handle the heavy
workload.

The Flexibility in cloud is a well-known highlight related with scale-out arrangements

(level scaling), which takes into consideration assets to be powerfully added or eliminated
when required. It is for the most part connected with public cloud assets which is generally
highlighted in pay-per-use or pay-more only as costs arise administrations.

The Flexibility is the capacity to develop or contract framework assets (like process,
capacity or organization) powerfullywww.EnggTree.com
on a case by case basis to adjust to responsibility changes
in the applications in an autonomic way.

Example: Consider an online shopping site whose transaction workload increases during
festive season like Christmas. So for this specific period of time, the resources need a spike up.
In order to handle this kind of situation, we can go for a Cloud-Elasticity service rather than
Cloud Scalability. As soon as the season goes out, the deployed resources can then be requested
for withdrawal.

Cloud Scalability

Cloud scalability is used to handle the growing workload where good performance is
also needed to work efficiently with software or applications. Scalability is commonly used
where the persistent deployment of resources is required to handle the workload statically.

Example: Consider you are the owner of a company whose database size was small in earlier
days but as time passed your business does grow and the size of your database also increases,
so in this case you just need to request your cloud service vendor to scale up your database
capacity to handle a heavy workload.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

It is totally different from what you have read above in Cloud Elasticity. Scalability is used to
fulfill the static needs while elasticity is used to fulfill the dynamic need of the organization.
Scalability is a similar kind of service provided by the cloud where the customers have to pay-
per-use. So, in conclusion, we can say that Scalability is useful where the workload remains
high and increases statically.

Types of Scalability

1. Vertical Scalability (Scale-up)

In this type of scalability, increase the power of existing resources in the working
environment in an upward direction.

www.EnggTree.com

2. Horizontal Scalability

In this kind of scaling, the resources are added in a horizontal row.

3. Diagonal Scalability

It is a mixture of both Horizontal and Vertical scalability where the resources are added
both vertically and horizontally.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Difference Between Cloud Elasticity and Scalability

Cloud Elasticity Cloud Scalability

Elasticity is used just to meet Scalability is used to meet

the sudden up and down in the static increase in the
1
the workload for a small workload.
period of time.

Elasticity is used to meet Scalability is always used

dynamic changes, where the to address the increase in
2
resources need can increase workload in an
or decrease. organization.
www.EnggTree.com
Scalability is used by giant
Elasticity is commonly used
companies whose
by small companies whose
customer circle
3 workload and demand
persistently grows in order
increases only for a specific
to do the operations
period of time.
efficiently.

It is a short term planning

Scalability is a long term
and adopted just to deal with
planning and adopted just
4 an unexpected increase in
to deal with an expected
demand or seasonal
increase in demand.
demands.

Replication

The simplest form of data replication in cloud computing environment is to store a copy
of a file (copy), in expanded form, the copying and pasting in any modern operating systems.
Replication is the reproduction of the original data in unchanged form. Changing data accesses
are expensive in general through replication. In the frequently encountered master / slave
replication, a distinction between the original data (primary data) and the dependent copies. In

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

peer copies (version control) there must be merging of data sets (synchronization). Sometimes
it is important to know which data sets must have the replicas. Depending on the type of
replication it is located between the processing and creation of the primary data and their
replication in a certain period of time. This period is usually referred to as latency.

Array-Based Data Replication

An array-based data replication strategy uses built-in software to automatically

replicate data. With this type of data replication, the software is used in compatible storage
arrays to copy data between each. Using this method has several advantages and disadvantages.

Advantages:

 More robust
 Requires less coordination when deployed
 The work gets offloaded from the servers to the storage device

Disadvantages:

 www.EnggTree.com
Requires homogenous storage environments: the source and target array have to be
similar
 It is costly to implement

Host-Based Data Replication

Host-based data replication uses the servers to copy data from one site to another site.
Host-based replication software usually includes options like compression, encryption and,
throttling, as well as failover. Using this method has several advantages and disadvantages.

Advantages:

 Flexible: It can leverage existing IP networks

 Can be customized to your business’ needs: You can choose what data to replicate
 Can create a schedule for sending data: allows you to throttle bandwidth
 Can use any combination of storage devices on each end

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Disadvantages:

 Difficult to manage with a large group of servers if there is no centralized management

console
 Consumes host resources during replication
 Both storage devices on each end need to be active, which means you will need to
purchase dedicated hardware and OS
 Not all applications can support this type of data replication
 Can be affected by viruses or application failure
 Host-based replication offers the safest option if a business is looking for close to zero
impact on operations after a disaster.

Network-Based Data Replication

Network-based data replication uses a device or appliance that sits on the network in
the path of the data to manage replication. The data is then copied to a second device. These
devices usually have proprietary replication technology but can be used with any host server
and storage hardware.
www.EnggTree.com
Advantages

 Effective in large, heterogeneous storage and server environments

 Supports any host platform and works with any array
 Works separately from the servers and the storage devices
 Allows replication between multi-vendor products

Disadvantages:

 Higher initial set-up cost because it requires proprietary hardware, as well as ongoing
operational and management costs
 Requires implementation of a storage area network (SAN)

Monitoring

Cloud monitoring is a method of reviewing, observing, and managing the operational

workflow in a cloud-based IT infrastructure. Manual or automated management techniques
confirm the availability and performance of websites, servers, applications, and other cloud

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

infrastructure. This continuous evaluation of resource levels, server response times, and speed
predicts possible vulnerability to future issues before they arise.

This technique tracks multiple analytics simultaneously, monitoring storage resources

and processes that are provisioned to virtual machines, services, databases, and applications.
This technique is often used to host infrastructure-as-a-service (IaaS) and software-as-a-service
(SaaS) solutions. For these applications, you can configure monitoring to track performance
metrics, processes, users, databases, and available storage. It provides data to help you focus
on useful features or to fix bugs that disrupt functionality.

Monitoring is a skill, not a full-time job. In today’s world of cloud-based architectures

that are implemented through DevOps projects, developers, site reliability engineers (SREs),
and operations staff must collectively define an effective cloud monitoring strategy. Such a
strategy should focus on identifying when service-level objectives (SLOs) are not being met,
likely negatively affecting the user experience. So, then what are the benefits of leveraging
cloud monitoring tools? With cloud monitoring:

Benefits of cloud monitoring

www.EnggTree.com
 Scaling for increased activity is seamless and works in organizations of any size
 Dedicated tools (and hardware) are maintained by the host
 Tools are used across several types of devices, including desktop computers, tablets,
and phones, so your organization can monitor apps from any location
 Installation is simple because infrastructure and configurations are already in place
 Your system doesn’t suffer interruptions when local problems emerge, because
resources aren’t part of your organization’s servers and workstations
 Subscription-based solutions can keep your costs low

Cloud monitoring is primarily part of cloud security and management processes. It is

normally implemented through automated monitoring software that provides central access
and control over cloud infrastructure.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Cloud Services and Platforms

Cloud Reference Model

• Infrastructure & Facilities Layer

Includes the physical infrastructure such as datacenter facilities, electrical and

mechanical equipment, etc.

• Hardware Layer

Includes physical compute, network and storage hardware.

• Virtualization Layer

Partitions the physical hardware resources into multiple virtual resources that enabling
pooling of resources.

• Platform & Middleware Layer

Builds upon the IaaS layers below and provides standardized stacks of services such as
www.EnggTree.com
database service, queuing service, application frameworks and run-time environments,
messaging services, monitoring services, analytics services, etc.

• Service Management Layer

Provides APIs for requesting, managing and monitoring cloud resources.

• Applications Layer

Includes SaaS applications such as Email, cloud storage application, productivity

applications, management portals, customer self-service portals, etc.

• Infrastructure & Facilities Layer

Includes the physical infrastructure such as datacenter facilities, electrical and

mechanical equipment, etc.

• Hardware Layer

Includes physical compute, network and storage hardware.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

www.EnggTree.com
Compute Service

 Compute services provide dynamically scalable compute capacity in the cloud.

 Compute resources can be provisioned on-demand in the form of virtual machines.
Virtual machines can be created from standard images provided by the cloud service
provider or custom images created by the users.
 Compute services can be accessed from the web consoles of these services that provide
graphical user interfaces for provisioning, managing and monitoring these services.
 Cloud service providers also provide APIs for various programming languages that
allow developers to access and manage these services programmatically.

Compute Service - Amazon EC2

• Amazon Elastic Compute Cloud (EC2) is a compute service provided by Amazon.

• Launching EC2 Instances
To launch a new instance click on the launch instance button. This will open a
wizard where you can select the Amazon machine image (AMI) with which you want

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

to launch the instance. You can also create their own AMIs with custom applications,
libraries and data. Instances can be launched with a variety of operating systems.
• Instance Sizes
When you launch an instance you specify the instance type (micro, small,
medium, large, extra-large, etc.), the number of instances to launch based on the
selected AMI and availability zones for the instances.
• Key-pairs
When launching a new instance, the user selects a key-pair from existing
keypairs or creates a new keypair for the instance. Keypairs are used to securely
connect to an instance after it launches.
• Security Groups
The security groups to be associated with the instance can be selected from the
instance launch wizard. Security groups are used to open or block a specific network
port for the launched instances.

www.EnggTree.com

Compute Services – Google Compute Engine

• Google Compute Engine is a compute service provided by Google.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

• Launching Instances

To create a new instance, the user selects an instance machine type, a zone in which
the instance will be launched, a machine image for the instance and provides an instance name,
instance tags and meta-data.

• Disk Resources

Every instance is launched with a disk resource. Depending on the instance type, the
disk resource can be a scratch disk space or persistent disk space. The scratch disk space is
deleted when the instance terminates. Whereas, persistent disks live beyond the life of an
instance.

• Network Options

Network option allows you to control the traffic to and from the instances. By default,
traffic between instances in the same network, over any port and any protocol and incoming
SSH connections from anywhere are enabled.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Compute Services – Windows Azure VMs

• Windows Azure Virtual Machines is the compute service from Microsoft.

• Launching Instances:
o To create a new instance, you select the instance type and the machine image.
o You can either provide a user name and password or upload a certificate file for
securely connecting to the instance.
o Any changes made to the VM are persistently stored and new VMs can be
created from the previously stored machine images.

www.EnggTree.com

Storage Services

 Cloud storage services allow storage and retrieval of any amount of data, at any time
from anywhere on the web.
 Most cloud storage services organize data into buckets or containers.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 Scalability

Cloud storage services provide high capacity and scalability. Objects upto several
tera-bytes in size can be uploaded and multiple buckets/containers can be created on cloud
storages.

 Replication

When an object is uploaded it is replicated at multiple facilities and/or on multiple

devices within each facility.

 Access Policies

Cloud storage services provide several security features such as Access Control
Lists (ACLs), bucket/container level policies, etc. ACLs can be used to selectively grant
access permissions on individual objects. Bucket/container level policies can also be
defined to allow or deny permissions across some or all of the objects within a single
bucket/container.

 Encryption www.EnggTree.com

Cloud storage services provide Server Side Encryption (SSE) options to encrypt all
data stored in the cloud storage.

 Consistency

Strong data consistency is provided for all upload and delete operations. Therefore,
any object that is uploaded can be immediately downloaded after the upload is complete.

Storage Services – Amazon S3

 Amazon Simple Storage Service(S3) is an online cloud-based data storage

infrastructure for storing and retrieving any amount of data.
 S3 provides highly reliable, scalable, fast, fully redundant and affordable storage
infrastructure.
 Buckets
- Data stored on S3 is organized in the form of buckets. You must create a bucket
before you can store data on S3.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 Uploading Files to Buckets

- S3 console provides simple wizards for creating a new bucket and uploading files.
- You can upload any kind of file to S3.
- While uploading a file, you can specify the redundancy and encryption options and
access permissions.

Storage Services – Google Cloud Storage

 GCS is the Cloud storage service from Google

 Buckets
Objects in GCS are organized into buckets.
 Access Control Lists www.EnggTree.com

ACLs are used to control access to objects and buckets. ACLs can be configured
to share objects and buckets with the entire world, a Google group, a Google-hosted
domain, or specific Google account holders.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Storage Services – Windows Azure Storage

 Windows Azure Storage is the cloud storage service from Microsoft.

 Windows Azure Storage provides various storage services such as blob storage service,
table service and queue service.
 Blob storage service
o The blob storage service allows storing unstructured binary data or binary large
objects (blobs).
o Blobs are organized into containers.
o Block blobs - can be subdivided into some number of blocks. If a failure occurs
while transferring a block blob, retransmission can resume with the most recent
block rather than sending the entire blob again.
o Page blobs - are divided into number of pages and are designed for random
access. Applications can read and write individual pages at random in a page
blob.

www.EnggTree.com

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

Application Runtimes & Frameworks

 Cloud-based application runtimes and frameworks allow developers to develop and

host applications in the cloud.
 Support for various programming languages
Application runtimes provide support for programming languages (e.g., Java,
Python, or Ruby).
 Resource Allocation
Application runtimes automatically allocate resources for applications and
handle the application scaling, without the need to run and maintain servers.

Google App Engine

 Google App Engine is the platform-as-a-service (PaaS) from Google, which includes
both an application runtime and web frameworks.
 Runtimes
- App Engine provides runtime environments for Java, Python, PHP and Go
programming language.
www.EnggTree.com
 Sandbox
- Applications run in a secure sandbox environment isolated from other applications.
- The sandbox environment provides a limited access to the underlying operating
system.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

EnggTree.com
CS3551 DISTRIBUTED COMPUTING

 Web Frameworks
- App Engine provides a simple Python web application framework called webapp2.
App Engine also supports any framework written in pure Python that speaks WSGI,
including Django, CherryPy, Pylons, web.py, and web2py.
 Datastore
- App Engine provides a no-SQL data storage service
 Authentication
- App Engine applications can be integrated with Google Accounts for user
authentication.
 URL Fetch service
- URL Fetch service allows applications to access resources on the Internet, such as
web services or other data.
 Other services
- Email service
- Image Manipulation service
- Memcache
- Task Queues
www.EnggTree.com
- Scheduled Tasks service

Windows Azure Web Sites

• Windows Azure Web Sites is a Platform-as-a-Service (PaaS) from Microsoft.

• Azure Web Sites allows you to host web applications in the Azure cloud.
• Shared & Standard Options.
- In the shared option, Azure Web Sites run on a set of virtual machines that may
contain multiple web sites created by multiple users.
- In the standard option, Azure Web Sites run on virtual machines (VMs) that belong
to an individual user.
• Azure Web Sites supports applications created in ASP.NET, PHP, Node.js and Python
programming languages.
• Multiple copies of an application can be run in different VMs, with Web Sites
automatically load balancing requests across them.

M.A.M COLLEGE OF ENGINEERING

Downloaded from EnggTree.com

1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
No ratings yet
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
55 pages
Unit 1
No ratings yet
Unit 1
135 pages
FALLSEM2022-23 CBS3002 ETH VL2022230104384 2022-08-04 Reference-Material-I
100% (1)
FALLSEM2022-23 CBS3002 ETH VL2022230104384 2022-08-04 Reference-Material-I
46 pages
CCS341 Data Warehousing Unit 2 Notes - Ccs341-Data-warehousing-unit-2-Notes
No ratings yet
CCS341 Data Warehousing Unit 2 Notes - Ccs341-Data-warehousing-unit-2-Notes
32 pages
CCS367 Storage Technologies - QB
No ratings yet
CCS367 Storage Technologies - QB
6 pages
Esss 1
No ratings yet
Esss 1
88 pages
Ethical Hacking Unit-1
No ratings yet
Ethical Hacking Unit-1
30 pages
SNS Unit V
No ratings yet
SNS Unit V
39 pages
Data and Information Security - CW3551 - Important Questions and Question Bank
No ratings yet
Data and Information Security - CW3551 - Important Questions and Question Bank
9 pages
Eh Unit 5
No ratings yet
Eh Unit 5
52 pages
Web Application Security Lab Manual Word
No ratings yet
Web Application Security Lab Manual Word
27 pages
Unit IV Roboethics
No ratings yet
Unit IV Roboethics
23 pages
Aktu Os Exam Answers
No ratings yet
Aktu Os Exam Answers
4 pages
Vce 2
No ratings yet
Vce 2
29 pages
CW3551-DIS Unit 1 Notes
No ratings yet
CW3551-DIS Unit 1 Notes
18 pages
Ccs356 - Oose Lab
No ratings yet
Ccs356 - Oose Lab
138 pages
Cs3691-Unit 4
No ratings yet
Cs3691-Unit 4
21 pages
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 4 - Logical Reasoning
No ratings yet
Artificial Intelligence - AL3391 - Important Questions With Answer - Unit 4 - Logical Reasoning
8 pages
2 Mark Question With Answers
No ratings yet
2 Mark Question With Answers
9 pages
Introduction To Computing and Problem Solving
No ratings yet
Introduction To Computing and Problem Solving
171 pages
ccs341 Data Warehouse Lab Experiments
No ratings yet
ccs341 Data Warehouse Lab Experiments
26 pages
DC Notes - 2 Marks
No ratings yet
DC Notes - 2 Marks
11 pages
CS3451 OS Unit 5 Notes
No ratings yet
CS3451 OS Unit 5 Notes
25 pages
DS Unit 5
No ratings yet
DS Unit 5
27 pages
SNS Lab Anual
No ratings yet
SNS Lab Anual
33 pages
CSE 2-2 CS & Syllabus - UG - R20
No ratings yet
CSE 2-2 CS & Syllabus - UG - R20
83 pages
CCS354 NS-UNIT-2 KEY MANAGEMENT & AUTHENTICATION Full
No ratings yet
CCS354 NS-UNIT-2 KEY MANAGEMENT & AUTHENTICATION Full
60 pages
Unit 3DC
No ratings yet
Unit 3DC
28 pages
CS3551 DC - Int - I - Answer Key 7.9.23
No ratings yet
CS3551 DC - Int - I - Answer Key 7.9.23
68 pages
Unit II - TCS 552 - Part 1
No ratings yet
Unit II - TCS 552 - Part 1
46 pages
CCS341 DATA WAREHOUSING FIRST INTERNAL QUESTION Set 1
No ratings yet
CCS341 DATA WAREHOUSING FIRST INTERNAL QUESTION Set 1
2 pages
Unit - Iii
No ratings yet
Unit - Iii
20 pages
DS Unit-5
No ratings yet
DS Unit-5
20 pages
Ns Unit 2
No ratings yet
Ns Unit 2
18 pages
C Notes
100% (1)
C Notes
158 pages
Unit I
No ratings yet
Unit I
53 pages
Unit No.4 Parallel Database
No ratings yet
Unit No.4 Parallel Database
32 pages
Distributed Mutual Exclusion
No ratings yet
Distributed Mutual Exclusion
23 pages
Advanced Databases - Unit - V - PPT
No ratings yet
Advanced Databases - Unit - V - PPT
71 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
CCS341 - Data Warehousing 2023 Nov Dec
No ratings yet
CCS341 - Data Warehousing 2023 Nov Dec
2 pages
Unit-2 OS BCS-401
No ratings yet
Unit-2 OS BCS-401
108 pages
Subject Name Parallel and Distributed Computing
100% (1)
Subject Name Parallel and Distributed Computing
3 pages
CSE 5th Semester - Neural Networks and Deep Learning - CCS355 2021 Regulation - Question Paper 2023 Nov Dec
No ratings yet
CSE 5th Semester - Neural Networks and Deep Learning - CCS355 2021 Regulation - Question Paper 2023 Nov Dec
5 pages
Os Lab Manual AI&DS
No ratings yet
Os Lab Manual AI&DS
64 pages
Ccs341-Dw-Int I Key-Set I-Ar
No ratings yet
Ccs341-Dw-Int I Key-Set I-Ar
18 pages
Cs3551 Distributed Computing
No ratings yet
Cs3551 Distributed Computing
2 pages
Priority Queues
No ratings yet
Priority Queues
13 pages
Computer Operating Systems Exam
No ratings yet
Computer Operating Systems Exam
3 pages
FDP Brochure PDF
100% (1)
FDP Brochure PDF
2 pages
Practical 4 Asset Transfer App
No ratings yet
Practical 4 Asset Transfer App
8 pages
Embedded System & IoT
No ratings yet
Embedded System & IoT
27 pages
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
No ratings yet
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
40 pages
14 Process Synchronization 30082023 115413am 17082024 022931pm 22042025 093830am
No ratings yet
14 Process Synchronization 30082023 115413am 17082024 022931pm 22042025 093830am
15 pages
DAA-2020-21 Final Updated Course File
No ratings yet
DAA-2020-21 Final Updated Course File
49 pages
Cs8591 - CN Unit 4
No ratings yet
Cs8591 - CN Unit 4
33 pages
cs3251 UNIT II QUESTION BANK
No ratings yet
cs3251 UNIT II QUESTION BANK
4 pages
KUCS301 Operating Systems - 221001 - 134911
No ratings yet
KUCS301 Operating Systems - 221001 - 134911
108 pages
Oose Question Bank
No ratings yet
Oose Question Bank
6 pages
Operating System Unit - 2
No ratings yet
Operating System Unit - 2
12 pages
Chapter 1: Multi Threaded Programming: (Operating Systems-18Cs43)
No ratings yet
Chapter 1: Multi Threaded Programming: (Operating Systems-18Cs43)
39 pages
Question Bank - OS
No ratings yet
Question Bank - OS
6 pages
Os Lpu
No ratings yet
Os Lpu
48 pages
Course Outline (Theory)
No ratings yet
Course Outline (Theory)
1 page
Anna University - Operating Systems Lesson Plan and Lecture Plan
No ratings yet
Anna University - Operating Systems Lesson Plan and Lecture Plan
8 pages
Chapter 06 Part1
No ratings yet
Chapter 06 Part1
20 pages
Distributed-Computing Notes
No ratings yet
Distributed-Computing Notes
108 pages
6 1 Process Synchronization
No ratings yet
6 1 Process Synchronization
62 pages
CS3541-OPERATING-SYSTEMS Question Bank
No ratings yet
CS3541-OPERATING-SYSTEMS Question Bank
54 pages
Operating Systems: (Notes To Prepare in 1 Night Before Exam) Based On Galvin Book By:-Shashank Shekhar
100% (2)
Operating Systems: (Notes To Prepare in 1 Night Before Exam) Based On Galvin Book By:-Shashank Shekhar
31 pages
Ee249 13 Rtos
No ratings yet
Ee249 13 Rtos
211 pages
2008-2009-1 Solution
No ratings yet
2008-2009-1 Solution
9 pages
Types of Consistency Models Used in DSM: 1.weak Consistency 2.release Consistency 3.entry Consistency
No ratings yet
Types of Consistency Models Used in DSM: 1.weak Consistency 2.release Consistency 3.entry Consistency
13 pages
Cs8582-Object Oriented Analysisand Design Laboratory-46023968-Cs8582 - Ooad Lab
No ratings yet
Cs8582-Object Oriented Analysisand Design Laboratory-46023968-Cs8582 - Ooad Lab
132 pages
Test-5-CS - Operating System - Basic PDF
No ratings yet
Test-5-CS - Operating System - Basic PDF
19 pages
AL SW MOCK 2022 Computer Sience 1
No ratings yet
AL SW MOCK 2022 Computer Sience 1
5 pages
Unit-1 Concurrent and Parallel Programming Syllabus: Concurrent Versus Sequential Programming. Concurrent Programming Constructs
No ratings yet
Unit-1 Concurrent and Parallel Programming Syllabus: Concurrent Versus Sequential Programming. Concurrent Programming Constructs
26 pages
Os Chap2
No ratings yet
Os Chap2
26 pages
Process and Threads
No ratings yet
Process and Threads
36 pages
02 - Pthread+primitive Sincronizare PDF
No ratings yet
02 - Pthread+primitive Sincronizare PDF
169 pages
EEE446 - Dr. Khurram - Lecture-5b PDF
No ratings yet
EEE446 - Dr. Khurram - Lecture-5b PDF
35 pages
Unit 5
No ratings yet
Unit 5
45 pages
Lập Trình Hệ Thống Nhúng: Bùi Quốc Bảo
No ratings yet
Lập Trình Hệ Thống Nhúng: Bùi Quốc Bảo
37 pages
IAR Corrected Problems
No ratings yet
IAR Corrected Problems
15 pages
Operating System Assignment 3
No ratings yet
Operating System Assignment 3
7 pages
Os Cheat XXX
No ratings yet
Os Cheat XXX
1 page
Concurrency:: Mutual Exclusion and Synchronization
No ratings yet
Concurrency:: Mutual Exclusion and Synchronization
9 pages
10 1 1 475 6504
No ratings yet
10 1 1 475 6504
10 pages
Labview Real-Time 2 Course Manual: Sample
No ratings yet
Labview Real-Time 2 Course Manual: Sample
17 pages
Deadlock Assignment
No ratings yet
Deadlock Assignment
6 pages