0% found this document useful (0 votes)
65 views18 pages

Distributed Computing: Farhad Muhammad Riaz

Distributed computing systems allow programs to coordinate actions by exchanging messages over a computer network. They operate by executing protocols that govern message passing between processes. Key aspects of distributed systems include reliability through fault tolerance, high availability, and recoverability from failures. Distributed systems must also address issues like security, privacy, scalability and predictable performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views18 pages

Distributed Computing: Farhad Muhammad Riaz

Distributed computing systems allow programs to coordinate actions by exchanging messages over a computer network. They operate by executing protocols that govern message passing between processes. Key aspects of distributed systems include reliability through fault tolerance, high availability, and recoverability from failures. Distributed systems must also address issues like security, privacy, scalability and predictable performance.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Distributed Computing

Lecture: 02
Farhad Muhammad Riaz
What is Distributed Computing
 Distributed computing system is a set of computer programs,
executing on one or more computers, and coordinating
actions by exchanging messages.
 A computer network is a collection of computers
interconnected by hardware that directly supports message
passing.
 Most distributed computing systems operate over computer
networks, but one can also build a distributed computing
system in which the components execute on a single
multitasking computer, and one can build distributed
computing systems in which information flows between the
components by means other than message passing.
 Parallel Computing VS Grid Computing
 Both lies in the class of distributed Systems.
Working
 In distributed Systems “protocol” in reference to an
algorithm governing the exchange of messages, by
which a collection of processes coordinate their
actions and communicate information among
themselves.
 Much as a program is a set of instructions, and a
process denotes the execution of those instructions,
a protocol is a set of instructions governing the
communication in a distributed program, and a
distributed computing system is the result of
executing some collection of such protocols to
coordinate the actions of a collection of processes in
a network.
Reliability
 Fault tolerance:
 The ability of a distributed computing system to recover from component failures without performing
incorrect actions.
 High availability:
 In the context of a fault-tolerant distributed computing system, the ability of the system to restore
correct operation, permitting it to resume providing services during periods when some
components have failed. A highly available system may provide reduced service for short
periods of time while reconfiguring itself.
 Continuous availability:
 A highly available system with a very small recovery time, capable of providing uninterrupted service
to its users. The reliability properties of a continuously available system are unaffected or
only minimally affected by failures.
 Recoverability:
 Also in the context of a fault-tolerant distributed computing system, the ability of failed
components to restart themselves and rejoin the system, after the cause of failure has been
repaired.
 Consistency:
 The ability of the system to coordinate related actions by multiple components, often in the
presence of concurrency and failures. Consistency underlies the ability of a distributed
system to emulate a non-distributed system.
Reliability
 Scalability:
 The ability of a system to continue to operate correctly even as some aspect is scaled to a larger
size. For example, we might increase the size of the network on which the system
is running—doing so increases the frequency of such events as network outages
and could degrade a “non-scalable” system.We might increase numbers of users,
or numbers of servers, or load on the system. Scalability thus has many
dimensions; a scalable system would normally specify the dimensions in which it
achieves scalability and the degree of scaling it can sustain.
 Security:
 The ability of the system to protect data, services, and resources against misuse by
unauthorized users.
 Privacy:
 The ability of the system to protect the identity and locations of its users, or the contents of
sensitive data, from unauthorized disclosure.
 Correct specification:
 The assurance that the system solves the intended problem.
 Correct implementation:
 The assurance that the system correctly implements its specification.
Reliability
 Predictable performance:
 of
The guarantee that a distributed system achieves desired levels
performance—for example, data throughput from source to
destination, latencies measured for critical paths, requests
processed per second, and so forth.
 Timeliness:
 In systems subject to real-time constraints, the assurance that actions are
taken within the specified time bounds, or are performed with a
desired degree of temporal synchronization between the
components.
Tolerating Failures
 Halting failures:
 In this model, a process or computer either works correctly, or simply stops executing and crashes without
taking incorrect actions, as a result of failure. As the model is normally specified, there is no way to
detect that the process has halted except by timeout: It stops sending “keep alive” messages or
responding to “pinging” messages and hence other processes can deduce that it has failed.
 Fail-stop failures:
 fail by halting. However, other
These are accurately detectable halting failures. In this model, processe
processes that may be interacting with the faulty process also have a completely accurate way
to detect such failures—for example, a fail-stop environment might be one in which timeouts
can be used to monitor the status of processes, and no timeout occurs unless the process being
monitored has actually crashed. Obviously, such a model may be unrealistically optimistic,
representing an idealized world in which the handling of failures is reduced to a pure problem
of how the system should react when a failure is sensed. If we solve problems with this model,
we then need to ask how to relate the solutions to the real world.
 Send-omission failures:
 of the distributed computing systems,
These are failures to send a message that, according to the logic
should have been sent. Send-omission failures are commonly caused by a lack of buffering
space in the operating system or network interface, which can cause a message to be
discarded after the application program has sent it but before it leaves the sender’s machine.
Perhaps surprisingly, few operating systems report such events to the application.
Tolerating Failures
 Receive-omission failures:
 These are similar to send-omission failures, but they occur when a message is lost near the destination process, often
because of a lack of memory in which to buffer it or because evidence of data corruption has been
discovered.
 Network failures:
 These occur when the network loses messages sent between certain pairs of processes.
 Network partitioning failures:
 These are a more severe form of network failure, in which the network fragments into disconnected sub-networks,
within which messages can be transmitted, but between which messages are lost. When a failure of this sort
is repaired, one talks about merging the network partitions. Network partitioning failures are a common
problem in modern distributed systems
 Timing failures:
 These occur when a temporal property of the system is violated—for example, when a clock on a computer exhibits a
value that is unacceptably far from the values of other clocks, or when an action is taken too soon or too
late, or when a message is delayed by longer than the maximum tolerable delay for a network connection.
 Byzantine failures:
 This is a term that captures a wide variety of other faulty behaviors, including data corruption, programs that fail to
follow the correct protocol, and even malicious or adversarial behaviors by programs that actively seek to
force a system to violate its reliability properties.
Computation Models
 Real-world networks:
 These are composed of workstations, personal computers,
and other computing devices interconnected by hardware.
 Properties of the hardware and software components will
often be known to the designer, such as speed, delay, and
error frequencies for communication devices; latencies for
critical software and scheduling paths; throughput for data
generated by the system and data distribution patterns; speed
of the computer hardware, accuracy of clocks; and so forth.
 This information can be of tremendous value in designing
solutions to problems that might be very hard—or
impossible—in a completely general sense.
Computation Models
 Asynchronous computing systems:
 This is a very simple theoretical model used to approximate one extreme sort of computer
network. In this model, no assumptions can be made about the relative speed of the
communication system, processors, and processes in the network.
 One message from a process p to a process q may be delivered in zero time, while the
next is delayed by a million years.
 The asynchronous model reflects an assumption about time, but not failures: Given an
asynchronous model, one can talk about protocols that tolerate message loss,
protocols that overcome fail-stop failures in asynchronous networks, and so forth.
 The main reason for using the model is to prove properties about protocols for which
one makes as few assumptions as possible.
 The model is very clean and simple, and it lets us focus on fundamental properties of
systems without cluttering up the analysis by including a great number of practical
considerations.
 If a problem can be solved in this model, it can be solved at least as well in a more
realistic one.
 On the other hand, the converse may not be true:
 We may be able to do things in realistic systems by making use of features not
available in the asynchronous model, and in this way may be able to solve problems in
real systems that are impossible in ones that use the asynchronous model.
Computation Models
 Synchronous computing systems:
 extreme end of the spectrum. In the synchronous
Like the asynchronous systems, these represent an
systems, there is a very strong concept of time that all processes in the system share.
 One common formulation of the model can be thought of as having a system wide gong that
sounds periodically; when the processes in the system hear the gong, they run one round of a
protocol, reading messages from one another, sending messages that will be delivered in the
next round, and so forth.
 And these messages always are delivered to the application by the start of the next round, or
not at all.
 Normally, the synchronous model also assumes bounds on communication latency between
processes, clock skewand precision, and other properties of the environment.
 As in the case of an asynchronous model, the synchronous one takes an extreme point of view
because this simplifies reasoning about certain types of protocols.
 Real-world systems are not synchronous—it is impossible to build a system in which actions
are perfectly coordinated as this model assumes.
 However, if one proves the impossibility of solving some problem in the synchronous model, or
proves that some problem requires at least a certain number of messages in this model, one
has established a sort of lower bound.
 In a real-world system, things can only get worse, because we are limited to weaker
assumptions.
 This makes the synchronous model a valuable tool for understanding how hard it will be to
solve certain problems.
Computation Models
 Parallel-shared memory systems:
 An important family of systems is based on multiple processors
that share memory.
 Unlike for a network, where communication is by message
passing, in these systems communication is by reading and
writing shared memory locations. Clearly, the shared memory
model can be emulated using message passing, and can be used
to implement message communication.
 Nonetheless, because there are important examples of real
computers that implement this model, there is considerable
theoretical interest in the model per-se.
 Unfortunately, although this model is very rich and a great deal is
known about it, it would be beyond the scope of this book to
attempt to treat the model in any detail.
Communication Technology
 The most basic communication technology in any distributed system
is the hardware support for message passing.
 Although there are some types of networks that offer special
properties, most modern networks are designed to transmit data in
packets with some fixed, but small, maximum size. Each packet
consists of a header, which is a data structure containing
information about the packet—its destination, route, and so forth.
 It contains a body, which are the bytes that make up the content of
the packet.
 And it may contain a trailer, which is a second data structure that is
physically transmitted after the header and body and would normally
consist of a checksum for the packet that the hardware computes
and appends to it as part of the process of transmitting the packet.

You might also like