0% found this document useful (0 votes)
4 views126 pages

Distributed Computing

Distributed computing is a system where independent computers collaborate to solve complex problems by dividing tasks among themselves. It has evolved from early computing days to modern architectures like client-server and peer-to-peer models, enhancing scalability, availability, and efficiency. Distributed operating systems facilitate resource sharing and real-time applications, but they also present challenges in management and reliability.

Uploaded by

Jyoti Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views126 pages

Distributed Computing

Distributed computing is a system where independent computers collaborate to solve complex problems by dividing tasks among themselves. It has evolved from early computing days to modern architectures like client-server and peer-to-peer models, enhancing scalability, availability, and efficiency. Distributed operating systems facilitate resource sharing and real-time applications, but they also present challenges in management and reliability.

Uploaded by

Jyoti Grover
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

Distributed Computing

What is Distributed Computing?

• A distributed computing system, simply put, is a network of independent computers working together to
achieve common computational goals. It is a system where multiple computers, often geographically
dispersed, collaborate to solve a problem that is beyond their individual computing capabilities. Each
system, or 'node', is self-sufficient, meaning it operates independently while also contributing to the
overall goal.

• This is achieved through a process of task division, where a large task is divided into smaller subtasks.
Each subtask is then assigned to a different node within the network. The nodes work concurrently,
processing their individual tasks independently, and finally the results are aggregated into a final result.

Example:-

• Cloud computing system, where resources such as computing power, storage, and networking are
delivered over the Internet and accessed on demand. In this type of system, users can access and use
shared resources through a web browser or other client software.
Evolution of Distributed Computing Systems
Evolution of Computing Power

 Early Days of Computing:

o Initial computers had limited processing power.


o Advancement in technology led to faster execution and improved processing capabilities.

 Shift to Minicomputers:

o Minicomputers started to handle various data volumes efficiently.


o Personal computers evolved to process complex tasks and large data efficiently.

 Role of Mainframe Computers:

o Mainframes were used for massive data analysis but were not always ideal for all tasks.
o This limitation sparked research into alternative solutions.
 Introduction of Distributed Operating Systems:

o Distributed architecture allows multiple computers to work together in a network, resembling a single
virtual computer.
o Offers enhanced speed and processing power by pooling resources.

 Enhanced by Remote Procedure Calls (RPC):

o RPCs enable execution of commands on remote computers within the network.


o The TCP/IP protocol facilitated the rise of RPCs and improved distributed system relevance.

 Architectural Framework Advancement:

o Initially designed for batch processing; evolved with the introduction of minicomputers.
o The master/slave concept was implemented for process management.

 Impact of Microcomputers and Networks:

o Reduction in size and increase in power of microcomputers improved distributed computing efficiency.
o Modern networks offer better performance, scalability, and security, enhancing distributed computing
capabilities.
Key components of a Distributed Computing System

 Devices or Systems: The devices or systems in a distributed system have their own processing capabilities
and may also store and manage their own data.

 Network: The network connects the devices or systems in the distributed system, allowing them to
communicate and exchange data.

 Resource Management: Distributed systems often have some type of resource management system in place
to allocate and manage shared resources such as computing power, storage, and networking.
Advantages of Distributed Computing
Scalability

• As the computational needs of a task increase, instead of upgrading a single system to handle the increased
workload, additional nodes can be added to the distributed network. This way, the system can efficiently
handle the growing demands without major modifications or significant costs.

• It also includes the ability to enhance the computational power of existing nodes or to replace older nodes
with more powerful ones.

Availability

• High availability is another significant advantage of distributed computing. Since the system is composed
of multiple independent nodes, the failure of one or a few nodes does not halt the entire system. Other
nodes in the network can continue their operations, ensuring that the system as a whole remains functional.
Efficiency

• Distributed computing systems are highly efficient. By dividing a large task into smaller subtasks and
processing them concurrently, the system can significantly reduce the time required to complete the task.
This parallel processing capability is especially beneficial for complex computational tasks that would
take an unfeasibly long time to complete on a single computer.

Transparency

• Transparency is a key feature of distributed computing systems. Despite being composed of multiple
independent nodes, the system operates as a single entity from the user's perspective. This means that
the complexities of the underlying architecture, such as the division of tasks, the communication
between nodes, and the handling of failures, are hidden from the user.
Types of Distributed Computing Architecture

Client-Server Architecture

• Client-server architecture is a common type of distributed computing architecture. In this model, the system is
divided into two types of nodes: clients and servers. Clients request services, and servers provide them. The
servers are typically powerful computers that host and manage resources, while the clients are usually less
powerful machines that access these resources.

Three-Tier Architecture

• Three-tier architecture is a type of client-server architecture where the system is divided into three layers: the
presentation layer, the application layer, and the data layer. The presentation layer handles the user interface,
the application layer processes the business logic, and the data layer manages the database. By separating
these functions, the system can achieve greater scalability, flexibility, and maintainability.
N-Tier Architecture

• N-tier architecture is a further extension of the three-tier architecture. In this model, the system is divided into
'n' tiers or layers, where 'n' can be any number greater than three. Each layer is dedicated to a specific
function, such as user interface, business logic, data processing, data storage, etc. This division of labor
allows for greater modularity, making the system more scalable and easier to manage.

Peer-to-Peer Architecture

• Peer-to-Peer (P2P) architecture is a type of distributed computing architecture where all nodes are equal, and
each node can function as both a client and a server. In this model, there is no central server; instead, each
node can request services from and provide services to other nodes. This decentralization makes P2P
architectures highly scalable and resilient, as there is no single point of failure.
Distributed Computing System Models
Types of Distributed Computing System Models

1. Physical Model
• A physical model represents the underlying hardware elements of a distributed system. It encompasses the
hardware composition of a distributed system in terms of computers and other devices and their interconnections.
It is primarily used to design, manage, implement, and determine the performance of a distributed system.

• A physical model majorly consists of the following components:

1.1. Nodes

• Nodes are the end devices that can process data, execute tasks, and communicate with the other nodes. These end
devices are generally the computers at the user end or can be servers, workstations, etc.

 Nodes provision the distributed system with an interface in the presentation layer that enables the user to interact
with other back-end devices, or nodes, that can be used for storage and database services, processing, web
browsing, etc.

 Each node has an Operating System, execution environment, and different middleware requirements that
facilitate communication and other vital tasks.
1.2. Links

• Links are the communication channels between different nodes and intermediate devices. These may be wired
or wireless. Wired links or physical media are implemented using copper wires, fiber optic cables, etc.

• Different connection types that can be implemented are as follows:

 Point-to-point links: Establish a connection and allow data transfer between only two nodes.

 Broadcast links: It enables a single node to transmit data to multiple nodes simultaneously.

 Multi-Access links: Multiple nodes share the same communication channel to transfer data. Requires protocols
to avoid interference while transmission.

1.3. Middleware

• These are the softwares installed and executed on the nodes. By running middleware on each node, the
distributed computing system achieves a decentralised control and decision-making. It handles various tasks like
communication with other nodes, resource management, fault tolerance, synchronisation of different nodes and
security to prevent malicious and unauthorised access.
1.4 Network Topology

• This defines the arrangement of nodes and links in the distributed computing system. The most common network
topologies that are implemented are bus, star, mesh, ring or hybrid. Choice of topology is done by determining the
exact use cases and the requirements.

1.5. Communication Protocols

• Communication protocols are the set rules and procedures for transmitting data from in the links. Examples of
these protocols include TCP, UDP, HTTPS, MQTT etc. These allow the nodes to communicate and interpret the data.
2. Architectural Model

• Architectural model in distributed computing system is the overall design and structure of the system, and how its
different components are organised to interact with each other and provide the desired functionalities. It is an
overview of the system, on how will the development, deployment and operations take place. Construction of a
good architectural model is required for efficient cost usage, and highly improved scalability of the applications.

• The key aspects of architectural model are:

2.1. Client-Server model

• It is a centralised approach in which the clients initiate requests for services and severs respond by providing
those services. It mainly works on the request-response model where the client sends a request to the server and
the server processes it, and responds to the client accordingly.

 It can be achieved by using TCP/IP, HTTP protocols on the transport layer.

 This is mainly used in web services, cloud computing, database management systems etc.
2.2. Peer-to-peer model

• It is a decentralised approach in which all the distributed


computing nodes, known as peers, are all the same in
terms of computing capabilities and can both request as
well as provide services to other peers. It is a highly
scalable model because the peers can join and leave the
system dynamically, which makes it an ad-hoc form of
network.

 The resources are distributed and the peers need to look


out for the required resources as and when required.

 The communication is directly done amongst the peers


without any intermediaries according to some set rules
and procedures defined in the P2P networks.
2.3. Layered model

• It involves organising the system into multiple layers,


where each layer will provision a specific service. Each
layer communicated with the adjacent layers using
certain well-defined protocols without affecting the
integrity of the system. A hierarchical structure is
obtained where each layer abstracts the underlying
complexity of lower layers.
2.4. Micro-services model

• In this system, a complex application or task, is decomposed into multiple independent tasks and these services
running on different servers. Each service performs only a single function and is focussed on a specific business-
capability. This makes the overall system more maintainable, scalable and easier to understand. Services can be
independently developed, deployed and scaled without affecting the ongoing services.
3. Fundamental Model

• The fundamental model in a distributed computing system is a broad conceptual framework that helps in
understanding the key aspects of the distributed systems. These are concerned with more formal description of
properties that are generally common in all architectural models. Three fundamental models are as follows:

3.1. Interaction Model

• Distributed computing systems are full of many processes interacting with each other in highly complex ways.
Interaction model provides a framework to understand the mechanisms and patterns that are used for
communication and coordination among various processes. Different components that are important in this model
are: –

 Message Passing – It deals with passing messages that may contain, data, instructions, a service request, or
process synchronisation between different computing nodes. It may be synchronous or asynchronous depending
on the types of tasks and processes.

 Publish/Subscribe Systems – Also known as pub/sub system. In this the publishing process can publish a
message over a topic and the processes that are subscribed to that topic can take it up and execute the process for
themselves. It is more important in an event-driven architecture.
3.2. Remote Procedure Call (RPC)

• It is a communication paradigm that has an ability to invoke a new process or a method on a remote process
as if it were a local procedure call. The client process makes a procedure call using RPC and then the
message is passed to the required server process using communication protocols.
Distributed Operating System
• A distributed operating system is an important type of
operating system. An operating system is basically, a
program that acts as an interface between the system
hardware and the user. Moreover, it handles all the
interactions between the software and the hardware.

• A distributed operating system is one in which several


computer systems connected through a single
communication channel. Moreover, these systems have
their individual processors and memory. Furthermore,
these processors communicate through high-speed
buses telephone lines.
Advantages of Distributed Operating System:

• There are several advantages to using a Distributed Operating System:

• Two or more systems can be connected to each other, allowing them to share resources and serve real-time
applications.

• The distributed system can be scaled to meet the needs of the business, making it more flexible and efficient.

• The system can be managed centrally, making it easier to control and monitor.

• Distributed systems are becoming increasingly popular due to the rise of big data and the need for real-time
applications.

• Disadvantages of Distributed Operating System:

• There are also a few disadvantages to using a Distributed Operating System:

• The systems can be more difficult to administer and manage.

• There is a greater risk of system failure due to the increased number of systems and points of failure.
Uses of Distributed OS
• The distributed OS has numerous applications. Here are a few examples:

Network Applications

• Many network apps, including the web, multiplayer web-based games, peer-to-peer networks, and virtual
communities, rely on DOS.

Telecommunication Networks

• DOS is useful in cellular networks and phones. In networks like wireless sensor networks, the Internet, and
routing algorithms, a DOS can be found.

Real-Time Process Control

• Aircraft control systems are instances of real-time process control systems that operate on a deadline.

Parallel Computation

• DOS is the foundation for systematic computing, which encompasses cluster and grid computing as well as a
number of volunteer computing projects.
Designing a distributed operating system
The various design issues in the development of distributed systems are stated as follows:

• Transparency

• Flexibility

• Reliability

• Performance

• Scalability
Transparency:
• A distributed system is said to be transparent if the users of the system feel that the collection of machines is a
timesharing system and belongs entirely to him. Transparency can be achieved at two different levels. In the first level,
the distribution of the system is hidden from the users.

• for example, in UNIX, when a user compiles his program using the make command, compilation takes place in parallel
on different machines, which use different file systems. The whole system can be made to look like a single-processor
system.

Flexibility
• Flexibility in distributed systems is important because this system is new for engineers, and thus there may be false
starts and it might be required to backtrack the system. The design issues might prove wrong in the later stages of
development. There are two different schemes for building distributed systems.

• The first one, called monolithic kernel, states that each machine should run a traditional kernel, which provides most of
the services itself.
Monolithic kernel is the centralized OS, which has networking ability and
remote services. The system calls are made by locking the kernel, then the
desired task is performed and the kernel is released after returning the result
to the user. In this approach, the machines have their own disks and maintain
their own local file system.
The other one, called microkernel, states that the kernel should provide very
little services and most of the OS services should be provided from the
userlevel servers.

• Most of the distributed systems are designed to use microkernel because it


performs very few tasks. As a result, these systems are more flexible. The
services provided by the microkernel are stated as follows:

• It provides interprocess communication.

• It manages the memory.

• It performs low-level process management and scheduling.

• It also performs low-level I/O


Reliability
• Distributed systems are more reliable than single-processor systems because if one system in a distributed
system stops functioning, other systems can take over. Some other machine can take up the job of the
machine that is not working. There are various issues related to reliability, which are stated as follows:

• a) Availability: It refers to the fraction of time during which the system is usable. The availability of a
system can be ensured with the design, which does not require simultaneous use of key components. In other
words, the components, which are required very often, should not be used concurrently. The resources or
files, which are used frequently, can be replicated.

• b) Security: It implies that anyone can access the data stored on a distributed system. As a result, it should be
protected from unauthorized access. This problem also persists in single-processor systems. But in single-
processor systems, the users are required to log in, and thus they are authenticated and the system can check
the permission of the user. But in a distributed system, the system has no provision for determining the user
and his permission.
Performance
• A flexible and reliable system is ineffective if it performs slower than a single-processor system. Performance
is measured using metrics like jobs per hour, system utilization, and network capacity. Different tasks yield
varying performance results; for instance, CPU-intensive tasks differ from file searching tasks. In distributed
systems, communication speed is vital, but waiting for message-handling protocols can slow performance.

• To improve efficiency, multiple tasks should run simultaneously, which necessitates sending many messages.
Analyzing computation grain size is crucial: simple tasks should be prioritized over complex ones since they
require fewer CPU cycles. Fine-grained parallelism involves simple tasks with high interaction, while coarse-
grained parallelism involves large tasks with low interaction, with the latter being preferred for better
performance.

• Reliability can be enhanced through server cooperation on requests. If one server fails, another can take over
the task, ensuring completion. This approach improves reliability but requires additional messages across the
network.
Scalability
Distributed systems typically operate efficiently with a few hundred CPUs, but future scenarios may involve much
larger systems. For instance, if a postal and telecommunications authority installs a terminal in every home and
business for online access to a comprehensive database (like telephone numbers), this could eliminate the need for
printed directories. These terminals could also facilitate email, electronic banking, and ticket reservations,
showcasing the potential of such expansive systems.

There are certain bottlenecks in developing such a large system, which are stated as follows:

• Centralized components: There should not be any centralized components in the systems: for example, if there
is a single centralized mail server for all the users of the system, the traffic over the network will increase. The
system will not be able to tolerate faults and also if any one of the systems fails, the whole system will crash.

• Centralized tables: If the data of the users is stored in the centralized tables, the communication lines will be
blocked. Thus, the system will become prone to faults and failures.

• Centralized algorithms: If the messages in such a large system are sent using a single algorithm, it will take
much time to reach the destination due to the large number of users and traffic
INTRODUCTION TO MESSAGE PASSING
• The communication between the computers in a distributed
operating system works as a backbone to the whole system.
The communication between the computers in a distributed
operating system is also known as IPS (Inter Process
Communication).

• When a user initiates a command starts to execute an


application, it will result in initiation of multiple process
which need to work in tandem in order to process the
request of a user. The process created to complete the
request need to interact with one another either by sharing
the resources or by passing the messages.
Fig. An Example of Message Passing Communication An Example of Resource Sharing between Two
between the Two Processes Processes
Issues in IPC By Message Passing in Distributed System

• The design of an inter-process communication needs to address some basic issues while designing
communication process procedures used for communication between any two or more nodes. The messages
are generally transferred from sender to receiver in the form of data packets. The sender node will send the
data packet which will include of two basic components fixed length header and variable length block.

• The fixed length header includes different information related to sender process address, receiving process
address, message unique identification number, type of data, number of bytes/ element. The information
available in the fixed length header helps a data packet to reach the correct destination and given correct
information to the receiver about the originating process as the same is used for sending the acknowledgement
from the receiver.

• Once the process of receiving a data packet is complete then an acknowledgement is sent to the originating
process in order to complete the transfer of message between a sender and a receiver.
Fig. Data Packet which is Transferred from a Sender Node to a Receiver Node
Fig. Parameters Available in “Fixed Length Block” of a Data Packet
The basic challenges that are encountered by a distributed operating system during inter process communication are:
1. Naming and Name Resolution: Every process in a communication system is assigned a unique identification number
know as Process-ID (process identification number). The computer network system should have a naming system which
allows a process to names in order to resolve any conflicts or in order to manage the process execution in a distributed
operating system or inter-process communication.
• The implementation of a naming system can be implemented either using distributed or non-distributed approach. The
method of selection have a direct impact on the effectiveness and efficiency of a distributed operating system.
2. Routing Strategies: Inter-process communication primarily involves determining how a data packet is transmitted
from a sender to a receiver, including the specific route it takes through various nodes or computers. During this process,
the message is only accessible at the destination node, ensuring confidentiality. The route taken by the data packet is
referred to as routing, and the methods for identifying this route are known as routing strategies. An effective routing
strategy should prioritize efficiency, security, and optimal resource utilization.
3.Connection Strategies:The backbone of communication between a sender and receiver is the physical connection,
established through a connection strategy. A poorly chosen connection strategy can result in communication delays,
message loss, or alterations. In distributed operating systems, three key connection strategies—Circuit Switching,
Message Switching, and Packet Switching—must be selected based on specific system requirements. In inter-process
communication, messages are sent in a structured format containing attributes such as sender and receiver addresses,
sequence numbers, structural information, and the actual data, which is typically located at the end of the message block,
often accompanied by a pointer to the data.
SYNCHRONIZATION
• A distributed operating system relies on cooperation and data exchange between independent nodes, requiring
synchronization between sender and receiver for effective communication. This synchronization ensures that
messages are received accurately and allows for the timing coordination necessary for successful data
transfer. It can be categorized into two types: blocking and non-blocking synchronization.

• In blocking synchronization, the sender remains in a blocked state until the receiver acknowledges receipt of
the message. In contrast, non-blocking synchronization allows the sender to continue its operations
immediately after sending the message, without waiting for acknowledgment. Effective synchronization is
crucial for the integrity and reliability of communication in distributed systems.

a) Blocking:
• In blocking synchronization, communication involves a sender transmitting a message and then waiting for
the receiver's acknowledgment before proceeding. During this waiting period, the sender remains blocked,
and the receiver also waits for a message to continue its process. This method ensures that the sender only
resumes execution after confirming receipt of the message, thus coordinating the communication flow
between the nodes effectively.
When a message is sent from one computer (the
sender) to another (the receiver), sometimes one of
them can get stuck, which may lead to a situation
where neither can continue working. To prevent this
from happening, a timeout is set. This timeout is a
limit on how long the sender will wait for a response
(acknowledgment) from the receiver. If the sender
doesn't get a response before the timeout ends, it will
stop waiting and continue with other tasks, helping to
avoid a deadlock where everything freezes.

If both the sender and receiver use a blocking method


to communicate, this is called synchronous
communication. Synchronous communication is
straightforward and ensures the message is delivered.
However, it can also lead to deadlocks if both sides
are waiting on each other, causing the entire system to
become unresponsive.
(b) Non-Blocking:

• In non-blocking synchronization, when a sender sends a message, it doesn't have to wait for the receiver to
acknowledge it. The sender can continue with its work once the message is placed in a buffer. Similarly, the receiver
can also move on to other tasks after it has executed the receive operation, without waiting for a message.

• In this setup, if both the sender and receiver are using non-blocking methods, it’s called asynchronous communication.
The challenge is how the receiver knows when a message is available in the buffer. There are two main ways to handle
this:

• Polling: This is when the receiver continuously checks the buffer to see if there are any new messages. While this
method works, it can be inefficient because it uses up processing resources.

• Interrupt Method: In this method, the system generates an interrupt signal when a message arrives in the buffer. This
alert informs the receiver that it can retrieve the message. However, implementing this method can be complex and
resource-intensive, especially in distributed systems.
BUFFERING
What is buffering?

When a sender sends a message to a receiver, it can use either synchronous or asynchronous communication. To ensure
the message gets delivered, it needs to be temporarily stored somewhere until the receiver is ready to get it. This storage
area can be in the sender's memory or a memory space managed by the operating system. The place where the message
is kept until the receiver retrieves it is called a buffer, and the act of storing the message there is known as buffering.

Different types of buffering are used based on the requirement of a process within a distributed operating system and
some of them are given below:

1. Null Buffering: This type of buffering doesn’t use any buffer rather the send process remains in suspended mode till
the receiver node in a position to receive the message. Once the process of send message starts the receiver starts the
receiving the message and accordingly an acknowledgement is sent once the message is delivered. The sender node
on receipt of acknowledgement sends a message to the received in order to unblock the receiver node for further
processing.
2. Single Message Buffering: This type of buffering uses a single buffer either at the receiver node address space in
order to ensure that the message is readily available to the receiver as and when the receiver node is ready to accept the
same. The single message buffer performs better in some situations as the message is available in the buffer which
helps the while system in reducing the blocking duration at different nodes.
3. Multiple Message Buffering:Multiple message buffering is commonly used in asynchronous communication
within inter-process communication in distributed systems. This type of buffering acts like a mailbox, where
messages can be stored either in the receiver’s memory or in the operating system’s memory.
• When a sender wants to send a message, it executes the send process, and the message is placed in this
mailbox. The receiver can later check the mailbox and retrieve messages whenever it is ready by
processing the receive operation.
MULTI-DATAGRAM MESSAGES
• Inter-process communication in distributed operating systems involves transferring messages between nodes
in the form of packets, which contain various attributes like process identifier, address, and data. A datagram
is an independent packet that facilitates connectionless communication across a packet-switched network and
carries sufficient information for routing.

• Each network has a maximum transfer unit (MTU) specifying the largest datagram size allowed. If a message
exceeds the MTU, it is split into smaller datagrams, referred to as multi-datagrams, which include additional
attributes for sequencing and fragmentation.
• The packet is used to send fragments with control information and data and it is called Datagram.

 Single-datagram Messages: A message is called a Single-datagram Message if its size is smaller than that
of the Maximum Transfer Unit (MTU) of a network. Therefore, it can be sent in a single packet on a
network.
 Multidatagram Messages: A message is called a Multidatagram Message if its size is larger than that of
the Maximum Transfer Unit (MTU) of a network. Therefore, it is sent in multiple packets on a network
ENCODING AND DECODING
• Messages sent from a source to a destination node can be in the form of either single or multi-datagrams. To
ensure that the message is complete and correct upon receipt, the receiver must understand the structure of the
datagram and have information consistent with that at the sender.
• This involves encoding the original data packet into a compatible form for transmission and then decoding it back
into its original form at the receiver. Encoding methods can vary, with two main representations: tagged and
untagged. Tagged representation includes detailed information about the object and data, making decoding
straightforward for the receiver. In contrast, untagged representation only contains the data, requiring the receiver
to know the encoding method in advance to successfully decode the message.
Distributed Computing
Unit-2
Remote Procedural Call in Distributed Systems
• Remote Procedure Call (RPC) is a
protocol used in distributed
systems that allows a program to
execute a procedure (subroutine)
on a remote server or system as if
it were a local procedure call.

• RPC enables a client to invoke


methods on a server residing in a
different address space (often on a
different machine) as if they were
local procedures.

• The client and server communicate


over a network, allowing for
remote interaction and
computation.
Importance of Remote Procedural Call(RPC) in
Distributed Systems
• Remote Procedure Call (RPC) plays a crucial role in distributed systems by enabling seamless communication
and interaction between different components or services that reside on separate machines or servers. Here’s an
outline of its importance:

• Simplified Communication

Abstraction of Complexity: RPC abstracts the complexity of network communication, allowing developers
to call remote procedures as if they were local, simplifying the development of distributed applications.
Consistent Interface: Provides a consistent and straightforward interface for invoking remote services,
which helps in maintaining uniformity across different parts of a system.
• Enhanced Modularity and Reusability

Decoupling: RPC enables the decoupling of system components, allowing them to interact without being tightly
coupled. This modularity helps in building more maintainable and scalable systems.
Service Reusability: Remote services or components can be reused across different applications or systems,
enhancing code reuse and reducing redundancy.
Remote Procedural Call (RPC) Architecture in Distributed
Systems
• The RPC (Remote Procedure Call) architecture in distributed systems is designed to enable communication between
client and server components that reside on different machines or nodes across a network. Here’s an overview of the
RPC architecture:
1. Client and Server Components

• Client: The client is the component that makes the RPC request. It invokes a procedure or method on the remote
server by calling a local stub, which then handles the details of communication.

• Server: The server hosts the actual procedure or method that the client wants to execute. It processes incoming RPC
requests and sends back responses.

2. Stubs

• Client Stub: Acts as a proxy on the client side. It provides a local interface for the client to call the remote
procedure. The client stub is responsible for marshalling (packing) the procedure arguments into a format suitable for
transmission and for sending the request to the server.

• Server Stub: On the server side, the server stub receives the request, unmarshals (unpacks) the arguments, and
invokes the actual procedure on the server. It then marshals the result and sends it back to the client stub.
3. Marshalling and Unmarshalling

• Marshalling: The process of converting procedure arguments and return values into a format that can be
transmitted over the network. This typically involves serializing the data into a byte stream.

• Unmarshalling: The reverse process of converting the received byte stream back into the original data
format that can be used by the receiving system.

4. Communication Layer

• Transport Protocol: RPC communication usually relies on a network transport protocol, such as TCP or
UDP, to handle the data transmission between client and server. The transport protocol ensures that data
packets are reliably sent and received.

• Message Handling: This layer is responsible for managing network messages, including routing, buffering,
and handling errors.
Transparency in a Distributed System
Transparency refers to hiding the complexities of the system’s implementation details from users and applications. It
aims to provide a seamless and consistent user experience regardless of the system’s underlying architecture,
distribution, or configuration. Transparency ensures that users and applications interact with distributed resources in
a uniform and predictable manner, abstracting away the complexities of the distributed nature of the system.
Importance of Transparency in Distributed Systems
Transparency is very important in distributed systems because of:

• Simplicity and Abstraction: Allows developers and users to interact with complex distributed systems using
simplified interfaces and abstractions.

• Consistency: Ensures consistent behavior and performance across different parts of the distributed system.

• Ease of Maintenance: Facilitates easier troubleshooting, debugging, and maintenance by abstracting away
underlying complexities.

• Scalability: Supports scalability and flexibility by allowing distributed components to be added or modified
without affecting overall system functionality.
Types of Transparency in Distributed Systems
1. Location Transparency
Location transparency refers to the ability to access distributed resources without knowing their physical or network
locations. It hides the details of where resources are located, providing a uniform interface for accessing them.
• Importance: Enhances system flexibility and scalability by allowing resources to be relocated or replicated without
affecting applications.

• Examples:

• DNS (Domain Name System): Maps domain names to IP addresses, providing location transparency for web
services.
• Virtual Machines (VMs): Abstract hardware details, allowing applications to run without knowledge of the
underlying physical servers.
2. Access Transparency
Access transparency ensures that users and applications can access distributed resources uniformly, regardless of the
distribution of those resources across the network.
• Significance: Simplifies application development and maintenance by providing a consistent method for accessing
distributed services and data.

• Methods:

• Remote Procedure Call (RPC): Allows a program to call procedures located on remote systems as if they were
local.
3. Concurrency Transparency
Concurrency transparency hides the complexities of concurrent access to shared resources in distributed systems
from the application developer. It ensures that concurrent operations do not interfere with each other.
• Challenges: Managing synchronization, consistency, and deadlock avoidance in a distributed environment
where multiple processes or threads may access shared resources simultaneously.

• Techniques:

• Locking Mechanisms: Ensure mutual exclusion to prevent simultaneous access to critical sections of
code or data.
• Transaction Management: Guarantees atomicity, consistency, isolation, and durability (ACID
properties) across distributed transactions.
4. Replication Transparency
Replication transparency ensures that clients interact with a set of replicated resources as if they were a
single resource. It hides the presence of replicas and manages consistency among them.
• Strategies: Maintaining consistency through techniques like primary-backup replication, where one
replica (primary) handles updates and others (backups) replicate changes.

• Applications:

• Content Delivery Networks (CDNs): Replicate content across geographically distributed servers to
reduce latency and improve availability.
5. Failure Transparency
Failure transparency ensures that the occurrence of failures in a distributed system does not disrupt service
availability or correctness. It involves mechanisms for fault detection, recovery, and resilience.
• Approaches:

• Heartbeating: Periodically checks the availability of nodes or services to detect failures.


• Replication and Redundancy : Uses redundant components or data replicas to continue operation
despite failures.
• Examples:

• Load Balancers: Distribute traffic across healthy servers and remove failed ones from the pool
automatically.
6. Performance Transparency
Performance transparency ensures consistent performance levels across distributed nodes despite variations in
workload, network conditions, or hardware capabilities.
• Challenges: Optimizing resource allocation and workload distribution to maintain predictable performance
levels across distributed systems.

• Strategies:

• Load Balancing: Distributes incoming traffic evenly across multiple servers to optimize resource
utilization and response times.
7. Security Transparency
Security transparency ensures that security mechanisms and protocols are integrated into a distributed system
seamlessly, protecting data and resources from unauthorized access or breaches.
• Importance: Ensures confidentiality, integrity, and availability of data and services in distributed environments.

• Techniques:

• Encryption: Secures data at rest and in transit using cryptographic algorithms to prevent eavesdropping or
tampering.
• Access Control: Manages permissions and authentication to restrict access to sensitive resources based on
user roles and policies.
8. Management Transparency
Management transparency simplifies the monitoring, control, and administration of distributed systems by providing
unified visibility and control over distributed resources.
• Methods: Utilizes automation, monitoring tools, and centralized management interfaces to streamline operations
and reduce administrative overhead.

• Examples:

• Cloud Management Platforms (CMPs): Provide unified interfaces for provisioning, monitoring, and
managing cloud resources across multiple providers.
• Configuration Management Tools: Automate deployment, configuration, and updates of software and
infrastructure components in distributed environments.
What is RPC mechanism?
They enable client to communication with server by calling procedures in a similar way to the conventional use of
procedure on the local procedure call, but the call procedure is executed in a different process and usually a different
computer.
The steps in making a RPC
• Client procedure calls the client stub in a normal way.
• Client stub builds a message and traps to the kernel.
• Kernel sends the message to remote kernel.
• Remote kernel gives the message to server stub.
• Server stub unpacks parameters and calls the server.
• Server computes results and returns it to server stub.
• Server stub packs results in a message to client and traps to kernel.
• Remote kernel sends message to client stub.
• Client kernel gives message to client stub.
• Client stub unpacks results and returns to client.
RPC Implementation Mechanism
• RPC is an effective mechanism for building
client-server systems that are distributed. RPC
enhances the power and ease of programming
of the client/server computing concept. It’s a
protocol that allows one software to seek a
service from another program on another
computer in a network without having to know
about the network. The software that makes
the request is called a client, and the
program that provides the service is called a
server.
• The calling parameters are sent to the remote
process during a Remote Procedure Call, and
the caller waits for a response from the remote
procedure.
When the client process requests by calling a local procedure then the procedure will pass the arguments/parameters in
request format so that they can be sent in a message to the remote server. The remote server then will execute the local
procedure call ( based on the request arrived from the client machine) and after execution finally returns a response to
the client in the form of a message. Till this time the client is blocked but as soon as the response comes from the
server side it will be able to find the result from the message. In some cases, RPCs can be executed asynchronously
also in which the client will not be blocked in waiting for the response.
Parameter passing Semantics in RPC
• When a client sends a procedure call to a server over network, parameters of procedure need to be transmitted to
server. RPC uses different parameter passing methods to transmit these parameters.
• The parameter passing is the only way to share information between clients and servers in the Remote Procedure
Call (RPC).
The following are the various semantics that is used in RPC for passing parameters in distributed applications:
1. Call-by-Value: The client stub copies and packages the value from the client into a message so that it can be
sent to the server through a network.
• Parameter marshaling is like packing a suitcase for a trip. When a program on your computer (the client) wants to
use a function on another computer (the server), it needs to send the necessary information (parameters) over.
• 1. Packing (Client Stub): The client stub takes the parameters (like `x` and `y` for the `add` function) and puts
them into a message, like packing items into a suitcase. It also includes what function needs to be called (`add`).
• 2. Sending: The message is sent to the server.
• 3. Unpacking (Server):The server unpacks the "suitcase" to see what function is needed and what parameters to
use.
• 4. Execution (Server): The server executes the function.
• 5. Repacking (Server): The server puts the result back into a message and sends it back to the client.
• 6. Unpacking (Client Stub): The client stub unpacks the result and gives it back to the program that made the call.
This works smoothly if both computers speak the same language and the data is simple (like numbers or letters).

2. Call-by-Reference: Call-by-Reference simply means that pointers to the parameters are transferred from the client
to the server. In some RPC techniques, parameters can be passed by reference. Employed in closed systems in which
multiple processes share a single address.
Pointers (memory addresses) don't translate between computers in a distributed system. Directly passing a pointer from
a client to a server won't work because the address is only valid on the client's machine. The "asking back and forth"
solution is possible, but slow.
• In passing arrays, a variable’s address is supplied. Handling pointer-based data structures, such as pointers, lists, trees,
stacks, graphs, and so on.Call-by-object-reference: Here RPC mechanism uses object invocation. The value that
holds a variable is used to refer to an object.
• Call-by-move: A parameter is passed by reference, much like in the call-by-object reference method, but the
parameter object is relocated to the target node during the call (callee). It is termed call-by-visit if it remains at the
caller’s node. It allows the argument objects to be packaged in the same network packet as the invocation message,
which in turn reduces network traffic and message count.
Call Semantics in RPC
RPC has the same semantics as a local procedure call, the calling process calls the procedure, gives inputs to it,
and then waits while it executes. When the procedure is finished, it can return results to the calling process. If
the distributed system is to achieve transparency, the following problems concerning the properties of remote
procedure calls must be considered in the design of an RPC system:
• Binding: Binding establishes a link between the caller process’s name and the remote procedure’s location.

• Communication Transparency: It should be unknown to the users that the process they are calling is
remote.

• Concurrency: Communication techniques should not mix with concurrency mechanisms. When single-
threaded clients and servers are blocked while waiting for RPC results, considerable delays might occur.
Lightweight processes permit the server to handle concurrent calls from several clients.

• Heterogeneity: Separate machines may have distinct data representations, operate under different operating
systems, or have remote procedures written in different languages.
Types of Call Semantics:
Types of Call Semantics:
• Perhaps or Possibly Call Semantics: It is the weakest one, here, the caller is waiting until a predetermined
timeout period and then continues with its execution. It is used in services where periodic updates are
required.

• Last-one Call Semantics: Based on timeout, retransmission of call message is made. After the elapsing of
the timeout period, the obtained result from the last execution is used by the caller. It sends out orphan calls.
It finds its application in designing simple RPC.

• Last-of-Many Call Semantics: It is like Last-one Call semantics but the difference here is that it neglects
orphan calls through call identifiers. A new call-id is assigned to the call whenever it is repeated. It is
accepted by the caller only if the caller id matches the most recent repeated call.

• At-least-once Call Semantics: The execution of the call is there for one or more times, but the results given
to the caller are not specified. Here also retransmissions rely on the time-out period without giving a thought
to orphan calls. In the case of nested calls, the result is taken from the first response message if there are
orphan calls and others are ignored irrespective of the accepted response is whether from orphan or not.

• Exactly-once Call Semantics: No matter how many times the call is transmitted, the potential of the
procedure being conducted more than once is eliminated. When the server receives an acknowledgment from
the client then only it deletes information from the reply cache.
Communication Protocols For RPCs
The following are the communication protocols that are used in RPC:

• Request Protocol

• Request/Reply Protocol

• The Request/Reply/Acknowledgement-Reply Protocol


Request Protocol:
• The Request Protocol (R protocol) is a way for a client to tell
a server to do something in Remote Procedure Call (RPC).
Once the client sends the request, it doesn't expect anything
back – no result, no confirmation. It's like sending a letter
without needing a reply.
• One-Way Communication: Only one message goes from the
client to the server.
• "Maybe" Semantics: The server might do the request, or
maybe not. The client doesn't retry if it fails, so it's not
guaranteed.
• Asynchronous RPC: This protocol is used in asynchronous
RPC to speed things up because the client doesn't wait for a
response.
• Unreliable Communication: It's best used with reliable
protocols like TCP (which handles retransmission
automatically), rather than UDP.
• Use Case: This is often used for things like periodic updates,
where it's okay if some updates are missed. An example
application is the Distributed System Window.
Request/Reply Protocol:
• The Request-Reply Protocol is also known as the RR protocol.
• This protocol has a concept base of using implicit acknowledgements
instead of explicit acknowledgements.

• Here, a reply from the server is treated as the acknowledgement


(ACK) for the client’s request message, and a client’s following call is
considered as an acknowledgement (ACK) of the server’s reply
message to the previous call made by the client.

• To deal with failure handling e.g. lost messages, the timeout


transmission technique is used with RR protocol.

• If a client does not get a response message within the predetermined


timeout period, it retransmits the request message.

• Exactly-once semantics is provided by servers as responses get held


in reply cache that helps in filtering the duplicated request messages
and reply messages are retransmitted without processing the request
again.

• If there is no mechanism for filtering duplicate messages then at least-


call semantics is used by RR protocol in combination with timeout
transmission.
The Request/Reply/Acknowledgement-Reply Protocol:
• This protocol is also known as the RRA protocol
(request/reply/acknowledge-reply).

• Exactly-once semantics is provided by RR protocol which


refers to the responses getting held in reply cache of servers
resulting in loss of replies that have not been delivered.

• The RRA (Request/Reply/Acknowledgement-Reply )


Protocol is used to get rid of the drawbacks of the RR
(Request/Reply) Protocol.

• In this protocol, the client acknowledges the receiving of


reply messages and when the server gets back the
acknowledgement from the client then only deletes the
information from its cache.

• Because the reply acknowledgement message may be lost


at times, the RRA protocol requires unique ordered
message identities. This keeps track of the
acknowledgement series that has been sent.
Distributed Shared Memory
• A distributed shared memory (DSM) system is a collection of many nodes/computers which are
connected through some network and all have their local memories. The DSM system manages
the memory across all the nodes. All the nodes/computers transparently interconnect and process.
The DSM also makes sure that all the nodes are accessing the virtual memory independently
without any interference from other nodes. The DSM does not have any physical memory,
instead, a virtual space address is shared among all the nodes, and the transfer of data occurs
among them through the main memory.
• In DSM every node has its own memory and provides memory read and write services and it
provides consistency protocols. The distributed shared memory (DSM) implements the shared
memory model in distributed systems but it doesn’t have physical shared memory.
Distributed shared memory
Types of Distributed Shared Memory

1. On-Chip Memory
• The data is present in the CPU portion of the chip.

• Memory is directly connected to address lines.

2. Bus-Based Multiprocessors
• A set of parallel wires called a bus acts as a connection between CPU and memory.

• Accessing of same memory simultaneously by multiple CPUs is prevented by using some algorithms.

3. Ring-Based Multiprocessors
• There is no global centralized memory present in Ring-based DSM.

• All nodes are connected via a token passing ring.

• In ring-bases DSM a single address line is divided into the shared area.
Architecture of Distributed Shared Memory(DSM)
The architecture of a Distributed Shared Memory (DSM) system typically consists of several key components
that work together to provide the illusion of a shared memory space across distributed nodes. the components
of Architecture of Distributed Shared Memory :

1.Nodes: Each node in the distributed system consists of one or more CPUs and a memory unit. These nodes are
connected via a high-speed communication network.

2.Memory Mapping Manager Unit: The memory mapping manager routine in each node is responsible for
mapping the local memory onto the shared memory space. This involves dividing the shared memory space
into blocks and managing the mapping of these blocks to the physical memory of the node.

*Caching is employed to reduce operation latency. Each node uses its local memory to cache portions of the
shared memory space. The memory mapping manager treats the local memory as a cache for the shared
memory space, with memory blocks as the basic unit of caching.

• 3.Communication Network Unit: This unit facilitates communication between nodes. When a process
accesses data in the shared address space, the memory mapping manager maps the shared memory address to
physical memory. The communication network unit handles the communication of data between nodes,
ensuring that data can be accessed remotely when necessary.
Algorithm for implementing Distributed Shared Memory
1. Central Server Algorithm:
• In this, a central server maintains all shared data. It
services read requests from other nodes by
returning the data items to them and write requests
by updating the data and returning
acknowledgement messages.

• Time-out can be used in case of failed


acknowledgement while sequence number can be
used to avoid duplicate write requests.

• It is simpler to implement but the central server


can become bottleneck and to overcome this
shared data can be distributed among several
servers. This distribution can be by address or by
using a mapping function to locate the appropriate
server.
2. Migration Algorithm:
• In contrast to central server algo where every data access
request is forwarded to location of data while in this data
is shipped to location of data access request which allows
subsequent access to be performed locally.

• It allows only one node to access a shared data at a time


and the whole block containing data item migrates instead
of individual item requested.

• It is susceptible to thrashing where pages frequently


migrate between nodes while servicing only a few
requests.

• This algo provides an opportunity to integrate DSM with


virtual memory provided by operating system at
individual nodes.
3. Read Replication Algorithm:
• This extends the migration algorithm by
replicating data blocks and allowing multiple
nodes to have read access or one node to have
both read write access.
• It improves system performance by allowing
multiple nodes to access data concurrently.
• The write operation in this is expensive as all
copies of a shared block at various nodes will
either have to invalidated or updated with the
current value to maintain consistency of shared
data block.
• DSM must keep track of location of all copies of
data blocks in this.
4. Full Replication Algorithm:
• It is an extension of read replication
algorithm which allows multiple nodes to
have both read and write access to shared
data blocks.
• Since many nodes can write shared data
concurrently, the access to shared data must
be controlled to maintain it’s consistency.
• To maintain consistency, it can use a gap free
sequences in which all nodes wishing to
modify shared data will send the
modification to sequencer which will then
assign a sequence number and multicast the
modification with sequence number to all
nodes that have a copy of shared data item.
Consistency models in distributed systems
• Consistency models establish criteria for data synchronization and specify how users and
applications should interpret data changes across several nodes. In a distributed system, it
specifically controls how data is accessed and changed across numerous nodes and how clients are
informed of these updates. These models range from strict to relaxed approaches.
Types of Consistency Models
1. Strong Consistency

• All nodes in the system agree on the order in


which operations occurred. Reads will always
return the most recent version of the data, when
an update occurs on one server, this model
makes sure every other server in the system
reflects this change immediately. This model
provides the highest level of consistency, but it
can be slower and require more resources in a
distributed environment since all servers must
stay perfectly in sync.
2. Sequential Consistency Model
• It is a consistency model in distributed systems that ensures all operations across processes appear
in a single, unified order. In this model, every read and write operation from any process appears to
happen in sequence, regardless of where it occurs in the system. Importantly, all processes observe
this same sequence of operations, maintaining a sense of consistency and order across the system.

3. Weak Consistency Model


• A weakly consistent system provides no guarantees about the ordering of operations or the state of
the data at any given time. Clients may see different versions of the data depending on which node
they connect to. This model provides the highest availability and scalability but at the cost of
consistency.

4. Session Consistency
• Session Consistency guarantees that all of the data and actions a user engages with within a single
session remain consistent. Consider it similar to online shopping: session consistency ensures that
an item will always be in your cart until you check out or log out, regardless of how you explore the
page.
5. Causal Consistency Model
The Causal Consistency Model is a type of consistency in distributed systems that ensures that related
events happen in a logical order. In simpler terms, if two operations are causally related (like one action
causing another), the system will make sure they are seen in that order by all users. However, if there’s no
clear relationship between two operations, the system doesn’t enforce an order, meaning different users
might see the operations in different sequences.
Replacement strategy for a distributed caching system
• Problem: Existing cache strategies don't fully address the unique temporal and spatial access
patterns of geospatial data, leading to suboptimal cache hit rates.
• Proposed Solution: A new cache replacement strategy that considers both temporal and spatial
locality in geospatial data access.
• Temporal Locality: Tracks access frequency and time intervals using a modified LRU
approach.
• Spatial Locality: Builds access sequences based on an LRU stack to capture spatial
relationships and caching locations.
• Balancing Act: Chooses replacement objects based on access sequence length and caching
resource costs to balance temporal and spatial locality.
• Benefits: Improved cache hit rate, better response performance, and higher system throughput,
making it suitable for cloud-based networked GIS environments.
Thrashing
• Thrashing is a process that occurs when the system spends a major portion of time transferring shared data
block blocks from one node to another in comparison with the time spent on doing the useful work of
executing the application process. If thrashing is not handled carefully it degrades system performance
considerably.
Why Thrashing Occurs??
1. Ping pong effect- It occurs when processes make interleaved data access on two or more nodes it may cause a
data block to move back and forth from one node to another in quick succession known as the ping-pong
effect,

2. When the blocks having read-only permission are repeatedly invalidated after they are replicated. It is caused
due to poor locality of reference.

3. When data is being modified by multiple nodes at the same instant.


How to control Thrashing?
1. Providing application-controlled locks
• Data is locked for a short period of time to prevent nodes from accessing data and thus will prevent
thrashing.

• For this method, an application-controlled lock can be associated with each data block.

2. Nailing a block to the node for a minimum amount of time(t):


• A block is disallowed to be taken away from a node until a minimum amount of time(t) passes after it
has been allocated to the node.

• On the basis of past access patterns, time t can be fixed statically or dynamically

3. Tailoring the coherence algorithm to the shared data usage patterns


• Different coherence protocols for shared data having different characteristics can be used to minimize
thrashing.
Thank You
Unit-3
Distributed Computing
RESOURCE MANAGEMENT
Resource management in distributed computing involves the planning, organizing, and controlling of
resources across a network of interconnected systems. It ensures that all resources (both physical and
logical) are efficiently allocated, monitored, and utilized according to user needs and system
demands.
Key Aspects of Resource Management Systems
1. Identification and Location:
1. Recognize available resources by name.

2. Know their current location and features.

2. Resource Attributes:
1. Define functionality, structure, and properties of resources.

2. Use attributes for matching client requirements with available resources.


3. Accessibility and Functionality:
1. Ensure resources are accessible and functioning correctly.

2. Provide failure signals when issues arise.

4. Control and Sharing:


1. Manage access to resources based on allocation and optimization rules.

2. Support fair sharing while maintaining performance and security.

5.Centralized vs. Distributed Approaches:


1. Understand the differences in resource management methods due to the physical distribution
of resources in distributed systems.
Task assignment approach
Task assignment approach is the technique used for scheduling processes of a distributed system. In
this approach each process submitted by a user for processing is viewed as a collection of related
tasks and these tasks are scheduled to suitable nodes so as to improve performance.
A process is considered to be made up of multiple tasks.

Assumptions made in this approach are:


- A process has already been split into pieces called tasks.
- Amount of computation required by each task and the speed of each processor are known.
- Cost of processing each task on every node of the system is known.
- Inter process Communication(IPC) costs between every pair of tasks is known.
Goals Of Task assignment algorithms

The main goals to be achieved in task assignment algorithms are:

1. Minimization of IPC costs


2. Quick turnaround time
3. High degree of parallelism
4. Efficient utilization of system resources in general
Load Balancing Approach

A local distributed computer system consists of a


number of individual computers connected by a local
network. In such a system it can happen that some
computers are idle, but more than one user process on
other computers are waiting for a service. In order to
fully utilize all the computational resources (processors)
and, therefore, to improve the overall performance, load
sharing and balancing are proposed.
Load balancing algorithm tries to balance the total
system load by transparently transferring the workload
from heavily loaded nodes to lightly loaded nodes. It
ensures overall good performance of the system. Basic
goal of almost all load balancing algorithms is to
maximize the system throughput.
Taxonomy of Load Balancing Algorithms
Static Vs Dynamic:- Static algorithms use only information about the average behavior of the system,
ignoring the current system state, while dynamic reacts to system state changes. Static is simple due to no
need to maintain state info, but suffers from not being able to react to system state, while dynamic is vice
versa.
Deterministic Vs probabilistic:-Deterministic algorithms use info about the properties of the nodes and the
characteristics of the processes to be scheduled to deterministically allocate processes to nodes. Probabilistic
uses information regarding static attributes of the system e.g. number of nodes, processing capability of
each node, network topology etc. to formulate simple process placement rules. But deterministic is difficult
to optimize and is more expensive to implement while probabilistic is easier but has poor performance.
Centralized Vs Distributed:- Decisions are centralized (one node) in centralized algorithms and distributed
among the nodes on distributed. In centralized, it can effectively make process assignment decision because
it knows both the load at each node and the number of processes needing service. Each node is responsible
for updating the central server with its state information, but it suffers from poor reliability. Replication is
thus necessary but with the added cost of maintaining information consistency. For distributed, there is no
master entity i.e. each entity is responsible for making scheduling decision for the processes of its own
node, in either transferring local processes or accepting of remote processes.
Cooperative Vs non-cooperative:- In non-cooperative, entities act autonomously, making scheduling
decision independently of others and vice versa for cooperative. But cooperative are more complex, leading
to larger overheads, but display better stability.
Load-sharing Approach
Load sharing approach also aims for efficient resource utilization. It attempts to ensure that no node
is idle while processes wait for service at some other node. In this approach it is sufficient to know
whether a node is busy or idle. Load sharing approach does not attempt to balance the average
workload on all the nodes of the system.
Load balancing has been criticized for:
- The overhead involved in gathering state info is very large.
- It’s not achievable because the number of processes in a node is always fluctuating.
Thus it is necessary and sufficient to prevent some nodes from being idle while others are busy,
which is the essence of sharing. Thus, policies for load sharing include:
a. Load estimation policies:- Since load sharing algorithms simply attempt to ensure that no node is
idle while processes wait for service at some other node, it is sufficient to know whether a node is
busy or idle.
.
b. Process transfer policies:- Since load sharing algorithms are normally interested only in the busy
or idle states of a node, most of them employ the all-or-nothing strategy. In the all-or-nothing
strategy, a node that becomes idle is unable to immediately acquire new processes to execute even
though processes wait for service at other nodes, resulting in a loss of available processing power
in the system.
c. Location Policies:- Location policy decides the sender node or the receiver node of a process that
is to be moved within the system for load sharing. Depending on the type of the node that takes the
initiative there are two different location policies. They are:
1. Sender-initiated policy - The sender node of the process decides where to send the process.
Heavily loaded nodes search for lightly loaded nodes to which work has to be transferred. When a
node’s load becomes more than the threshold value, it either broadcasts a message or randomly
probes the other nodes one by one to find a lightly loaded node that can accept one or more
process.
2. Receiver-initiated policy - The receiver node of the process decides where to get the process.
Lightly loaded nodes search for heavily loaded nodes from which work can be transferred. When a
node’s load falls below the threshold value it either broadcasts a message indicating its willingness
to receive process or randomly sends probes to heavily loaded nodes.
Process Migration
• Process migration is the relocation of a process from its current location (source node) to another
node (destination node).The process can be either a non-preemptive or preemptive process.
Selection of the process to be migrated, selection of the destination node and the actual transfer of
the selected process are the three steps involved in process migration.
Key Steps in Process Migration:

1. State Saving:
1. Freeze the selected process to halt its execution.

2. Save the process control data structure (PCB), ports, and memory space.

2. Data Transfer:
1. Send the state information to the destination computer.

2. The receiving side prepares the execution environment.


4. Execution Resumption:
1. Destroy the original frozen process after successful reception.

2. Allow the migrated process to resume execution.

5. Completion Confirmation:
1. After execution, return results or the process image to the original computer.

2. Confirm receipt of results before cleaning up on the destination computer

6. Process Restoration:
1. Allocate data structures on the receiving computer using the received information.

2. Acknowledge the successful transfer to the sender.


Features of a good process migration mechanism
A good process migration mechanism must possess transparency, minimal interferences, minimal
residue dependencies, efficiency, robustness and communication between coprocesses.
Transparency :
Transparency is an important requirement for a system that supports process migration. Following
levels of transparency can be identified:
• Object access level: Transparency at the object access level is the minimum requirement for a
system to support non-preemptive process migration facility. If a system supports transparency at
object access level, access to objects such as files and devices can be done at location independent
manner.
• System calls and inters process communication level: In this case, a migrated process does not
continue to depend upon its originating node after being migrated. It is necessary that all system
calls including inter process communication are location independent.
Minimal Interference
• Migration of a process should cause minimal interference to the progress of the process involved
and to the system as a whole. One method to achieve this is to minimize the freezing time of the
process being executed. Freezing time is defined as the time period for which the execution of the
process is stopped for transferring its information to the destination node.
Minimal Residual Dependencies
A migrated process should not in any way continue to depend on its previous node once it has started
executing on its new node. Else the following things occur: The migrated process will continue to
impose a load on its previous node. A failure or reboot of the previous node will cause the process to
fail.
Efficiency
The main sources of inefficiency are: Time required for migrating a process. Cost of location an
object Cost of supporting remote execution once the process is migrated.
Robustness
It is the case where the failure of one node other than the one on which a process is currently running
should not in any way affect the accessibility or execution of a process.
Communication between coprocesses of a job
Communication between coprocesses of a job should be carried out a minimal cost and also should
be able to directly communicate with each other irrespective of their locations.
Benefits of process migration.
• Reducing average response time of processes: Process migration facility is used to reduce the average
response time of the processes of a heavily loaded node by migrating some of the processes that is either idle
or whose processing capacity is underutilized.
• Speeding up individual jobs: There are two methods to speed up individual jobs. One method is to migrate
the tasks of a job to different nodes to execute them concurrently. Another method is to migrate a job to a
node having a faster CPU.
• Gaining higher throughput: The throughput is increased by using a suitable load balancing approach. And it
also executes a mixture of I/O and CPU bound processes on a global basis to increase the throughput.
• Utilizing resources effectively: In a distributed system there are various resources such as CPU’s, printers,
storage devices etc. Depending upon the nature of the process the resources has to be allocated to the suitable
node.
• Reducing network traffic: Migrating a process closer to the resources it is using is a mechanism used to
reduce the network traffic. It can also migrate and cluster two or more processes which frequently
communicate with each other on the same node of the system.
• Improving system reliability: System reliability can be improved in two ways. One way is to migrate a
critical process to a node whose reliability is higher than other nodes in the system. Another way is to migrate
a copy of the critical process to some node and execute both the original and copied processes concurrently on
different nodes.
• Improving system security: A sensitive process may be migrated and run on a secure node that is not
directly accessible to general users thus improving the security of that process.
Threads
Threads are an efficient way to improve application performance through parallelism. Each thread of
a process has its own program counter, its own register states, and its own stack. But all the threads
share the same address space. Threads are often referred to as lightweight process.
Thread has its own program counter, its own register states, and its own stack but shares the same
address space. Advantages of threads over multiple processes are:
• Context Switching: Threads are very inexpensive to create and destroy, and they are inexpensive
to represent. For eg: they require space to store, the PC, the SP, and the generalpurpose registers,
but they do not require space to share memory information, Information about open files of I/O
devices in use, etc. In other words, it is relatively easier for a context switch using threads.
• Sharing: Treads allow the sharing of a lot resources that cannot be shared in process, for example,
sharing code section, data section, Operating System resources like open file etc.
Key Features of Threads in a Distributed
System
1. Parallelism: Threads enable multiple parts of a task to run simultaneously, improving performance
in multi-core and multi-node environments.

2. Shared Resources: Threads within a process share the same memory and resources, making them
lightweight compared to processes.

3. Concurrency: Threads allow multiple operations to be performed concurrently, such as handling


multiple client requests in a server.

4. Communication and Coordination: Threads often handle inter-node communication,


synchronization, and distributed task execution.
Benefits of Using Threads in Distributed
Systems
• Improved Responsiveness: Applications remain responsive by using threads for background tasks
(e.g., network communication).

• Resource Efficiency: Threads share memory and resources of their parent process, reducing
overhead.

• Enhanced Scalability: Threads allow distributed systems to handle large numbers of requests
simultaneously.

• Simplified Multitasking: Threads enable simultaneous execution of tasks like data processing and
I/O operations.
Use Cases of Threads

• Web Servers: Handle multiple client connections concurrently using threads.

• Parallel Processing: Distribute computational tasks across threads for faster processing.

• Database Operations: Perform concurrent queries and updates using threads.

• Inter-Node Communication: Manage message passing and synchronization between nodes.


Distributed Computing
Unit 4
Distributed file system
A distributed file system (DFS) is a networked architecture that
allows multiple users and applications to access and manage files
across various machines as if they were on a local storage device.
Instead of storing data on a single server, a DFS spreads files across
multiple locations, enhancing redundancy and reliability.
• This setup not only improves performance by enabling parallel
access but also simplifies data sharing and collaboration among
users.
• By abstracting the complexities of the underlying hardware, a
distributed file system provides a seamless experience for file
operations, making it easier to manage large volumes of data in a
scalable manner.
Features of DFS
1. Transparency
• Structure transparency: There is no need for the client to know about the
number or locations of file servers and the storage devices. Multiple file
servers should be provided for performance, adaptability, and dependability.
• Access transparency: Both local and remote files should be accessible in
the same manner. The file system should be automatically located on the
accessed file and send it to the client’s side.
• Naming transparency: There should not be any hint in the name of the file
to the location of the file. Once a name is given to the file, it should not be
changed during transferring from one node to another.
• Replication transparency: If a file is copied on multiple nodes, both the
copies of the file and their locations should be hidden from one node to
another.
• User mobility: It will automatically bring the user’s home directory to the node where the
user logs in.

• Performance: Performance is based on the average amount of time needed to convince


the client requests. This time covers the CPU time + time taken to access secondary
storage + network access time. It is advisable that the performance of the Distributed File
System be similar to that of a centralized file system.

• Simplicity and ease of use: The user interface of a file system should be simple and the
number of commands in the file should be small.

• High availability: A Distributed File System should be able to continue in case of any
partial failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage devices.

• Scalability: Since growing the network by adding new machines or joining two networks
together is routine, the distributed system will inevitably grow over time. As a result, a
good distributed file system should be built to scale quickly as the number of nodes and
users in the system grows
Working of DFS
There are two ways in which DFS can be implemented:
• Standalone DFS namespace: It allows only for those DFS roots that exist on the local computer and
are not using Active Directory. A Standalone DFS can only be acquired on those computers on which it
is created. It does not provide any fault liberation and cannot be linked to any other DFS. Standalone
DFS roots are rarely come across because of their limited advantage.

• Domain-based DFS namespace: It stores the configuration of DFS in Active Directory, creating the
DFS namespace root accessible at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>
File Model in Distributed Systems
A file model in distributed systems refers to the way data and files are
organized, accessed, and managed across multiple nodes or locations
within a network. It encompasses the structure, organization, and
methods used to store, retrieve, and manipulate files in a distributed
environment. File models define how data is stored physically, how it
can be accessed, and what operations can be performed on it.
Importance of File Models in Distributed
Systems
The importance of file models in distributed systems lies in their ability to:
• Organize and Structure Data: File models provide a framework for organizing data into
logical units, making it easier to manage and query data across distributed nodes.

• Ensure Data Consistency and Integrity: By defining how data is structured and
accessed, file models help maintain data consistency and integrity, crucial for reliable
operations in distributed environments.

• Support Scalability: Different file models offer varying levels of scalability, allowing
distributed systems to efficiently handle growing amounts of data and increasing user
demands.

• Enable Efficient Access and Retrieval: Depending on the file model chosen, distributed
systems can optimize data access patterns, ensuring that data retrieval operations are
efficient and responsive.

• Facilitate Collaboration and Sharing: File models in distributed systems enable


seamless collaboration and sharing of data among users and applications, regardless of
geographical location or network configuration.
File Accessing Model
In a distributed file accessing model, files are distributed across multiple
servers or nodes, and users can access these files from any node in
network. This model is highly scalable and fault-tolerant, as files are
distributed across multiple nodes, reducing risk of a single point of failure.
Example Hadoop Distributed File System (HDFS) is an example of a
distributed file accessing model. In HDFS, files are distributed across
multiple nodes in network, and users can access these files using Hadoop
File System API.

The file accessing model basically to depends on


• The unit of data access/Transfer

• The method utilized for accessing to remote files


The unit of data access/Transfer
1.File-level transfer model: In file level transfer model, the all out document is
moved while a particular action requires the document information to be sent the
whole way through the circulated registering network among client and server.
This model has better versatility and is proficient.
2.Block-level transfer model: In the block-level transfer model, record information
travels through the association among client and a server is accomplished in
units of document blocks. Thus, the unit of information move in block-level
transfer model is document blocks. The block-level transfer model might be used
in dispersed figuring climate containing a few diskless workstations.
3. Byte-level transfer model: In the byte-level transfer model, record information
moves the association among client and a server is accomplished in units of bytes.
In this way, the unit of information move in byte-level exchange model is bytes.
4. Record-level transfer model: The record-level file transfer model might be
used in the document models where the document contents are organized as
records. In record-level exchange model, document information travels through the
organization among client and a server is accomplished in units of records.
The Method Utilizes for Accessing Remote
Files
1. Remote service model: Handling of a client’s request is performed at the
server’s hub. Thusly, the client’s solicitation for record access is passed
across the organization as a message on to the server, the server machine
plays out the entrance demand, and the result is shipped off the client. Need
to restrict the amount of messages sent and the vertical per message.

2. Data-caching model: This model attempts to decrease the organization


traffic of the past model by getting the data got from the server center. This
exploits the region part of the found in record gets to. A replacement
methodology, for instance, LRU is used to keep the store size restricted.
File Sharing Semantics
File sharing semantics define the rules and behaviors for accessing and updating files in
distributed systems. These rules ensure that data remains consistent and reliable when
multiple users or processes interact with the same files across different nodes. Proper file
sharing semantics are essential for maintaining data integrity, preventing conflicts, and
ensuring seamless collaboration in distributed environments.

• Access Patterns: File sharing semantics govern how files are accessed by different
users. This includes read and write operations.

• Visibility of Changes: They determine when changes made by one user become visible
to others. Immediate or delayed visibility impacts system behavior.

• Conflict Resolution: File sharing semantics provide mechanisms to resolve conflicts


when multiple users modify the same file. These mechanisms ensure data consistency.

• Performance Impact: Different semantics can affect system performance. Strong


consistency may reduce performance, while weaker consistency models can enhance it.
Types of File Sharing Semantics
1. UNIX Semantics: Each operation on a file is instantly visible to all users.
• Description: Changes to a file are immediately propagated and visible to all
processes in the system.
• Advantages: This approach provides strong consistency, ensuring that all users
always see the most recent version of the file.
• Disadvantages: This can lead to performance bottlenecks in large distributed
systems due to the frequent synchronization required to maintain this level of
consistency.
2. Session Semantics: Changes are made visible only after the session
ends.
• Description: Modifications to a file are only made visible to other users once the
session in which the changes were made is closed.
• Advantages: This reduces contention and improves performance because updates
are batched and applied at the end of a session, rather than immediately.
• Disadvantages: This can lead to inconsistencies if multiple users modify the file in
different sessions and the changes are not properly merged.
3. Immutable Files Semantics: Files, once created, cannot be modified.
• Description: Instead of modifying existing files, any changes require creating a new
version of the file.
• Advantages: This simplifies consistency management because there are no concurrent
writes to the same file. It ensures that data is not altered once it has been written.
• Disadvantages: This approach can be inflexible for applications that require frequent
updates and modifications, as it necessitates the creation of many new versions of files.
4. Close-to-Open Semantics: Changes are visible when the file is reopened.
• Description: Modifications made to a file become visible to other users only when they
reopen the file.
• Advantages: This balances consistency and performance by reducing the need for
constant synchronization, as updates are batched and applied when the file is closed and
reopened.
• Disadvantages: There can be temporary inconsistencies because users might not see
the most recent changes until they reopen the file.
File caching
• File caching is the process of storing frequently accessed files or
data in a temporary storage space called a cache. The cache is
typically located closer to the application or user, such as in the local
memory of a computer or server.
• When a file is requested by an application or user, the distributed file
system checks if the file is already stored in the cache. If the file is
found in the cache, it can be retrieved quickly without the need for a
remote request to the storage system. This reduces the latency and
network traffic associated with remote file access.
• If the file is not found in the cache, the distributed file system
retrieves the file from the storage system and stores it in the cache
for future access. This process is known as caching the file. The
cached file is stored in the cache until it is no longer needed or until
the cache space is needed for other files.
Ways of file caching is implemented in distributed file
systems
Client-side caching: In this approach, the client machine stores a local
copy of frequently accessed files. When the file is requested, the client
checks if the local copy is up-to-date and, if so, uses it instead of
requesting the file from the server. This reduces network traffic and
improves performance by reducing the need for network access.
Server-side caching: In this approach, the server stores frequently
accessed files in memory or on local disks to reduce the need for disk
access. When a file is requested, the server checks if it is in the cache
and, if so, returns it without accessing the disk. This approach can also
reduce network traffic by reducing the need to transfer files over the
network.
Distributed caching: In this approach, the file cache is distributed across
multiple servers or nodes. When a file is requested, the system checks if it is
in the cache and, if so, returns it from the nearest server. This approach
reduces network traffic by minimizing the need for data to be transferred
across the network.
File caching process in distributed file systems
The file caching process typically follows these steps :
• File access request : When an application or user requests access to a file, the
distributed file system checks if the file is already stored in the cache memory. If the file is
found in the cache memory, it can be retrieved quickly without the need for a remote
request to the storage system.
• Cache hit : If the file is found in the cache memory, it is retrieved and returned to the
application or user. This is known as a cache hit. The cache hit reduces the latency and
network traffic associated with remote file access.
• Cache miss : If the file is not found in the cache memory, it is retrieved from the storage
system and stored in the cache memory for future access. This is known as a cache miss.
The file remains in the cache memory until it is no longer needed or until the cache space
is needed for other files.
• Cache replacement : When the cache memory is full and a new file needs to be cached,
the distributed file system must decide which file to remove from the cache memory to
make room for the new file. This process is known as cache replacement. Different cache
replacement policies, such as least recently used (LRU) or least frequently used (LFU),
can be used to determine which file to remove from the cache memory.
Importance of file caching
• Improved Read and Write Performance : File caching in distributed file systems
can significantly improve read and write performance by reducing the number of
remote disk accesses. Caching frequently accessed files or data blocks in the local
cache memory can eliminate the need for frequent network trips, thereby reducing
latency and improving throughput.
• Reduced Network Latency and Bandwidth Usage : File caching can also reduce
network latency and bandwidth usage by storing frequently accessed data in local
cache memory. By reducing the amount of data transferred over the network, file
caching can improve network performance and reduce network traffic.
• Better Resource Utilization and Cost-efficiency : File caching can also help
distribute the workload across multiple nodes and reduce the need for expensive
hardware resources. By caching frequently accessed files or data blocks in local
cache memory, distributed file systems can reduce the amount of data transferred
over the network and improve resource utilization and cost-efficiency.
• Enhanced Scalability and Fault Tolerance : File caching can improve the
scalability and fault tolerance of distributed file systems by distributing the workload
across multiple nodes and reducing the risk of data loss. By caching frequently
accessed data in local cache memory, distributed file systems can improve system
responsiveness and reduce the risk of data loss in the event of a node failure.
Fault Tolerance

Fault Tolerance is defined as the


ability of the system to function
properly even in the presence of any
failure. Distributed systems consist of
multiple components due to which
there is a high risk of faults occurring.
Due to the presence of faults, the
overall performance may degrade.
Phases of Fault Tolerance
1. Fault Detection:- Fault Detection is the first phase where the system is monitored continuously. The
outcomes are being compared with the expected output. During monitoring if any faults are identified
they are being notified. These faults can occur due to various reasons such as hardware failure,
network failure, and software issues. The main aim of the first phase is to detect these faults as soon
as they occur so that the work being assigned will not be delayed.

2. Fault Diagnosis:-Fault diagnosis is the process where the fault that is identified in the first phase
will be diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques in order to
solve the fault and perform the given task.

3. Evidence Generation:-Evidence generation is defined as the process where the report of the fault
is prepared based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing, and other
alternatives and preventions that need to be considered.

4. Assessment:-Assessment is the process where the damages caused by the faults are analyzed. It
can be determined with the help of messages that are being passed from the component that has
encountered the fault. Based on the assessment further decisions are made.

5. Recovery:-Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup recovery. Some of
the common recovery techniques such as reconfiguration and resynchronization can be used.
Types of Faults
• Transient Faults: Transient Faults are the type of faults that occur once and then
disappear. These types of faults do not harm the system to a great extent but are
very difficult to find or locate. Processor fault is an example of transient fault.

• Intermittent Faults: Intermittent Faults are the type of faults that come again and
again. Such as once the fault occurs it vanishes upon itself and then reappears
again. An example of intermittent fault is when the working computer hangs up.

• Permanent Faults: Permanent Faults are the type of faults that remain in the
system until the component is replaced by another. These types of faults can
cause very severe damage to the system but are easy to identify. A burnt-out chip
is an example of a permanent Fault.
Need for Fault Tolerance in Distributed Systems
Fault Tolerance is required in order to provide below four features.

1. Availability: Availability is defined as the property where the system is readily available for its
use at any time.

2. Reliability: Reliability is defined as the property where the system can work continuously
without any failure.

3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.

4. Maintainability: Maintainability is defined as the property states that how easily and fastly the
failed node or system can be repaired.
Design Principles of Distributed File
System
1. Scalability
• The system must handle increasing amounts of data and users efficiently without degradation in
performance.
• Example:Hadoop Distributed File System (HDFS): HDFS is designed to scale out by adding more
DataNodes to the cluster. Each DataNode stores data blocks, and the system can handle petabytes
of data across thousands of nodes
2. Consistency
• Ensuring that all users see the same data at the same time. This can be achieved through different
consistency models
• Example:Google File System (GFS): GFS provides a relaxed consistency model to achieve high
availability and performance.It allows concurrent mutations and uses version numbers and
timestamps to maintain consistency.
3. Availability
• Ensuring that the system is operational and accessible even during failures.
• Example:Amazon S3: Amazon S3 achieves high availability by replicating data across multiple
Availability Zones (AZs). If one AZ fails, data is still accessible from another, ensuring minimal
downtime and high availability.
4. Performance
Optimizing the system for speed and efficiency in data access.
Example:Ceph: Ceph is designed to provide high performance by using techniques
such as object storage, which allows for efficient, parallel data access.It uses a
dynamic distributed hashing algorithm called CRUSH.
5. Security
Protecting data from unauthorized access and ensuring data integrity.
Example:
Azure Blob Storage: Azure Blob Storage offers comprehensive security features,
including role-based access control (RBAC)
6. Data Management
Efficiently distributing, replicating, and caching data to ensure optimal performance
and reliability.
Example:
Cassandra: Apache Cassandra is a distributed NoSQL database that uses consistent
hashing to distribute data evenly across all nodes in the cluster.

You might also like