Distributed Computing
Distributed Computing
• A distributed computing system, simply put, is a network of independent computers working together to
achieve common computational goals. It is a system where multiple computers, often geographically
dispersed, collaborate to solve a problem that is beyond their individual computing capabilities. Each
system, or 'node', is self-sufficient, meaning it operates independently while also contributing to the
overall goal.
• This is achieved through a process of task division, where a large task is divided into smaller subtasks.
Each subtask is then assigned to a different node within the network. The nodes work concurrently,
processing their individual tasks independently, and finally the results are aggregated into a final result.
Example:-
• Cloud computing system, where resources such as computing power, storage, and networking are
delivered over the Internet and accessed on demand. In this type of system, users can access and use
shared resources through a web browser or other client software.
Evolution of Distributed Computing Systems
Evolution of Computing Power
Shift to Minicomputers:
o Mainframes were used for massive data analysis but were not always ideal for all tasks.
o This limitation sparked research into alternative solutions.
Introduction of Distributed Operating Systems:
o Distributed architecture allows multiple computers to work together in a network, resembling a single
virtual computer.
o Offers enhanced speed and processing power by pooling resources.
o Initially designed for batch processing; evolved with the introduction of minicomputers.
o The master/slave concept was implemented for process management.
o Reduction in size and increase in power of microcomputers improved distributed computing efficiency.
o Modern networks offer better performance, scalability, and security, enhancing distributed computing
capabilities.
Key components of a Distributed Computing System
Devices or Systems: The devices or systems in a distributed system have their own processing capabilities
and may also store and manage their own data.
Network: The network connects the devices or systems in the distributed system, allowing them to
communicate and exchange data.
Resource Management: Distributed systems often have some type of resource management system in place
to allocate and manage shared resources such as computing power, storage, and networking.
Advantages of Distributed Computing
Scalability
• As the computational needs of a task increase, instead of upgrading a single system to handle the increased
workload, additional nodes can be added to the distributed network. This way, the system can efficiently
handle the growing demands without major modifications or significant costs.
• It also includes the ability to enhance the computational power of existing nodes or to replace older nodes
with more powerful ones.
Availability
• High availability is another significant advantage of distributed computing. Since the system is composed
of multiple independent nodes, the failure of one or a few nodes does not halt the entire system. Other
nodes in the network can continue their operations, ensuring that the system as a whole remains functional.
Efficiency
• Distributed computing systems are highly efficient. By dividing a large task into smaller subtasks and
processing them concurrently, the system can significantly reduce the time required to complete the task.
This parallel processing capability is especially beneficial for complex computational tasks that would
take an unfeasibly long time to complete on a single computer.
Transparency
• Transparency is a key feature of distributed computing systems. Despite being composed of multiple
independent nodes, the system operates as a single entity from the user's perspective. This means that
the complexities of the underlying architecture, such as the division of tasks, the communication
between nodes, and the handling of failures, are hidden from the user.
Types of Distributed Computing Architecture
Client-Server Architecture
• Client-server architecture is a common type of distributed computing architecture. In this model, the system is
divided into two types of nodes: clients and servers. Clients request services, and servers provide them. The
servers are typically powerful computers that host and manage resources, while the clients are usually less
powerful machines that access these resources.
Three-Tier Architecture
• Three-tier architecture is a type of client-server architecture where the system is divided into three layers: the
presentation layer, the application layer, and the data layer. The presentation layer handles the user interface,
the application layer processes the business logic, and the data layer manages the database. By separating
these functions, the system can achieve greater scalability, flexibility, and maintainability.
N-Tier Architecture
• N-tier architecture is a further extension of the three-tier architecture. In this model, the system is divided into
'n' tiers or layers, where 'n' can be any number greater than three. Each layer is dedicated to a specific
function, such as user interface, business logic, data processing, data storage, etc. This division of labor
allows for greater modularity, making the system more scalable and easier to manage.
Peer-to-Peer Architecture
• Peer-to-Peer (P2P) architecture is a type of distributed computing architecture where all nodes are equal, and
each node can function as both a client and a server. In this model, there is no central server; instead, each
node can request services from and provide services to other nodes. This decentralization makes P2P
architectures highly scalable and resilient, as there is no single point of failure.
Distributed Computing System Models
Types of Distributed Computing System Models
1. Physical Model
• A physical model represents the underlying hardware elements of a distributed system. It encompasses the
hardware composition of a distributed system in terms of computers and other devices and their interconnections.
It is primarily used to design, manage, implement, and determine the performance of a distributed system.
1.1. Nodes
• Nodes are the end devices that can process data, execute tasks, and communicate with the other nodes. These end
devices are generally the computers at the user end or can be servers, workstations, etc.
Nodes provision the distributed system with an interface in the presentation layer that enables the user to interact
with other back-end devices, or nodes, that can be used for storage and database services, processing, web
browsing, etc.
Each node has an Operating System, execution environment, and different middleware requirements that
facilitate communication and other vital tasks.
1.2. Links
• Links are the communication channels between different nodes and intermediate devices. These may be wired
or wireless. Wired links or physical media are implemented using copper wires, fiber optic cables, etc.
Point-to-point links: Establish a connection and allow data transfer between only two nodes.
Broadcast links: It enables a single node to transmit data to multiple nodes simultaneously.
Multi-Access links: Multiple nodes share the same communication channel to transfer data. Requires protocols
to avoid interference while transmission.
1.3. Middleware
• These are the softwares installed and executed on the nodes. By running middleware on each node, the
distributed computing system achieves a decentralised control and decision-making. It handles various tasks like
communication with other nodes, resource management, fault tolerance, synchronisation of different nodes and
security to prevent malicious and unauthorised access.
1.4 Network Topology
• This defines the arrangement of nodes and links in the distributed computing system. The most common network
topologies that are implemented are bus, star, mesh, ring or hybrid. Choice of topology is done by determining the
exact use cases and the requirements.
• Communication protocols are the set rules and procedures for transmitting data from in the links. Examples of
these protocols include TCP, UDP, HTTPS, MQTT etc. These allow the nodes to communicate and interpret the data.
2. Architectural Model
• Architectural model in distributed computing system is the overall design and structure of the system, and how its
different components are organised to interact with each other and provide the desired functionalities. It is an
overview of the system, on how will the development, deployment and operations take place. Construction of a
good architectural model is required for efficient cost usage, and highly improved scalability of the applications.
• It is a centralised approach in which the clients initiate requests for services and severs respond by providing
those services. It mainly works on the request-response model where the client sends a request to the server and
the server processes it, and responds to the client accordingly.
This is mainly used in web services, cloud computing, database management systems etc.
2.2. Peer-to-peer model
• In this system, a complex application or task, is decomposed into multiple independent tasks and these services
running on different servers. Each service performs only a single function and is focussed on a specific business-
capability. This makes the overall system more maintainable, scalable and easier to understand. Services can be
independently developed, deployed and scaled without affecting the ongoing services.
3. Fundamental Model
• The fundamental model in a distributed computing system is a broad conceptual framework that helps in
understanding the key aspects of the distributed systems. These are concerned with more formal description of
properties that are generally common in all architectural models. Three fundamental models are as follows:
• Distributed computing systems are full of many processes interacting with each other in highly complex ways.
Interaction model provides a framework to understand the mechanisms and patterns that are used for
communication and coordination among various processes. Different components that are important in this model
are: –
Message Passing – It deals with passing messages that may contain, data, instructions, a service request, or
process synchronisation between different computing nodes. It may be synchronous or asynchronous depending
on the types of tasks and processes.
Publish/Subscribe Systems – Also known as pub/sub system. In this the publishing process can publish a
message over a topic and the processes that are subscribed to that topic can take it up and execute the process for
themselves. It is more important in an event-driven architecture.
3.2. Remote Procedure Call (RPC)
• It is a communication paradigm that has an ability to invoke a new process or a method on a remote process
as if it were a local procedure call. The client process makes a procedure call using RPC and then the
message is passed to the required server process using communication protocols.
Distributed Operating System
• A distributed operating system is an important type of
operating system. An operating system is basically, a
program that acts as an interface between the system
hardware and the user. Moreover, it handles all the
interactions between the software and the hardware.
• Two or more systems can be connected to each other, allowing them to share resources and serve real-time
applications.
• The distributed system can be scaled to meet the needs of the business, making it more flexible and efficient.
• The system can be managed centrally, making it easier to control and monitor.
• Distributed systems are becoming increasingly popular due to the rise of big data and the need for real-time
applications.
• There is a greater risk of system failure due to the increased number of systems and points of failure.
Uses of Distributed OS
• The distributed OS has numerous applications. Here are a few examples:
Network Applications
• Many network apps, including the web, multiplayer web-based games, peer-to-peer networks, and virtual
communities, rely on DOS.
Telecommunication Networks
• DOS is useful in cellular networks and phones. In networks like wireless sensor networks, the Internet, and
routing algorithms, a DOS can be found.
• Aircraft control systems are instances of real-time process control systems that operate on a deadline.
Parallel Computation
• DOS is the foundation for systematic computing, which encompasses cluster and grid computing as well as a
number of volunteer computing projects.
Designing a distributed operating system
The various design issues in the development of distributed systems are stated as follows:
• Transparency
• Flexibility
• Reliability
• Performance
• Scalability
Transparency:
• A distributed system is said to be transparent if the users of the system feel that the collection of machines is a
timesharing system and belongs entirely to him. Transparency can be achieved at two different levels. In the first level,
the distribution of the system is hidden from the users.
• for example, in UNIX, when a user compiles his program using the make command, compilation takes place in parallel
on different machines, which use different file systems. The whole system can be made to look like a single-processor
system.
Flexibility
• Flexibility in distributed systems is important because this system is new for engineers, and thus there may be false
starts and it might be required to backtrack the system. The design issues might prove wrong in the later stages of
development. There are two different schemes for building distributed systems.
• The first one, called monolithic kernel, states that each machine should run a traditional kernel, which provides most of
the services itself.
Monolithic kernel is the centralized OS, which has networking ability and
remote services. The system calls are made by locking the kernel, then the
desired task is performed and the kernel is released after returning the result
to the user. In this approach, the machines have their own disks and maintain
their own local file system.
The other one, called microkernel, states that the kernel should provide very
little services and most of the OS services should be provided from the
userlevel servers.
• a) Availability: It refers to the fraction of time during which the system is usable. The availability of a
system can be ensured with the design, which does not require simultaneous use of key components. In other
words, the components, which are required very often, should not be used concurrently. The resources or
files, which are used frequently, can be replicated.
• b) Security: It implies that anyone can access the data stored on a distributed system. As a result, it should be
protected from unauthorized access. This problem also persists in single-processor systems. But in single-
processor systems, the users are required to log in, and thus they are authenticated and the system can check
the permission of the user. But in a distributed system, the system has no provision for determining the user
and his permission.
Performance
• A flexible and reliable system is ineffective if it performs slower than a single-processor system. Performance
is measured using metrics like jobs per hour, system utilization, and network capacity. Different tasks yield
varying performance results; for instance, CPU-intensive tasks differ from file searching tasks. In distributed
systems, communication speed is vital, but waiting for message-handling protocols can slow performance.
• To improve efficiency, multiple tasks should run simultaneously, which necessitates sending many messages.
Analyzing computation grain size is crucial: simple tasks should be prioritized over complex ones since they
require fewer CPU cycles. Fine-grained parallelism involves simple tasks with high interaction, while coarse-
grained parallelism involves large tasks with low interaction, with the latter being preferred for better
performance.
• Reliability can be enhanced through server cooperation on requests. If one server fails, another can take over
the task, ensuring completion. This approach improves reliability but requires additional messages across the
network.
Scalability
Distributed systems typically operate efficiently with a few hundred CPUs, but future scenarios may involve much
larger systems. For instance, if a postal and telecommunications authority installs a terminal in every home and
business for online access to a comprehensive database (like telephone numbers), this could eliminate the need for
printed directories. These terminals could also facilitate email, electronic banking, and ticket reservations,
showcasing the potential of such expansive systems.
There are certain bottlenecks in developing such a large system, which are stated as follows:
• Centralized components: There should not be any centralized components in the systems: for example, if there
is a single centralized mail server for all the users of the system, the traffic over the network will increase. The
system will not be able to tolerate faults and also if any one of the systems fails, the whole system will crash.
• Centralized tables: If the data of the users is stored in the centralized tables, the communication lines will be
blocked. Thus, the system will become prone to faults and failures.
• Centralized algorithms: If the messages in such a large system are sent using a single algorithm, it will take
much time to reach the destination due to the large number of users and traffic
INTRODUCTION TO MESSAGE PASSING
• The communication between the computers in a distributed
operating system works as a backbone to the whole system.
The communication between the computers in a distributed
operating system is also known as IPS (Inter Process
Communication).
• The design of an inter-process communication needs to address some basic issues while designing
communication process procedures used for communication between any two or more nodes. The messages
are generally transferred from sender to receiver in the form of data packets. The sender node will send the
data packet which will include of two basic components fixed length header and variable length block.
• The fixed length header includes different information related to sender process address, receiving process
address, message unique identification number, type of data, number of bytes/ element. The information
available in the fixed length header helps a data packet to reach the correct destination and given correct
information to the receiver about the originating process as the same is used for sending the acknowledgement
from the receiver.
• Once the process of receiving a data packet is complete then an acknowledgement is sent to the originating
process in order to complete the transfer of message between a sender and a receiver.
Fig. Data Packet which is Transferred from a Sender Node to a Receiver Node
Fig. Parameters Available in “Fixed Length Block” of a Data Packet
The basic challenges that are encountered by a distributed operating system during inter process communication are:
1. Naming and Name Resolution: Every process in a communication system is assigned a unique identification number
know as Process-ID (process identification number). The computer network system should have a naming system which
allows a process to names in order to resolve any conflicts or in order to manage the process execution in a distributed
operating system or inter-process communication.
• The implementation of a naming system can be implemented either using distributed or non-distributed approach. The
method of selection have a direct impact on the effectiveness and efficiency of a distributed operating system.
2. Routing Strategies: Inter-process communication primarily involves determining how a data packet is transmitted
from a sender to a receiver, including the specific route it takes through various nodes or computers. During this process,
the message is only accessible at the destination node, ensuring confidentiality. The route taken by the data packet is
referred to as routing, and the methods for identifying this route are known as routing strategies. An effective routing
strategy should prioritize efficiency, security, and optimal resource utilization.
3.Connection Strategies:The backbone of communication between a sender and receiver is the physical connection,
established through a connection strategy. A poorly chosen connection strategy can result in communication delays,
message loss, or alterations. In distributed operating systems, three key connection strategies—Circuit Switching,
Message Switching, and Packet Switching—must be selected based on specific system requirements. In inter-process
communication, messages are sent in a structured format containing attributes such as sender and receiver addresses,
sequence numbers, structural information, and the actual data, which is typically located at the end of the message block,
often accompanied by a pointer to the data.
SYNCHRONIZATION
• A distributed operating system relies on cooperation and data exchange between independent nodes, requiring
synchronization between sender and receiver for effective communication. This synchronization ensures that
messages are received accurately and allows for the timing coordination necessary for successful data
transfer. It can be categorized into two types: blocking and non-blocking synchronization.
• In blocking synchronization, the sender remains in a blocked state until the receiver acknowledges receipt of
the message. In contrast, non-blocking synchronization allows the sender to continue its operations
immediately after sending the message, without waiting for acknowledgment. Effective synchronization is
crucial for the integrity and reliability of communication in distributed systems.
a) Blocking:
• In blocking synchronization, communication involves a sender transmitting a message and then waiting for
the receiver's acknowledgment before proceeding. During this waiting period, the sender remains blocked,
and the receiver also waits for a message to continue its process. This method ensures that the sender only
resumes execution after confirming receipt of the message, thus coordinating the communication flow
between the nodes effectively.
When a message is sent from one computer (the
sender) to another (the receiver), sometimes one of
them can get stuck, which may lead to a situation
where neither can continue working. To prevent this
from happening, a timeout is set. This timeout is a
limit on how long the sender will wait for a response
(acknowledgment) from the receiver. If the sender
doesn't get a response before the timeout ends, it will
stop waiting and continue with other tasks, helping to
avoid a deadlock where everything freezes.
• In non-blocking synchronization, when a sender sends a message, it doesn't have to wait for the receiver to
acknowledge it. The sender can continue with its work once the message is placed in a buffer. Similarly, the receiver
can also move on to other tasks after it has executed the receive operation, without waiting for a message.
• In this setup, if both the sender and receiver are using non-blocking methods, it’s called asynchronous communication.
The challenge is how the receiver knows when a message is available in the buffer. There are two main ways to handle
this:
• Polling: This is when the receiver continuously checks the buffer to see if there are any new messages. While this
method works, it can be inefficient because it uses up processing resources.
• Interrupt Method: In this method, the system generates an interrupt signal when a message arrives in the buffer. This
alert informs the receiver that it can retrieve the message. However, implementing this method can be complex and
resource-intensive, especially in distributed systems.
BUFFERING
What is buffering?
When a sender sends a message to a receiver, it can use either synchronous or asynchronous communication. To ensure
the message gets delivered, it needs to be temporarily stored somewhere until the receiver is ready to get it. This storage
area can be in the sender's memory or a memory space managed by the operating system. The place where the message
is kept until the receiver retrieves it is called a buffer, and the act of storing the message there is known as buffering.
Different types of buffering are used based on the requirement of a process within a distributed operating system and
some of them are given below:
1. Null Buffering: This type of buffering doesn’t use any buffer rather the send process remains in suspended mode till
the receiver node in a position to receive the message. Once the process of send message starts the receiver starts the
receiving the message and accordingly an acknowledgement is sent once the message is delivered. The sender node
on receipt of acknowledgement sends a message to the received in order to unblock the receiver node for further
processing.
2. Single Message Buffering: This type of buffering uses a single buffer either at the receiver node address space in
order to ensure that the message is readily available to the receiver as and when the receiver node is ready to accept the
same. The single message buffer performs better in some situations as the message is available in the buffer which
helps the while system in reducing the blocking duration at different nodes.
3. Multiple Message Buffering:Multiple message buffering is commonly used in asynchronous communication
within inter-process communication in distributed systems. This type of buffering acts like a mailbox, where
messages can be stored either in the receiver’s memory or in the operating system’s memory.
• When a sender wants to send a message, it executes the send process, and the message is placed in this
mailbox. The receiver can later check the mailbox and retrieve messages whenever it is ready by
processing the receive operation.
MULTI-DATAGRAM MESSAGES
• Inter-process communication in distributed operating systems involves transferring messages between nodes
in the form of packets, which contain various attributes like process identifier, address, and data. A datagram
is an independent packet that facilitates connectionless communication across a packet-switched network and
carries sufficient information for routing.
• Each network has a maximum transfer unit (MTU) specifying the largest datagram size allowed. If a message
exceeds the MTU, it is split into smaller datagrams, referred to as multi-datagrams, which include additional
attributes for sequencing and fragmentation.
• The packet is used to send fragments with control information and data and it is called Datagram.
Single-datagram Messages: A message is called a Single-datagram Message if its size is smaller than that
of the Maximum Transfer Unit (MTU) of a network. Therefore, it can be sent in a single packet on a
network.
Multidatagram Messages: A message is called a Multidatagram Message if its size is larger than that of
the Maximum Transfer Unit (MTU) of a network. Therefore, it is sent in multiple packets on a network
ENCODING AND DECODING
• Messages sent from a source to a destination node can be in the form of either single or multi-datagrams. To
ensure that the message is complete and correct upon receipt, the receiver must understand the structure of the
datagram and have information consistent with that at the sender.
• This involves encoding the original data packet into a compatible form for transmission and then decoding it back
into its original form at the receiver. Encoding methods can vary, with two main representations: tagged and
untagged. Tagged representation includes detailed information about the object and data, making decoding
straightforward for the receiver. In contrast, untagged representation only contains the data, requiring the receiver
to know the encoding method in advance to successfully decode the message.
Distributed Computing
Unit-2
Remote Procedural Call in Distributed Systems
• Remote Procedure Call (RPC) is a
protocol used in distributed
systems that allows a program to
execute a procedure (subroutine)
on a remote server or system as if
it were a local procedure call.
• Simplified Communication
Abstraction of Complexity: RPC abstracts the complexity of network communication, allowing developers
to call remote procedures as if they were local, simplifying the development of distributed applications.
Consistent Interface: Provides a consistent and straightforward interface for invoking remote services,
which helps in maintaining uniformity across different parts of a system.
• Enhanced Modularity and Reusability
Decoupling: RPC enables the decoupling of system components, allowing them to interact without being tightly
coupled. This modularity helps in building more maintainable and scalable systems.
Service Reusability: Remote services or components can be reused across different applications or systems,
enhancing code reuse and reducing redundancy.
Remote Procedural Call (RPC) Architecture in Distributed
Systems
• The RPC (Remote Procedure Call) architecture in distributed systems is designed to enable communication between
client and server components that reside on different machines or nodes across a network. Here’s an overview of the
RPC architecture:
1. Client and Server Components
• Client: The client is the component that makes the RPC request. It invokes a procedure or method on the remote
server by calling a local stub, which then handles the details of communication.
• Server: The server hosts the actual procedure or method that the client wants to execute. It processes incoming RPC
requests and sends back responses.
2. Stubs
• Client Stub: Acts as a proxy on the client side. It provides a local interface for the client to call the remote
procedure. The client stub is responsible for marshalling (packing) the procedure arguments into a format suitable for
transmission and for sending the request to the server.
• Server Stub: On the server side, the server stub receives the request, unmarshals (unpacks) the arguments, and
invokes the actual procedure on the server. It then marshals the result and sends it back to the client stub.
3. Marshalling and Unmarshalling
• Marshalling: The process of converting procedure arguments and return values into a format that can be
transmitted over the network. This typically involves serializing the data into a byte stream.
• Unmarshalling: The reverse process of converting the received byte stream back into the original data
format that can be used by the receiving system.
4. Communication Layer
• Transport Protocol: RPC communication usually relies on a network transport protocol, such as TCP or
UDP, to handle the data transmission between client and server. The transport protocol ensures that data
packets are reliably sent and received.
• Message Handling: This layer is responsible for managing network messages, including routing, buffering,
and handling errors.
Transparency in a Distributed System
Transparency refers to hiding the complexities of the system’s implementation details from users and applications. It
aims to provide a seamless and consistent user experience regardless of the system’s underlying architecture,
distribution, or configuration. Transparency ensures that users and applications interact with distributed resources in
a uniform and predictable manner, abstracting away the complexities of the distributed nature of the system.
Importance of Transparency in Distributed Systems
Transparency is very important in distributed systems because of:
• Simplicity and Abstraction: Allows developers and users to interact with complex distributed systems using
simplified interfaces and abstractions.
• Consistency: Ensures consistent behavior and performance across different parts of the distributed system.
• Ease of Maintenance: Facilitates easier troubleshooting, debugging, and maintenance by abstracting away
underlying complexities.
• Scalability: Supports scalability and flexibility by allowing distributed components to be added or modified
without affecting overall system functionality.
Types of Transparency in Distributed Systems
1. Location Transparency
Location transparency refers to the ability to access distributed resources without knowing their physical or network
locations. It hides the details of where resources are located, providing a uniform interface for accessing them.
• Importance: Enhances system flexibility and scalability by allowing resources to be relocated or replicated without
affecting applications.
• Examples:
• DNS (Domain Name System): Maps domain names to IP addresses, providing location transparency for web
services.
• Virtual Machines (VMs): Abstract hardware details, allowing applications to run without knowledge of the
underlying physical servers.
2. Access Transparency
Access transparency ensures that users and applications can access distributed resources uniformly, regardless of the
distribution of those resources across the network.
• Significance: Simplifies application development and maintenance by providing a consistent method for accessing
distributed services and data.
• Methods:
• Remote Procedure Call (RPC): Allows a program to call procedures located on remote systems as if they were
local.
3. Concurrency Transparency
Concurrency transparency hides the complexities of concurrent access to shared resources in distributed systems
from the application developer. It ensures that concurrent operations do not interfere with each other.
• Challenges: Managing synchronization, consistency, and deadlock avoidance in a distributed environment
where multiple processes or threads may access shared resources simultaneously.
• Techniques:
• Locking Mechanisms: Ensure mutual exclusion to prevent simultaneous access to critical sections of
code or data.
• Transaction Management: Guarantees atomicity, consistency, isolation, and durability (ACID
properties) across distributed transactions.
4. Replication Transparency
Replication transparency ensures that clients interact with a set of replicated resources as if they were a
single resource. It hides the presence of replicas and manages consistency among them.
• Strategies: Maintaining consistency through techniques like primary-backup replication, where one
replica (primary) handles updates and others (backups) replicate changes.
• Applications:
• Content Delivery Networks (CDNs): Replicate content across geographically distributed servers to
reduce latency and improve availability.
5. Failure Transparency
Failure transparency ensures that the occurrence of failures in a distributed system does not disrupt service
availability or correctness. It involves mechanisms for fault detection, recovery, and resilience.
• Approaches:
• Load Balancers: Distribute traffic across healthy servers and remove failed ones from the pool
automatically.
6. Performance Transparency
Performance transparency ensures consistent performance levels across distributed nodes despite variations in
workload, network conditions, or hardware capabilities.
• Challenges: Optimizing resource allocation and workload distribution to maintain predictable performance
levels across distributed systems.
• Strategies:
• Load Balancing: Distributes incoming traffic evenly across multiple servers to optimize resource
utilization and response times.
7. Security Transparency
Security transparency ensures that security mechanisms and protocols are integrated into a distributed system
seamlessly, protecting data and resources from unauthorized access or breaches.
• Importance: Ensures confidentiality, integrity, and availability of data and services in distributed environments.
• Techniques:
• Encryption: Secures data at rest and in transit using cryptographic algorithms to prevent eavesdropping or
tampering.
• Access Control: Manages permissions and authentication to restrict access to sensitive resources based on
user roles and policies.
8. Management Transparency
Management transparency simplifies the monitoring, control, and administration of distributed systems by providing
unified visibility and control over distributed resources.
• Methods: Utilizes automation, monitoring tools, and centralized management interfaces to streamline operations
and reduce administrative overhead.
• Examples:
• Cloud Management Platforms (CMPs): Provide unified interfaces for provisioning, monitoring, and
managing cloud resources across multiple providers.
• Configuration Management Tools: Automate deployment, configuration, and updates of software and
infrastructure components in distributed environments.
What is RPC mechanism?
They enable client to communication with server by calling procedures in a similar way to the conventional use of
procedure on the local procedure call, but the call procedure is executed in a different process and usually a different
computer.
The steps in making a RPC
• Client procedure calls the client stub in a normal way.
• Client stub builds a message and traps to the kernel.
• Kernel sends the message to remote kernel.
• Remote kernel gives the message to server stub.
• Server stub unpacks parameters and calls the server.
• Server computes results and returns it to server stub.
• Server stub packs results in a message to client and traps to kernel.
• Remote kernel sends message to client stub.
• Client kernel gives message to client stub.
• Client stub unpacks results and returns to client.
RPC Implementation Mechanism
• RPC is an effective mechanism for building
client-server systems that are distributed. RPC
enhances the power and ease of programming
of the client/server computing concept. It’s a
protocol that allows one software to seek a
service from another program on another
computer in a network without having to know
about the network. The software that makes
the request is called a client, and the
program that provides the service is called a
server.
• The calling parameters are sent to the remote
process during a Remote Procedure Call, and
the caller waits for a response from the remote
procedure.
When the client process requests by calling a local procedure then the procedure will pass the arguments/parameters in
request format so that they can be sent in a message to the remote server. The remote server then will execute the local
procedure call ( based on the request arrived from the client machine) and after execution finally returns a response to
the client in the form of a message. Till this time the client is blocked but as soon as the response comes from the
server side it will be able to find the result from the message. In some cases, RPCs can be executed asynchronously
also in which the client will not be blocked in waiting for the response.
Parameter passing Semantics in RPC
• When a client sends a procedure call to a server over network, parameters of procedure need to be transmitted to
server. RPC uses different parameter passing methods to transmit these parameters.
• The parameter passing is the only way to share information between clients and servers in the Remote Procedure
Call (RPC).
The following are the various semantics that is used in RPC for passing parameters in distributed applications:
1. Call-by-Value: The client stub copies and packages the value from the client into a message so that it can be
sent to the server through a network.
• Parameter marshaling is like packing a suitcase for a trip. When a program on your computer (the client) wants to
use a function on another computer (the server), it needs to send the necessary information (parameters) over.
• 1. Packing (Client Stub): The client stub takes the parameters (like `x` and `y` for the `add` function) and puts
them into a message, like packing items into a suitcase. It also includes what function needs to be called (`add`).
• 2. Sending: The message is sent to the server.
• 3. Unpacking (Server):The server unpacks the "suitcase" to see what function is needed and what parameters to
use.
• 4. Execution (Server): The server executes the function.
• 5. Repacking (Server): The server puts the result back into a message and sends it back to the client.
• 6. Unpacking (Client Stub): The client stub unpacks the result and gives it back to the program that made the call.
This works smoothly if both computers speak the same language and the data is simple (like numbers or letters).
2. Call-by-Reference: Call-by-Reference simply means that pointers to the parameters are transferred from the client
to the server. In some RPC techniques, parameters can be passed by reference. Employed in closed systems in which
multiple processes share a single address.
Pointers (memory addresses) don't translate between computers in a distributed system. Directly passing a pointer from
a client to a server won't work because the address is only valid on the client's machine. The "asking back and forth"
solution is possible, but slow.
• In passing arrays, a variable’s address is supplied. Handling pointer-based data structures, such as pointers, lists, trees,
stacks, graphs, and so on.Call-by-object-reference: Here RPC mechanism uses object invocation. The value that
holds a variable is used to refer to an object.
• Call-by-move: A parameter is passed by reference, much like in the call-by-object reference method, but the
parameter object is relocated to the target node during the call (callee). It is termed call-by-visit if it remains at the
caller’s node. It allows the argument objects to be packaged in the same network packet as the invocation message,
which in turn reduces network traffic and message count.
Call Semantics in RPC
RPC has the same semantics as a local procedure call, the calling process calls the procedure, gives inputs to it,
and then waits while it executes. When the procedure is finished, it can return results to the calling process. If
the distributed system is to achieve transparency, the following problems concerning the properties of remote
procedure calls must be considered in the design of an RPC system:
• Binding: Binding establishes a link between the caller process’s name and the remote procedure’s location.
• Communication Transparency: It should be unknown to the users that the process they are calling is
remote.
• Concurrency: Communication techniques should not mix with concurrency mechanisms. When single-
threaded clients and servers are blocked while waiting for RPC results, considerable delays might occur.
Lightweight processes permit the server to handle concurrent calls from several clients.
• Heterogeneity: Separate machines may have distinct data representations, operate under different operating
systems, or have remote procedures written in different languages.
Types of Call Semantics:
Types of Call Semantics:
• Perhaps or Possibly Call Semantics: It is the weakest one, here, the caller is waiting until a predetermined
timeout period and then continues with its execution. It is used in services where periodic updates are
required.
• Last-one Call Semantics: Based on timeout, retransmission of call message is made. After the elapsing of
the timeout period, the obtained result from the last execution is used by the caller. It sends out orphan calls.
It finds its application in designing simple RPC.
• Last-of-Many Call Semantics: It is like Last-one Call semantics but the difference here is that it neglects
orphan calls through call identifiers. A new call-id is assigned to the call whenever it is repeated. It is
accepted by the caller only if the caller id matches the most recent repeated call.
• At-least-once Call Semantics: The execution of the call is there for one or more times, but the results given
to the caller are not specified. Here also retransmissions rely on the time-out period without giving a thought
to orphan calls. In the case of nested calls, the result is taken from the first response message if there are
orphan calls and others are ignored irrespective of the accepted response is whether from orphan or not.
• Exactly-once Call Semantics: No matter how many times the call is transmitted, the potential of the
procedure being conducted more than once is eliminated. When the server receives an acknowledgment from
the client then only it deletes information from the reply cache.
Communication Protocols For RPCs
The following are the communication protocols that are used in RPC:
• Request Protocol
• Request/Reply Protocol
1. On-Chip Memory
• The data is present in the CPU portion of the chip.
2. Bus-Based Multiprocessors
• A set of parallel wires called a bus acts as a connection between CPU and memory.
• Accessing of same memory simultaneously by multiple CPUs is prevented by using some algorithms.
3. Ring-Based Multiprocessors
• There is no global centralized memory present in Ring-based DSM.
• In ring-bases DSM a single address line is divided into the shared area.
Architecture of Distributed Shared Memory(DSM)
The architecture of a Distributed Shared Memory (DSM) system typically consists of several key components
that work together to provide the illusion of a shared memory space across distributed nodes. the components
of Architecture of Distributed Shared Memory :
1.Nodes: Each node in the distributed system consists of one or more CPUs and a memory unit. These nodes are
connected via a high-speed communication network.
2.Memory Mapping Manager Unit: The memory mapping manager routine in each node is responsible for
mapping the local memory onto the shared memory space. This involves dividing the shared memory space
into blocks and managing the mapping of these blocks to the physical memory of the node.
*Caching is employed to reduce operation latency. Each node uses its local memory to cache portions of the
shared memory space. The memory mapping manager treats the local memory as a cache for the shared
memory space, with memory blocks as the basic unit of caching.
• 3.Communication Network Unit: This unit facilitates communication between nodes. When a process
accesses data in the shared address space, the memory mapping manager maps the shared memory address to
physical memory. The communication network unit handles the communication of data between nodes,
ensuring that data can be accessed remotely when necessary.
Algorithm for implementing Distributed Shared Memory
1. Central Server Algorithm:
• In this, a central server maintains all shared data. It
services read requests from other nodes by
returning the data items to them and write requests
by updating the data and returning
acknowledgement messages.
4. Session Consistency
• Session Consistency guarantees that all of the data and actions a user engages with within a single
session remain consistent. Consider it similar to online shopping: session consistency ensures that
an item will always be in your cart until you check out or log out, regardless of how you explore the
page.
5. Causal Consistency Model
The Causal Consistency Model is a type of consistency in distributed systems that ensures that related
events happen in a logical order. In simpler terms, if two operations are causally related (like one action
causing another), the system will make sure they are seen in that order by all users. However, if there’s no
clear relationship between two operations, the system doesn’t enforce an order, meaning different users
might see the operations in different sequences.
Replacement strategy for a distributed caching system
• Problem: Existing cache strategies don't fully address the unique temporal and spatial access
patterns of geospatial data, leading to suboptimal cache hit rates.
• Proposed Solution: A new cache replacement strategy that considers both temporal and spatial
locality in geospatial data access.
• Temporal Locality: Tracks access frequency and time intervals using a modified LRU
approach.
• Spatial Locality: Builds access sequences based on an LRU stack to capture spatial
relationships and caching locations.
• Balancing Act: Chooses replacement objects based on access sequence length and caching
resource costs to balance temporal and spatial locality.
• Benefits: Improved cache hit rate, better response performance, and higher system throughput,
making it suitable for cloud-based networked GIS environments.
Thrashing
• Thrashing is a process that occurs when the system spends a major portion of time transferring shared data
block blocks from one node to another in comparison with the time spent on doing the useful work of
executing the application process. If thrashing is not handled carefully it degrades system performance
considerably.
Why Thrashing Occurs??
1. Ping pong effect- It occurs when processes make interleaved data access on two or more nodes it may cause a
data block to move back and forth from one node to another in quick succession known as the ping-pong
effect,
2. When the blocks having read-only permission are repeatedly invalidated after they are replicated. It is caused
due to poor locality of reference.
• For this method, an application-controlled lock can be associated with each data block.
• On the basis of past access patterns, time t can be fixed statically or dynamically
2. Resource Attributes:
1. Define functionality, structure, and properties of resources.
1. State Saving:
1. Freeze the selected process to halt its execution.
2. Save the process control data structure (PCB), ports, and memory space.
2. Data Transfer:
1. Send the state information to the destination computer.
5. Completion Confirmation:
1. After execution, return results or the process image to the original computer.
6. Process Restoration:
1. Allocate data structures on the receiving computer using the received information.
2. Shared Resources: Threads within a process share the same memory and resources, making them
lightweight compared to processes.
• Resource Efficiency: Threads share memory and resources of their parent process, reducing
overhead.
• Enhanced Scalability: Threads allow distributed systems to handle large numbers of requests
simultaneously.
• Simplified Multitasking: Threads enable simultaneous execution of tasks like data processing and
I/O operations.
Use Cases of Threads
• Parallel Processing: Distribute computational tasks across threads for faster processing.
• Simplicity and ease of use: The user interface of a file system should be simple and the
number of commands in the file should be small.
• High availability: A Distributed File System should be able to continue in case of any
partial failures like a link failure, a node failure, or a storage drive crash.
A high authentic and adaptable distributed file system should have different and
independent file servers for controlling different and independent storage devices.
• Scalability: Since growing the network by adding new machines or joining two networks
together is routine, the distributed system will inevitably grow over time. As a result, a
good distributed file system should be built to scale quickly as the number of nodes and
users in the system grows
Working of DFS
There are two ways in which DFS can be implemented:
• Standalone DFS namespace: It allows only for those DFS roots that exist on the local computer and
are not using Active Directory. A Standalone DFS can only be acquired on those computers on which it
is created. It does not provide any fault liberation and cannot be linked to any other DFS. Standalone
DFS roots are rarely come across because of their limited advantage.
• Domain-based DFS namespace: It stores the configuration of DFS in Active Directory, creating the
DFS namespace root accessible at \\<domainname>\<dfsroot> or \\<FQDN>\<dfsroot>
File Model in Distributed Systems
A file model in distributed systems refers to the way data and files are
organized, accessed, and managed across multiple nodes or locations
within a network. It encompasses the structure, organization, and
methods used to store, retrieve, and manipulate files in a distributed
environment. File models define how data is stored physically, how it
can be accessed, and what operations can be performed on it.
Importance of File Models in Distributed
Systems
The importance of file models in distributed systems lies in their ability to:
• Organize and Structure Data: File models provide a framework for organizing data into
logical units, making it easier to manage and query data across distributed nodes.
• Ensure Data Consistency and Integrity: By defining how data is structured and
accessed, file models help maintain data consistency and integrity, crucial for reliable
operations in distributed environments.
• Support Scalability: Different file models offer varying levels of scalability, allowing
distributed systems to efficiently handle growing amounts of data and increasing user
demands.
• Enable Efficient Access and Retrieval: Depending on the file model chosen, distributed
systems can optimize data access patterns, ensuring that data retrieval operations are
efficient and responsive.
• Access Patterns: File sharing semantics govern how files are accessed by different
users. This includes read and write operations.
• Visibility of Changes: They determine when changes made by one user become visible
to others. Immediate or delayed visibility impacts system behavior.
2. Fault Diagnosis:-Fault diagnosis is the process where the fault that is identified in the first phase
will be diagnosed properly in order to get the root cause and possible nature of the faults. Fault
diagnosis can be done manually by the administrator or by using automated Techniques in order to
solve the fault and perform the given task.
3. Evidence Generation:-Evidence generation is defined as the process where the report of the fault
is prepared based on the diagnosis done in an earlier phase. This report involves the details of the
causes of the fault, the nature of faults, the solutions that can be used for fixing, and other
alternatives and preventions that need to be considered.
4. Assessment:-Assessment is the process where the damages caused by the faults are analyzed. It
can be determined with the help of messages that are being passed from the component that has
encountered the fault. Based on the assessment further decisions are made.
5. Recovery:-Recovery is the process where the aim is to make the system fault free. It is the step to
make the system fault free and restore it to state forward recovery and backup recovery. Some of
the common recovery techniques such as reconfiguration and resynchronization can be used.
Types of Faults
• Transient Faults: Transient Faults are the type of faults that occur once and then
disappear. These types of faults do not harm the system to a great extent but are
very difficult to find or locate. Processor fault is an example of transient fault.
• Intermittent Faults: Intermittent Faults are the type of faults that come again and
again. Such as once the fault occurs it vanishes upon itself and then reappears
again. An example of intermittent fault is when the working computer hangs up.
• Permanent Faults: Permanent Faults are the type of faults that remain in the
system until the component is replaced by another. These types of faults can
cause very severe damage to the system but are easy to identify. A burnt-out chip
is an example of a permanent Fault.
Need for Fault Tolerance in Distributed Systems
Fault Tolerance is required in order to provide below four features.
1. Availability: Availability is defined as the property where the system is readily available for its
use at any time.
2. Reliability: Reliability is defined as the property where the system can work continuously
without any failure.
3. Safety: Safety is defined as the property where the system can remain safe from
unauthorized access even if any failure occurs.
4. Maintainability: Maintainability is defined as the property states that how easily and fastly the
failed node or system can be repaired.
Design Principles of Distributed File
System
1. Scalability
• The system must handle increasing amounts of data and users efficiently without degradation in
performance.
• Example:Hadoop Distributed File System (HDFS): HDFS is designed to scale out by adding more
DataNodes to the cluster. Each DataNode stores data blocks, and the system can handle petabytes
of data across thousands of nodes
2. Consistency
• Ensuring that all users see the same data at the same time. This can be achieved through different
consistency models
• Example:Google File System (GFS): GFS provides a relaxed consistency model to achieve high
availability and performance.It allows concurrent mutations and uses version numbers and
timestamps to maintain consistency.
3. Availability
• Ensuring that the system is operational and accessible even during failures.
• Example:Amazon S3: Amazon S3 achieves high availability by replicating data across multiple
Availability Zones (AZs). If one AZ fails, data is still accessible from another, ensuring minimal
downtime and high availability.
4. Performance
Optimizing the system for speed and efficiency in data access.
Example:Ceph: Ceph is designed to provide high performance by using techniques
such as object storage, which allows for efficient, parallel data access.It uses a
dynamic distributed hashing algorithm called CRUSH.
5. Security
Protecting data from unauthorized access and ensuring data integrity.
Example:
Azure Blob Storage: Azure Blob Storage offers comprehensive security features,
including role-based access control (RBAC)
6. Data Management
Efficiently distributing, replicating, and caching data to ensure optimal performance
and reliability.
Example:
Cassandra: Apache Cassandra is a distributed NoSQL database that uses consistent
hashing to distribute data evenly across all nodes in the cluster.