0% found this document useful (0 votes)
10 views

Distributed System

The document discusses distributed systems, which connect autonomous computers over a network to share resources and work together, with each computer referred to as a node. Distributed systems are designed to be scalable, fault-tolerant, and highly available by adding more nodes as needed. Middleware acts as an intermediary between operating systems and applications to facilitate communication and coordination among distributed system components.

Uploaded by

Niraj Acharya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Distributed System

The document discusses distributed systems, which connect autonomous computers over a network to share resources and work together, with each computer referred to as a node. Distributed systems are designed to be scalable, fault-tolerant, and highly available by adding more nodes as needed. Middleware acts as an intermediary between operating systems and applications to facilitate communication and coordination among distributed system components.

Uploaded by

Niraj Acharya
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 129

Distributed System

NRB- Assistant-Director-IT

1
2
3
What is Distributed Systems?
• Distributed System is a collection of autonomous
(independent) computers (nodes) that are physically
separated but are connected by a Computer Network
that is provided with distributed system software.
• The autonomous computers work together by sharing
resources and files and performing set of functionalities.
• In a distributed system, each computer is referred to as a
node, and the nodes communicate with each other by
passing messages over the network.
• Distributed systems are designed to be scalable, fault-
tolerant, and highly available.

4
What is a Distributed System?
• They are used in a wide range of applications, including web servers, cloud
computing, distributed databases, and distributed file systems.
• Middleware: Middleware is a layer of software that acts as an
intermediary between the operating system and application software in
a distributed environment.
• It provides a set of services and abstractions to facilitate communication, coordination, and
integration among different components within the distributed system.

5
Characteristics of Distributed System:
• Resource Sharing: It is the ability to use any Hardware, Software, or Data anywhere in the
System.
• Openness: Openness in distributed systems refers to the ability of the system to be extended,
modified, and integrated with other systems, promoting interoperability, extensibility, portability,
and reusability.
• Concurrency: It is naturally present in Distributed Systems, that deal with the same activity or
functionality that can be performed by separate users who are in remote locations.
• multiple machines can process the same function at the same time
• Every local system has its independent Operating Systems and Resources.
• Scalability. The ability to grow as the size of the workload increases is an essential feature of
distributed systems, accomplished by adding additional processing units or nodes to the network as
needed.
• The computing and processing capacity can scale up as needed when extended to additional machines
6
Characteristics of Distributed System:
• Availability and fault tolerance. If one node fails, the remaining nodes can continue to
operate without disrupting the overall computation.
• It cares about the reliability of the system if there is a failure in Hardware or Software,
• the system continues to operate properly without degrading the performance the system.
• Transparency. The end user sees a distributed system as a single computational unit
rather than as its underlying parts, allowing users to interact with a single logical device
rather than being concerned with the system’s architecture.
• It hides the complexity of the Distributed Systems to the Users and Application programs as there
should be privacy in every system.
• Heterogeneity: In most distributed systems, the nodes and components are often
asynchronous, with different hardware, middleware, software and operating systems.
• This allows the distributed systems to be extended with the addition of new components.
7
Advantages of Distributed System:
• Unlimited Horizontal Scaling -Distributed systems can scale horizontally by adding more nodes to the network,
allowing them to handle increasing workloads and accommodate more users.
• Fault Tolerance - Distributed systems are designed to be fault-tolerant, meaning that they can continue to operate
even if some nodes fail. This is achieved through redundancy and replication of data and services.
• if one server or data center goes down, others could still serve the users of the service. It has
higher reliability and availability against component failure.
• High Availability: Distributed systems are designed to be highly available, meaning that they can continue to
operate even if some nodes are unavailable. This is achieved through load balancing and failover mechanisms.
• Performance: Distributed systems can improve performance by distributing tasks across multiple nodes, allowing
them to process tasks in parallel and achieve better performance.
• Geographical Distribution: Distributed systems can be geographically distributed, allowing them to provide
services to users in different locations.
• Low Latency - having machines that are geographically located closer to users, it will reduce the time it
takes to serve users.

8
Disadvantages of Distributed System:
• Complexity: Distributed systems are inherently more complex than centralized systems,
as they involve multiple nodes, networks, and communication protocols.
• Consistency: Maintaining consistency of data and services across all nodes in a
distributed system can be challenging and requires careful design and implementation.
• Concurrency: Supporting concurrent access to data and services by multiple users in a
distributed system can be challenging and requires distributed locking and
synchronization mechanisms.
• Latency: Distributed systems can suffer from increased latency due to the need to
communicate over a network, especially in geographically distributed systems.
• Security: Distributed systems can be more vulnerable to security threats, as they involve
multiple nodes and networks that need to be secured.
9
Transparency in Distributed System
• Transparency in distributed systems refers to the degree to which the
system's operation is visible to its users.
• Access Transparency: Users should not need to know the physical location or
distribution of resources. This is achieved through mechanisms like remote
procedure calls (RPCs) or object-oriented middleware.
• Location Transparency: Users should not need to know where a resource is
located. This is achieved through mechanisms like naming and directory
services.
• Replication Transparency: Users should not need to know that a resource has
been replicated. This is achieved through mechanisms like distributed file
systems. 10
Transparency in Distributed System
• Concurrency Transparency: Users should not need to know that a resource is being
shared by multiple users. This is achieved through mechanisms like distributed locking.
• Failure Transparency: Users should not need to know that a resource has failed. This is
achieved through mechanisms like fault tolerance and error recovery.
• Performance Transparency: Users should not need to know that a resource is being
accessed over a network. This is achieved through mechanisms like caching and load
balancing.
• Scaling Transparency: Users should not need to know that a resource is being scaled.
This is achieved through mechanisms like distributed databases and distributed file
systems.
• Security Transparency: Users should not need to know that a resource is being accessed
securely. This is achieved through mechanisms like encryption and access control.
11
Applications Area of Distributed System:
• Finance and Commerce: Amazon, eBay, Online Banking, E-Commerce websites.
• Information Society: Search Engines, Wikipedia, Social Networking, Cloud Computing.
• Cloud Technologies: AWS, Salesforce, Microsoft Azure, SAP.
• Entertainment: Online Gaming, Music, youTube.
• Healthcare: Online patient records, Health Informatics.
• Education: E-learning, video conferencing systems
• Transport and logistics: GPS, Google Maps.
• Environment Management: Sensor technologies.
• Cryptocurrency processing systems: (e.g. Bitcoin)
12
Challenges of Distributed Systems:
• While distributed systems offer many advantages, they also present some
challenges that must be addressed. These challenges include:
• Network latency: The communication network in a distributed system can
introduce latency, which can affect the performance of the system.
• Distributed coordination: Distributed systems require coordination among
the nodes, which can be challenging due to the distributed nature of the system.
• Security: Distributed systems are more vulnerable to security threats than
centralized systems due to the distributed nature of the system.
• Data consistency: Maintaining data consistency across multiple nodes in a
distributed system can be challenging.

13
Client Server Computing
• Client-server computing is a distributed
computing model where tasks or processes are
divided between client devices and server
systems, and these components communicate
over a network.
• In client server computing,
• the clients requests a resource and the server
provides that resource.
• A central server may serve multiple clients at the
same time while a client is in contact with only one
server.
• Both the client and server usually communicate via
a computer network.
14
Building blocks of Client Server Computing
• The building blocks of client-server computing include various
components that contribute to the architecture and functionality of the
system. Here are the key building blocks:
• Client:
• The client is a user's device or application that requests services or
resources from a server.
• It initiates communication by sending requests to the server.
• Clients can be desktop computers, laptops, smartphones, tablets, or
any device capable of making requests.
• Server:
• The server is a powerful computer or software application that
provides services or resources to clients.
• It responds to client requests by processing and delivering the
requested information or services.
• Servers are designed to handle multiple client requests
simultaneously.
15
Building blocks of Client Server Computing
• Middleware:
• Middleware is software that facilitates communication and data exchange between the client
and server.
• A software that acts as a bridge between different application, system or components.
• It enables them to communicate and work together.
• Middleware acts as a layer between the client and the server
• It’s sometimes called plumbing as it connects two applications together so data and database
can be easily passed between the “Pipe.”- Microsoft
• It acts as an intermediary layer, providing services such as message queuing,
authentication/security, and data transformation.

16
Building blocks of Client Server Computing
• In the context of client-server computing, middleware plays a crucial role in enabling
communication between the client and server components.
• It abstracts the underlying network and hardware complexities, allowing developers to focus on
building the application logic without worrying about the complexity of network communication.
• Some common functions provided by middleware in client-server computing include:
• Remote Procedure Calls (RPC): Middleware allows clients to invoke procedures or methods on
remote servers as if they were local, abstracting the details of network communication.
• Object Request Brokers (ORBs): Middleware provides ORBs that facilitate communication
between distributed objects, allowing them to interact seamlessly across the network.
• Message Queues: Middleware provides message queuing services that enable asynchronous
communication between clients and servers, improving system responsiveness and reliability.
• Data Access: Middleware provides data access services that allow clients to access and manipulate
data stored on remote servers, ensuring data consistency and integrity.

17
Characteristics of Client Server Computing
• Tasks and responsibilities are divided between clients and servers, enhancing efficiency.
• Clients and servers communicate over a network, enabling data exchange.
• Interaction follows a request-response pattern, facilitating clear communication flow.
• Servers resources like databases, applications, or files for consistency and security.
• Client-server architectures scale by adding more clients or servers as needed.
• Security mechanisms, such as authentication and encryption, protect data and ensure
integrity.
• Clients and servers follow a common communication protocol at the application layer.
• All communication protocols are available at the application layer for interaction like HTTP.

18
Advantages of Client-server networks:
• Centralized: Centralized back-up is possible in client-server networks, i.e., all the
data is stored in a server.
• Security: These networks are more secure as all the shared resources are centrally
administered.
• Performance: The use of the dedicated server increases the speed of sharing
resources. This increases the performance of the overall system.
• Scalability: We can increase the number of clients and servers separately, i.e., the
new element can be added, or we can add a new node in a network at any time.

19
Disadvantages of Client-Server network:
• Traffic Congestion is a big problem in Client/Server networks.
• When a large number of clients send requests to the same server may cause the
problem of Traffic congestion.
• It does not have a robustness of a network,
• i.e., when the server is down, then the client requests cannot be met.
• A client/server network is very critical.
• Sometimes, regular computer hardware does not serve a certain number of
clients. In such situations, specific hardware is required at the server side to
complete the work.
20
How does client-server architecture work?
• We have understood that client- • The user enters the uniform resource locator (URL) of
server architecture is made up of the website or file and the browser sends a request to
two elements, one that provides the domain name system (DNS) server.
services and the other that consumes
those services.
• DNS server is responsible for searching and retrieving
the IP address associated with a web server and then
initiating actions using that IP address.
• After the DNS server responds, the browser sends over
an HTTP or HTTPS request to the web server’s IP,
which was provided by the DNS server.
• Following the request, the server proceeds to transmit
the essential website files required.
• Ultimately, the files are processed by the browser and
the website is subsequently presented for viewing.

21
The state of distributed client server infrastructure
• The current condition or status of the networked architecture where clients and
servers interact in a distributed manner. This includes the hardware, software,
protocols, and overall design of the infrastructure.
• The state of distributed client-server infrastructure is characterized by several
trends and developments:
• Cloud Computing:
• Edge Computing
• Microservices Architecture:

22
Cloud Computing
• There is a growing trend towards cloud-based distributed client-server infrastructure,
• where services and resources are hosted on remote servers and accessed over the internet. This
allows for scalability, flexibility, and cost-effectiveness.
• Cloud computing is the delivery of computing services — servers, storage, databases, networking,
software and more — over the Internet (“the cloud”).
• You typically pay only for cloud services you use, helping lower your operating costs, run your
infrastructure more efficiently and scale as your business needs change.
• Cloud Computing is a model for enabling on-demand network access to a shared pool of
configurable computing resources. (e.g. networks, servers, storage, applications and
services)that can be rapidly provisioned and released with minimal management effort or
service provider interaction.
• This cloud model is composed of five essential characteristics, three service models and
four deployment models.

23
Cloud Computing Architecture
• Cloud computing architecture includes:
• front-end platforms consists of cloud clients,
tablets, and mobile devices,
• back-end platforms provide data security,
traffic control, and middleware for device
communication (servers and storage),
• Front end interact with the back-end through
middleware or web-browser virtual sessions.
• cloud-based delivery, and a network
(Internet, Intranet, Intercloud).
• These components work together to enable cloud
computing.
24
Cloud Computing- Characteristics:
• On-demand self-service: Users can provision computing resources as needed
without requiring human intervention from the service provider.
• Broad network access: Services are accessible over the network and can be
accessed from a variety of devices.
• Resource pooling: Computing resources are pooled together to serve multiple
users, allowing for efficient resource utilization.
• Rapid elasticity: Computing resources can be rapidly scaled up or down to meet
changing demand.
• Measured service: Usage of computing resources is monitored, controlled, and
reported, allowing for transparent and efficient resource management.

25
Cloud Computing- Service Models:
• Infrastructure as a Service (IaaS): Provides virtualized computing
resources over the internet, such as virtual machines, storage, and
networking.
• Platform as a Service (PaaS): Provides a platform for developing,
deploying, and managing applications without the complexity of
infrastructure management.
• Software as a Service (SaaS): Provides software applications over the
internet on a subscription basis, eliminating the need for local
installation and maintenance.
26
Cloud Computing- Deployment Models:
• Public cloud: Services are provided over the
internet and are available to the general public.
e.g.: Amazon Web Services (AWS), Microsoft
Azure, and Google Cloud Platform (GCP)
• Private cloud: Services are provided over a
private network and are dedicated to a single
organization. e.g.: company that sets up its own
data center with virtualization technology to
provide computing resources to its internal users.
• Hybrid cloud: Combines public and private cloud
services, allowing for data and applications to be
shared between them. e.g., AWS, Azure
• Community cloud: Services are shared by several
organizations with common concerns, such as
security or compliance requirements. e.g.
healthcare organizations
27
Advantages of Cloud Computing:
• Cost-Efficiency: Cloud computing eliminates the need for upfront infrastructure
investment, reducing costs.
• Scalability: Cloud services can be easily scaled up or down based on demand,
allowing for flexibility.
• Accessibility: Cloud services can be accessed from anywhere with an internet
connection, enabling remote work.
• Reliability: Cloud providers offer high availability and redundancy, reducing the
risk of downtime.
• Security: Cloud providers invest in security measures, often providing better
security than on-premises solutions.

28
Disadvantages of Cloud Computing:
• Dependency on Internet: Cloud services require a stable internet connection,
which can be a limitation.
• Data Privacy: Storing data on the cloud raises concerns about data privacy and
security.
• Limited Control: Users have limited control over the infrastructure and services
provided by cloud providers.
• Compliance: Compliance with regulations and standards can be challenging in a
cloud environment.
• Potential Downtime: Cloud services can experience downtime, impacting
business operations.

29
Edge Computing:
• Edge computing is a distributed
computing paradigm that brings
computation and data storage closer to
the location where it is needed,
improving response times and saving
bandwidth.
• In edge computing, data is processed
by the device itself or by a local
computer or server, rather than being
transmitted to a centralized data center.

30
Key characteristics of edge computing:
• Lower Latency: Edge computing brings computation and data storage closer to the
location where it is needed, reducing latency and improving response times.
• Bandwidth Savings: Edge computing reduces the amount of data that needs to be
transmitted over the network, saving bandwidth and reducing network congestion.
• Decentralization: Edge computing distributes computation and data storage across
multiple devices or servers, rather than relying on a centralized data center.
• Scalability: Edge computing can be easily scaled up or down based on demand, allowing
for flexibility and cost-efficiency.
• Reliability: Edge computing can provide high availability and redundancy, reducing the
risk of downtime.
• Security: Edge computing can provide better security by keeping sensitive data closer to
the source and reducing the need to transmit data over the network.

31
Applications Area of Edge computing
• Manufacturing: Edge computing enables real-time analysis of sensor and robotics data,
improving quality control, maintenance, and worker safety.
• For example, in a smart factory, sensors on manufacturing equipment can detect anomalies and
trigger maintenance alerts, while Augmented Reality (AR)/Virtual reality (VR) devices can provide
real-time guidance to workers.
• Field Services: Edge computing allows for local processing of data from industrial
equipment, ensuring continuity of business operations even if internet connectivity is lost.
• For example, in a mining operation, edge devices can process data from sensors on mining
equipment to monitor performance and detect potential failures, reducing downtime and improving
safety.
• Real-time and Near Real-time Processing: Edge computing minimizes network and
bandwidth issues by processing data near its source, reducing reliance on the cloud.
• For example, in a smart city, edge devices can process data from traffic sensors to optimize traffic
flow and reduce congestion, without needing to transmit large amounts of data to a centralized data
center.
32
What are microservices?
• Microservice architecture is a software design pattern that decomposes the extensive app into
various independent services that interact via APIs.
• As a result, you can have an autonomous team for developing and maintaining independent services, so scaling
becomes easier.
• microservices as an extension of SOA (Service-Oriented Architecture).
• create various independent services using different programming languages and platforms. It enables the rapid,
frequent, and reliable delivery of large and complex applications.
• In short, microservices are a collection of services that are:
• Loosely coupled
• Highly maintainable and testable
• Organized around business capabilities
• Independently deployable
• Owned by a small team

33
What are the benefits of microservices?
• Agility – With microservices, small DevOps teams can work independently and act within a well-
defined context, for faster development and increasing throughput.
• Increased resilience and fault tolerance – When appropriately constructed, independent services
do not impact one another. Service independence ensures that the failure of a specific service does
not crash the enterprise application.
• Higher-quality end product – Modularisation of an application into discrete components
helps app development teams concentrate on a tiny part at a time. This approach simplifies the
overall coding and testing process and increases software quality.
• Real-time processing – A publish-subscribe framework of microservices enables data center
processing in real-time. As a result, extensible systems can consume and process large amounts of
events or information in real time.
• Data isolation – Unlike a monolithic application, where different parts of the application might
touch the same data, here, only a single microservice is affected. And so, it is much easier to
perform schema updates.

34
Key Challenges of Microservices Architecture:
• Complexity: Managing a large number of microservices can be complex, requiring
careful coordination and communication between teams.
• Distributed Systems Management: Microservices architecture introduces the challenge
of managing a distributed system, including monitoring, logging, and debugging.
• Communication Overhead: Microservices architecture requires communication between
services, which can introduce overhead and latency.
• Data Management: Microservices architecture can lead to data duplication and
inconsistency, as each service may have its own data store.
• Security: Microservices architecture requires careful consideration of security, as each
service may have its own security requirements and vulnerabilities.

35
Types of Distributed Computing System Models
Physical Model
• A physical model represents the hardware components of a distributed system, including
computers, other computing devices, and their interconnections.
• It is primarily used to design, manage, implement and determine the performance of a
distributed system.
• A physical model majorly consists of the following components:
• Nodes – Nodes are the end devices that have the ability of processing data, executing tasks and
communicating with the other nodes.
• These end devices are generally the computers at the user end or can be servers, workstations etc.
• Nodes provision the distributed system with an interface in the presentation layer that enables the user to
interact with other nodes, that can be used for storage and database services, or processing, web browsing
etc.
• Each node has an Operating System, execution environment and different middleware requirements that
facilitate communication and other vital tasks.

36
37
Physical Model
• Links – Links are the communication channels between different nodes and intermediate
devices. These may be wired or wireless.
• Wired links or physical media are implemented using copper wires, fibre optic cables etc.
• The choice of the medium depends on the requirements. Generally, physical links are
required for high performance and real-time computing. Different connection types that
can be implemented are as follows:
• Point-to-point links – It establishes a connection and allows data transfer between only two
nodes. Eg: direct Ethernet connection between two computers.
• Broadcast links – It enables a single node to transmit data to multiple nodes simultaneously.
E.g.Wi-Fi network, a single access point broadcasts data to multiple devices
• Multi-Access links – Multiple nodes share the same communication channel to transfer data.
Requires protocols to avoid interference while transmission. Eg.: shared Ethernet network,
multiple computers share the same communication channel to transfer data. 38
Physical Model
• Middleware – These are the softwares installed and executed on the nodes. By running
middleware on each node, the distributed computing system achieves a decentralised control
and decision-making.
• It handles various tasks like communication with other nodes, resource management, fault tolerance,
synchronisation of different nodes and security to prevent malicious and unauthorised access.

• Network Topology – This defines the arrangement of nodes and links in the distributed
computing system.
• The most common network topologies that are implemented are bus, star, mesh, ring or hybrid. Choice of
topology is done by determining the exact use cases and the requirements.

• Communication Protocols – Communication protocols are the set rules and procedures for
transmitting data from in the links.
• Examples of these protocols include TCP, HTTPS etc. These allow the nodes to communicate and interpret the
data.
39
Architectural Model
• Architectural model in distributed computing system is the overall design and structure
of the system, and how its different components are organised to interact with each other
and provide the desired functionalities.
• Construction of a good architectural model is required for efficient cost usage, and highly
improved scalability of the applications. The key aspects of architectural model are –
Client-Server model – The clients initiate requests for
services and severs respond by providing those services.
• It mainly works on the request-response model where the
client sends a request to the server and the server
processes it, and responds to the client accordingly.
• It can be achieved by using TCP/IP, HTTP protocols on
the transport layer.
• This is mainly used in web services, cloud computing,
database management systems etc.
40
Architectural Model-Peer-to-peer model
• It is a de-centralised approach in which all the distributed
computing nodes, known as peers, are all the same in terms
of computing capabilities and can both request as well as
provide services to other peers.
• It is a highly scalable model because the peers can join and
leave the system dynamically, which makes it an ad-hoc
form of network.
• The resources are distributed and the peers need to look out
for the required resources as and when required.
• The communication is directly done amongst the peers
without any intermediaries according to some set rules and
procedures defined in the P2P networks.
• The best example of this type of computing is BitTorrent.

41
Architectural Model- Layered model
• It involves organizing the system into multiple
layers, where each layer will provision a specific
service.
• Each layer communicated with the adjacent
layers using certain well-defined protocols
without affecting the integrity of the system.
• A hierarchical structure is obtained where each
layer abstracts the underlying complexity of
lower layers.

42
Architectural Model- Micro-services model
• Micro-services model – In this system, a complex application or task, is
decomposed into multiple independent tasks and these services running on
different servers.
• Each service performs only a single function and is focused on a specific business-capability.
• This makes the overall system more maintainable, scalable and easier to understand.
• Services can be independently developed, deployed and scaled without affecting the ongoing
services.

43
Fundamental Models of Distributed System
• Distributed systems are characterized by multiple independent entities
(nodes or components) that collaborate and communicate to achieve a
common goal. Several fundamental models exist for distributed systems.
Three fundamental models are as follows:
1. Interaction Model: Governs how nodes collaborate and communicate in a
distributed system, defining communication patterns, protocols, and mechanisms for
achieving a common goal.
2. Failure Model: Describes how the system handles and recovers from different types
of failures, addressing node crashes, and unexpected events to ensure system
durability.
3. Security Model: Focuses on safeguarding data and resources in a distributed
system, implementing measures such as access controls, authentication, and
encryption to ensure confidentiality, integrity, and availability.

44
Interaction Model
• Distributed computing systems are full of many processes interacting with each other in highly complex
ways.
• Interaction model provides a framework to understand the mechanisms and patterns that are used for
communication and coordination among various processes by passing messages.
• Message Passing – It deals with passing messages that may contain, data, instructions, a service request, or
process synchronisation between different computing nodes. It may be synchronous or asynchronous depending
on the types of tasks and processes.
• Publish/Subscribe Systems – Also known as pub/sub system. In this the publishing process can publish a
message over a topic and the processes that are subscribed to that topic can take it up and execute the process
for themselves. It is more important in an event-driven architecture.
• The following characteristics of communication channels impact the performance of the system:
• Latency - the time between the sending of a message at the source and the receipt of the message at the destination.
• Bandwidth - the total amount of information that can be transmitted over a given time period (e.g., Mbits/second).
• Jitter - "the variation in the time taken to deliver a series of messages."
45
Failure Model
• This model addresses the faults and failures that occur in the distributed computing system.
• It provides a framework to identify and rectify the faults that occur or may occur in the system.
• Fault tolerance mechanisms are implemented so as to handle failures by replication and error
detection and recovery methods. Different failures that may occur are:
• Failstop: A process halts and remains halted. Other processes can detect that the process has failed.
• Crash failures – A process or node unexpectedly stops functioning.
• Omission failures – It involves a loss of message, resulting in absence of required communication.
• Timing failures – The process deviates from its expected time quantum and may lead to delays or
unsynchronised response times.
• Arbitrary failures – The process may send malicious or unexpected messages that conflict with the
set protocols.

46
Security Model
• Distributed computing systems may suffer malicious attacks, unauthorised access and data
breaches.
• Security model provides a framework for understanding the security requirements, threats,
vulnerabilities, and mechanisms to safeguard the system and its resources.
• Various aspects that are vital in the security model are –
• Authentication – It verifies the identity of the users accessing the system. It ensures that only
the authorised and trusted entities get access. It involves –
• Password-based authentication – Users provide a unique password to prove their identity.
• Public-key cryptography – Entities possess a private key and a corresponding public key, allowing
verification of their authenticity.
• Multi-factor authentication – Multiple factors, such as passwords, biometrics, or security tokens,
are used to validate identity.
47
Security Model
• There are several potential threats a system designer need be aware of:
• Threats to processes - An attacker may send a request or response using a false identity.
• Threats to communication channels - An attacker may eavesdrop (listen to messages) or
inject new messages into a communication channel. An attacker can also save messages and
replay them later.
• Denial of service - An attacker may overload a server by making excessive requests.
• Cryptography and authentication are often used to provide security.
Communication entities can use a shared secret (key) to ensure that they are
communicating with one another and to encrypt their messages so that they cannot
be read by attackers.

48
Security Model
• Encryption – It is the process of transforming data into
a format that is unreadable without a decryption key. It
protects sensitive information from unauthorized access
or disclosure.
• Data Integrity – Data integrity mechanisms protect
against unauthorised modifications or tampering of data.
They ensure that data remains unchanged during
storage, transmission, or processing. Data integrity
mechanisms include:
• Hash functions – Generating a hash value or checksum
from data to verify its integrity.
• Digital signatures – Using cryptographic techniques to
sign data and verify its authenticity and integrity.

49
Distributed Object Based Communications:
Distributed Objects
• A distributed object is an object that can be accessed remotely.
• This means that a distributed object can be used like a regular object, but from anywhere on the
network.
• Distributed objects might be used :
• To share information across applications or users.
• To synchronize activity across several machines.
• To increase performance associated with a particular task.
• Work together by sharing data and invoking methods.
• This often involves location transparency, where remote objects appear the same as local objects.
• The main method of distributed object communication is with remote method invocation, generally
by message-passing: one object sends a message to another object in a remote machine or process to
perform some task. The results are sent back to the calling object.
50
HOW DISTRIBUTED OBJECT COMMUNICATE ?
• The widely used approach on how to implement the
communication channel is realized by using stubs and
skeletons.
• They are generated objects whose structure and behavior depends on
chosen communication protocol, but in general provide additional
functionality that ensures reliable communication over the network.

• When a caller wants to perform a remote call on the called


object, it delegates requests to its stub which initiates
communication with the remote skeleton.
• Consequently, the stub passes caller arguments over the network to the
server skeleton.
• The skeleton then passes received data to the called object, waits for a
response and returns the result to the client stub. Note that there is no
direct communication between the caller and the called object.

51
HOW DISTRIBUTED OBJECT COMMUNICATE ?
• In more details, the communication consists of several steps:
1. caller calls a local procedure implemented by the stub
2. stub marshalls call type and the input arguments into a request message
3. client stub sends the message over the network to the server and blocks the current execution thread
4. server skeleton receives the request message from the network
5. skeleton unpacks call type from the request message and looks up the procedure on the called object
6. skeleton unmarshalls procedure arguments
7. skeleton executes the procedure on the called object
8. called object performs a computation and returns the result
9. skeleton packs the output arguments into a response message
10. skeleton sends the message over the network back to the client
11. client stub receives the response message from the network
12. stub unpacks output arguments from the message
13. stub passes output arguments to the caller, releases execution thread and caller then continues in execution
52
Remote Procedure Call (RPC)
• Remote Procedure Call (RPC) is a
communication protocol that allows a client to
invoke procedures on a remote server located in
another computer on a network without having to
understand the network details.
• A procedure call is also called as function call or
subroutine call.
• RPC uses the client server model.
• The client process makes a procedure call using RPC
and then the message is passed to the required server
process using communication protocols.
• These message passing protocols are abstracted and
the result once obtained from the server process, is
sent back to the client process to continue execution.

53
How RPC Works?
• RPC architecture has mainly five components of the program:
1. Client
2. Client Stub (stub: piece of code which convert the parameters)
3. RPC Runtime
4. Server Stub
5. Server
• Client: The client is the program that initiates the remote procedure call (RPC) by sending
a request to the server.
• Client Stub: The client stub is a local representation of the remote procedure that the
client wants to call.
• It marshals (pack) the parameters of the procedure call into a format that can be transmitted over the
network, and ask RPC runtime to sends the request to the server, and unmarshals (unpack) the result
when it is received.
54
How RPC Works?
• RPC Runtime: It handles the details of sending and receiving messages over the
network, managing connections, and handling errors.
• Server Stub: The server stub is a local representation of the remote procedure that
the server wants to expose.
• It unmarshals (unpack) the parameters of the procedure call, calls the actual procedure on the
server, marshals (pack) the result into a format that can be transmitted over the network, and
ask RPC runtime to sends the result back to the client.
• Server: The server is the program that receives the RPC request from the client,
calls the actual procedure, and sends the result back to the client.

55
Characteristics of RPC
• Transparency: RPC provides the illusion that the client is calling a local procedure, even though
the procedure is actually executed on a remote server.
• Location Transparency: RPC hides the details of the location of the server from the client.
• Marshalling and Unmarshalling: RPC automatically converts the parameters of a procedure call
into a format that can be transmitted over the network and converts the result back into a format
that the client can understand.
• Asynchronous and Synchronous Calls: RPC supports both synchronous and asynchronous
procedure calls.
• Error Handling: RPC provides mechanisms for handling errors that occur during the procedure
call.
• Scalability: RPC is designed to be scalable, allowing multiple clients to call the same procedure on
a server simultaneously.
• Security: RPC provides mechanisms for securing the communication between the client and server.

56
Remote Method Invocation (RMI)
• The RMI (Remote Method Invocation) is an API (Application Program Interface) that provides a
mechanism to create distributed application in java.
• The RMI allows an object to invoke methods on an object running in another JVM (Java Virtual
Machine).
• The RMI provides remote communication between the applications using two objects: stub and
skeleton.
• Stub: The stub is an object, acts as a gateway for the client side. All the outgoing requests are
routed through it. It resides at the client side and represents the remote object. When the caller
invokes method on the stub object, it does the following tasks:
1. It initiates a connection with remote Virtual Machine (JVM),
2. It writes and transmits (marshals) the parameters to the remote Virtual Machine (JVM),
3. It waits for the result
4. It reads (unmarshals) the return value or exception, and
5. It finally, returns the value to the caller.
57
Remote Method Invocation (RMI)
• Skeleton: The skeleton is an object, acts as a gateway for the server side object.
All the incoming requests are routed through it. When the skeleton receives the
incoming request, it does the following tasks:
1. It reads the parameter for the remote method
2. It invokes the method on the actual remote object, and
3. It writes and transmits (marshals) the result to the caller.

58
Architecture of an RMI Application
• In an RMI application, we write two
programs, a server program (resides on
the server) and a client program
(resides on the client).
• Inside the server program, a remote object
is created and reference of that object is
made available for the client (using the
registry).
• The client program requests the remote
objects on the server and tries to invoke its
methods.
• The following diagram shows the
architecture of an RMI application.
59
Remote Method Invocation (RMI)
• Transport Layer − This layer connects the client and the server. It
manages the existing connection and also sets up new connections.
• Stub − A stub is a representation of the remote object at client. It
resides in the client system; it acts as a gateway for the client program.
• Skeleton − This is the object which resides on the server side. stub
communicates with this skeleton to pass request to the remote object.
• RRL(Remote Reference Layer) − It is the layer which manages the
references made by the client to the remote object.

60
Working of an RMI Application
• The following points summarize how an RMI application works −
• When the client makes a call to the remote object, it is received by the stub
which eventually passes this request to the RRL.
• When the client-side RRL receives the request, it invokes a method called
invoke() of the object remote Ref. It passes the request to the RRL on the
server side.
• The RRL on the server side passes the request to the Skeleton which finally
invokes the required object on the server.
• The result is passed all the way back to the client.

61
Difference between RPC and RMI:

S.N. RPC RMI


RPC is a library and OS dependent
1. Whereas it is a java platform.
platform.
RPC supports procedural RMI supports object-oriented
2.
programming. programming.
RPC is less efficient in comparison of
3. While RMI is more efficient than RPC.
RMI.
The parameters which are passed in While in RMI, objects are passed as
4.
RPC are ordinary or normal data. parameter.
5. RPC is the older version of RMI. While it is the successor version of RPC.
6. RPC does not provide any security. While it provides client level security.
While it’s development cost is fair or
7. It’s development cost is huge.
reasonable.
62
Common Object Request Broker Architecture (CORBA)
• Common Object Request Broker Architecture (CORBA) is a standard
developed by the Object Management Group (OMG) for creating distributed
object-oriented systems.
• It defines a set of specifications for building distributed systems that can
communicate with each other regardless of the programming language or
platform they are written in.
• CORBA provides a framework for creating distributed applications by defining
a set of interfaces and protocols for communication between objects.
• It allows objects to be created and accessed remotely, enabling developers to
build distributed systems that are more flexible, scalable, and maintainable.

63
Key Features of CORBA:
• Object Request Broker (ORB): The ORB is the central component of CORBA
that manages the communication between objects.
• It acts as a middleware layer that handles the marshalling and unmarshalling of data, as well as
the routing of requests between objects.
• Interface Definition Language (IDL): CORBA uses an Interface Definition
Language (IDL) to define the interfaces between objects.
• This allows objects written in different programming languages to communicate with each
other by defining a common set of interfaces.
• Language Independence: CORBA is designed to be language-independent,
meaning that objects written in different programming languages can communicate
with each other using the same set of interfaces.

64
CORBA Architecture
• The CORBA Architecture known as
Object Management design (OMA) is
shown below figure.
• ORB (Object Request Broker): The
ORB is the middleware component
that enables communication between
distributed objects. It handles the
marshalling and unmarshalling of
method calls and data between clients
and servers.

65
CORBA Architecture
• Interface Definition Language (IDL): CORBA uses IDL to define the interfaces
of distributed objects.
• IDL is a language-independent way to describe the methods and data types that objects
support.

• Object Adapter: The Object Adapter is responsible for managing the lifecycle of
objects and providing access to them.
• It is responsible for creating and destroying objects, as well as handling requests from clients.
• Interface Repository: The Interface Repository is a central repository that stores
the IDL definitions of all the objects in a CORBA system.
• It provides a standardized way of accessing and sharing IDL definitions across different ORBs.
66
CORBA Architecture
• Implementation Repository: Contain all the information regarding object
implementation.
• Provides a persistent record of how to activate and invoke operations on object implementation.
• CORBA gives vendor free-hand in handling implementation.
• Dynamic Invocation Interface: Generic interface for making remote invocations.
• Uses interface repository at run time to discover interfaces
• No need of pre-compiled stubs.
• Dynamic Skeleton Interface: Allows the ORB and OA to deliver requests to an object
without the need of pre-compiled skeletons.
• Implemented via a DIR( Dynamic Invocation Routine)
• ORB invokes DIR for every Dynamic Skeleton request it makes.
67
Synchronization in Distributed Systems
• In the world of distributed computing, where multiple systems collaborate to
accomplish tasks ensuring that all the clocks are synchronized plays a crucial role.
• Clock synchronization involves aligning the clocks of computers or nodes which enables
efficient data transfer, smooth communication, and coordinated task execution.

• Clock synchronization in distributed systems aims to establish a reference for time


across nodes.
• Imagine a scenario where three distinct systems are part of a distributed environment.
• In order for data exchange and coordinated operations to take place it is essential that these
systems have a shared understanding of time.
• Achieving clock synchronization ensures that data flows seamlessly between them tasks are
executed coherently and communication happens without any ambiguity.
68
Clock Synchronization Challenges in Distributed Systems
• Clock synchronization in distributed systems introduces complexities
compared to centralized ones due to the use of distributed algorithms. Some
notable challenges include:
• Clock Drift: Clocks in distributed systems can drift apart due to differences in hardware,
temperature, and other environmental factors.
• This drift can lead to inconsistencies in timestamps and cause issues in distributed algorithms
that rely on synchronized clocks.
• Network Latency: Network latency can cause delays in message transmission between
nodes, leading to inconsistency in the recognized time at different nodes. This can make it
challenging to achieve precise synchronization.

69
Clock Synchronization Challenges in Distributed Systems
• Dynamic Environments: Distributed systems often operate in dynamic
environments where nodes can join or leave the system at any time.
• Maintaining clock synchronization in such environments is challenging due to the constantly
changing network topology.

• Heterogeneous Environments: Distributed systems often run on heterogeneous


hardware and software platforms, which can introduce additional challenges for
clock synchronization.

70
Time in DS
• Each machine in a distributed system has its own clock providing the physical
time.
• The distributed system do not have global physical time.
• Time synchronization is essential to know at what time of day a particular event
occurred at a particular computer within a system.

71
Physical Clock:
• Each computer contains an electronic device that counts oscillations in a crystal
at a definite frequency and store division of count to frequency in a register to
provide time.
• Such device is called physical clock and the time shown is physical time.
• Since, different computers in a distributed system have different crystals that run at different
rates, the physical clock gradually get out of synchronization and provide different time values.
• Due to this, it is very difficult to handle and maintain time critical real time systems.
• Consistency of distributed data during any modification is based on time factor.
• The algorithms for synchronization of physical clocks are as follows:
• Christian's method
• Berkeley's method
• Network time protocol 72
Cristian's Algorithm
• It is a physical clock synchronization algorithm
used in a distributed system.
• Basic Idea: If client processes wants to correct its
time as per server time then it will make the
request to the time server and correct accordingly.
• The time interval between the beginning of a
Request and the conclusion of the corresponding
Response is referred to as Round Trip Time in this
context.
• An example mimicking the operation of Cristian's
algorithm is provided below:

73
Cristian's Algorithm
Algorithm:
• The process on the client machine sends the clock server a request at time T0 for the clock time
(time at the server).
• In response to the client process's request, the clock server listens and responds with clock server
time.
• The client process retrieves the response from the clock server at time T1 and uses the formula
below to determine the synchronized client clock time.
TCLIENT = TSERVER + (T1 - T0)/2.
Where,
• TCLIENT denotes the synchronised clock time,
• TSERVER denotes the clock time returned by the server,
• T0 denotes the time at which the client process sent the request,
• and T1 denotes the time at which the client process received the response
74
Logical Clock:
• Logical clock is a virtual clock that records the relative ordering of events in a
process.
• It is realized whenever relative ordering of events is more important than the
physical time.
• The value of logical clock is used to assign time stamps to the events.
• Lamport clocks and vector clocks are examples of logical clocks used in
distributed systems.

75
Lamport’s Logical Clock
• Lamport’s Logical Clock was created by Leslie Lamport. It is a procedure to determine the order
of events occurring.
• Lamport invented a simple mechanism by which the happened-before ordering can be captured
numerically.
• A Lamport logical clock is a incrementing software counter maintained in each process.
• It provides a basis for the more advanced Vector Clock Algorithm. Due to the absence of
a Global Clock in a Distributed System Lamport Logical Clock is needed.
• Algorithm follows:
• A process increments its counter before each event in that process;
• When a process sends a message, it includes its counter value with the message;
• On receiving a message, the receiver process sets its counter to be greater than the maximum of its
own value and the received value before it considers the message received.
• Conceptually, this logical clock can be thought of as a clock that only has meaning in relation to
messages moving between processes. When a process receives a message, it resynchronizes its
logical clock with that sender. 76
Lamport’s Algorithm
• The happened-before relation is a partial ordering of events in distributed systems such that
1. If A and B are events in the same process, and A was executed before B, then A  B.
2. If A is the event of sending a message by one process and B is the event of receiving that by another
process, then A  B.
3. If A  B and B  C, then A  C.
• If two events A and B are not related by the  relation, then they are executed concurrently (no
causal relationship)

77
Example: Lamport’s Algorithm
• Three processes, each with its own clock. The clocks run at different rates.
• Lamport’s Algorithm corrects the clock.
 Note: ts(A) < ts(B) does not imply A happened before B.

(impossible)

78
Lamport's logical clock in distributed systems
• If two entities do not exchange any messages, then they probably do not need to share a common
clock; events occurring on those entities are termed as concurrent events.
• Among the processes on the same local machine we can order the events based on the local clock
of the system.
• When two entities communicate by message passing, then the send event is said to 'happen before'
the receive event, and the logical order can be established among the events.
• A distributed system is said to have partial order if we can have a partial order relationship among
the events in the system. If 'totality', i.e., causal relationship among all events in the system can be
established, then the system is said to have total order.
Problems:
1. Lamport's logical clock impose only partial order on set of events but pairs of distinct events of
different processes can have identical time stamp.
2. Total ordering can be enforced by global logical time stamp.

79
Vector Clock: Algorithm
• Vector Clock is an algorithm that generates partial ordering of events and detects causality
violations in a distributed system.
• These clocks expand on Scalar time to facilitate a causally consistent view of the distributed system,
they detect whether a contributed event has caused another event in the distributed system.
• It essentially captures all the causal relationships. This algorithm helps us label every process with a
vector(a list of integers) with an integer for each local clock of every process within the system. So for N
given processes, there will be vector/ array of size N.
1. Initially all clocks are zero.
2. Each time a process experiences an internal event, it increments its own logical clock in the vector by
one.
3. Each time a process prepares to send a message, it increments its own logical clock in the vector by one
and then sends its entire vector along with the message.
4. Each time a process receives a message, it increments its own logical clock in the vector by one and
updates each element in its vector by taking the maximum of the value in its own vector clock and the
value in the vector in the received message (for every element).

80
Vector Clock: Example

• The above example depicts the vector clocks mechanism in which the vector clocks are updated after execution of
internal events, the arrows indicate how the values of vectors are sent in between the processes (A, B, C).
• To sum up, Vector clocks algorithms are used in distributed systems to provide a causally consistent ordering of
events but the entire Vector is sent to each process for every message sent, in order to keep the vector clocks in sync.
81
Distributed Mutual Exclusion
• Mutual exclusion is a concurrency control property which is introduced to prevent race
conditions.
• It is the requirement that a process can not enter its critical section while another concurrent process is
currently present or executing in its critical section
• i.e only one process is allowed to execute the critical section at any given instance of time.
• Mutual exclusion in single computer system: In single computer system, memory and other
resources are shared between different processes.
• The status of shared resources and the status of users is easily available in the shared memory so with the help
of shared variable (For example: Semaphores) mutual exclusion problem can be easily solved.
• Mutual exclusion in distributed system : In Distributed systems, we neither have shared memory
nor a common physical clock and there for we can not solve mutual exclusion problem using
shared variables.
• To eliminate the mutual exclusion problem in distributed system approach based on message passing is used.
• A site in distributed system do not have complete information of state of the system due to lack of shared
memory and a common physical clock.
82
Distributed Mutual Exclusion
Requirements of Mutual exclusion Algorithm:
• No Deadlock: Two or more site should not endlessly wait for any message that will never
arrive.
• No Starvation: Every site who wants to execute critical section should get an opportunity to
execute it in finite time. Any site should not wait indefinitely to execute critical section while
other site are repeatedly executing critical section
• Fairness: Each site should get a fair chance to execute critical section. Any request to execute
critical section must be executed in the order they are made i.e Critical section execution
requests should be executed in the order of their arrival in the system.
• Fault Tolerance: In case of failure, it should be able to recognize it by itself in order to
continue functioning without any disruption.

83
Distributed Mutual Exclusion
• Some points are need to be taken in consideration to understand mutual exclusion
fully :
1. It is an issue/problem which frequently arises when concurrent access to shared resources by
several sites is involved. For example, directory management where updates and reads to a
directory must be done atomically to ensure correctness.
2. It is a fundamental issue in the design of distributed systems.
3. Mutual exclusion for a single computer is not applicable for the shared resources since it
involves resource distribution, transmission delays, and lack of global information.
• Solution to distributed mutual exclusion: As we know shared variables or a
local kernel can not be used to implement mutual exclusion in distributed systems.
• Message passing is a way to implement mutual exclusion. Below are the three approaches
based on message passing to implement mutual exclusion in distributed systems:
84
1. Non-token based approach:
• A site communicates with other sites in order to determine which sites should
execute critical section next.
• This requires exchange of two or more successive round of messages among sites.
• This approach use timestamps instead of sequence number to order requests for the
critical section.
• When ever a site make request for critical section, it gets a timestamp. Timestamp
is also used to resolve any conflict between critical section requests.
• All algorithm which follows non-token based approach maintains a logical clock.
Logical clocks get updated according to Lamport’s scheme
• Example : Ricart–Agrawala Algorithm

85
Ricart–Agrawala Algorithm in Mutual Exclusion in Distributed System

• Ricart–Agrawala algorithm is an algorithm for mutual exclusion in a distributed


system proposed by Glenn Ricart and Ashok Agrawala.
• In this algorithm:
• Two type of messages ( REQUEST and REPLY) are used and communication channels are
assumed to follow FIFO order.
• A site send a REQUEST message to all other site to get their permission to enter the critical section.
• A site send a REPLY message to another site to give its permission to enter the critical section.
• A timestamp is given to each critical section request using Lamport’s logical clock.
• Timestamp is used to determine priority of critical section requests. Smaller timestamp gets high
priority over larger timestamp. The execution of critical section request is always in the order of
their timestamp.

86
Ricart–Agrawala Algorithm
• To enter Critical section:
• When a site Si wants to enter the critical section, it send a timestamped REQUEST message to
all other sites.
• When a site Sj receives a REQUEST message from site Si, It sends a REPLY message to site
Si if and only if
• Site Sj is neither requesting nor currently executing the critical section.
• In case Site Sj is requesting, the timestamp of Site Si‘s request is smaller than its own request.
• To execute the critical section:
• Site Si enters the critical section if it has received the REPLY message from all other sites.
• To release the critical section:
• Upon exiting site Si sends REPLY message to all the deferred requests.
87
Ricart–Agrawala Algorithm-Example

88
2. Token Based Algorithm:
• A unique token is shared among all the sites.
• If a site possesses the unique token, it is allowed to enter its critical section
• This approach uses sequence number to order requests for the critical section.
• Each requests for critical section contains a sequence number. This sequence
number is used to distinguish old and current requests.
• This approach insures Mutual exclusion as the token is unique
• Example : Suzuki–Kasami Algorithm

89
Suzuki–Kasami Algorithm for Mutual Exclusion in Distributed System

• Suzuki–Kasami algorithm is a token-based algorithm for achieving mutual


exclusion in distributed systems.
• This is modification of Ricart–Agrawala algorithm, a permission based (Non-
token based) algorithm which uses REQUEST and REPLY messages to ensure
mutual exclusion.
• In token-based algorithms, A site is allowed to enter its critical section if it
possesses the unique token.
• Non-token based algorithms uses timestamp to order requests for the critical
section where as sequence number is used in token based algorithms.
• Each requests for critical section contains a sequence number. This sequence
number is used to distinguish old and current requests.
90
Suzuki–Kasami Algorithm
• To enter Critical section:
• When a site Si wants to enter the critical section and it does not have the token then it
increments its sequence request number RNi[i] and sends a request message REQUEST(i,
sn) to all other sites in order to request the token.
• Here sn is update value of RNi[i]
• When a site Sj receives the request message REQUEST(i, sn) from site Si, it sets RNj[i] to
maximum of RNj[i] and sn i.e RNj[i] = max(RNj[i], sn).
• After updating RNj[i], Site Sj sends the token to site Si if it has token and RNj[i] = LN[i] + 1
• To execute the critical section:
• Site Si executes the critical section if it has acquired the token.

91
Suzuki–Kasami Algorithm
• To release the critical section:
After finishing the execution Site Si exits the critical section and does following:
• sets LN[i] = RNi[i] to indicate that its critical section request RNi[i] has been executed
• For every site Sj, whose ID is not present in the token queue Q, it appends its ID
to Q if RNi[j] = LN[j] + 1 to indicate that site Sj has an outstanding request.
• After above updation, if the Queue Q is non-empty, it pops a site ID from the Q and sends the
token to site indicated by popped ID.
• If the queue Q is empty, it keeps the token

92
Suzuki–Kasami Algorithm- Example

93
Suzuki–Kasami Algorithm- Example

94
Suzuki–Kasami Algorithm- Example

95
3. Quorum based approach:
• Instead of requesting permission to execute the critical section from all other sites,
Each site requests only a subset of sites which is called a quorum.
• Any two subsets of sites or Quorum contains a common site.
• This common site is responsible to ensure mutual exclusion
• Example : Maekawa’s Algorithm which is quorum based approach to ensure
mutual exclusion in distributed systems.
• As we know, In permission based algorithms like Lamport’s Algorithm, Ricart-Agrawala
Algorithm etc. a site request permission from every other site.
• but in quorum based approach, A site does not request permission from every other site but from a
subset of sites which is called quorum.

96
Maekawa’s Algorithm for Mutual Exclusion in Distributed System

• In this algorithm:
• Three type of messages ( REQUEST, REPLY and RELEASE)
are used.
• A site send a REQUEST message to all other site in its request
set or quorum to get their permission to enter critical section.
• A site send a REPLY message to requesting site to give its
permission to enter the critical section.
• A site send a RELEASE message to all other site in its request
set or quorum upon exiting the critical section.
97
Maekawa’s Algorithm
• To enter Critical section:
• When a site Si wants to enter the critical section, it sends a request
message REQUEST(i) to all other sites in the request set Ri.
• When a site Sj receives the request message REQUEST(i) from site Si, it
returns a REPLY message to site Si if it has not sent a REPLY message to the
site from the time it received the last RELEASE message. Otherwise, it
queues up the request.
• To execute the critical section:
• A site Si can enter the critical section if it has received the REPLY message
from all the site in request set Ri
98
Maekawa’s Algorithm
• To release the critical section:
• When a site Si exits the critical section, it
sends RELEASE(i) message to all other sites in request set Ri
• When a site Sj receives the RELEASE(i) message from site Si, it
send REPLY message to the next site waiting in the queue and
deletes that entry from the queue
• In case queue is empty, site Sj update its status to show that it has
not sent any REPLY message since the receipt of the
last RELEASE message
99
Maekawa’s Algorithm- Example

100
Maekawa’s Algorithm- Example

101
Election in DS
• Distributed Algorithm is an algorithm that runs on a distributed system.
• Distributed system is a collection of independent computers that do not share their memory.
• Each processor has its own memory and they communicate via communication networks.
• Communication in networks is implemented in a process on one machine communicating with a process
on another machine.
• Many algorithms used in the distributed system require a coordinator that performs functions needed by
other processes in the system.
• Election algorithms are designed to choose a coordinator
• Election: In distributed computing, leader election is the process of designating a single process as the organizer of some
task distributed among several computers (nodes).
• Before the task is begun, all network nodes are unaware which node will serve as the "leader," or coordinator, of the task.
• After a leader election algorithm has been run, however, each node throughout the network recognizes a particular, unique
node as the task leader.
102
Election Algorithms:
• Election algorithms choose a process from a group of processors to act as a coordinator.
• If the coordinator process crashes due to some reasons, then a new coordinator is elected on other
processor.
• Election algorithm basically determines where a new copy of the coordinator should be restarted.
• Election algorithm assumes that every active process in the system has a unique priority number.
• The process with highest priority will be chosen as a new coordinator.
• Hence, when a coordinator fails, this algorithm elects that active process which has highest priority
number.
• Then this number is send to every active process in the distributed system. We have two election
algorithms for two different configurations of a distributed system.
1. The Bully Algorithm
2. The Ring Algorithm
103
1. The Bully Algorithm
• This algorithm applies to system where every process can send a message to every other process in
the system.
• Algorithm – Suppose process P sends a message to the coordinator.
• If the coordinator does not respond to it within a time interval T, then it is assumed that coordinator
has failed.
• Now process P sends an election messages to every process with high priority number.
• It waits for responses, if no one responds for time interval T then process P elects itself as a
coordinator.
• Then it sends a message to all lower priority number processes that it is elected as their new
coordinator.
• However, if an answer is received within time T from any other process Q,
• (I) Process P again waits for time interval T’ to receive another message from Q that it has been elected as
coordinator.
• (II) If Q doesn’t responds within time interval T’ then it is assumed to have failed and algorithm is restarted.

104
The Bully Algorithm- Example

Process 5 start election and send the election message to


only those node which have higher priority

Process 5 handover the election to process 6


and process perform same process and
handover election to process 7 and the
process 7 elected as the coordinator and the
processor 7 sent the coordinator message to
all other node.
105
2. The Ring Algorithm
• Principle:
• Process priority is obtained by organizing processes into a (logical) ring. Process with the
highest priority should be elected as coordinator.
• When a process notices that co-ordinator is not functioning:
• Builds an ELECTION message (containing its own process number)
• Sends the message to its successor (if successor is down, sender skips over it and goes to the
next member along the ring, or the one after that, until a running process is located.
• At each step, sender adds its own process number to the list in the message.
• When the message gets back to the process that started it all :
• Message comes back to the initiator.
• In the queue, the process with maximum process ID wins.
• Initiator announces the winner by sending another message around the ring.
106
2. The Ring Algorithm
• This algorithm applies to systems organized as a ring. In this algorithm we assume that the link
between the process are unidirectional and every process can message to the process on its right
only. Data structure that this algorithm uses is active list, a list that has a priority number of all
active processes in the system.
• Algorithm –
• If process P1 detects a coordinator failure, it creates new active list which is empty initially. It sends
election message to its neighbour on right and adds number 1 to its active list.
• If process P2 receives message elect from processes on left, it responds in 3 ways:
• (I) If message received does not contain 1 in active list then P1 adds 2 to its active list and forwards the
message.
• (II) If this is the first election message it has received or sent, P1 creates new active list with numbers 1 and 2. It
then sends election message 1 followed by 2.
• (III) If Process P1 receives its own election message 1 then active list for P1 now contains numbers of all the
active processes in the system. Now Process P1 detects highest priority number from list and elects it as the
new coordinator.

107
2. The Ring Algorithm
Coordinator process 8 is crashed Coordinator process 8 is crashed
then process 2 start election then process 5 start election

108
2. The Ring Algorithm

109
What is Replication in Distributed System?
• In a distributed system data is stored is over different
computers in a network. Therefore, we need to make
sure that data is readily available for the users.
• Or, Replication refers to the process of creating and
maintaining duplicate copies of data or resources
across multiple nodes or systems.
• The primary objective of replication is to improve
data availability, reliability, and performance.
• By having multiple copies of data distributed across
different locations,
• systems can retrieve data more quickly and
efficiently, and in case one copy becomes
unavailable, another copy can still be accessed.
• Replication is the practice of keeping several copies of
data in different places.

110
Why do we require replication?
• The first thing is that it makes our system more stable because of node replication.
It is good to have replicas of a node in a network due to following reasons:
• If a node stops working, the distributed network will still work fine due to its replicas which
will be there. Thus it increases the fault tolerance of the system.
• It also helps in load sharing where loads on a server are shared among different replicas.
• It enhances the availability of the data. If the replicas are created and data is stored near to the
consumers, it would be easier and faster to fetch data.
• Types of Replication
• Active Replication
• Passive Replication

111
Active Replication:
• The request of the client goes to all the replicas.
• It is to be made sure that every replica receives the client request in the same order else the system will
get inconsistent.
• There is no need for coordination because each copy processes the same request in the same sequence.
• All replicas respond to the client’s request.
• Advantages:
• It is really simple. The codes in active replication are the same throughout.
• It is transparent.
• Even if a node fails, it will be easily handled by replicas of that node.
• Disadvantages:
• It increases resource consumption. The greater the number of replicas, the greater the memory needed.
• It increases the time complexity. If some change is done on one replica it should also be done in all others.

112
Passive Replication:
• The client request goes to the primary replica, also called the main replica.
• There are more replicas that act as backup for the primary replica.
• Primary replica informs all other backup replicas about any modification done.
• The response is returned to the client by a primary replica.
• Periodically primary replica sends some signal to backup replicas to let them know that it is working
perfectly fine.
• In case of failure of a primary replica, a backup replica becomes the primary replica.
• Advantages:
• The resource consumption is less as backup servers only come into play when the primary server fails.
• The time complexity of this is also less as there’s no need for updating in all the nodes replicas, unlike active
replication.
• Disadvantages:
• If some failure occurs, the response time is delayed.

113
Replication and Fault Tolerant,
• Replication can be implemented in various
ways:
• Master-Slave Replication: In this model, there is
a master node that receives write operations, and
these operations are then replicated to one or
more slave nodes. Slave nodes are typically used
for read operations, improving read scalability.
• Master-Master Replication: In this model,
multiple nodes act as both masters and slaves,
allowing for bidirectional replication. This setup
provides better redundancy and fault tolerance.
• Multi-Master Replication: Similar to master-
master replication, but with more than two nodes
participating in the replication process. This
approach is commonly used in large-scale
distributed systems.

114
Fault Tolerance:
• Fault Tolerance is defined as the ability of the system to
function properly even in the presence of any failure.
• Distributed systems consist of multiple components due to which
there is a high risk of faults occurring. Due to the presence of
faults, the overall performance may degrade.
• Fault tolerance refers to a system's ability to continue
operating properly in the event of component failures or
unexpected errors.
• It involves designing systems to anticipate and recover from
faults automatically without causing disruption to the overall
functionality.
• Techniques for achieving fault tolerance include redundancy,
error detection and correction, graceful degradation, and
failover mechanisms.
• Fault-tolerant systems often employ redundancy by
replicating critical components or data across multiple nodes
to ensure continued operation even if some components fail.
115
Types of Faults
• Transient Faults: Transient Faults are the type of faults that occur once and then
disappear.
• These types of faults do not harm the system to a great extent but are very difficult to find or locate.
• Processor fault is an example of transient fault.
• Intermittent Faults: Intermittent Faults are the type of faults that comes again and
again.
• Such as once the fault occurs it disappear upon itself and then reappears again.
• An example of intermittent fault is when the working computer hangs up.
• Permanent Faults: Permanent Faults are the type of faults that remains in the system
until the component is replaced by another.
• These types of faults can cause damage to the system but are easy to identify.
• A burnt-out chip is an example of a permanent Fault.

116
Need for Fault Tolerance in Distributed Systems
• Fault Tolerance is required in order to provide below four features.
• Availability: system is readily available for its use at any time.
• Reliability: system can work continuously without any failure.
• Safety: system can remain safe from unauthorized access even if any failure
occurs.
• Maintainability: Maintainability is defined as the property states that how
easily and fastly the failed node or system can be repaired.

117
Fault Tolerance in Distributed Systems
• Below are the phases carried out for fault tolerance
in any distributed systems.
1. Fault Detection
• It is the first phase where the system is monitored
continuously.
• During monitoring if any faults are identified they
are being notified.
• These faults can occur due to various reasons such
as hardware failure, network failure, and software
issues.
• The main aim of the first phase is to detect these
faults as soon as they occur so that the work being
assigned will not be delayed.

118
Fault Tolerance in Distributed Systems
2. Fault Diagnosis
• Fault diagnosis is the process where the fault that is identified in the first phase will be
diagnosed properly in order to get the root cause and possible nature of the faults.
• Fault diagnosis can be done manually by the administrator or by using automated
Techniques in order to solve the fault and perform the given task.
3. Evidence Generation
• Evidence generation is defined as the process where the report of the fault is prepared
based on the diagnosis done in an earlier phase.
• This report involves the details of the causes of the fault, the nature of faults, the solutions
that can be used for fixing, and other alternatives and preventions that need to be
considered.

119
Fault Tolerance in Distributed Systems
4. Assessment
• Assessment is the process where the damages caused by the faults are analyzed.
• It can be determined with the help of messages that are being passed from the component
that has encountered the fault.
• Based on the assessment further decisions are made.
5. Recovery
• Recovery is the process where the aim is to make the system fault free.
• It is the step to make the system fault free and restore it to state forward recovery and
backup recovery.
• Some of the common recovery techniques such as reconfiguration and resynchronization
can be used.

120
Types of Fault Tolerance in Distributed Systems
• Hardware Fault Tolerance: Hardware Fault Tolerance involves keeping a backup plan for hardware
devices such as memory, hard disk, CPU, and other hardware peripheral devices.
• Hardware Fault Tolerance is a type of fault tolerance that does not examine faults and runtime errors but can only
provide hardware backup.
• The two different approaches that are used in Hardware Fault Tolerance are fault-masking and dynamic recovery.
• Software Fault Tolerance: Software Fault Tolerance is a type of fault tolerance where dedicated
software is used in order to detect invalid output, runtime, and programming errors.
• Software Fault Tolerance makes use of static and dynamic methods for detecting and providing the solution.
• Software Fault Tolerance also consists of additional data points such as recovery rollback and checkpoints.
• System Fault Tolerance: System Fault Tolerance is a type of fault tolerance that consists of a whole
system.
• It has the advantage that it not only stores the checkpoints but also the memory block, and program checkpoints and
detects the errors in applications automatically.
• If the system encounters any type of fault or error it does provide the required mechanism for the solution. Thus
system fault tolerance is reliable and efficient.

121
Hardware Fault-tolerance Techniques:
• Making a hardware fault-tolerance is simple as compared to software.
Fault-tolerance techniques make the hardware work proper and give
correct result even some fault occurs in the hardware part of the
system.
• There are basically two techniques used for hardware fault-tolerance: Majority Result
• BIST (Build in Self Test) –System carries out the test of itself after a
certain period of time again and again, that is BIST technique for
hardware fault-tolerance.
• When system detects a fault, it switches out the faulty
component and switches in the redundant of it. System
basically reconfigure itself in case of fault occurrence.
• TMR (Triple Modular Redundancy):- Three redundant copies of
critical components are generated and all these three copies are run
concurrently.
• Voting of result of all redundant copies are done and majority
result is selected. It can tolerate the occurrence of a single
fault at a time. Eg. majority voting and median voting

122
Software Fault-tolerance Techniques:
• Software fault-tolerance techniques are used to make the
software reliable in the condition of fault occurrence and
failure.
• There are three techniques used in software fault-tolerance.
• First two techniques are common and are basically an adaptation of
hardware fault-tolerance techniques.
N-version Programming: - In N-version programming, N
versions of software are developed by N individuals or groups of
developers.
• N-version programming is just like TMR in hardware fault-
tolerance technique.
• In N-version programming, all the redundant copies are run
concurrently and result obtained is different from each
processing.
• The idea of n-version programming is basically to get the all
errors during development only.

123
Software Fault-tolerance Techniques:
• Recovery Blocks –Recovery blocks
technique is also like the n-version
programming but in recovery blocks
technique, redundant copies are
generated using different algorithms
only.
• In recovery block, all the redundant
copies are not run concurrently and
these copies are run one by one.
• Recovery block technique can only be
used where the task deadlines are more
than task computation time.
124
Software Fault-tolerance Techniques:
• Check-pointing and Rollback Recovery –This technique is different from
above two techniques of software fault-tolerance.
• In this technique, system is tested each time when we perform some computation.
• This techniques is basically useful when there is processor failure or data corruption.

125
Recovery Approach in DS
• Recovery from an error is essential to fault tolerance, and error is a component of a system that could result in failure.
• The whole idea of error recovery is to replace an erroneous state with an error-free state. Error recovery can be broadly divided into
two categories.
1. Backward Recovery:
• Backward recovery involves restoring the system to a consistent state before the occurrence of the failure.
• This approach typically requires maintaining transaction logs or checkpoints that record the state of the system at various points in
time.
• Upon detecting a failure, the system rolls back to a previously consistent state and re-executes transactions or operations from that
point forward.
• Backward recovery ensures that the system remains in a consistent state, but it may suffer higher overhead due to the need for
extensive logging and rollback operations.
2. Forward Recovery:
• In forward recovery, the system attempts to recover from a failure by progressing forward from the point of failure.
• This approach involves detecting the failure, identifying the affected components or data, and then resuming normal operation from a
known checkpoint or intermediate state.
• Forward recovery mechanisms often include techniques such as re-execution of lost or incomplete transactions, retransmission of
lost messages, and incremental synchronization of data.
126
Recovery Approach in DS
Stable Storage :
• Stable storage, which can resist anything but major disasters like floods and earthquakes, is another option.
• A pair of regular discs can be used to implement stable storage.
• Each block on drive 2 is a duplicate of the corresponding block on drive 1, with no differences.
• The block on drive 1 is updated and confirmed first whenever a block is updated. then the identical block on drive 2 is
finished.
• Suppose that the system crashes after drive 1 is updated but before the update on drive 2.
• Upon recovery, the disk can be compared with blocks. Since drive 1 is always updated before drive 2, the new block is
copied from drive 1 to drive 2 whenever two comparable blocks differ, it is safe to believe that drive 1 is the correct
one.
• Both drives will be identical once the recovery process is finished.
• Another potential issue is a block’s natural decline. A previously valid block may suddenly experience a
checksum mistake without reason.
• The faulty block can be constructed from the corresponding block on the other drive when such an error is discovered.
127
Recovery Approach in DS
Checkpointing :
• Backward error recovery calls for the system to routinely save its
state onto stable storage in a fault-tolerant distributed system.
• We need to take a distributed snapshot, often known as a consistent
global state, in particular.
• If a process P has recorded the receipt of a message in a distributed
snapshot, then there should also be a process Q that has recorded
the sending of that message. It has to originate somewhere, after
all.
• Each process periodically saves its state to a locally accessible
stable storage in backward error recovery techniques.
• We must create a stable global state from these local states in order
to recover from a process or system failure.
• Recovery to the most current distributed snapshot, also known as a
recovery line, is recommended in particular.
• In other words, as depicted in Fig., a recovery line represents the
most recent stable cluster of checkpoints.
128
Recovery Approach in DS
Message Logging :
• The core principle of message logging is that we can still
obtain a globally consistent state even if the transmission
of messages can be replayed, but without having to
restore that state from stable storage. Instead, any
communications that have been sent since the last
checkpoint are simply retransmitted and treated
appropriately.
• As system executes, messages are recorded on stable
storage. A message is called as logged if its data and
index of stable interval that is stored are both recorded on
stable storage. In above Fig. you can see logged and
unlogged images denoted by different arrows. The idea is
if transmission of messages is replayed, we can still reach
a globally consistent state. so we can recover logs of
messages and continue the execution.
129

You might also like