0% found this document useful (0 votes)
21 views

Dist_Sys_Unit_5_Notes

The document discusses distributed coordination-based systems, emphasizing the separation of computation and coordination within distributed systems. It explores various coordination models, such as direct coordination, mailbox coordination, and publish/subscribe systems, while also addressing the architecture, communication, naming, and security aspects of these systems. Additionally, it highlights emerging trends like cloud computing, grid computing, and virtualization, which enhance performance, security, and cost efficiency in distributed systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

Dist_Sys_Unit_5_Notes

The document discusses distributed coordination-based systems, emphasizing the separation of computation and coordination within distributed systems. It explores various coordination models, such as direct coordination, mailbox coordination, and publish/subscribe systems, while also addressing the architecture, communication, naming, and security aspects of these systems. Additionally, it highlights emerging trends like cloud computing, grid computing, and virtualization, which enhance performance, security, and cost efficiency in distributed systems.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Distributed Systems CSE

Introduction to Distributed Coordination Based Systems

Coordination-based Systems

 Here we consider a generation of distributed systemsKey to the approach followed in


coordination-based systems is the clean separation between computation and coordination.

 If we view a distributed system as a collection of (possibly multi-threaded) processes, then the


computing part of a distributed system is formed by the processes, each concerned with a
specific computational activity, which in principle, is carried out independently from the
activities of other processes.

In this model, the coordination part of a distributed system handles the communication and
cooperation between processes.

 It forms the glue that binds the activities performed by processes into a whole.

 In distributed coordination-based systems, the focus is on how coordination between the


processes takes place.

 that assume that the various components of a system are inherently distributed and that the
real problem in developing such systems lies in coordinating the activities of different
components.

 In other words, instead of concentrating on the transparent distribution of components,


emphasis lies on the coordination of activities between components.

A taxonomy of coordination models for mobile agents that can be applied equally to many other types
of distributed systems. Adapting their terminology to distributed systems in general, we make a
distinction between models along two different dimensions, temporal and referential.

 Do cooperative/communicative processes know each other explicitly.

 Are cooperative/communicative processes alive at same time.

1. When processes are temporally and referentially coupled, coordination takes place in a direct
way, referred to as direct coordination.

2. The referential coupling generally appears in the form of explicit referencing in


SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT
communication. 1
Distributed Systems CSE
3. For example, a process can communicate only-if it knows the name or identifier of the other
processes it wants to exchange information with.

4. Temporal coupling means that processes that are communicating will both have to be up and
running.

5. This coupling is analogous to the transient message oriented communication. A different type
of coordination occurs when processes are temporally decoupled but referentially coupled,
which we refer to as mailbox coordination.

6. In this case, there is no need for two communicating processes to execute at the same time in
order to let communication take place. Instead, communication takes place by putting
messages in a (possibly shared) mailbox.

7. This situation is analogous to persistent message-oriented communication. It is necessary to


explicitly address the mailbox that will hold the messages that are to be exchanged.
Consequently, there is a referential coupling

Introduction to Coordination-based Systems Publish/Subscribe

1. The combination of referentially decoupled and temporally coupled systems form the group of
models for meeting-oriented coordination. In referentially decoupled systems, processes do
not know each other explicitly.

2. In other words, when a process wants to coordinate its activities with other processes, it
cannot directly refer to another process. Instead, there is a concept of a meeting in which
processes temporarily group together to coordinate their activities.

The model prescribes that the meeting processes are executing at the same time.

1. Meeting-based systems are often implemented by means of events, like the ones supported
by object-based distributed systems. In this chapter, we discuss another mechanism for
implementing meetings, namely publish/subscribe systems.

In these systems, processes can subscribe to messages containing information on specific subjects,
while other processes produce (i.e., publish) such messages. Most publish/subscribe systems require
that communicating processes are active at the same time; hence, there is a temporal coupling.

Introduction to Coordination-based Systems Linda Tuples

1. The most widely-known coordination model is the combination of referentially and


temporally decoupled processes, exemplified by generative communication as introduced in
the Linda programming system by Gelemter (1985).

2. The key idea in generative communication is that a collection of independent processes make
use of a shared persistent dataspace of tuples. Tuples are tagged data records consisting of a
number (but possibly zero) typed fields. Processes can put any type of record into the shared
dataspace (i.e., they generate communication records). Unlike the case with blackboards,
there is no need to agree in advance on the structure of tuples.

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 2


Distributed Systems CSE
An interesting feature of these shared dataspaces is that they implement an associative search
mechanism for tuples.

In other words, when a process wants to extract a tuple from the dataspace, it essentially specifies
(some of) the values of the fields it is interested in. Any tuple that matches that specification is
then removed from the dataspace and passed to the process. If no match could be found, the
process can choose to block until there is a matching tuple.

We note that generative communication and shared data spaces are often also considered to be
forms of publish/subscribe systems. In what follows, we shall adopt this commonality as well.

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 3


Distributed Systems CSE
Architecture of Distributed Coordination Based Systems

1. An important aspect of coordination-based systems is that communication takes place by


describing the characteristics of data items that are to be exchanged. As a consequence,
naming plays a crucial role. We return to naming later in this chapter, but for now the
important issue is that in many cases, data items are not explicitly identified by senders and
receivers.

2. Let us first assume that data items are described by a series of attributes. A data item is said to
be published when it is made available for other processes to read. To that end, a subscription
needs to be passed to the middleware, containing a description of the data items that the
subscriber is interested in. Such a description typically consists of some(attribute, value) pairs,
possibly combined with (attribute, range) pairs.

3. We are now confronted with a situation in which subscriptions need to be matched against
data Items. When matching succeeds, there are two possible scenarios. In the first case, the
middleware may decide to forward the published data to its current set of subscribers, that is,
processes with a matching subscription.

4. As an alternative, the middleware can also forward a notification at which point subscribers
can execute a read operation to retrieve the published data item.

The principle of exchanging data items between publishers and subscribers.

1. In those cases in which data items are immediately forwarded to subscribers, the middleware
will generally not offer storage of data. Storage is either explicitly handled by a separate
service, or is the responsibility of subscribers. In other words, we have a referentially
decoupled, but temporally coupled system.

2. This situation is different when notifications are sent so that subscribers need to explicitly
read the published data. Necessarily, the middleware will have to store data items. In these
situations there are additional operations for data management.

3. It is also possible to attach a lease to a data item such that when the lease expires that the
data item is automatically deleted.

4. In the model described so far, we have assumed that there is a fixed set of n attributes a1, ...
,an that is used to describe data items. In particular, each published data item is assumed to
have an associated vector (attribute. value) pairs.
SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 1
Distributed Systems CSE
5. Events complicate the processing of subscriptions. To illustrate, consider a subscription such
as "notify when room R4.20 is unoccupied and the door is unlocked."

6. Typically, a distributed system supporting such subscriptions can be implemented by placing


independent sensors for monitoring room occupancy (e.g., motion sensors) and those for
registering the status of a door lock.

7. Following the approach sketched so far, we would need to compose such primitive events
into a publishable data item to which processes can then subscribe.

8. Clearly, in coordination-based systems such as these, the crucial issue is the efficient and
scalable implementation of matching subscriptions to data items, along with the construction
of relevant data items.

9. From the outside, a coordination approach provides lots of potential for building very large-
scale distributed systems due to the strong decoupling of processes.

Traditional Architectures

1. The simplest solution for matching data items against subscriptions is to have a centralized
client-server architecture.

2. This is a typical solution currently adopted by many publish or subscribe systems,


including IBM's (IBM, (Sun WebSphere 2005c) and popular implementations for Sun‘s JMS
Microsystems. 2004a).

3. Likewise, implementations for the more elaborate generative communication models such as
Jini (Sun Microsystems, 2005b) and JavaSpaces (Freeman et al., 1999) are mostly based on
central servers.

Traditional Architectures Example: Jini and JavaSpaces

The general organization of a JavaSpace in Jini.

Example: TIB/Rendezvous

The principle of a publish/subscribe system as implemented in TIB/Rendezvous

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 2


Distributed Systems CSE

Example: A Gossip-Based Publish/Subscribe System

Grouping nodes for supporting range queries in a peer-to-peer publish/subscribe system

Example: Lime

Transient sharing of local dataspaces in Lime

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 3


Distributed Systems CSE
Communication, Naming & Security in Distributed Coordination Based Systems

1. Communication in many publish/subscribe systems is relatively simple. For example, in


virtually every Java-based system, all communication proceeds through remote method
invocations (RMI).

2. Naming can be proceeded with subject names i.e., message names.

3. Security will be provided by decoupling procedure.

4. One important problem that needs to be handled when publish/subscribe systems are spread
across a wide-area system is that published data should reach only the relevant subscribers.

5. A solution is to deploy content-based routing.

COMMUNICATION - Content-Based Routing

1. In content-based routing, the system is assumed to be built on top of a point-to-point


network in which messages are explicitly routed between nodes.

2. Crucial in this setup is that routers can take routing decisions by considering the content of a
message.

3. More precisely, it is assumed that each message carries a description of its content, and that
this description can be used to cut-off routes for which it is known that they do not lead to
receivers interested in that message.

4. Consider a publish/subscribe system consisting of N servers to which clients (i.e.,


applications) can send messages, or from which an application will have previously provided
the server with a description of the kind of data it is interested in.

5. The server, in turn, will notify the application when relevant data has arrived.

6. A two-layered routing scheme in which the lowest layer consists of a shared broadcast tree
connecting the N servers.

7. There are various ways for setting up such a tree, ranging from network-level multicast
support to application-level multicast trees.

8. Here, we also assume that such a tree has been set up with the N servers as end nodes, along
with a collection of intermediate nodes forming the routers.

9. Consider first two extremes for content-based routing, assuming we need to support only
simple subject-based publish/subscribe in which each message is tagged with a unique (non-
compound) keyword.

10. One extreme solution is to send each published message to every server, and subsequently let
the server check whether any of its clients had subscribed to the subject of that message. In
essence, this is the approach followed in TIB-Rendezvous.

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 1


Distributed Systems CSE
11. The other extreme solution is to let every server broadcast its subscriptions to all other
servers. As a result, every server will be able to compile a list of (subject, destination) pairs.
Then, whenever an application submits a message on subject s, its When the message reaches
a router, later can use the list to decide on the paths that the message should follow,.

Naive content-based routing

Taking this last approach as our starting point, we can refine the capabilities of routers for deciding
where to forward messages to. To that end, each server broadcasts its subscription across the
network so that routers can compose routing filters. For example, assume that node 3 subscribes to
messages for which an attribute a lies in the range [0,3], but that node 4 wants messages with a in
[2,5]. In this case, router R1 will create a routing filter as a table with an entry for each of its outgoing
links (in this case three: one to node 3, one to node 4, and one toward router R 1).

A partially filled routing table.

Security by Decoupling Publishers from Subscribers

Decoupling publishers from subscribers using an additional trusted service

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 2


Distributed Systems CSE
Emerging Trends of Distributed Systems

Latest Trends in Distributed Systems

Staying updated with latest trends in distributed systems is crucial for several reasons:

• Performance Optimization: New trends often bring improvements in efficiency and scalability,
helping to enhance system performance and manage growing workloads.

• Security Enhancements: Emerging trends can introduce advanced security measures and
protocols to protect against evolving cyber threats.

• Cost Efficiency: Innovations in distributed systems can lead to more cost-effective solutions by
optimizing resource usage and reducing operational expenses.

• Competitive Edge: Keeping abreast of the latest developments allows organizations to


leverage cutting-edge technologies, maintaining a competitive advantage in the market.

• Adaptability: Understanding new trends helps organizations adapt to changing technology


landscapes and user demands, ensuring systems remain relevant and effective.

Cloud Computing and Distributed Systems

• Integrations between cloud computing and distributed systems involve combining the
principles and technologies of both to create efficient, scalable, and resilient computing
environments. Following are the integrations work and their significance:

• Cloud computing platforms often use distributed systems principles to deliver their services.
For example:

– Scalable Infrastructure: Cloud providers use distributed systems to manage large-scale


data centers and networks. This allows them to scale resources dynamically based on
demand.
– Load Balancing: Cloud services distribute incoming network traffic across multiple
servers to ensure no single server becomes a bottleneck, improving performance and
reliability.
– Data Replication: Cloud storage solutions replicate data across multiple nodes or
locations to ensure high availability and fault tolerance.
Grid Computing

• Grid Computing can be defined as a network of computers working together to perform a task
that would rather be difficult for a single machine.

• All machines on that network work under the same protocol to act as a virtual supercomputer.

• The tasks that they work on may include analyzing huge datasets or simulating situations that
require high computing power. Computers on the network contribute resources like
processing power and storage capacity to the network.


Grid Computing is a subset of distributed computing, where a virtual supercomputer
comprises machines on a network connected by some bus, mostly ethernet or sometimes the
SUTHOJU GIRIJA
Internet.RANI, Assistant Professor, CSE, NGIT 1
Distributed Systems CSE
• It can also be seen as a form of parallel computing where instead of many CPU cores on a
single machine, it contains multiple cores spread across various locations.

• The concept of grid computing isn’t new, but it is not yet perfected as there are no standard
rules and protocols established and accepted by people.

Virtualization

 Is the process of making physical resources as virtual resources


 It allows us to share single physical instance of application or resource or application among
multiple organizations
 Hardware virtualization: it means creating virtual machine over existing OS and H/W
Host machine: Machine on which virtual machine created

Advantages of virtualization
 More flexible and efficient
 Low cost
 Cost per usage
 Enable running multiple OS

Types of virtualization

1. Application virtualization
2. Network virtualization
3. Desktop virtualization
4. Storage virtualization

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 2


Distributed Systems CSE
1.Application virtualization
 The server stores all the personal information and other characteristics of the application but
can still run on local work station through internet. Helps a user to have remote access of an
application from server

2.Network virtualization

 The ability to run multiple virtual networks with each has a separate control & data pan.
 E.g. : VPN
3.Desktop virtualization
 It allows the user OS to remotely stored server and allow control from any device or machine
connected to it

4. Storage virtualization

 It is an array of servers that are managed by a virtual storage. We don’t know exactly where
data is stored but we can read it and do any operation on it.
 E.g. : AWS, Microsoft Azure
SELF -MANAGEMENT IN DISTRIBUTED SYSTEMS
• These are the systems which can themselves adjust when some thing happens, these
automatically adapt on the account of
– Self configuration
– Self managing
– Self healing
– Self Optimization
Service-Oriented Architecture (SOA)
• Service-Oriented Architecture (SOA) is a stage in the evolution of application development
and/or integration. It defines a way to make software components reusable using the
interfaces.

SOA is different from micro-service architecture.

• SOA allows users to combine a large number of facilities from existing services to form
applications.
• SOA encompasses a set of design principles that structure system development and provide
means for integrating components into a coherent and decentralized system.
• SOA-based computing packages functionalities into a set of interoperable services, which can
be integrated into different software systems belonging to separate business domains.
Characteristics of SOA

•Provides interoperability between the services.


•Provides methods for service encapsulation, service discovery, service composition, service
reusability and service integration.
• Facilitates QoS (Quality of Services) through service contract based on Service
Level Agreement (SLA).
• Provides loosely couples services.
• Provides location transparency with better scalability and availability.
• Ease of maintenance with reduced cost of application development and deployment.
SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 3
Distributed Systems CSE

Future Trends in Distributed Systems

The future of distributed systems is poised to be shaped by several emerging trends and
technological advancements. Here are some key directions where distributed systems are likely to
evolve:

• Ubiquitous Edge Computing

– Edge-AI Integration: Combining edge computing with artificial intelligence to enable


real-time data processing and decision-making at the edge of the network.

– Smart Infrastructure: Development of smart cities and smart infrastructure with


distributed systems handling tasks like traffic management, energy distribution, and
environmental monitoring.

• Enhanced Security and Privacy

– Zero Trust Architectures: Widespread adoption of zero trust models that enforce
strict identity verification and continuous monitoring.

– Advanced Cryptography: Use of quantum-resistant cryptography to safeguard against


future quantum computing threats and enhanced encryption methods for data
security.

• Quantum Computing and Networking

– Quantum-Enhanced Systems: Integration of quantum computing for tasks requiring


high computational power and optimization, potentially revolutionizing problem-
solving in distributed systems.

– Quantum Networks: Development of quantum communication networks that


leverage quantum entanglement for ultra-secure data transmission.

• Interoperability and Standards

– Cross-Domain Interoperability: Increased focus on creating standards and protocols


that allow different distributed systems and applications to work together seamlessly.

– Open Standards and APIs: Growth of open standards and APIs to facilitate easier
integration and communication between diverse systems and platforms.

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 4


Distributed Systems CSE
BigData, Hadoop, MapReduce, HDFS, Hive & Apache Pig

Big Data: Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is
called Big Data.

Issue: Huge amount of unstructured data which needs to be stored, processed and analyzed.

Solution

• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which
uses commodity hardware to form clusters and store data in a distributed fashion. It works on
Write once, read many times principle.

• Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.

• Analyze: Pig, Hive can be used to analyze the data.

• Cost: Hadoop is open source so the cost is no more an issue.

Hadoop, MapReduce, HDFS & YARN

1. Hadoop is an open source framework.

1. It is provided by Apache to process and analyze very huge volume of data.

2. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter
etc.

2. MapReduce:

1. A distributed data processing model and execution environment that runs on large
clusters of commodity machines.

2. This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair.

3. The output of Map task is consumed by reduce task and then the output of reducer
gives the desired result.

3. HDFS: The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop that
runs on large clusters of commodity machines.

1. It contains a master/slave architecture. This architecture consist of a single NameNode


performs the role of master, and multiple DataNodes performs the role of a slave.

4. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster

5. Hadoop Ecosystems

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 1


Distributed Systems CSE
1. Hive: A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to
MapReduce jobs) for querying the data.

2. Pig: A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.

3. Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop
and structured data stores such as relational databases.

4. Flume: Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows.

Scalability & programming Model

Scalability is a key feature of MapReduce, a programming model and computational framework that
processes and generates large datasets:

How MapReduce scales

• MapReduce scales by distributing tasks across many nodes in a cluster. This allows it to handle
massive datasets, making it suitable for Big Data applications.

How MapReduce works

• MapReduce divides tasks into two phases: mapping and reducing. The mapping phase
processes and sorts data, while the reducing phase aggregates and summarizes the results.

How MapReduce is implemented

• The MapReduce framework is implemented by reading a large dataset, implementing the map
function, implementing the reduce function, and returning the resulting data.

How Apache Hadoop implements MapReduce

Apache Hadoop is a popular implementation of MapReduce that runs on a distributed cluster of


machines. Hadoop is designed to scale up from a single computer to thousands of clustered
computers.

HDFS

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 2


Distributed Systems CSE
Steps in Map Reduce

• The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
not be unique in this case.

• Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.

• An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.

Hive

1. Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.

2. Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.

3. Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).

Features:

I. Hive is fast and scalable.

II. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.

III. It is capable of analyzing large datasets stored in HDFS.

IV. It allows different storage types such as plain text, RCFile, and HBase.

V. It uses indexing to accelerate queries.

VI. It can operate on compressed data stored in the Hadoop ecosystem.

VII. It supports user-defined functions (UDFs) where user can provide its functionality.

Limitations of Hive

I. Hive is not capable of handling real-time data &It is not designed for online
transaction processing.

Apache Pig

1. Apache Pig is a high-level data flow tool/platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.
SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 3
Distributed Systems CSE
2. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.

3. Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores
the corresponding results into Hadoop Data File System. Every task which can be achieved
using PIG can also be achieved using java used in MapReduce.

Advantages of Apache Pig

I. Less code - The Pig consumes less line of code to perform any operation.

II. Reusability - The Pig code is flexible enough to reuse again.

III. Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and
map.

Differences between Hive and Pig


– Hive is commonly used by Data Analysts.
– Pig is commonly used by programmers.

– Hive follows SQL-like queries.


– Pig follows the data-flow language.

– Hive can handle structured data.


– Pig can handle semi-structured data.

– Hive works on server-side of HDFS cluster.


– Pig works on client-side of HDFS cluster.

– Hive is slower than Pig.


– Pig is comparatively faster than Hive.

SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 4

You might also like