Dist_Sys_Unit_5_Notes
Dist_Sys_Unit_5_Notes
Coordination-based Systems
In this model, the coordination part of a distributed system handles the communication and
cooperation between processes.
It forms the glue that binds the activities performed by processes into a whole.
that assume that the various components of a system are inherently distributed and that the
real problem in developing such systems lies in coordinating the activities of different
components.
A taxonomy of coordination models for mobile agents that can be applied equally to many other types
of distributed systems. Adapting their terminology to distributed systems in general, we make a
distinction between models along two different dimensions, temporal and referential.
1. When processes are temporally and referentially coupled, coordination takes place in a direct
way, referred to as direct coordination.
4. Temporal coupling means that processes that are communicating will both have to be up and
running.
5. This coupling is analogous to the transient message oriented communication. A different type
of coordination occurs when processes are temporally decoupled but referentially coupled,
which we refer to as mailbox coordination.
6. In this case, there is no need for two communicating processes to execute at the same time in
order to let communication take place. Instead, communication takes place by putting
messages in a (possibly shared) mailbox.
1. The combination of referentially decoupled and temporally coupled systems form the group of
models for meeting-oriented coordination. In referentially decoupled systems, processes do
not know each other explicitly.
2. In other words, when a process wants to coordinate its activities with other processes, it
cannot directly refer to another process. Instead, there is a concept of a meeting in which
processes temporarily group together to coordinate their activities.
The model prescribes that the meeting processes are executing at the same time.
1. Meeting-based systems are often implemented by means of events, like the ones supported
by object-based distributed systems. In this chapter, we discuss another mechanism for
implementing meetings, namely publish/subscribe systems.
In these systems, processes can subscribe to messages containing information on specific subjects,
while other processes produce (i.e., publish) such messages. Most publish/subscribe systems require
that communicating processes are active at the same time; hence, there is a temporal coupling.
2. The key idea in generative communication is that a collection of independent processes make
use of a shared persistent dataspace of tuples. Tuples are tagged data records consisting of a
number (but possibly zero) typed fields. Processes can put any type of record into the shared
dataspace (i.e., they generate communication records). Unlike the case with blackboards,
there is no need to agree in advance on the structure of tuples.
In other words, when a process wants to extract a tuple from the dataspace, it essentially specifies
(some of) the values of the fields it is interested in. Any tuple that matches that specification is
then removed from the dataspace and passed to the process. If no match could be found, the
process can choose to block until there is a matching tuple.
We note that generative communication and shared data spaces are often also considered to be
forms of publish/subscribe systems. In what follows, we shall adopt this commonality as well.
2. Let us first assume that data items are described by a series of attributes. A data item is said to
be published when it is made available for other processes to read. To that end, a subscription
needs to be passed to the middleware, containing a description of the data items that the
subscriber is interested in. Such a description typically consists of some(attribute, value) pairs,
possibly combined with (attribute, range) pairs.
3. We are now confronted with a situation in which subscriptions need to be matched against
data Items. When matching succeeds, there are two possible scenarios. In the first case, the
middleware may decide to forward the published data to its current set of subscribers, that is,
processes with a matching subscription.
4. As an alternative, the middleware can also forward a notification at which point subscribers
can execute a read operation to retrieve the published data item.
1. In those cases in which data items are immediately forwarded to subscribers, the middleware
will generally not offer storage of data. Storage is either explicitly handled by a separate
service, or is the responsibility of subscribers. In other words, we have a referentially
decoupled, but temporally coupled system.
2. This situation is different when notifications are sent so that subscribers need to explicitly
read the published data. Necessarily, the middleware will have to store data items. In these
situations there are additional operations for data management.
3. It is also possible to attach a lease to a data item such that when the lease expires that the
data item is automatically deleted.
4. In the model described so far, we have assumed that there is a fixed set of n attributes a1, ...
,an that is used to describe data items. In particular, each published data item is assumed to
have an associated vector (attribute. value) pairs.
SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 1
Distributed Systems CSE
5. Events complicate the processing of subscriptions. To illustrate, consider a subscription such
as "notify when room R4.20 is unoccupied and the door is unlocked."
7. Following the approach sketched so far, we would need to compose such primitive events
into a publishable data item to which processes can then subscribe.
8. Clearly, in coordination-based systems such as these, the crucial issue is the efficient and
scalable implementation of matching subscriptions to data items, along with the construction
of relevant data items.
9. From the outside, a coordination approach provides lots of potential for building very large-
scale distributed systems due to the strong decoupling of processes.
Traditional Architectures
1. The simplest solution for matching data items against subscriptions is to have a centralized
client-server architecture.
3. Likewise, implementations for the more elaborate generative communication models such as
Jini (Sun Microsystems, 2005b) and JavaSpaces (Freeman et al., 1999) are mostly based on
central servers.
Example: TIB/Rendezvous
Example: Lime
4. One important problem that needs to be handled when publish/subscribe systems are spread
across a wide-area system is that published data should reach only the relevant subscribers.
2. Crucial in this setup is that routers can take routing decisions by considering the content of a
message.
3. More precisely, it is assumed that each message carries a description of its content, and that
this description can be used to cut-off routes for which it is known that they do not lead to
receivers interested in that message.
5. The server, in turn, will notify the application when relevant data has arrived.
6. A two-layered routing scheme in which the lowest layer consists of a shared broadcast tree
connecting the N servers.
7. There are various ways for setting up such a tree, ranging from network-level multicast
support to application-level multicast trees.
8. Here, we also assume that such a tree has been set up with the N servers as end nodes, along
with a collection of intermediate nodes forming the routers.
9. Consider first two extremes for content-based routing, assuming we need to support only
simple subject-based publish/subscribe in which each message is tagged with a unique (non-
compound) keyword.
10. One extreme solution is to send each published message to every server, and subsequently let
the server check whether any of its clients had subscribed to the subject of that message. In
essence, this is the approach followed in TIB-Rendezvous.
Taking this last approach as our starting point, we can refine the capabilities of routers for deciding
where to forward messages to. To that end, each server broadcasts its subscription across the
network so that routers can compose routing filters. For example, assume that node 3 subscribes to
messages for which an attribute a lies in the range [0,3], but that node 4 wants messages with a in
[2,5]. In this case, router R1 will create a routing filter as a table with an entry for each of its outgoing
links (in this case three: one to node 3, one to node 4, and one toward router R 1).
Staying updated with latest trends in distributed systems is crucial for several reasons:
• Performance Optimization: New trends often bring improvements in efficiency and scalability,
helping to enhance system performance and manage growing workloads.
• Security Enhancements: Emerging trends can introduce advanced security measures and
protocols to protect against evolving cyber threats.
• Cost Efficiency: Innovations in distributed systems can lead to more cost-effective solutions by
optimizing resource usage and reducing operational expenses.
• Integrations between cloud computing and distributed systems involve combining the
principles and technologies of both to create efficient, scalable, and resilient computing
environments. Following are the integrations work and their significance:
• Cloud computing platforms often use distributed systems principles to deliver their services.
For example:
• Grid Computing can be defined as a network of computers working together to perform a task
that would rather be difficult for a single machine.
• All machines on that network work under the same protocol to act as a virtual supercomputer.
• The tasks that they work on may include analyzing huge datasets or simulating situations that
require high computing power. Computers on the network contribute resources like
processing power and storage capacity to the network.
•
Grid Computing is a subset of distributed computing, where a virtual supercomputer
comprises machines on a network connected by some bus, mostly ethernet or sometimes the
SUTHOJU GIRIJA
Internet.RANI, Assistant Professor, CSE, NGIT 1
Distributed Systems CSE
• It can also be seen as a form of parallel computing where instead of many CPU cores on a
single machine, it contains multiple cores spread across various locations.
• The concept of grid computing isn’t new, but it is not yet perfected as there are no standard
rules and protocols established and accepted by people.
Virtualization
Advantages of virtualization
More flexible and efficient
Low cost
Cost per usage
Enable running multiple OS
Types of virtualization
1. Application virtualization
2. Network virtualization
3. Desktop virtualization
4. Storage virtualization
2.Network virtualization
The ability to run multiple virtual networks with each has a separate control & data pan.
E.g. : VPN
3.Desktop virtualization
It allows the user OS to remotely stored server and allow control from any device or machine
connected to it
4. Storage virtualization
It is an array of servers that are managed by a virtual storage. We don’t know exactly where
data is stored but we can read it and do any operation on it.
E.g. : AWS, Microsoft Azure
SELF -MANAGEMENT IN DISTRIBUTED SYSTEMS
• These are the systems which can themselves adjust when some thing happens, these
automatically adapt on the account of
– Self configuration
– Self managing
– Self healing
– Self Optimization
Service-Oriented Architecture (SOA)
• Service-Oriented Architecture (SOA) is a stage in the evolution of application development
and/or integration. It defines a way to make software components reusable using the
interfaces.
• SOA allows users to combine a large number of facilities from existing services to form
applications.
• SOA encompasses a set of design principles that structure system development and provide
means for integrating components into a coherent and decentralized system.
• SOA-based computing packages functionalities into a set of interoperable services, which can
be integrated into different software systems belonging to separate business domains.
Characteristics of SOA
The future of distributed systems is poised to be shaped by several emerging trends and
technological advancements. Here are some key directions where distributed systems are likely to
evolve:
– Zero Trust Architectures: Widespread adoption of zero trust models that enforce
strict identity verification and continuous monitoring.
– Open Standards and APIs: Growth of open standards and APIs to facilitate easier
integration and communication between diverse systems and platforms.
Big Data: Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte size is
called Big Data.
Issue: Huge amount of unstructured data which needs to be stored, processed and analyzed.
Solution
• Storage: This huge amount of data, Hadoop uses HDFS (Hadoop Distributed File System) which
uses commodity hardware to form clusters and store data in a distributed fashion. It works on
Write once, read many times principle.
• Processing: Map Reduce paradigm is applied to data distributed over network to find the
required output.
2. It is written in Java and currently used by Google, Facebook, LinkedIn, Yahoo, Twitter
etc.
2. MapReduce:
1. A distributed data processing model and execution environment that runs on large
clusters of commodity machines.
2. This is a framework which helps Java programs to do the parallel computation on data
using key value pair. The Map task takes input data and converts it into a data set
which can be computed in Key value pair.
3. The output of Map task is consumed by reduce task and then the output of reducer
gives the desired result.
3. HDFS: The Hadoop Distributed File System (HDFS) is a distributed file system for Hadoop that
runs on large clusters of commodity machines.
4. Yarn: Yet another Resource Negotiator is used for job scheduling and manage the cluster
5. Hadoop Ecosystems
2. Pig: A data flow language and execution environment for exploring very large
datasets. Pig runs on HDFS and MapReduce clusters.
3. Sqoop: A tool designed for efficiently transferring bulk data between Apache Hadoop
and structured data stores such as relational databases.
4. Flume: Flume is a distributed, reliable, and available service for efficiently collecting,
aggregating, and moving large amounts of log data. It has a simple and flexible
architecture based on streaming data flows.
Scalability is a key feature of MapReduce, a programming model and computational framework that
processes and generates large datasets:
• MapReduce scales by distributing tasks across many nodes in a cluster. This allows it to handle
massive datasets, making it suitable for Big Data applications.
• MapReduce divides tasks into two phases: mapping and reducing. The mapping phase
processes and sorts data, while the reducing phase aggregates and summarizes the results.
• The MapReduce framework is implemented by reading a large dataset, implementing the map
function, implementing the reduce function, and returning the resulting data.
HDFS
• The map takes data in the form of pairs and returns a list of <key, value> pairs. The keys will
not be unique in this case.
• Using the output of Map, sort and shuffle are applied by the Hadoop architecture. This sort
and shuffle acts on these list of <key, value> pairs and sends out unique keys and a list of
values associated with this unique key <key, list(values)>.
• An output of sort and shuffle sent to the reducer phase. The reducer performs a defined
function on a list of values for unique keys, and Final output <key, value> will be
stored/displayed.
Hive
1. Hive is a data warehouse system which is used to analyze structured data. It is built on the top
of Hadoop. It was developed by Facebook.
2. Hive provides the functionality of reading, writing, and managing large datasets residing in
distributed storage. It runs SQL like queries called HQL (Hive query language) which gets
internally converted to MapReduce jobs.
3. Using Hive, we can skip the requirement of the traditional approach of writing complex
MapReduce programs. Hive supports Data Definition Language (DDL), Data Manipulation
Language (DML), and User Defined Functions (UDF).
Features:
II. It provides SQL-like queries (i.e., HQL) that are implicitly transformed to MapReduce or
Spark jobs.
IV. It allows different storage types such as plain text, RCFile, and HBase.
VII. It supports user-defined functions (UDFs) where user can provide its functionality.
Limitations of Hive
I. Hive is not capable of handling real-time data &It is not designed for online
transaction processing.
Apache Pig
1. Apache Pig is a high-level data flow tool/platform for executing MapReduce programs of
Hadoop. The language used for Pig is Pig Latin.
SUTHOJU GIRIJA RANI, Assistant Professor, CSE, NGIT 3
Distributed Systems CSE
2. The Pig scripts get internally converted to Map Reduce jobs and get executed on data stored in
HDFS. Apart from that, Pig can also execute its job in Apache Tez or Apache Spark.
3. Pig can handle any type of data, i.e., structured, semi-structured or unstructured and stores
the corresponding results into Hadoop Data File System. Every task which can be achieved
using PIG can also be achieved using java used in MapReduce.
I. Less code - The Pig consumes less line of code to perform any operation.
III. Nested data types - The Pig provides a useful concept of nested data types like tuple, bag, and
map.