0% found this document useful (0 votes)
26 views12 pages

MODULE 4 Notes

This document provides an overview of data-intensive computing, including what it is, the challenges it introduces, and its historical development. Data-intensive computing deals with huge amounts of data, from hundreds of megabytes to petabytes. It focuses on production, manipulation, and analysis of large datasets across many domains like science and social media.

Uploaded by

DABBARA ROHINI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views12 pages

MODULE 4 Notes

This document provides an overview of data-intensive computing, including what it is, the challenges it introduces, and its historical development. Data-intensive computing deals with huge amounts of data, from hundreds of megabytes to petabytes. It focuses on production, manipulation, and analysis of large datasets across many domains like science and social media.

Uploaded by

DABBARA ROHINI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Module – 4

Data-Intensive Computing: MapReduce Programming

Data-intensive computing focuses on a class of applications that deal with a large amount of data. Several
application fields, ranging from computational science to social networking, produce large volumes of data that
need to be efficiently stored, made accessible, indexed, and analyzed.
Distributed computing is definitely of help in addressing these challenges by providing more scalable and
efficient storage architectures and a better performance in terms of data computation and processing.
This chapter characterizes the nature of data-intensive computing and presents an overview of the challenges
introduced by production of large volumes of data and how they are handled by storage systems and computing
models. It describes MapReduce, which is a popular programming model for creating data-intensive
applications and their deployment on clouds.

What is data-intensive computing?


Data-intensive computing is concerned with production, manipulation, and analysis of large-scale data in
the range of hundreds of megabytes (MB) to petabytes (PB) and beyond.
Dataset is commonly used to identify a collection of information elements that is relevant to one or more
applications. Datasets are often maintained in repositories, which are infrastructures supporting the storage,
retrieval, and indexing of large amounts of information.
To facilitate classification and search, relevant bits of information, called metadata, are attached to
datasets. Data-intensive computations occur in many application domains.
Computational science is one of the most popular ones. People conducting scientific simulations and
experiments are often keen to produce, analyze, and process huge volumes of data. Hundreds of gigabytes of
data are produced every second by telescopes mapping the sky; the collection of images of the sky easily
reaches the scale of petabytes over a year.
Bioinformatics applications mine databases that may end up containing terabytes of data.
Earthquake simulators process a massive amount of data, which is produced as a result of recording the
vibrations of the Earth across the entire globe.

Characterizing data-intensive computations


Challenges ahead
Historical perspective
1 The early age: high-speed wide-area networking
2 Data grids
3 Data clouds and “Big Data”
4 Databases and data-intensive computing

Characterizing data-intensive computations


Data-intensive applications deals with huge volumes of data, also exhibit compute-intensive properties.
Figure 8.1 identifies the domain of data-intensive computing in the two upper quadrants of the graph.
Data-intensive applications handle datasets on the scale of multiple terabytes and petabytes.
Challenges ahead
The huge amount of data produced, analyzed, or stored imposes requirements on the supporting infrastructures
and middleware that are hardly found in the traditional solutions.
Moving terabytes of data becomes an obstacle for high-performing computations.
Data partitioning, content replication and scalable algorithms help in improving the performance.
Open challenges in data-intensive computing given by Ian Gorton et al. are:
1. Scalable algorithms that can search and process massive datasets.
2. New metadata management technologies that can handle complex, heterogeneous, and distributed
datasources.
3. Advances in high-performance computing platforms aimed at providing a better support for
accessingin-memory multiterabyte data structures.
4. High-performance, highly reliable, petascale distributed file systems.
5. Data signature-generation techniques for data reduction and rapid processing.
6. Software mobility that are able to move the computation to where the data are located.
7. Interconnection architectures that provide better support for filtering multi gigabyte datastreams
comingfrom high-speed networks and scientific instruments.
8. Software integration techniques that facilitate the combination of software modules running on
differentplatforms to quickly form analytical pipelines.

Historical Perspective
Data-intensive computing involves the production, management, and analysis of large volumes of data.
Support for data-intensive computations is provided by harnessing storage, networking, technologies,
algorithms, and infrastructure software all together.
1 The early age: high-speed wide-area networking
In 1989, the first experiments in high-speed networking as a support for remote visualization of scientific
data led the way.
Two years later, the potential of using high-speed wide area networks for enabling high-speed, TCP/IP-based
distributed applications was demonstrated at Supercomputing 1991 (SC91).
Kaiser project, leveraged the Wide Area Large Data Object (WALDO) system, used to provide following
capabilities:
1. automatic generation of metadata;
2. automatic cataloguing of data and metadata processing data in real time;
3. facilitation of cooperative research by providing local and remote users access to data; and
4. mechanisms to incorporate data into databases and other documents.
The Distributed Parallel Storage System (DPSS) was developed, later used to support TerraVision, a terrain
visualization application that lets users explore and navigate a tridimensional real landscape.
Clipper project, the goal of designing and implementing a collection of independent, architecturally consistent
service components to support data-intensive computing. The challenges addressed by Clipper project include
management of computing resources, generation or consumption of high-rate and high-volume data flows,
human interaction management, and aggregation of resources.

2 Data grids
Huge computational power and storage facilities could be obtained by harnessing heterogeneous resources
across different administrative domains.
Data grids emerge as infrastructures that support data-intensive computing.
A data grid provides services that help users discover, transfer, and manipulate large datasets stored in
distributed repositories as well as create and manage copies of them.
Data grids offer two main functionalities
● high-performance and reliable file transfer for moving large amounts of data, and
● scalable replica discovery and management mechanisms.
Data grids mostly provide storage and dataset management facilities as support for scientific experiments that
produce huge volumes of data.
Datasets are replicated by infrastructure to provide better availability.
Data grids have their own characteristics and introduce new challenges:
1. Massive datasets. The size of datasets can easily be on the scale of gigabytes, terabytes, and beyond. It is
therefore necessary to minimize latencies during bulk transfers, replicate content with appropriate strategies,
and manage storage resources.
2. Shared data collections. Resource sharing includes distributed collections of data. For example,
repositories can be used to both store and read data.
3. Unified namespace. Data grids impose a unified logical namespace where to locate data collections and
resources. Every data element has a single logical name, which is eventually mapped to different physical
filenames for the purpose of replication and accessibility.
4. Access restrictions. Even though one of the purposes of data grids is to facilitate sharing of results and
data for experiments, some users might want to ensure confidentiality for their data and restrict access to
them to their collaborators. Authentication and authorization in data grids involve both coarse-grained and
fine-grained access control over shared data collections.
As a result, several scientific research fields, including high-energy physics, biology, and astronomy, leverage
data grids.
3 Data clouds and “Big Data”
Together with the diffusion of cloud computing technologies that support data-intensive computations, the term
Big Data has become popular. Big Data characterizes the nature of data-intensive computations today and
currently identifies datasets that grow so large that they become complex to work with using on-hand database
management tools.
In general, the term Big Data applies to datasets of which the size is beyond the ability of commonly used
software tools to capture, manage, and process within a tolerable elapsed time. Therefore, Big Data sizes are a
constantly moving target, currently ranging from a few dozen tera-bytes to many petabytes of data in a single
dataset.
Cloud technologies support data-intensive computing in several ways:
1. By providing a large amount of compute instances on demand, which can be used to process and
analyzelarge datasets in parallel.
2. By providing a storage system optimized for keeping large blobs of data and other distributed data
storearchitectures.
3. By providing frameworks and programming APIs optimized for the processing and management of
largeamounts of data.
A data cloud is a combination of these components.
Ex 1: MapReduce framework, which provides the best performance for leveraging the Google File System on
top of Google’s large computing infrastructure.
Ex 2: Hadoop system, the most mature, large, and open-source data cloud. It consists of the Hadoop Distributed
File System (HDFS) and Hadoop’s implementation of MapReduce.
Ex 3: Sector, consists of the Sector Distributed File System (SDFS) and a compute service called Sphere that
allows users to execute arbitrary user-defined functions (UDFs) over the data managed by SDFS.
Ex 4: Greenplum uses a shared-nothing massively parallel processing (MPP) architecture based on commodity
hardware.

4 Databases and data-intensive computing


Distributed databases are a collection of data stored at different sites of a computer network. Each site might
expose a degree of autonomy, providing services for the execution of local applications, but also participating
in the execution of a global application.
A distributed database can be created by splitting and scattering the data of an existing database over different
sites or by federating together multiple existing databases. These systems are very robust and provide distributed
transaction processing, distributed query optimization, and efficient management of resources.

Technologies for Data-intensive computing


Data-intensive computing concerns the development of applications that are mainly focused on processing large
quantities of data.
Therefore, storage systems and programming models constitute a natural classification of the technologies
supporting data-intensive computing.

Storage systems
1. High-performance distributed file systems and storage clouds
2. NoSQL systems
Programming platforms
1. The MapReduce programming model.
2. Variations and extensions of MapReduce.
3. Alternatives to MapReduce.
8.2.1 Storage systems
Traditionally, database management systems constituted the de facto storage.
Due to the explosion of unstructured data in the form of blogs, Web pages, software logs, and sensor readings,
the relational model in its original formulation does not seem to be the preferred solution for supporting data
analytics on a large scale.
Some factors contributing to change in database are:
A. Growing of popularity of Big Data. The management of large quantities of data is no longer a rare case
but instead has become common in several fields: scientific computing, enterprise applications, media
entertainment, natural language processing, and social network analysis.
B. Growing importance of data analytics in the business chain. The management of data is no longer
considered a cost but a key element of business profit. This situation arises in popular social networks such
as Facebook, which concentrate their focus on the management of user profiles, interests, and connections
among people.
C. Presence of data in several forms, not only structured. As previously mentioned, what constitutes
relevant information today exhibits a heterogeneous nature and appears in several forms and formats.
D. New approaches and technologies for computing. Cloud computing promises access to a massiveamount
of computing capacity on demand. This allows engineers to design software systems that incrementally
scale to arbitrary degrees of parallelism.

1. High-performance distributed file systems and storage clouds


Distributed file systems constitute the primary support for data management. They provide an interface whereby
to store information in the form of files and later access them for read and write operations.
a. Lustre. The Lustre file system is a massively parallel distributed file system that covers the needs of a small
workgroup of clusters to a large-scale computing cluster. The file system is used by several of the Top 500
supercomputing systems
Lustre is designed to provide access to petabytes (PBs) of storage to serve thousands of clients with an I/O
throughput of hundreds of gigabytes per second (GB/s). The system is composed of a metadata server that
contains the metadata about the file system and a collection of object storage servers that are in charge of
providing storage.
b. IBM General Parallel File System (GPFS). GPFS is the high-performance distributed file system
developed by IBM that provides support for the RS/6000 supercomputer and Linux computing clusters. GPFSis
a multiplatform distributed file system built over several years of academic research and provides advanced
recovery mechanisms. GPFS is built on the concept of shared disks, in which a collection of disks is attached to
the file system nodes by means of some switching fabric. The file system makes this infrastructure transparent
to users and stripes large files over the disk array by replicating portions of the file to ensure high availability.
c. Google File System (GFS). GFS is the storage infrastructure that supports the execution of distributed
applications in Google’s computing cloud.
GFS is designed with the following assumptions:
1. The system is built on top of commodity hardware that often fails.
2. The system stores a modest number of large files; multi-GB files are common and should be
treatedefficiently, and small files must be supported, but there is no need to optimize for that.
3. The workloads primarily consist of two kinds of reads: large streaming reads and small random reads.
4. The workloads also have many large, sequential writes that append data to files.
5. High-sustained bandwidth is more important than low latency.
The architecture of the file system is organized into a single master, which contains the metadata of the entire
file system, and a collection of chunk servers, which provide storage space. From a logical point of view the
system is composed of a collection of software daemons, which implement either the master server or thechunk
server.
d. Sector. Sector is the storage cloud that supports the execution of data-intensive applications
definedaccording to the Sphere framework. It is a user space file system that can be deployed on commodity
hardware across a wide-area network. The system’s architecture is composed of four nodes: a security server,
one or more master nodes, slave nodes, and client machines. The security server maintains all the information
about access control policies for user and files, whereas master servers coordinate and serve the I/O requests
ofclients, which ultimately interact with slave nodes to access files. The protocol used to exchange data with
slavenodes is UDT, which is a lightweight connection-oriented protocol.
e. Amazon Simple Storage Service (S3). Amazon S3 is the online storage service provided by Amazon.
The system offers a flat storage space organized into buckets, which are attached to an Amazon Web Services
(AWS) account. Each bucket can store multiple objects, each identified by a unique key. Objects are identified
by unique URLs and exposed through HTTP, thus allowing very simple get-put semantics.

2. NoSQL systems
The term Not Only SQL (NoSQL) was coined in 1998 to identify a set of UNIX shell scripts and commands to
operate on text files containing the actual data.
NoSQL cannot be considered a relational database, it is a collection of scripts that allow users to manage most
of the simplest and more common database tasks by using text files as information stores.
Two main factors have determined the growth of the NoSQL:
1. simple data models are enough to represent the information used by applications, and
2. the quantity of information contained in unstructured formats has grown.

Let us now examine some prominent implementations that support data-intensive applications.
a. Apache CouchDB and MongoDB.
Apache CouchDB and MongoDB are two examples of document stores. Both provide a schema-less store
whereby the primary objects are documents organized into a collection of key-value fields. The value of each
field can be of type string, integer, float, date, or an array of values.
The databases expose a RESTful interface and represent data in JSON format. Both allow querying and
indexing data by using the MapReduce programming model, expose JavaScript as a base language for data
querying and manipulation rather than SQL, and support large files as documents.
b. Amazon Dynamo.
The main goal of Dynamo is to provide an incrementally scalable and highly available storage system. This
goal helps in achieving reliability at a massive scale, where thousands of servers and network components build
an infrastructure serving 10 million requests per day. Dynamo provides a simplified interface based on get/put
semantics, where objects are stored and retrieved with a unique identifier (key).
The architecture of the Dynamo system, shown in Figure 8.3, is composed of a collection of storage peers
organized in a ring that shares the key space for a given application. The key space is partitioned among the
storage peers, and the keys are replicated across the ring, avoiding adjacent peers. Each peer is configured with
access to a local storage facility where original objects and replicas are stored.
Each node provides facilities for distributing the updates among the rings and to detect failures and
unreachable nodes.
c. Google Bigtable.
Bigtable provides storage support for several Google applications that expose different types of workload: from
throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users.
Bigtable’s key design goals are wide applicability, scalability, high performance, and high availability. To
achieve these goals, Bigtable organizes the data storage in tables of which the rows are distributed over the
distributed file system supporting the middleware, which is the Google File System.
From a logical point of view, a table is a multidimensional sorted map indexed by a key that is represented by a
string of arbitrary length. A table is organized into rows and columns; columns can be grouped in column
family, which allow for specific optimization for better access control, the storage and the indexing of data.
Bigtable APIs also allow more complex operations such as single row transactions and advanced data
manipulation.

Figure 8.4 gives an overview of the infrastructure that enables Bigtable.


The service is the result of a collection of processes that coexist with other processes in a cluster-based
environment. Bigtable identifies two kinds of processes: master processes and tablet server processes. A tablet
server is responsible for serving the requests for a given tablet that is a contiguous partition of rows of a table.
Each server can manage multiple tablets (commonly from 10 to 1,000). The master server is responsible for
keeping track of the status of the tablet servers and of the allocation of tablets to tablet servers. The server
constantly monitors the tablet servers to check whether they are alive, and in case they are not reachable, the
allocated tablets are reassigned and eventually partitioned to other servers.
d. Apache Cassandra.
The system is designed to avoid a single point of failure and offer a highly reliable service. Cassandra was
initially developed by Facebook; now it is part of the Apache incubator initiative. Currently, it provides storage
support for several very large Web applications such as Facebook itself, Digg, and Twitter.
The data model exposed by Cassandra is based on the concept of a table that is implemented as a distributed
multidimensional map indexed by a key. The value corresponding to a key is a highly structured object and
constitutes the row of a table. Cassandra organizes the row of a table into columns, and sets of columns can be
grouped into column families. The APIs provided by the system to access and manipulate the data are very
simple: insertion, retrieval, and deletion. The insertion is performed at the row level; retrieval and deletion can
operate at the column level.
e. Hadoop HBase.
HBase is designed by taking inspiration from Google Bigtable; its main goal is to offer real-time read/write
operations for tables with billions of rows and millions of columns by leveraging clusters of commodity
hardware. The internal architecture and logic model of HBase is very similar to Google Bigtable, and the entire
system is backed by the Hadoop Distributed File System (HDFS).
8.2.1 Programming platforms
Programming platforms for data-intensive computing provide higher-level abstractions, which focus on the
processing of data and move into the runtime system the management of transfers, thus making the data always
available where needed.
This is the approach followed by the MapReduce programming platform, which expresses the computation in
the form of two simple functions—map and reduce—and hides the complexities of managing large
andnumerous data files into the distributed file system supporting the platform. In this section, we discuss the
characteristics of MapReduce and present some variations of it.

1. The MapReduce programming model.


MapReduce expresses the computational logic of an application in two simple functions: map and reduce.
Data transfer and management are completely handled by the distributed storage infrastructure (i.e., the Google
File System), which is in charge of providing access to data, replicating files, and eventually moving them
where needed.
the MapReduce model is expressed in the form of the two functions, which are defined as follows:

The map function reads a key-value pair and produces a list of key-value pairs of different types. The reduce
function reads a pair composed of a key and a list of values and produces a list of values of the same type. The
types (k1,v1,k2,kv2) used in the expression of the two functions provide hints as to how these two functions are
connected and are executed to carry out the computation of a MapReduce job: The output of map tasks is
aggregated together by grouping the values according to their corresponding keys and constitutes the input of
reduce tasks that, for each of the keys found, reduces the list of attached values to a single value. Therefore, the
input of a MapReduce computation is expressed as a collection of key-value pairs < k1,v1 >, and the final
output is represented by a list of values: list(v2).
Figure 8.5 depicts a reference workflow characterizing MapReduce computations. As shown, the user submits
a collection of files that are expressed in the form of a list of < k1,v1 > pairs and specifies the map and reduce
functions. These files are entered into the distributed file system that supports MapReduce and, if necessary,
partitioned in order to be the input of map tasks. Map tasks generate intermediate files that store collections
of< k2, list(v2) > pairs, and these files are saved into the distributed file system. These files constitute the input
ofreduce tasks, which finally produce output files in the form of list(v2).
The computation model expressed by MapReduce is very straightforward and allows greater productivity for
people who have to code the algorithms for processing huge quantities of data.
In general, any computation that can be expressed in the form of two major stages can be represented in terms
of MapReduce computation.
These stages are:
1. Analysis. This phase operates directly on the data input file and corresponds to the operation performed by
the map task. Moreover, the computation at this stage is expected to be embarrassingly parallel, since map
tasks are executed without any sequencing or ordering.
2. Aggregation. This phase operates on the intermediate results and is characterized by operations that are
aimed at aggregating, summing, and/or elaborating the data obtained at the previous stage to present the data in
their final form. This is the task performed by the reduce function.
Figure 8.6 gives a more complete overview of a MapReduce infrastructure, according to the implementation
proposed by Google.
As depicted, the user submits the execution of MapReduce jobs by using the client libraries that are in charge
of submitting the input data files, registering the map and reduce functions, and returning control to the user
once the job is completed. A generic distributed infrastructure (i.e., a cluster) equipped with job-scheduling
capabilities and distributed storage can be used to run MapReduce applications.
Two different kinds of processes are run on the distributed infrastructure:
a master process and
a worker process.
The master process is in charge of controlling the execution of map and reduce tasks, partitioning, and
reorganizing the intermediate output produced by the map task in order to feed the reduce tasks.
The master process generates the map tasks and assigns input splits to each of them by balancing the load.
The worker processes are used to host the execution of map and reduce tasks and provide basic I/O facilities
that are used to interface the map and reduce tasks with input and output files.
Worker processes have input and output buffers that are used to optimize the performance of map and reduce
tasks. In particular, output buffers for map tasks are periodically dumped to disk to create intermediate files.
Intermediate files are partitioned using a user-defined function to evenly split the output of map tasks.

2. Variations and extensions of MapReduce


MapReduce constitutes a simplified model for processing large quantities of data and imposes constraints on
the way distributed algorithms should be organized to run over a MapReduce infrastructure.
Therefore, a series of extensions to and variations of the original MapReduce model have been proposed. They
aim at extending the MapReduce application space and providing developers with an easier interface for
designing distributed algorithms.
We briefly present a collection of MapReduce-like frameworks and discuss how they differ from the
original MapReduce model.
A. Hadoop.
B. Pig.
C. Hive.
D. Map-Reduce-Merge.
E. Twister.

A. Hadoop.
Apache Hadoop is a collection of software projects for reliable and scalable distributed computing.
Theinitiative consists of mostly two projects: Hadoop Distributed File System (HDFS) and Hadoop
MapReduce. The former is an implementation of the Google File System; the latter provides the same features
and abstractions as Google MapReduce.
B. Pig.
Pig is a platform that allows the analysis of large datasets. Developed as an Apache project, Pig consists of a
high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these
programs. The Pig infrastructure’s layer consists of a compiler for a high-level language that produces a
sequence of MapReduce jobs that can be run on top of distributed infrastructures.
C. Hive.
Hive is another Apache initiative that provides a data warehouse infrastructure on top of Hadoop MapReduce.
It provides tools for easy data summarization, ad hoc queries, and analysis of large datasets stored in Hadoop
MapReduce files.
Hive’s major advantages reside in the ability to scale out, since it is based on the Hadoop framework, and in the
ability to provide a data warehouse infrastructure in environments where there is already a Hadoop system
running.
D. Map-Reduce-Merge.
Map-Reduce-Merge is an extension of the MapReduce model, introducing a third phase to the standard
MapReduce pipeline—the Merge phase—that allows efficiently merging data already partitioned and sorted (or
hashed) by map and reduce modules. The Map-Reduce-Merge framework simplifies the management of
heterogeneous related datasets and provides an abstraction able to express the common relational algebra
operators as well as several join algorithms.
E. Twister.
Twister is an extension of the MapReduce model that allows the creation of iterative executions of MapReduce
jobs. With respect to the normal MapReduce pipeline, the model proposed by Twister proposes the following
extensions:
1. Configure Map
2. Configure Reduce
3. While Condition Holds True Do
a. Run MapReduce
b. Apply Combine Operation to Result
c. Update Condition
4. Clos
Twister provides additional features such as the ability for map and reduce tasks to refer to static and in-
memory data; the introduction of an additional phase called combine, run at the end of the MapReduce job, that
aggregates the output together.

3. Alternatives to MapReduce.
a. Sphere.
b. All-Pairs.
c. DryadLINQ.

a. Sphere.
Sphere is the distributed processing engine that leverages the Sector Distributed File System (SDFS).
Sphere implements the stream processing model (Single Program, Multiple Data) and allows developers to
express the computation in terms of user-defined functions (UDFs), which are run against the distributed
infrastructure.
Sphere is built on top of Sector’s API for data access.
UDFs are expressed in terms of programs that read and write streams. A stream is a data structure that provides
access to a collection of data segments mapping one or more files in the SDFS.
The execution of UDFs is achieved through Sphere Process Engines (SPEs), which are assigned with a given
stream segment.
Sphere client sends a request for processing to the master node, which returns the list of available slaves, and
the client will choose the slaves on which to execute Sphere processes.

b. All-Pairs.
It provides a simple abstraction—in terms of the All-pairs function—that is common in many scientific
computing domains:
All-pairs(A:set; B:set; F:function) -> M:matrix
Ex 1: field of biometrics, where similarity matrices are composed as a result of the comparison of several
images that contain subject pictures.
Ex 2: applications and algorithms in data mining.
The All-pairs function can be easily solved by the following algorithm:
1. For each $i in A
2. For each $j in B
3. Submit job F $i $j
The execution of a distributed application is controlled by the engine and develops in four stages:
(1) model the system;
(2) distribute the data;
(3) dispatch batch jobs; and
(4) clean up the system.

c. DryadLINQ.
Dryad is a Microsoft Research project that investigates programming models for writing parallel and distributed
programs to scale from a small cluster to a large datacenter.
In Dryad, developers can express distributed applications as a set of sequential programs that are connected by
means of channels.
Dryad computation expressed in terms of a directed acyclic graph in which nodes are the sequential programs
and vertices represent the channels connecting such programs.
Dryad is considered a superset of the MapReduce model, its application model allows expressing graphs
representing MapReduce computation

You might also like