Cloud Unit 5
Cloud Unit 5
Apache Hadoop is an open source java based programming framework that supports
the processing of large data set in a distributed computing environment.
1. Hadoop common – collection of common utilities and libraries that support other Hadoop
modules.
2. Hadoop Distributed File System (HDFS) – Primary distributed storage system used by
Hadoop applications to hold large volume of data. HDFS is scalable and fault-tolerant which
works closely with a wide variety of concurrent data access application.
3. Hadoop YARN (Yet Another Resource Negotiator) – Apache Hadoop YARN is the
resource management and job scheduling technology in the open source Hadoop
distributed processing framework. YARN is responsible for allocating system resources to
the various applications running in a Hadoop cluster and scheduling tasks to be executed
on different cluster nodes.
❖ Even though the file chunks are replicated and distributed across several machines,
they form a single namespace, so their contents are universally accessible.
MAPREDUCE in Hadoop
❖ Hadoop will not run just any program and distribute it across a cluster. Programs must
be written to conform to a particular programming model, named "MapReduce."
❖ In MapReduce, records are processed by tasks called Mappers. The output from the
Mappers is then brought together into a second set of tasks called Reducers, where
results from different mappers can be merged together.
Hadoop Architecture:
InputFormat: Hadoop can process many different types of data formats, from flat text files to
databases. In this section, we explore the different formats available.
How the input files are split up and read is defined by the InputFormat. An InputFormat is a
class that provides the following functionality:
● Selects the files or other objects that should be used for input
● Defines the InputSplits that break a file into tasks
● Provides a factory for RecordReader objects that read the file
Input Splits and Records:
An input split is a chunk of the input that is processed by a single map. Each map processes a
single split. Each split is divided into records, and the map processes each record—a key-value
pair—in turn. Splits and records are logical.
FileInputFormat:
FileInputFormat is the base class for all implementations of InputFormat that use files as their data
source.
It provides two things: a place to define which files are included as the input to a job, and an
implementation for generating splits for the input files. The job of dividing splits into records is
performed by subclasses.
Preventing splitting Some applications don’t want files to be split, so that a single mapper can
process each input file in its entirety. For example, a simple way to check if all the records in a file
are sorted is to go through the records in order, checking whether each record is not less than the
preceding one. Implemented as a map task, this algorithm will work only if one map processes the
whole file.3
Text Input Hadoop excels at processing unstructured text.
TextInputFormat :
TextInputFormat is the default InputFormat. Each record is a line of input. The key, a LongWritable,
is the byte offset within the file of the beginning of the line. The value is the contents of the line,
excluding any line terminators (newline, carriage return)
KeyValueTextInputFormat:
TextInputFormat’s keys, being simply the offset within the file, are not normally very useful. It is
common for each line in a file to be a key-value pair, separated by a delimiter such as a tab character.
For example, this is the output produced by TextOutputFormat, Hadoop’s default OutputFormat. To
interpret such files correctly, KeyValueTextInputFormat is appropriate.
NLineInputFormat
With TextInputFormat and KeyValueTextInputFormat, each mapper receives a variable number of
lines of input. The number depends on the size of the split and the length of the lines. If you want
your mappers to receive a fixed number of lines of input, then NLineInputFormat is the InputFormat
to use. Like TextInputFormat, the keys are the byte offsets within the file and the values are the lines
themselves.
XML :
Most XML parsers operate on whole XML documents, so if a large XML document is made up of
multiple input splits, then it is a challenge to parse these individually. Of course, you can process the
entire XML document in one mapper (if it is not too large) using the technique in “Processing a
whole file as a record” .
Binary Input Hadoop MapReduce is not just restricted to processing textual data—it has support
for binary formats,
DBInputFormat is an input format for reading data from a relational database, using JDBC.
Output Formats
Hadoop has output data formats that correspond to the input formats
Text Output The default output format, TextOutputFormat, writes records as lines of text. Its keys
and values may be of any type, since TextOutputFormat turns them to strings by calling toString()
on them. Each key-value pair is separated by a tab character.
OutputFormat: The (key, value) pairs provided to this OutputCollector are then written to output
files.
❖ Hadoop can process many different types of data formats, from flat text files to
databases.
❖ If it is flat file, the data is stored using a line-oriented ASCII format, in which each
line is a record.
❖ For example, ( National Climatic Data Center) NCDC data as given below, the format
supports a rich set of meteorological elements, many of which are optional or with
variable data lengths.
The input to map phase is the raw NCDC data. We choose a text input format that gives us
each line in the dataset as a text value. The key is the offset of the beginning of the line from
the beginning of the file.
To visualize the way the map works, consider the following sample lines of input data:
These lines are presented to the map function as the key-value pairs:
The keys are the line offsets within the file, which we ignore in our map function. The map
function merely extracts the year and the air temperature (indicated in bold text), and
emits them as its output (the temperature values have been interpreted as integers):
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
The output from the map function is processed by the MapReduce framework before
being sent to the reduce function. This processing sorts and groups the key- value pairs by
key. So, continuing the example, our reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])
Each year appears with a list of all its air temperature readings. All the reduce function has to
do now is iterate through the list and pick up the maximum reading:
(1949, 111)
(1950, 22)
This is the final output: the maximum global temperature recorded in each year. The whole
data flow is illustrated in the following figure.
Combiner Functions
❖ Hadoop allows the user to specify a combiner function to be run on the map output—
the combiner function’s output forms the input to the reduce function.
❖ Since the combiner function is an optimization, Hadoop does not provide a guarantee
of how many times it will call it for a particular map output record, if at all.
❖ This is best illustrated with an example. Suppose that for the maximum temperature
example, readings for the year 1950 were processed by two maps (because they were
in different splits). Imagine the first map produced the output:
(1950, 0)
(1950, 20)
(1950, 10)
And the second produced:
(1950, 25)
(1950, 15)
The reduce function would be called with a list of all the values: (1950, [0,
20, 10, 25, 15])
with output:
(1950, 25)
since 25 is the maximum value in the list. We could use a combiner function that, just like the
reduce function, finds the maximum temperature for each map output.
The reduce would then be called with: (1950,
[20, 25])
and the reduce would produce the same output as before.
The code is
max(0, 20, 10, 25, 15) = max(max(0, 20, 10), max(25, 15)) = max(20, 25) = 25
Design of HDFS
HDFS is a file system designed for storing very large files with streaming data access patterns,
running on clusters of commodity hardware.
“Very large”
Files that are hundreds of megabytes, gigabytes, or terabytes in size. There are Hadoop clusters
running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is a
write-once, read-many-times pattern.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to run on
clusters of commodity hardware (commonly available hardware available from multiple
vendors) .
Since the namenode holds filesystem metadata in memory, the limit to the number of files in
a filesystem is governed by the amount of memory on the namenode. As a rule of thumb, each
file, directory, and block takes about 150 bytes. So, for example, if you had one million files,
each taking one block, you would need at least 300 MB of memory. While storing millions
of files is feasible, billions is beyond the capability of current hardware
Multiple writers, arbitrary file modifications: Files in HDFS may be written to by a single
writer. Writes are always made at the end of the file. There is no support for multiple writers,
or for modifications at arbitrary offsets in the file
HDFS Concepts
The following diagram illustrates the Hadoop concepts
❖ HDFS is a block structured file system. Each HDFS file is broken into blocks of
fixed size usually 128 MB which are stored across various data nodes on the cluster.
Each of these blocks is stored as a separate file on local file system on data nodes
(Commodity machines on cluster).
❖ Thus to access a file on HDFS, multiple data nodes need to be referenced and the list
of the data nodes which need to be accessed is determined by the file system metadata
stored on Name Node.
❖ So, any HDFS client trying to access/read a HDFS file, will get block information from
Name Node first, and then based on the block id’s and locations, data will be read from
corresponding data nodes/computer machines on cluster.
❖ HDFS’s fsck command is useful to get the files and blocks details of file system.
❖ Example: The following command list the blocks that make up each file in the file
system.
Advantages of Blocks
2. Quick Seek Time:
By default, HDFS Block Size is 128 MB which is much larger than any other file system. In
HDFS, large block size is maintained to reduce the seek time for block access.
3. Ability to Store Large Files:
Another benefit of this block structure is that, there is no need to store all blocks of a file on
the same disk or node. So, a file’s size can be larger than the size of a disk or node.
4. How Fault Tolerance is achieved with HDFS Blocks:
HDFS blocks feature suits well with the replication for providing fault tolerance and
availability.
By default each block is replicated to three separate machines. This feature insures blocks
against corrupted blocks or disk or machine failure. If a block becomes unavailable, a copy
can be read from another machine. And a block that is no longer available due to corruption
or machine failure can be replicated from its alternative
machines to other live machines to bring the replication factor back to the normal level (3 by
default).
Name Node
❖ Name Node is the single point of contact for accessing files in HDFS and it determines
the block ids and locations for data access. So, Name Node plays a Master role in
Master/Slaves Architecture where as Data Nodes acts as slaves. File System
metadata is stored on Name Node.
❖ File System Metadata contains File names, File Permissions and locations of each
block of files. Thus, Metadata is relatively small in size and fits into Main Memory
of a computer machine. So, it is stored in Main Memory of Name Node to allow fast
access.
Data Node
Data Nodes are the slaves part of Master/Slaves Architecture and on which
actual HDFS files are stored in the form of fixed size chunks of data which are
called blocks.
Data Nodes serve read and write requests of clients on HDFS files and also
perform block creation, replication and deletions.
Hadoop Filesystem
Hadoop has an abstract notion of filesystem, of which HDFS is just one implementation.
The Java abstract class org.apache.hadoop.fs.FileSystem represents a filesystem in Hadoop,
and there are several concrete implementations, which are described in Table,
JAVA INTERFACE
Reading Data from a Hadoop URL
One of the simplest ways to read a file from a Hadoop filesystem is by using a java.net.URL
object to open a stream to read the data from.
The following steps are involved in reading the file from HDFS:
Let’s suppose a Client (a HDFS Client) wants to read a file from HDFS.
Step 1: First the Client will open the file by giving a call to open() method on FileSystem
object, which is an instance of DistributedFileSystem class.
Step 2: DistributedFileSystem calls the Namenode, using RPC (Remote Procedure Call), to
determine the locations of the blocks for the file. For each block, the NameNode returns the
addresses of all the DataNode’s that have a copy of that block. Client will interact with
respective DataNode’s to read the file. NameNode also provide a token to the client which it
shows to datanode for authentication.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the
DataNode addresses for the first few blocks in the file, then connects to the first closest
DataNode for the first block in the file.
Step 4: Data is streamed from the DataNode back to the client, which calls read() repeatedly
on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the connection to
the DataNode , then find the best DataNode for the next block.
Step 6: Blocks are read in order, with the DFSInputStream opening new connections to
datanodes as the client reads through the stream. It will also call the namenode to retrieve the
datanode locations for the next batch of blocks as needed. When the client has finished
reading, it calls close() on the FSDataInputStream.
Anatomy of File Write
Step 1: The client creates the file by calling create() method on DistributedFileSystem.
Step 2: DistributedFileSystem makes an RPC call to the namenode to create a new file in the
filesystem’s namespace, with no blocks associated with it.The namenode performs various
checks to make sure the file doesn’t already exist and that the client has the right permissions
to create the file.
Step 3: As the client writes data, DFSOutputStream splits it into packets, which it writes to
an internal queue, called the data queue. The data queue is consumed by the DataStreamer,
which is responsible for asking the namenode to allocate new blocks by picking a list of
suitable datanodes to store the replicas. The list of datanodes forms a pipeline, and here we’ll
assume the replication level is three, so there are three nodes in the pipeline. TheDataStreamer
streams the packets to the first datanode
in the pipeline, which stores the packet and forwards it to the second datanode in the pipeline.
Step 4: Similarly, the second datanode stores the packet and forwards it to the third (and last)
datanode in the pipeline.
Step 5: DFSOutputStream also maintains an internal queue of packets that are waiting to be
acknowledged by datanodes, called the ack queue. A packet is removed from the ack queue
only when it has been acknowledged by all the datanodes in the pipeline.
Step 6: When the client has finished writing data, it calls close() on the stream.
This action flushes all the remaining packets to the datanode pipeline and waits for
acknowledgments before contacting the namenode to signal that the file is complete.
Single node cluster means only one DataNode running and setting up all the NameNode,
DataNode, ResourceManager and NodeManager on a single machine. This is used for
studying and testing purposes.
While in a Multi node cluster, there are more than one DataNode running and each
DataNode is running on different machines. The multi node cluster is practically used in
organizations for analyzing Big Data.
Step 1: download the Java Package. Save this file in home directory.
Step 5: Add the Hadoop and Java paths in the bash file (.bashrc). Open.
bashrc file. Now, add Hadoop and Java Path as shown below. Command:
gedit .bashrc
Fig: Hadoop Installation – Setting Environment Variable
For applying all these changes to the current Terminal, execute the source command.
To make sure that Java and Hadoop have been properly installed on the system and
can be accessed through the Terminal, execute the java -version and hadoop version
commands.
Step 7: Open core-site.xml and edit the property mentioned below inside
configuration tag:
core-site.xml informs Hadoop daemon where NameNode runs in the cluster. It
contains configuration settings of Hadoop core such as I/O settings that are common
to HDFS & MapReduce.
Step 8: Edit hdfs-site.xml and edit the property mentioned below inside configuration tag:
<property>
<name>dfs.permission</name>
<value>false</value>
</property>
</configuration>
Step 9: Edit the mapred-site.xml file and edit the property mentioned below inside
configuration tag:
In some cases, mapred-site.xml file is not available. So, we have to create the mapred-
site.xml file using mapred-site.xml template.
</property>
</configuration>
Step 10: Edit yarn-site.xml and edit the property mentioned below inside
configuration tag:
<?xml version="1.0">
<configuration>
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.nodemanager.auxservices.mapreduce.shuffle.class</name>
<value>org.apache.hadoop.mapred.ShuffleHandler</value>
</property>
</configuration>
Step 11: Edit hadoop-env.sh and add the Java Path as mentioned below:
hadoop-env.sh contains the environment variables that are used in the script to run
Hadoop like Java home path, etc.
Command: cd
Command: cd hadoop-2.9.0
This formats the HDFS via NameNode. This command is only executed for the first
time. Formatting the file system means initializing the directory specified by the
dfs.name.dir variable.
Step 13: Once the NameNode is formatted, go to hadoop-2.7.3/sbin directory and start
all the daemons.
Command: cd hadoop-2.9.0/sbin
Command: ./start-all.sh
Step 14: To check that all the Hadoop services are up and running, run the below
command.
Command: jps
Fig: Hadoop Installation – Checking Daemons
● Compatible with Amazon Web Services (AWS) and Simple Storage Service (S3).
● Works with multiple hypervisors including VMware, Xen and KVM.
Eucalyptus Architecture
Cloud Controller
The Cloud Controller (CLC) is the entry-point into the cloud for administrators,
developers, project managers, and end-users.
As the interface to the management platform, the CLC is responsible for exposing and
managing the underlying virtualized resources (servers, network, and storage).
Walrus
Walrus allows users to store persistent data, organized as buckets and objects. We can
use Walrus to create, delete, and list buckets, or to put, get, and delete objects, or to set access
control policies.
Cluster Controller
CCs gather information about a set of NCs and schedules virtual machine (VM)
execution on specific NCs. The CC also manages the virtual machine networks. All NCs
associated with a single CC must be in the same subnet.
Storage Controller
The Storage Controller (SC) provides functionality similar to the Amazon Elastic
Block Store (Amazon EBS). The SC is capable of interfacing with various storage systems.
Node Controller
The Node Controller (NC) executes on any machine that hosts VM instances. The NC
controls VM activities, including the execution, inspection, and termination of VM instances.
VM Image Management
⮚ Eucalyptus stores images in Walrus, the block storage system that is analogous to
the Amazon S3 service.
⮚ This image is uploaded into a user-defined bucket within Walrus, and can be
retrieved anytime from any availability zone.
⮚ The Eucalyptus system is available in a commercial proprietary version as well as
the open source version.
b) OpenNebula
⮚ Here, the core is a centralized component that manages the VM full life cycle,
including setting of networks dynamically for groups of VMs and managing their
storage requirements.
⮚ Another important component is the capacity manager or scheduler.
⮚ The default capacity scheduler is a requirement/rank matchmaker. However, it is also
possible to develop more complex scheduling policies, through a lease model and
advance reservations.
⮚ The last main components are the access drivers. They provide basic functionality of
the monitoring, storage, and virtualization services available in the cluster.
⮚ OpenNebula is not tied to any specific environment and can provide a uniform
management layer regardless of the virtualization platform.
⮚ OpenNebula offers management interfaces to integrate the core's functionality within
other data-center management tools, such as accounting or monitoring frameworks.
⮚ To this end, OpenNebula implements the libvirt API , an open interface for VM
management, as well as a command-line interface (CLI).
⮚ A subset of this functionality is exposed to external users through a cloud interface.
⮚ When the local resources are insufficient, OpenNebula can support a hybrid cloud
model by using cloud drivers to interface with external clouds.
⮚ OpenNebula currently includes an EC2 driver, which can submit requests to Amazon
EC2 and Eucalyptus , as well as an Elastic Hosts driver.
c) OpenStack
▪ OpenStack was introduced by Rackspace and NASA in July 2010.
▪ OpenStack is a set of software tools for building and managing cloud computing
platforms for public and private clouds.
▪ It focuses on the development of two aspects of cloud computing to address compute
and storage aspects with the OpenStack Compute and OpenStack Storage solutions.
▪ “OpenStack Compute for creating and managing large groups of virtual private
servers”
▪ “OpenStack Object Storage software for creating redundant, scalable object
storage using clusters of commodity servers to store terabytes or even petabytes
of data.”
▪ The OpenStack is an open source cloud computing platform for all types of clouds,
which aims to be simple to implement, massively scalable, and feature rich.
▪ OpenStack provides an Infrastructure-as-a-Service (IaaS) solution through a set of
interrelated services. Each service offers an application programming interface (API)
that facilitates this integration.
OpenStack Compute
⮚ OpenStack is developing a cloud computing fabric controller, a component of an
IAAS system, known as Nova.
⮚ Nova is an OpenStack project designed to provide massively scalable, on demand,
self service access to compute resources.
⮚ The architecture of Nova is built on the concepts of shared-nothing and
messaging-based information exchange.
⮚ Hence, most communication in Nova is facilitated by message queues.
⮚ To prevent blocking components while waiting for a response from others, deferred
objects are introduced. Such objects include callbacks that get triggered when a
response is received.
⮚ To achieve the shared-nothing paradigm, the overall system state is kept in a
distributed data system.
⮚ State updates are made consistent through atomic transactions.
Openstack Storage uses the following components to deliver high availability, high
durability, and high concurrency:
Proxy servers are the public face of Object Storage and handle all of the incoming API
requests. Once a proxy server receives a request, it determines the storage node based on the
object’s URL.
Proxy servers use a shared-nothing architecture and can be scaled as needed based on
projected workloads. A minimum of two proxy servers should be deployed for redundancy.
If one proxy server fails, the others take over.
Rings
A ring represents a mapping between the names of entities stored on disks and their physical
locations. There are separate rings for accounts, containers, and objects. When other
components need to perform any operation on an object, container, or account, they need to
interact with the appropriate ring to determine their location in the cluster.
The ring maintains this mapping using zones, devices, partitions, and replicas. Each partition
in the ring is replicated, by default, three times across the cluster, and partition locations are
stored in the mapping maintained by the ring. The ring is also responsible for determining
which devices are used for handoff in failure scenarios.
Each account and container is an individual SQLite database that is distributed across the
cluster. An account database contains the list of containers in that account. A container
database contains the list of objects in that container.
Partitions
A partition is a collection of stored data. This includes account databases, container databases,
and objects. Partitions are core to the replication system.
d) Nimbus
⮚ Nimbus is a set of open source tools that together provide an IaaS cloud computing
solution.
⮚ The following figure shows the architecture of Nimbus, which allows a client to lease
remote resources by deploying VMs on those resources and configuring them to
represent the environment desired by the user.
⮚ To this end, Nimbus provides a special web interface known as Nimbus Web. Its aim
is to provide administrative and user functions in a friendly interface.
⮚ Nimbus Web is centered around a Python Django web application that is intended to
be deployable completely separate from the Nimbus service.
⮚ As shown in Figure, a storage cloud implementation called Cumulus has been tightly
integrated with the other central services, although it can also be used stand-alone.
⮚ Cumulus is compatible with the Amazon S3 REST API , but extends its capabilities
by including features such as quota management. Therefore, clients such as boto and
s2cmd , that work against the S3 REST API, work with Cumulus.
⮚ On the other hand, the Nimbus cloud client uses the Java Jets3t library to interact with
Cumulus. Nimbus supports two resource management strategies.
⮚ The first is the default “resource pool” mode. In this mode, the service has direct
control of a pool of VM manager nodes and it assumes it can start VMs.
⮚ The other supported mode is called “pilot.” Here, the service makes requests to a
cluster’s Local Resource Management System (LRMS) to get a VM manager available
to deploy VMs.
⮚ Nimbus also provides an implementation of Amazon’s EC2 interface that allows users
to use clients developed for the real EC2 system against Nimbus- based clouds.
⮚ Aneka includes an extensible set of APIs associated with programming models like
MapReduce.
⮚ These APIs support different cloud models like a private, public, hybrid Cloud.
⮚ Manjrasoft focuses on creating innovative software technologies to simplify the
development and deployment of private or public cloud applications. Our product
plays the role of an application platform as a service for multiple cloud computing.
Multiple Structures:
⮚ The Aneka container constitutes the building blocks of Aneka Clouds and represents
the runtime machinery available to services and applications. The container, the unit
of deployment in Aneka Clouds, is a lightweight software layer designed to host
services and interact with the underlying operating system and hardware. The main
role of the container is to provide a lightweight environment in which to deploy
services and some basic capabilities such as communication channels through which
it interacts with other nodes in the Aneka Cloud. Almost all operations performed
within Aneka are carried out by the services managed by the container. The services
installed in the Aneka container can be classified into three major categories:
• Fabric Services
• Foundation Services
• Application Services
⮚ The services stack resides on top of the Platform Abstraction Layer (PAL),
representing the interface to the underlying operating system and hardware. It provides
a uniform view of the software and hardware environment in which the container is
running. Persistence and security traverse all the services stack to provide a secure and
reliable infrastructure. In the following sections we discuss the components of these
layers in more detail.
⮚ Fast and Simple: Task Programming Model:
⮚ Task Programming Model provides developers with the ability of expressing
applications as a collection of independent tasks. Each task can perform different
operations, or the same operation on different data, and can be executed in any order
by the runtime environment. This is a scenario in which many scientific applications
fit in and a very popular model for Grid
Computing. Also, Task programming allows the parallelization of legacy applications
on the Cloud.
⮚ Concurrent Applications: Thread Programming Model
The Task Programming Model is one of the primary programming models supported by Aneka,
enabling the execution of independent tasks in parallel. It is particularly useful for embarrassingly
parallel workloads, where tasks do not need to communicate or share data with each other. Each task
runs independently, and the system manages the distribution and execution across available
resources.
The Aneka Task Programming Model is designed for embarrassingly parallel tasks, making it
simpler than more complex distributed computing models such as:
● MapReduce Model: MapReduce is suited for scenarios where tasks need to be distributed
but may also require aggregation (e.g., data analytics). Unlike MapReduce, the Aneka Task
Model does not require a reduce phase.
● MPI (Message Passing Interface): MPI is used for tasks that require communication
between nodes. In contrast, Aneka’s Task Model is for independent tasks, where
communication is minimal or not required.
Advantages:
The MapReduce model in Aneka works by dividing a large computational task into two primary
functions:
1. Map Function: This phase involves processing input data and transforming it into a key-
value pair format. Each independent input chunk is processed in parallel by mapper nodes.
2. Reduce Function: This phase takes the output from the Map phase, aggregates, or combines
the results based on the keys generated, and produces a final result. Each reducer node
handles a subset of the intermediate results from the Map phase.
Map Phase:
● The input data is divided into multiple splits or chunks, which can be processed
independently.
● Each split is processed by a mapper node. The mapper processes data and generates key-
value pairs as output.
● The mappers work independently, which allows for parallel execution across different nodes
in the Aneka cloud.
emit(word, 1)
In this example, if the input is a text file, the map function tokenizes the text into words and outputs
each word with a count of 1.
Reduce Phase:
● The output from the Map phase (key-value pairs) is shuffled and sorted by the key.
● The reduce function receives all values associated with a specific key and processes them to
generate a final output.
● The reducers aggregate the values for each key, typically by summing, counting, or merging
the values.
def reduce(key, values):
result = sum(values)
emit(key, result)
● Big Data Analytics: Processing large datasets, such as logs, sensor data, or transaction data,
to generate insights.
● Text Processing: Analyzing large text corpora (e.g., word frequency analysis or natural
language processing).
● Data Aggregation: Performing operations like counting, summing, or averaging large
datasets distributed across cloud resources.
● Scientific Simulations: Running simulations that generate vast amounts of data, where the
intermediate results need to be processed in parallel.
● MapReduce is suited for applications with large datasets that can be broken down into
smaller tasks that involve mapping and reducing.
● Task Programming Model is more general-purpose and deals with independent tasks that
do not require intermediate data aggregation like MapReduce.