0% found this document useful (0 votes)
14 views

Module III Note

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides massive storage, enormous processing power, and the ability to handle large concurrent tasks. Hadoop was created by Doug Cutting and Mike Cafarella to solve the problem of storing and processing web crawl data from the Apache Nutch project. It was inspired by Google's papers on the Google File System and MapReduce and was later developed at Yahoo before being open sourced. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing of large datasets.

Uploaded by

johnsonjoshal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Module III Note

Hadoop is an open-source software framework for distributed storage and processing of large datasets across clusters of computers. It provides massive storage, enormous processing power, and the ability to handle large concurrent tasks. Hadoop was created by Doug Cutting and Mike Cafarella to solve the problem of storing and processing web crawl data from the Apache Nutch project. It was inspired by Google's papers on the Google File System and MapReduce and was later developed at Yahoo before being open sourced. The core components of Hadoop are HDFS for distributed storage and MapReduce for distributed processing of large datasets.

Uploaded by

johnsonjoshal5
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Module III: Big Data Analytics 1

Module III
Hadoop
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any kind of
data, enormous processing power and the ability to handle virtually limitless concurrent tasks
or jobs.
History of Hadoop

Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when
they both started to work on Apache Nutch project. Apache Nutch project was the process of
building a search engine system that can index 1 billion pages. After a lot of research on Nutch,
they concluded that such a system will cost around half a million dollars in hardware, and along
with a monthly running cost of $30, 000 approximately, which is very expensive. So, they
realized that their project architecture will not be capable enough to the workaround with
billions of pages on the web. So they were looking for a feasible solution which can reduce the
implementation cost as well as the problem of storing and processing of large datasets.

In 2003, they came across a paper that described the architecture of Google’s distributed
file system, called GFS (Google File System) which was published by Google, for storing the
large data sets. Now they realize that this paper can solve their problem of storing very large
files which were being generated because of web crawling and indexing processes. But this
paper was just the half solution to their problem.

In 2004, Google published one more paper on the technique MapReduce, which was
the solution of processing those large datasets. Now this paper was another half solution for
Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS &
MapReduce) were just on white paper at Google. Google didn’t implement these two
techniques. Doug Cutting knew from his work on Apache Lucene (It is a free and open-source
information retrieval software library, originally written in Java by Doug Cutting in 1999) that
open-source is a great way to spread the technology to more people. So, together with Mike
Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as open-source
in the Apache Nutch project.

In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon

Hadoop
Module III: Big Data Analytics 2

realized two problems:

(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters.

(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).

The engineering task in Nutch project was much bigger than he realized. So he started
to find a job with a company who is interested in investing in their efforts. And he found
Yahoo.

Yahoo had a large team of engineers that was eager to work on this there project.

So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to
provide the world with an open-source, reliable, scalable computing framework, with the help
of Yahoo. So at Yahoo first, he separates the distributed computing parts from Nutch and
formed a new project Hadoop (He gave name Hadoop it was the name of a yellow toy elephant
which was owned by the Doug Cutting’s son. and it was easy to pronounce and was the unique
word.) Now he wanted to make Hadoop in such a way that it can work well on thousands of
nodes. So with GFS and MapReduce, he started to work on Hadoop.

In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it. In
January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation).

And in July of 2008, Apache Software Foundation successfully tested a 4000 node
cluster with Hadoop.

In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages. And Doug Cutting
left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop to other
industries.

In December of 2011, Apache Software Foundation released Apache Hadoop version


1.0. And later in Aug 2013, Version 2.0.6 was available. And currently, we have Apache
Hadoop version 3.0 which released in December 2017.

Let’s Summarize above History:

Hadoop
Module III: Big Data Analytics 3

The Hadoop Distributed File System

HDFS is an effective, scalable, fault tolerant and distributed approach for storing and
managing huge volumes of data.

The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. HDFS employs a Name Node and Data Node architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.

Hadoop
Module III: Big Data Analytics 4

Hadoop itself is an open source distributed processing framework that manages data
processing and storage for big data applications. It provides a reliable means for managing
pools of big data and supporting related big data analytics applications.

WORKING OF HDFS:

HDFS enables the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with Map Reduce, a framework for data processing that filters and divides
up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive
answer to a query. Similarly, when HDFS takes in data, it breaks the information down into
separate blocks and distributes them to different nodes in a cluster.

With HDFS, data is written on the server once, and read and reused numerous times
after that. HDFS has a primary Name Node, which keeps track of where file data is kept in the
cluster.

HDFS also has multiple Data Nodes on a commodity hardware cluster -- typically one
per node in a cluster. The Data Nodes are generally organized within the same rack in the data
centre. Data is broken down into separate blocks and distributed among the various Data Nodes
for storage. Blocks are also replicated across nodes, enabling highly efficient parallel
processing.

The Name Node knows which Data Node contains which blocks and where the Data
Nodes reside within the machine cluster. The Name Node also manages access to the
files, including reads, writes, creates, deletes and the data block replication across the Data
Nodes.

The Name Node operates in conjunction with the Data Nodes. As a result, the cluster
can dynamically adapt to server capacity demand in real time by adding or subtracting nodes
as necessary.

The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of
the status of each DataNode. If the NameNode realizes that one DataNode isn't working
properly, it can immediately reassign that DataNode's task to a different node containing
the same data block. DataNodes also communicate with each other, which enables them
to cooperate during normal file operations.

Hadoop
Module III: Big Data Analytics 5

Moreover, the HDFS is designed to be highly fault-tolerant. The file system


replicates -- or copies -- each piece of data multiple times and distributes the copies to
individual nodes, placing at least one copy on a different server rack than the other copies.

HDFS architecture, NameNodes and DataNodes

HDFS uses a primary/secondary architecture. The HDFS cluster's NameNode is

the primary server that manages the file system namespace and controls client access to files.
As the central component of the Hadoop Distributed File System, the NameNode maintains
and manages the file system namespace and provides clients with the right access permissions.
The system's DataNodes manage the storage that's attached to the nodes they run on.

HDFS exposes a file system namespace and enables user data to be stored in files. A
file is split into one or more of the blocks that are stored in a set of DataNodes. The NameNode
performs file system namespace operations, including opening, closing and renaming files
and directories. The NameNode also governs the mapping of blocks to the DataNodes.
The DataNodes serve read and write requests from the clients of the file system. In
addition, they perform block creation, deletion and replication when the NameNode instructs
them to do so.

HDFS supports a traditional hierarchical file organization. An application or user can


create directories and then store files inside these directories. The file system namespace
hierarchy is like most other file systems -- a user can create, remove, rename or move files
from one directory to another.

Hadoop
Module III: Big Data Analytics 6

The NameNode records any change to the file system namespace or its properties. An
application can stipulate the number of replicas of a file that the HDFS should maintain. The
NameNode stores the number of copies of a file, called the replication factor of that file.

COMPONENTS OF HADOOP

Hadoop has three components:

1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data with a
cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables
data to be stored at multiple nodes in the cluster which ensures data security and fault
tolerance.

2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now
suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this
data is stored, this is called Mapping. Now the query is broken into multiple parts and the
results of all these multiple parts are combined and the overall result is sent back to the user.
This is called reduce process. Thus while HDFS is used to store the data, Map Reduce
is used to process the data.

3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated operating
system for Hadoop which manages the resources of the cluster and also functions as a
framework for job scheduling in Hadoop. The various types of scheduling are First Come First
Serve, Fair Share Scheduler and Capacity Scheduler etc. The First Come First Serve scheduling
is set by default in YARN.

ANALYZING THE DATA WITH HADOOP

Tools used for Analysing are:

Map Reduce

Map Reduce is just like an Algorithm or a data structure that is based on the YARN
framework. The primary feature of Map Reduce is to perform the distributed processing
in parallel in a Hadoop cluster, which Makes Hadoop working so fast Because when we
are dealing with Big Data, serial processing is no more of any use.

Features of Map-Reduce:

• Scalable

Hadoop
Module III: Big Data Analytics 7

• Fault Tolerance

• Parallel Processing

• Tuneable Replication

• Load Balancing

Apache Hive

Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is known
as HQL or HIVEQL.

Features of Hive:

• Queries are similar to SQL queries.

• Hive has different storage type HBase, ORC, Plain text, etc.

• Hive has in-built function for data-mining and other works.

• Hive operates on compressed data that is present inside Hadoop Ecosystem.

Apache Mahout

The name Mahout is taken from the Hindi word Mahavat which means the elephant
rider. Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. Mahout
is mainly used for implementing various Machine Learning algorithms on our Hadoop like
classification, Collaborative filtering, Recommendation. Apache Mahout can implement the
Machine algorithms without integration on Hadoop.

Features of Mahout:

• Used for Machine Learning Application

• Mahout has Vector and Matrix libraries

• Ability to analyze large datasets quickly

Hadoop
Module III: Big Data Analytics 8

Apache Pig

This Pig was Initially developed by Yahoo to get ease in programming. Apache Pig has
the capability to process an extensive dataset as it works on top of the Hadoop. Apache pig is
used for analyzing more massive datasets by representing them as dataflow. Apache Pig
also raises the level of abstraction for processing enormous datasets. Pig Latin is the
scripting language that the developer uses for working on the Pig framework that runs on Pig
runtime.

Features of Pig:

• Easy To Programme

• Rich set of operators

• Ability to handle various kind of data

• Extensibility

HBase

HBase is nothing but a non-relational, NoSQL distributed, and column-oriented


database. HBase consists of various tables where each table has multiple numbers of data rows.
These rows will have multiple numbers of column family’s, and this column family will
have columns that contain key-value pairs. HBase works on the top of HDFS(Hadoop
Distributed File System). We use HBase for searching small size data from the more massive
datasets.

Features of HBase:

• HBase has Linear and Modular Scalability

• JAVA API can easily be used for client access

• Block cache for real time data queries

SCALING OUT
Scaling alters size of a system. In the scaling process, we either compress or expand the
system to meet the expected needs. The scaling operation can be achieved by adding
resources to meet the smaller expectation in the current system, or by adding a new system
in the existing one, or both.

Hadoop
Module III: Big Data Analytics 9

Scaling can be categorised into 2 types:

1. Vertical Scaling: When new resources are added in the existing system to meet the
expectation, it is known as vertical scaling. Consider a rack of servers and resources
that comprises of the existing system. Now when the existing system fails to meet
the expected needs, and the expected needs can be met by just adding resources,
this is considered as vertical scaling. Vertical scaling is based on the idea of
adding more power(CPU, RAM) to existing systems, basically adding more
resources.Vertical scaling is not only easy but also cheaper than Horizontal Scaling. It
also requires less time to be fixed.
2. Horizontal Scaling: When new server racks are added in the existing system to
meet the higher expectation, it is known as horizontal scaling. Consider a rack of
servers and resources that comprises of the existing system. Now when the existing
system fails to meet the expected needs, and the expected needs cannot be met by just
adding resources, we need to add completely new servers. This is considered as
horizontal scaling. Horizontal scaling is based on the idea of adding more machines
into our pool of resources. Horizontal scaling is difficult and also costlier than
Vertical Scaling. It also requires more time to be fixed.

HADOOP STREAMING

It is a utility or feature that comes with a Hadoop distribution that allows developers
or programmers to write the Map-Reduce program using different programming languages
like Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). Hadoop
Framework is completely written in java but programs for Hadoop are not necessarily need to
code in Java programming language. feature of Hadoop Streaming is available since Hadoop
version 0.14.1.

How Hadoop Streaming Works

In that, we have an Input Reader which is responsible for reading the input data and
produces the list of key-value pairs. The input reader contains the complete logic about
the data it is reading. Suppose we want to read an image then we have to specify the logic in
the input reader so that it can read that image data and finally it will generate key-
value pairs for that image data.

Hadoop
Module III: Big Data Analytics 10

If we are reading an image data then we can generate key-value pair for each pixel
where the key will be the location of the pixel and the value will be its color value from (0-
255) for a colored image. Now this list of key-value pairs is fed to the Map phase and Mapper
will work on each of these key-value pair of each pixel and generate some intermediate key-
value pairs which are then fed to the Reducer after doing shuffling and sorting then the
final output produced by the reducer will be written to the HDFS. These are how a simple Map-
Reduce job works.

Now let’s see how we can use different languages like Python, C++, Ruby with
Hadoop for execution. We can run this arbitrary language by running them as a separate
process. For that, we will create our external mapper and run it as an external separate process.
These external map processes are not part of the basic MapReduce flow. This external mapper
will take input from STDIN and produce output to STDOUT. As the key-value pairs are
passed to the internal mapper the internal mapper process will send these key-value pairs to the
external mapper where we have written our code in some other language like with python with
help of STDIN. Now, these external mappers process these key-value pairs and generate
intermediate key-value pairs with help of STDOUT and send it to the internal mappers.

Similarly, Reducer does the same thing. Once the intermediate key-value pairs are
processed through the shuffle and sorting process they are fed to the internal reducer which
will send these pairs to external reducer process that are working separately through the help
of STDIN and gathers the output generated by external reducers with help of STDOUT and
finally the output is stored to our HDFS. This is how Hadoop Streaming works on Hadoop
which is by default available in Hadoop.

Hadoop
Module III: Big Data Analytics 11

DESIGN OF HDFS

HDFS is a file system designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.

Let’s examine this statement in more detail:

Very large files

“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.

Streaming data access

HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time. Each analysis will
involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is
more important than the latency in reading the first record.

Commodity hardware

Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from multiple
vendors) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user
in the face of such failure.

These are areas where HDFS is not a good fit today:

Low-latency data access

Applications that require low-latency access to data, in the tens of milliseconds


range, will not work well with HDFS. Remember, HDFS is optimized for delivering a high
throughput of data, and this may be at the expense of latency. HBase is currently a better
choice for low-latency access.

Lots of small files

Since the namenode holds filesystem metadata in memory, the limit to the number of
files in a filesystem is governed by the amount of memory on the namenode. As a rule of

Hadoop
Module III: Big Data Analytics 12

thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one
million files, each taking one block, you would need at least 300 MB of memory. While
storing millions of files is feasible, billions is beyond the capability of current hardware.

Multiple writers, arbitrary file modifications

Files in HDFS may be written to by a single writer. Writes are always made at the end
of the file. There is no support for multiple writers, or for modifications at arbitrary
offsets in the file. (These might be supported in the future, but they are likely to be relatively
inefficient.)

JAVA INTERFACES TO HDFS

Hadoop has an abstract notion of file systems, of which HDFS is just one
implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents the client
interface to a filesystem in Hadoop, and there are several concrete implementations.
Hadoop is written in Java, so most Hadoop file system interactions are mediated through the
Java API. The file system shell, for example, is a Java application that uses the Java File
System class to provide file system operations. By exposing its file system interface as a Java
API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST
API exposed by the WebHDFS protocol makes it easier for other languages to interact with
HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided
for very large data transfers if possible.

The Java Interface

In this section, we dig into the Hadoop FileSystem class: the API for interacting with
one of Hadoop’s filesystems.

Reading Operation in HDFS

Hadoop
Module III: Big Data Analytics 13

1. A client initiates read request by calling ‘open()’ method of FileSystem object; it is an


object of type DistributedFileSystem.

2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file.

3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.

4. Once addresses of DataNodes are received, an object of type FSDataInputStream is


returned to the client. FSDataInputStream contains DFSInputStream which takes
care of interactions with DataNode and NameNode. In step 4 shown in the above
diagram, a client invokes ‘read()’ method which causes DFSInputStream to establish
a connection with the first DataNode with the first block of a file.

5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.

6. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block

Hadoop
Module III: Big Data Analytics 14

7. Once a client has done with the reading, it calls a close() method.

Writing Operation in HDFS

1. A client initiates write operation by calling ‘create()’ method of DistributedFileSystem


object which creates a new file – Step no. 1 in the above diagram.

2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which is being
created) does not exist already and a client has correct permissions to create a new file.
If a file already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.

Hadoop
Module III: Big Data Analytics 15

3. Once a new record in NameNode is created, an object of type FSDataOutputStream is


returned to the client. A client uses it to write data into the HDFS. Data write method
is invoked (step 3 in the diagram).

4. FSDataOutputStream contains DFSOutputStream object which looks after


communication with DataNodes and NameNode. While the client continues writing
data, DFSOutputStream continues creating packets with this data. These packets are
enqueued into a queue which is called as DataQueue.

5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking
desirable DataNodes to be used for replication.

6. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.

7. The DataStreamer pours packets into the first DataNode in the pipeline.

8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.

9. Another queue, ‘Ack Queue’ is maintained by DFSOutputStream to store packets


which are waiting for acknowledgment from DataNodes.

10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.

11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.

12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.

DEVELOPING A MAPREDUCE APPLICATION

Data once stored in the HDFS also needs to be processed upon. Map Reduce is used to
process the data.

Hadoop
Module III: Big Data Analytics 16

Map Reduce:

It is a programming paradigm to process Big Data in parallel on the clusters of


commodity hardware with reliability and fault-tolerance.

The pattern is about breaking down a problem into small pieces of work. Map, Reduce
and Shuffle are three basic operations of Map Reduce.

 Map: This takes the input data and converts into a set of data where each and
every line of input is broken down into a key-value pair (tuple).
 Reduce: This task takes input from the Map phase's output and
combines(aggregates) data tuples into smaller sets based on keys.
 Shuffle: Is the process of transferring the data from mappers to the
reducers.

HOW MAPREDUCE WORKS

In Hadoop, Map Reduce works by breaking the data processing into two phases: Map
phase and Reduce phase. The map is the first phase of processing, where we specify all the
complex logic/business rules/costly code. Reduce is the second phase of processing, where
we specify light-weight processing like aggregation/summation.

1. Input Files
The data for a Map Reduce task is stored in input files, and input files typically
lives in HDFS. The format of these files is arbitrary, while line-based log files and
binary format can also be used.
2. Input Format

Hadoop
Module III: Big Data Analytics 17

Now, Input Format defines how these input files are split and read. It selects the
files or other objects that are used for input. Input Format creates Input Split.
3. Input Splits
It is created by InputFormat, logically represent the data which will be processed
by an individual Mapper (We will understand mapper below). One map task is created
for each split; thus the number of map tasks will be equal to the number of InputSplits.
The split is divided into records and each record will be processed by the mapper
4. RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts
the data into key-value pairs suitable for reading by the mapper. By default, it uses
TextInputFormat for converting data into a key-value pair. RecordReader
communicates with the InputSplit until the file reading is not completed. It
assigns byte offset (unique number) to each line present in the file. Further, these key-
value pairs are sent to the mapper for further processing.
5. Mapper
It processes each input record (from RecordReader) and generates new key-
value pair, and this key-value pair generated by Mapper is completely different from
the input pair. The output of Mapper is also known as intermediate output which is
written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a
high latency system). Mappers output is passed to the combiner for further process
6. Combiner
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner
performs local aggregation on the mappers’ output, which helps to minimize the data
transfer between mapper and reducer. Once the combiner functionality is executed, the
output is then passed to the partitioner for further work.
7. Partitioner
Hadoop MapReduce, Partitioner comes into the picture if we are working on
more than one reducer (for one reducer partitioner is not used).
Partitioner takes the output from combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. By hash
function, key (or a subset of the key) is used to derive the partition.

Hadoop
Module III: Big Data Analytics 18

According to the key value in MapReduce, each combiner output is partitioned,


and a record having the same key value goes into the same partition, and then each
partition is sent to a reducer. Partitioning allows even distribution of the map output
over the reducer.
8. Shuffling and Sorting
Now, the output is Shuffled to the reduce node (which is a normal slave node
but reduce phase will run here hence called as reducer node). The shuffling is the
physical movement of the data which is done over the network. Once all the
mappers are finished and their output is shuffled on the reducer nodes, then this
intermediate output is merged and sorted, which is then provided as input to
reduce phase.
9. Reducer
It takes the set of intermediate key-value pairs produced by the mappers as the input
and then runs a reducer function on each of them to generate the output. The output of
the reducer is the final output, which is stored in HDFS.
10. RecordWriter
It writes these output key-value pair from the Reducer phase to the output files.
11. OutputFormat
The way these output key-value pairs are written in output files by RecordWriter
is determined by the OutputFormat. OutputFormat instances provided by the
Hadoop are used to write files in HDFS or on the local disk. Thus the final output of
reducer is written on HDFS by OutputFormat instances. Hence, in this manner, a
Hadoop MapReduce works over the cluster.
Eg:

Hadoop
Module III: Big Data Analytics 19

ANATOMY OF A MAPREDUCE JOB RUN


The steps Hadoop takes to run a job are:
 The client, which submits the MapReduce job.
 The jobtracker, which coordinates the job run. The jobtracker is a Java
application whose main class is JobTracker.
 The tasktrackers, which run the tasks that the job has been split into.
Tasktrackers are Java applications whose main class is TaskTracker.
 The distributed filesystem, which is used for sharing job files between the
other entities.

Job Submission

The runJob() method on JobClient is a convenience method that creates a new


JobClient instance and calls submitJob() on it . Having submitted the job, runJob() polls
the job’s progress once a second and reports the progress to the console if it has changed since
the last report. When the job is complete, if it was successful, the job counters are displayed.
Otherwise, the error that caused the job to fail is logged to the console.

Hadoop
Module III: Big Data Analytics 20

The job submission process implemented by JobClient’s submitJob() method does


the following:
 Asks the jobtracker for a new job ID (by calling getNewJobId() on
JobTracker) (step 2).
 Checks the output specification of the job. For example, if the output
directory has not been specified or it already exists, the job is not
submitted and an error is thrown to the MapReduce program.
 Computes the input splits for the job. If the splits cannot be computed, because
the input paths don’t exist, for example, then the job is not submitted and
an error is thrown to the MapReduce program.
 Copies the resources needed to run the job, including the job JAR file, the
configuration file, and the computed input splits, to the jobtracker’s
filesystem in a directory named after the job ID. The job JAR is copied with a
high replication factor (controlled by the mapred.submit.replication property,
which defaults to 10) so that there are lots of copies across the cluster for the
tasktrackers to access when they run tasks for the job(step3).
 Tells the jobtracker that the job is ready for execution (by calling
submitJob() on JobTracker) (step 4).

Job Initialization

When the JobTracker receives a call to its submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it up and initialize it. Initialization involves
creating an object to represent the job being run, which encapsulates its tasks, and
bookkeeping information to keep track of the tasks’ status and progress (step 5).

To create the list of tasks to run, the job scheduler first retrieves the input splits
computed by the JobClient from the shared filesystem (step 6). It then creates one map task for
each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks
property in the JobConf, which is set by the setNumReduce Tasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.

Task Assignment

Tasktrackers run a simple loop that periodically sends heartbeat method calls to the
jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a

Hadoop
Module III: Big Data Analytics 21

channel for messages. As a part of the heartbeat, a tasktracker will indicate whether it is ready
to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value (step 7).

Before it can choose a task for the tasktracker, the jobtracker must choose a job to select
the task from. There are various scheduling algorithms as explained later in this chapter (see
“Job Scheduling”), but the default one simply maintains a priority list of jobs. Having chosen
a job, the jobtracker now chooses a task for the job.

Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for
example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
(The precise number depends on the number of cores and the amount of memory on the
tasktracker; see “Memory” ) The default scheduler fills empty map task slots before reduce
task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a
map task; otherwise, it will select a reduce task.

To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run
reduce tasks, since there are no data locality considerations. For a map task, however, it takes
account of the tasktracker’s network location and picks a task whose input split is as close as
possible to the tasktracker. In the optimal case, the task is data-local, that is, running on the
same node that the split resides on. Alternatively, the task may be rack-local: on the same rack,
but not the same node, as the split. Some tasks are neither data-local nor rack-local and retrieve
their data from a different rack from the one theyare running on. You can tell the proportion of
each type of task by looking at a job’s counters .

Task Execution

Now that the tasktracker has been assigned a task, the next step is for it to run the task.
First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s
filesystem. It also copies any files needed from the distributed cache by the application to the
local disk; see “Distributed Cache” (step 8). Second, it creates a local working directory for the
task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of
TaskRunner to run the task.

TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10),
so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by

Hadoop
Module III: Big Data Analytics 22

causing it to crash or hang, for example). It is, however, possible to reuse the JVM between
tasks; see “Task JVM Reuse”

The child process communicates with its parent through the umbilical interface. This
way it informs the parent of the task’s progress every few seconds until the task is complete.

Job Completion

When the jobtracker receives a notification that the last task for a job is complete, it
changes the status for the job to “successful.” Then, when the JobClient polls for status, it learns
that the job has completed successfully, so it prints a message to tell the user and then returns
from the runJob() method.

FAILURES

In the real world, user code is buggy, processes crash, and machines fail. One of the
major benefits of using Hadoop is its ability to handle such failures and allow your job to
complete. Task Failure Consider first the case of the child task failing. The most common way
that this happens is when user code in the map or reduce task throws a runtime exception. If
this happens, the child JVM reports the error back to its parent tasktracker, before it exits.
The error ultimately makes it into the user logs. The tasktracker marks the task attempt as
failed, freeing up a slot to run another task.

For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
marked as failed. This behavior is governed by the stream.non.zero.exit.is.failure property (the
default is true).

Another failure mode is the sudden exit of the child JVM perhaps there is a JVM bug
that causes the JVM to exit for a particular set of circumstances exposed by the Map- Reduce
user code. In this case, the tasktracker notices that the process has exited and marks the attempt
as failed.

Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The child JVM
process will be automatically killed after this period. The timeout period after which tasks are
considered failed is normally 10 minutes and can be configured on a per-job basis (or a
cluster basis) by setting the mapred.task.timeout property to a value in milliseconds.

Hadoop
Module III: Big Data Analytics 23

Setting the timeout to a value of zero disables the timeout, so long-running tasks are
never marked as failed. In this case, a hanging task will never free up its slot, and over time
there may be cluster slowdown as a result. This approach should therefore be avoided, and
making sure that a task is reporting progress periodically will suffice (see “What Constitutes
Progress in MapReduce?” ).

When the jobtracker is notified of a task attempt that has failed (by the
tasktracker’s heartbeat call), it will reschedule execution of the task. The jobtracker will
try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore,
if a task fails four times (or more), it will not be retried further. This value is configurable:the
maximum number of attempts to run a task is controlled by themapred.map.max.attempts
property for map tasks and mapred.reduce.max.attempts for reduce tasks. By default, if any
task fails four times (or whatever the maximum number of attempts is configured to), the whole
job fails.

For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum percentage
of tasks that are allowed to fail without triggering job failure can be set for the job. Map
tasks and reduce tasks are controlled independently, using the
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.

If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM
that launched it will be killed), so you should take precautions to monitor for this scenario,
and kill orphaned processes by some other means.

A task attempt may also be killed, which is different from it failing. A task attempt may
be killed because it is a speculative duplicate (for more, see “Speculative Execution” ), or
because the tasktracker it was running on failed, and the jobtracker marked all the task
attempts running on it as killed. Killed task attempts do not count against the number of
attempts to run the task (as set by mapred.map.max.attempts and
mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed. Users
may also kill or fail task attempts using the web UI or the command line (type hadoop job to
see the options). Jobs may also be killed by the same mechanisms. Tasktracker Failure

Failure of a tasktracker is another failure mode. If a tasktracker fails by crashing, or


running very slowly, it will stop sending heartbeats to the jobtracker (or send them very

Hadoop
Module III: Big Data Analytics 24

infrequently). The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it
hasn’t received one for 10 minutes, configured via the mapred.task tracker.expiry.interval
property, in milliseconds) and remove it from its pool of tasktrackers to schedule tasks on. The
jobtracker arranges for map tasks that were run and completed successfully on that tasktracker
to be rerun if they belong to incomplete jobs, since their intermediate output residing on the
failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in
progress are also rescheduled.

A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not
failed. A tasktracker is blacklisted if the number of tasks that have failed on it is significantly
higher than the average task failure rate on the cluster. Blacklisted tasktrackers can be
restarted to remove them from the jobtracker’s blacklist.

Jobtracker Failure

Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has no
mechanism for dealing with failure of the jobtracker it is a single point of failure so in this
case the job fails. However, this failure mode has a low chance of occurring, since the chance
of a particular machine failing is low. It is possible that a future release of Hadoop will remove
this limitation by running multiple jobtrackers, only one of which is the primary jobtracker at
any time

JOB SCHEDULING

Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they
ran in order of submission, using a FIFO scheduler. Typically, each job would use the
whole cluster, so jobs had to wait their turn. Although a shared cluster offers great potential
for offering large resources to many users, the problem of sharing resources fairly
between users requires a better scheduler. Production jobs need to complete in a timely
manner, while allowing users who are making smaller ad hoc queries to get results back in a
reasonable time.

Later on, the ability to set a job’s priority was added, via the mapred.job.priority
property or the setJobPriority() method on JobClient (both of which take one of the values
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). When the job scheduler is
choosing the next job to run, it selects one with the highest priority. However, with the
FIFO scheduler, priorities do not support pre-emption, so a high-priority job can still be

Hadoop
Module III: Big Data Analytics 25

blocked by a long-running low priority job that started before the high-priority job was
scheduled.

Map Reduce in Hadoop comes with a choice of schedulers. The default is the
original FIFO queue-based scheduler, and there are also multiuser schedulers called the
Fair Scheduler and the Capacity Scheduler.

The Fair Scheduler

The Fair Scheduler aims to give every user a fair share of the cluster capacity over time.
If a single job is running, it gets all of the cluster. As more jobs are submitted, free task slots
are given to the jobs in such a way as to give each user a fair share of the cluster.

Hadoop
Module III: Big Data Analytics 26

A short job belonging to one user will complete in a reasonable time even while another
user’s long job is running, and the long job will still make progress. Jobs are placed in pools,
and by default, each user gets their own pool. A user who submits more jobs than a second user
will not get any more cluster resources than the second, on average. It is also possible to define
custom pools with guaranteed minimum capacities defined in terms of the number of map and
reduce slots, and to set weightings for each pool.

The Fair Scheduler supports preemption, so if a pool has not received its fair share for
a certain period of time, then the scheduler will kill tasks in pools running over capacity in
order to give the slots to the pool running under capacity.

The Capacity Scheduler

The Capacity Scheduler takes a slightly different approach to multiuser scheduling.


A cluster is made up of a number of queues (like the Fair Scheduler’s pools), which may be
hierarchical (so a queue may be the child of another queue), and each queue has an allocated
capacity. This is like the Fair Scheduler, except that within each queue, jobs are scheduled
using FIFO scheduling (with priorities). In effect, the Capacity Scheduler allows users
or organizations (defined using queues) to simulate a separate MapReduce cluster with
FIFO scheduling for each user or organization. The Fair Scheduler, by contrast, (which actually
also supports FIFO job scheduling within pools as an option, making it like the Capacity
Scheduler) enforces fair sharing within each pool, so running jobs share the pool’s resources.

Hadoop
Module III: Big Data Analytics 27

SHUFFLING AND SORTING

Shuffling is the process by which it transfers mappers intermediate output to the


reducer. Reducer gets 1 or more keys and associated values on the basis of reducers.

The intermediated key – value generated by mapper is sorted automatically by key. In


Sort phase merging and sorting of map output takes place.

Shuffling and Sorting in Hadoop occurs simultaneously.

Shuffling in MapReduce

The process of transferring data from the mappers to reducers is shuffling. It is also the
process by which the system performs the sort. Then it transfers the map output to the reducer
as input. This is the reason shuffle phase is necessary for the reducers.

Otherwise, they would not have any input (or input from every mapper). Since shuffling
can start even before the map phase has finished. So this saves some time and completes the
tasks in lesser time.

Sorting in MapReduce

MapReduce Framework automatically sort the keys generated by the


mapper. Thus, before starting of reducer, all intermediate key-value pairs get sorted by key and
not by value. It does not sort values passed to each reducer. They can be in any order.

Sorting in a MapReduce job helps reducer to easily distinguish when a


new reduce task should start.

This saves time for the reducer. Reducer in MapReduce starts a new reduce task
when the next key in the sorted input data is different than the previous. Each reduce task takes
key value pairs as input and generates key-value pair as output.

The important thing to note is that shuffling and sorting in Hadoop MapReduce
are will not take place at all if you specify zero reducers (setNumReduceTasks(0)).

If reducer is zero, then the MapReduce job stops at the map phase. And the map
phase does not include any kind of sorting (even the map phase is faster).

Hadoop
Module III: Big Data Analytics 28

TASK EXECUTION

Below steps describes how the job executed.

 Task Tracker copies the job jar file from the shared file system (HDFS).
 Task tracker creates a local working directory and un-jars the jar file into the local
filesystem.
 Task tracker creates an instance of TaskRunner.
 Task tracker starts TaskRunner in a new JVM to run the map or reduce task.
 The child process communicates the progress to parent process.
 Each task can perform setup and cleanup actions based on OutputComitter.
 The input provided via stdin and get output via stdout from the running process even if
the map or reduce task ran via pipes or socket in case of streaming.

MapReduce Types

Mapping is the core technique of processing a list of data elements that come in
pairs of keys and values. The map function applies to individual elements defined as key-value
pairs of a list and produces a new list.

The general idea of map and reduce function of Hadoop can be illustrated as follows:

map: (K1, V1) -> list (K2, V2)

reduce: (K2, list(V2)) -> list (K3, V3)

The input parameters of the key and value pair, represented by K1 and
V1 respectively, are different from the output pair type: K2 and V2. The reduce function
accepts the same format output by the map, but the type of output again of the reduce
operation is different: K3 and V3. The Java API for this is as follows:

public interface Mapper<K1, V1, K2, V2> extends JobConfigurable,

Closeable {

void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws
IOException;

public interface Reducer<K2, V2, K3, V3> extends JobConfigurable,

Hadoop
Module III: Big Data Analytics 29

Closeable {

void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter


reporter)throws IOException;

The OutputCollector is the generalized interface of the Map-Reduce framework


to facilitate collection of data output either by the Mapper or the Reducer. These outputs are
nothing but intermediate output of the job. Therefore, they must be parameterized with their
types. The Reporter facilitates the Map-Reduce application to report progress and update
counters and status information. If, however, the combine function is used, it has the same
form as the reduce function and the output is fed to the reduce function. This may be illustrated
as follows:

map: (K1, V1) -> list (K2, V2)

combine: (K2, list(V2)) -> list (K2, V2)

reduce: (K2, list(V2)) -> list (K3, V3)

Note that the combine and reduce functions use the same type, except in
the variable names where K3 is K2 and V3 is V2.

The partition function operates on the intermediate key-value types. It controls


the partitioning of the keys of the intermediate map outputs. The key derives the partition using
a typical hash function. The total number of partitions is the same as the number of reduce tasks
for the job. The partition is determined only by the key ignoring the value.

public interface Partitioner<K2, V2> extends JobConfigurable {

int getPartition(K2 key, V2 value, int numberOfPartition);

This is the key essence of MapReduce types in short.

Hadoop Input Formats

Hadoop InputFormat describes the input-specification for execution of the Map-


Reduce job. InputFormat describes how to split up and read input files. In MapReduce job

Hadoop
Module III: Big Data Analytics 30

execution, InputFormat is the first step. It is also responsible for creating the input splits and
dividing them into records.

Input files store the data for MapReduce job. Input files reside in HDFS.
Although these files format is arbitrary, we can also use line-based log files and binary format.
Hence, In MapReduce, InputFormat class is one of the fundamental classes which provides
below functionality:

 InputFormat selects the files or other objects for input.


 It also defines the Data splits. It defines both the size of individual Map tasks and its
potential execution server.
 Hadoop InputFormat defines the RecordReader. It is also responsible for reading actual
records from the input files.

Types of InputFormat in MapReduce:

There are different types of MapReduce InputFormat in Hadoop which are used
for different purpose. Let’s discuss the Hadoop InputFormat types below:

1. FileInputFormat

It is the base class for all file-based InputFormats. FileInputFormat also specifies
input directory which has data files location. When we start a MapReduce job execution,
FileInputFormat provides a path containing files to read.

This InpuFormat will read all files. Then it divides these files into one or more
InputSplits.

2. TextInputFormat

It is the default InputFormat. This InputFormat treats each line of each input file
as a separate record. It performs no parsing. TextInputFormat is useful for unformatted data or
line-based records like log files. Hence,

Key – It is the byte offset of the beginning of the line within the file (not whole
file one split). So it will be unique if combined with the file name.

Value – It is the contents of the line. It excludes line terminators.

3. KeyValueTextInputFormat

Hadoop
Module III: Big Data Analytics 31

It is similar to TextInputFormat. This InputFormat also treats each line of input


as a separate record. While the difference is that TextInputFormat treats entire line as the value,
but the KeyValueTextInputFormat breaks the line itself into key and value by a tab character
(‘/t’). Hence,

Key – Everything up to the tab character.

Value – It is the remaining part of the line after tab character.

4. SequenceFileInputFormat

It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-compressed and
provide direct serialization and deserialization of several arbitrary data. Hence,

Key & Value both are user-defined.

5. SequenceFileAsTextInputFormat

It is the variant of SequenceFileInputFormat. This format converts the sequence


file key values to Text objects. So, it performs conversion by calling ‘tostring()’ on the keys
and values. Hence, SequenceFileAsTextInputFormat makes sequence files suitable input for
streaming.

6. SequenceFileAsBinaryInputFormat

By using SequenceFileInputFormat we can extract the sequence file’s keys and


values as an opaque binary object.

7. NlineInputFormat

It is another form of TextInputFormat where the keys are byte offset of the line.
And values are contents of the line. So, each mapper receives a variable number of lines of
input with TextInputFormat and KeyValueTextInputFormat.

The number depends on the size of the split. Also, depends on the length of the
lines. So, if want our mapper to receive a fixed number of lines of input, then we use
NLineInputFormat.

N- It is the number of lines of input that each mapper receives.

Hadoop
Module III: Big Data Analytics 32

By default (N=1), each mapper receives exactly one line of input.

Suppose N=2, then each split contains two lines. So, one mapper receives the
first two Key-Value pairs. Another mapper receives the second two key-value pairs.

8. DBInputFormat

This InputFormat reads data from a relational database, using JDBC. It also loads
small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Hence,

Key – LongWritables
Value – DBWritables.

Output Formats

Output Format check the output specification for execution of the Map-Reduce
job. It describes how RecordWriter implementation is used to write output to output files.

Hadoop Output Format:

The RecordWriter takes output data from Reducer. Then it writes this data to
output files. OutputFormat determines the way these output key-value pairs are written in
output files by RecordWriter.

The OutputFormat and InputFormat functions are similar. OutputFormat


instances are used to write to files on the local disk or in HDFS. In MapReduce job execution
on the basis of output specification;

 Hadoop MapReduce job checks that the output directory does not already present.
 OutputFormat in MapReduce job provides the RecordWriter implementation to be used
to write the output files of the job. Then the output files are stored in a FileSystem.

The framework uses FileOutputFormat.setOutputPath() method to set the output


directory.

Types of OutputFormat in MapReduce:

There are various types of OutputFormat which are as follows:

1. TextOutputFormat

Hadoop
Module III: Big Data Analytics 33

The default OutputFormat is TextOutputFormat. It writes (key, value) pairs on


individual lines of text files. Its keys and values can be of any type. The reason behind is that
TextOutputFormat turns them to string by calling toString() on them.

It separates key-value pair by a tab character. By


using MapReduce.output.textoutputformat.separator property we can also change it.

KeyValueTextOutputFormat is also used for reading these output text files.

2. SequenceFileOutputFormat

This OutputFormat writes sequences files for its output. SequenceFileInputFormat is


also intermediate format use between MapReduce jobs. It serializes arbitrary data types to the
file.

And the corresponding SequenceFileInputFormat will deserialize the file into the same
types. It presents the data to the next mapper in the same manner as it was emitted by the
previous reducer. Static methods also control the compression.

3. SequenceFileAsBinaryOutputFormat

It is another variant of SequenceFileInputFormat. It also writes keys and values to


sequence file in binary format.

4. MapFileOutputFormat

It is another form of FileOutputFormat. It also writes output as map files. The


framework adds a key in a MapFile in order. So we need to ensure that reducer emits keys in
sorted order.

5. MultipleOutputs

This format allows writing data to files whose names are derived from the output keys
and values.

6. LazyOutputFormat

In MapReduce job execution, FileOutputFormat sometimes create output files, even if


they are empty. LazyOutputFormat is also a wrapper OutputFormat.

7. DBOutputFormat

Hadoop
Module III: Big Data Analytics 34

It is the OutputFormat for writing to relational databases and HBase. This format
also sends the reduce output to a SQL table. It also accepts key-value pairs. In this, the key has
a type extending DBwritable.

Features of MapReduce

1. Scalability

Apache Hadoop is a highly scalable framework. This is because of its ability to


store and distribute huge data across plenty of servers. All these servers were inexpensive
and can operate in parallel. We can easily scale the storage and computation power by
adding servers to the cluster.

Hadoop MapReduce programming enables organizations to run applications


from large sets of nodes which could involve the use of thousands of terabytes of data.

Hadoop MapReduce programming enables business organizations to run


applications from large sets of nodes. This can use thousands of terabytes of data.

2. Flexibility

MapReduce programming enables companies to access new sources of data. It


enables companies to operate on different types of data. It allows enterprises to access
structured as well as unstructured data, and derive significant value by gaining insights
from the multiple sources of data.

Additionally, the MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to clickstream.

The MapReduce processes data in simple key-value pairs thus supports data type
including meta-data, images, and large files. Hence, MapReduce is flexible to deal with data
rather than traditional DBMS.

3. Security and Authentication

The MapReduce programming model uses HBase and HDFS security platform
that allows access only to the authenticated users to operate on the data. Thus, it protects
unauthorized access to system data and enhances system security.

4. Cost-effective solution

Hadoop
Module III: Big Data Analytics 35

Hadoop’s scalable architecture with the MapReduce programming


framework allows the storage and processing of large data sets in a very affordable manner.

5. Fast

Hadoop uses a distributed storage method called as a Hadoop Distributed File


System that basically implements a mapping system for locating data in a cluster.

The tools that are used for data processing, such as MapReduce programming,
are generally located on the very same servers that allow for the faster processing of data.

So, Even if we are dealing with large volumes of unstructured data,


Hadoop MapReduce just takes minutes to process terabytes of data. It can process
petabytes of data in just an hour.

6. Simple model of programming

Amongst the various features of Hadoop MapReduce, one of the most important
features is that it is based on a simple programming model. Basically, this allows programmers
to develop the MapReduce programs which can handle tasks easily and efficiently.

The MapReduce programs can be written in Java, which is not very hard to pick
up and is also used widely. So, anyone can easily learn and write MapReduce programs and
meet their data processing needs.

7. Parallel Programming

One of the major aspects of the working of MapReduce programming is


its parallel processing. It divides the tasks in a manner that allows their execution in parallel.
The parallel processing allows multiple processors to execute these divided tasks. So the entire
program is run in less time.

8. Availability and resilient nature

Whenever the data is sent to an individual node, the same set of data is forwarded
to some other nodes in a cluster. So, if any particular node suffers from a failure, then there are
always other copies present on other nodes that can still be accessed whenever needed. This
assures high availability of data.

One of the major features offered by Apache Hadoop is its fault tolerance. The
Hadoop MapReduce framework has the ability to quickly recognizing faults that occur.

Hadoop
Module III: Big Data Analytics 36

It then applies a quick and automatic recovery solution. This feature makes it a
game-changer in the world of big data processing.

Hadoop

You might also like