Module III Note
Module III Note
Module III
Hadoop
Hadoop is an open-source software framework for storing data and running
applications on clusters of commodity hardware. It provides massive storage for any kind of
data, enormous processing power and the ability to handle virtually limitless concurrent tasks
or jobs.
History of Hadoop
Hadoop was started with Doug Cutting and Mike Cafarella in the year 2002 when
they both started to work on Apache Nutch project. Apache Nutch project was the process of
building a search engine system that can index 1 billion pages. After a lot of research on Nutch,
they concluded that such a system will cost around half a million dollars in hardware, and along
with a monthly running cost of $30, 000 approximately, which is very expensive. So, they
realized that their project architecture will not be capable enough to the workaround with
billions of pages on the web. So they were looking for a feasible solution which can reduce the
implementation cost as well as the problem of storing and processing of large datasets.
In 2003, they came across a paper that described the architecture of Google’s distributed
file system, called GFS (Google File System) which was published by Google, for storing the
large data sets. Now they realize that this paper can solve their problem of storing very large
files which were being generated because of web crawling and indexing processes. But this
paper was just the half solution to their problem.
In 2004, Google published one more paper on the technique MapReduce, which was
the solution of processing those large datasets. Now this paper was another half solution for
Doug Cutting and Mike Cafarella for their Nutch project. These both techniques (GFS &
MapReduce) were just on white paper at Google. Google didn’t implement these two
techniques. Doug Cutting knew from his work on Apache Lucene (It is a free and open-source
information retrieval software library, originally written in Java by Doug Cutting in 1999) that
open-source is a great way to spread the technology to more people. So, together with Mike
Cafarella, he started implementing Google’s techniques (GFS & MapReduce) as open-source
in the Apache Nutch project.
In 2005, Cutting found that Nutch is limited to only 20-to-40 node clusters. He soon
Hadoop
Module III: Big Data Analytics 2
(a) Nutch wouldn’t achieve its potential until it ran reliably on the larger clusters.
(b) And that was looking impossible with just two people (Doug Cutting & Mike Cafarella).
The engineering task in Nutch project was much bigger than he realized. So he started
to find a job with a company who is interested in investing in their efforts. And he found
Yahoo.
Yahoo had a large team of engineers that was eager to work on this there project.
So in 2006, Doug Cutting joined Yahoo along with Nutch project. He wanted to
provide the world with an open-source, reliable, scalable computing framework, with the help
of Yahoo. So at Yahoo first, he separates the distributed computing parts from Nutch and
formed a new project Hadoop (He gave name Hadoop it was the name of a yellow toy elephant
which was owned by the Doug Cutting’s son. and it was easy to pronounce and was the unique
word.) Now he wanted to make Hadoop in such a way that it can work well on thousands of
nodes. So with GFS and MapReduce, he started to work on Hadoop.
In 2007, Yahoo successfully tested Hadoop on a 1000 node cluster and start using it. In
January of 2008, Yahoo released Hadoop as an open source project to ASF(Apache Software
Foundation).
And in July of 2008, Apache Software Foundation successfully tested a 4000 node
cluster with Hadoop.
In 2009, Hadoop was successfully tested to sort a PB (PetaByte) of data in less than 17
hours for handling billions of searches and indexing millions of web pages. And Doug Cutting
left the Yahoo and joined Cloudera to fulfill the challenge of spreading Hadoop to other
industries.
Hadoop
Module III: Big Data Analytics 3
HDFS is an effective, scalable, fault tolerant and distributed approach for storing and
managing huge volumes of data.
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. HDFS employs a Name Node and Data Node architecture to
implement a distributed file system that provides high-performance access to data across
highly scalable Hadoop clusters.
Hadoop
Module III: Big Data Analytics 4
Hadoop itself is an open source distributed processing framework that manages data
processing and storage for big data applications. It provides a reliable means for managing
pools of big data and supporting related big data analytics applications.
WORKING OF HDFS:
HDFS enables the rapid transfer of data between compute nodes. At its outset, it was
closely coupled with Map Reduce, a framework for data processing that filters and divides
up work among the nodes in a cluster, and it organizes and condenses the results into a cohesive
answer to a query. Similarly, when HDFS takes in data, it breaks the information down into
separate blocks and distributes them to different nodes in a cluster.
With HDFS, data is written on the server once, and read and reused numerous times
after that. HDFS has a primary Name Node, which keeps track of where file data is kept in the
cluster.
HDFS also has multiple Data Nodes on a commodity hardware cluster -- typically one
per node in a cluster. The Data Nodes are generally organized within the same rack in the data
centre. Data is broken down into separate blocks and distributed among the various Data Nodes
for storage. Blocks are also replicated across nodes, enabling highly efficient parallel
processing.
The Name Node knows which Data Node contains which blocks and where the Data
Nodes reside within the machine cluster. The Name Node also manages access to the
files, including reads, writes, creates, deletes and the data block replication across the Data
Nodes.
The Name Node operates in conjunction with the Data Nodes. As a result, the cluster
can dynamically adapt to server capacity demand in real time by adding or subtracting nodes
as necessary.
The DataNodes are in constant communication with the NameNode to determine if the
DataNodes need to complete specific tasks. Consequently, the NameNode is always aware of
the status of each DataNode. If the NameNode realizes that one DataNode isn't working
properly, it can immediately reassign that DataNode's task to a different node containing
the same data block. DataNodes also communicate with each other, which enables them
to cooperate during normal file operations.
Hadoop
Module III: Big Data Analytics 5
the primary server that manages the file system namespace and controls client access to files.
As the central component of the Hadoop Distributed File System, the NameNode maintains
and manages the file system namespace and provides clients with the right access permissions.
The system's DataNodes manage the storage that's attached to the nodes they run on.
HDFS exposes a file system namespace and enables user data to be stored in files. A
file is split into one or more of the blocks that are stored in a set of DataNodes. The NameNode
performs file system namespace operations, including opening, closing and renaming files
and directories. The NameNode also governs the mapping of blocks to the DataNodes.
The DataNodes serve read and write requests from the clients of the file system. In
addition, they perform block creation, deletion and replication when the NameNode instructs
them to do so.
Hadoop
Module III: Big Data Analytics 6
The NameNode records any change to the file system namespace or its properties. An
application can stipulate the number of replicas of a file that the HDFS should maintain. The
NameNode stores the number of copies of a file, called the replication factor of that file.
COMPONENTS OF HADOOP
1. HDFS: Hadoop Distributed File System is a dedicated file system to store big data with a
cluster of commodity hardware or cheaper hardware with streaming access pattern. It enables
data to be stored at multiple nodes in the cluster which ensures data security and fault
tolerance.
2. Map Reduce : Data once stored in the HDFS also needs to be processed upon. Now
suppose a query is sent to process a data set in the HDFS. Now, Hadoop identifies where this
data is stored, this is called Mapping. Now the query is broken into multiple parts and the
results of all these multiple parts are combined and the overall result is sent back to the user.
This is called reduce process. Thus while HDFS is used to store the data, Map Reduce
is used to process the data.
3. YARN : YARN stands for Yet Another Resource Negotiator. It is a dedicated operating
system for Hadoop which manages the resources of the cluster and also functions as a
framework for job scheduling in Hadoop. The various types of scheduling are First Come First
Serve, Fair Share Scheduler and Capacity Scheduler etc. The First Come First Serve scheduling
is set by default in YARN.
Map Reduce
Map Reduce is just like an Algorithm or a data structure that is based on the YARN
framework. The primary feature of Map Reduce is to perform the distributed processing
in parallel in a Hadoop cluster, which Makes Hadoop working so fast Because when we
are dealing with Big Data, serial processing is no more of any use.
Features of Map-Reduce:
• Scalable
Hadoop
Module III: Big Data Analytics 7
• Fault Tolerance
• Parallel Processing
• Tuneable Replication
• Load Balancing
Apache Hive
Apache Hive is a Data warehousing tool that is built on top of the Hadoop, and Data
Warehousing is nothing but storing the data at a fixed location generated from various sources.
Hive is one of the best tools used for data analysis on Hadoop. The one who is having
knowledge of SQL can comfortably use Apache Hive. The query language of high is known
as HQL or HIVEQL.
Features of Hive:
• Hive has different storage type HBase, ORC, Plain text, etc.
Apache Mahout
The name Mahout is taken from the Hindi word Mahavat which means the elephant
rider. Apache Mahout runs the algorithm on the top of Hadoop, so it is named Mahout. Mahout
is mainly used for implementing various Machine Learning algorithms on our Hadoop like
classification, Collaborative filtering, Recommendation. Apache Mahout can implement the
Machine algorithms without integration on Hadoop.
Features of Mahout:
Hadoop
Module III: Big Data Analytics 8
Apache Pig
This Pig was Initially developed by Yahoo to get ease in programming. Apache Pig has
the capability to process an extensive dataset as it works on top of the Hadoop. Apache pig is
used for analyzing more massive datasets by representing them as dataflow. Apache Pig
also raises the level of abstraction for processing enormous datasets. Pig Latin is the
scripting language that the developer uses for working on the Pig framework that runs on Pig
runtime.
Features of Pig:
• Easy To Programme
• Extensibility
HBase
Features of HBase:
SCALING OUT
Scaling alters size of a system. In the scaling process, we either compress or expand the
system to meet the expected needs. The scaling operation can be achieved by adding
resources to meet the smaller expectation in the current system, or by adding a new system
in the existing one, or both.
Hadoop
Module III: Big Data Analytics 9
1. Vertical Scaling: When new resources are added in the existing system to meet the
expectation, it is known as vertical scaling. Consider a rack of servers and resources
that comprises of the existing system. Now when the existing system fails to meet
the expected needs, and the expected needs can be met by just adding resources,
this is considered as vertical scaling. Vertical scaling is based on the idea of
adding more power(CPU, RAM) to existing systems, basically adding more
resources.Vertical scaling is not only easy but also cheaper than Horizontal Scaling. It
also requires less time to be fixed.
2. Horizontal Scaling: When new server racks are added in the existing system to
meet the higher expectation, it is known as horizontal scaling. Consider a rack of
servers and resources that comprises of the existing system. Now when the existing
system fails to meet the expected needs, and the expected needs cannot be met by just
adding resources, we need to add completely new servers. This is considered as
horizontal scaling. Horizontal scaling is based on the idea of adding more machines
into our pool of resources. Horizontal scaling is difficult and also costlier than
Vertical Scaling. It also requires more time to be fixed.
HADOOP STREAMING
It is a utility or feature that comes with a Hadoop distribution that allows developers
or programmers to write the Map-Reduce program using different programming languages
like Ruby, Perl, Python, C++, etc. We can use any language that can read from the standard
input(STDIN) like keyboard input and all and write using standard output(STDOUT). Hadoop
Framework is completely written in java but programs for Hadoop are not necessarily need to
code in Java programming language. feature of Hadoop Streaming is available since Hadoop
version 0.14.1.
In that, we have an Input Reader which is responsible for reading the input data and
produces the list of key-value pairs. The input reader contains the complete logic about
the data it is reading. Suppose we want to read an image then we have to specify the logic in
the input reader so that it can read that image data and finally it will generate key-
value pairs for that image data.
Hadoop
Module III: Big Data Analytics 10
If we are reading an image data then we can generate key-value pair for each pixel
where the key will be the location of the pixel and the value will be its color value from (0-
255) for a colored image. Now this list of key-value pairs is fed to the Map phase and Mapper
will work on each of these key-value pair of each pixel and generate some intermediate key-
value pairs which are then fed to the Reducer after doing shuffling and sorting then the
final output produced by the reducer will be written to the HDFS. These are how a simple Map-
Reduce job works.
Now let’s see how we can use different languages like Python, C++, Ruby with
Hadoop for execution. We can run this arbitrary language by running them as a separate
process. For that, we will create our external mapper and run it as an external separate process.
These external map processes are not part of the basic MapReduce flow. This external mapper
will take input from STDIN and produce output to STDOUT. As the key-value pairs are
passed to the internal mapper the internal mapper process will send these key-value pairs to the
external mapper where we have written our code in some other language like with python with
help of STDIN. Now, these external mappers process these key-value pairs and generate
intermediate key-value pairs with help of STDOUT and send it to the internal mappers.
Similarly, Reducer does the same thing. Once the intermediate key-value pairs are
processed through the shuffle and sorting process they are fed to the internal reducer which
will send these pairs to external reducer process that are working separately through the help
of STDIN and gathers the output generated by external reducers with help of STDOUT and
finally the output is stored to our HDFS. This is how Hadoop Streaming works on Hadoop
which is by default available in Hadoop.
Hadoop
Module III: Big Data Analytics 11
DESIGN OF HDFS
HDFS is a file system designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
HDFS is built around the idea that the most efficient data processing pattern is
a write-once, read-many-times pattern. A dataset is typically generated or copied from
source, then various analyses are performed on that dataset over time. Each analysis will
involve a large proportion, if not all, of the dataset, so the time to read the whole dataset is
more important than the latency in reading the first record.
Commodity hardware
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from multiple
vendors) for which the chance of node failure across the cluster is high, at least for large
clusters. HDFS is designed to carry on working without a noticeable interruption to the user
in the face of such failure.
Since the namenode holds filesystem metadata in memory, the limit to the number of
files in a filesystem is governed by the amount of memory on the namenode. As a rule of
Hadoop
Module III: Big Data Analytics 12
thumb, each file, directory, and block takes about 150 bytes. So, for example, if you had one
million files, each taking one block, you would need at least 300 MB of memory. While
storing millions of files is feasible, billions is beyond the capability of current hardware.
Files in HDFS may be written to by a single writer. Writes are always made at the end
of the file. There is no support for multiple writers, or for modifications at arbitrary
offsets in the file. (These might be supported in the future, but they are likely to be relatively
inefficient.)
Hadoop has an abstract notion of file systems, of which HDFS is just one
implementation. The Java abstract class org.apache.hadoop.fs.FileSystem represents the client
interface to a filesystem in Hadoop, and there are several concrete implementations.
Hadoop is written in Java, so most Hadoop file system interactions are mediated through the
Java API. The file system shell, for example, is a Java application that uses the Java File
System class to provide file system operations. By exposing its file system interface as a Java
API, Hadoop makes it awkward for non-Java applications to access HDFS. The HTTP REST
API exposed by the WebHDFS protocol makes it easier for other languages to interact with
HDFS. Note that the HTTP interface is slower than the native Java client, so should be avoided
for very large data transfers if possible.
In this section, we dig into the Hadoop FileSystem class: the API for interacting with
one of Hadoop’s filesystems.
Hadoop
Module III: Big Data Analytics 13
2. This object connects to namenode using RPC and gets metadata information such as
the locations of the blocks of the file. Please note that these addresses are of first few
blocks of a file.
3. In response to this metadata request, addresses of the DataNodes having a copy of that
block is returned back.
5. Data is read in the form of streams wherein client invokes ‘read()’ method repeatedly.
This process of read() operation continues till it reaches the end of block.
6. Once the end of a block is reached, DFSInputStream closes the connection and moves
on to locate the next DataNode for the next block
Hadoop
Module III: Big Data Analytics 14
7. Once a client has done with the reading, it calls a close() method.
2. DistributedFileSystem object connects to the NameNode using RPC call and initiates
new file creation. However, this file creates operation does not associate any blocks
with the file. It is the responsibility of NameNode to verify that the file (which is being
created) does not exist already and a client has correct permissions to create a new file.
If a file already exists or client does not have sufficient permission to create a new file,
then IOException is thrown to the client. Otherwise, the operation succeeds and a new
record for the file is created by the NameNode.
Hadoop
Module III: Big Data Analytics 15
5. There is one more component called DataStreamer which consumes this DataQueue.
DataStreamer also asks NameNode for allocation of new blocks thereby picking
desirable DataNodes to be used for replication.
6. Now, the process of replication starts by creating a pipeline using DataNodes. In our
case, we have chosen a replication level of 3 and hence there are 3 DataNodes in the
pipeline.
7. The DataStreamer pours packets into the first DataNode in the pipeline.
8. Every DataNode in a pipeline stores packet received by it and forwards the same to the
second DataNode in a pipeline.
10. Once acknowledgment for a packet in the queue is received from all DataNodes in the
pipeline, it is removed from the ‘Ack Queue’. In the event of any DataNode failure,
packets from this queue are used to reinitiate the operation.
11. After a client is done with the writing data, it calls a close() method (Step 9 in the
diagram) Call to close(), results into flushing remaining data packets to the pipeline
followed by waiting for acknowledgment.
12. Once a final acknowledgment is received, NameNode is contacted to tell it that the file
write operation is complete.
Data once stored in the HDFS also needs to be processed upon. Map Reduce is used to
process the data.
Hadoop
Module III: Big Data Analytics 16
Map Reduce:
The pattern is about breaking down a problem into small pieces of work. Map, Reduce
and Shuffle are three basic operations of Map Reduce.
Map: This takes the input data and converts into a set of data where each and
every line of input is broken down into a key-value pair (tuple).
Reduce: This task takes input from the Map phase's output and
combines(aggregates) data tuples into smaller sets based on keys.
Shuffle: Is the process of transferring the data from mappers to the
reducers.
In Hadoop, Map Reduce works by breaking the data processing into two phases: Map
phase and Reduce phase. The map is the first phase of processing, where we specify all the
complex logic/business rules/costly code. Reduce is the second phase of processing, where
we specify light-weight processing like aggregation/summation.
1. Input Files
The data for a Map Reduce task is stored in input files, and input files typically
lives in HDFS. The format of these files is arbitrary, while line-based log files and
binary format can also be used.
2. Input Format
Hadoop
Module III: Big Data Analytics 17
Now, Input Format defines how these input files are split and read. It selects the
files or other objects that are used for input. Input Format creates Input Split.
3. Input Splits
It is created by InputFormat, logically represent the data which will be processed
by an individual Mapper (We will understand mapper below). One map task is created
for each split; thus the number of map tasks will be equal to the number of InputSplits.
The split is divided into records and each record will be processed by the mapper
4. RecordReader
It communicates with the InputSplit in Hadoop MapReduce and converts
the data into key-value pairs suitable for reading by the mapper. By default, it uses
TextInputFormat for converting data into a key-value pair. RecordReader
communicates with the InputSplit until the file reading is not completed. It
assigns byte offset (unique number) to each line present in the file. Further, these key-
value pairs are sent to the mapper for further processing.
5. Mapper
It processes each input record (from RecordReader) and generates new key-
value pair, and this key-value pair generated by Mapper is completely different from
the input pair. The output of Mapper is also known as intermediate output which is
written to the local disk. The output of the Mapper is not stored on HDFS as this is
temporary data and writing on HDFS will create unnecessary copies (also HDFS is a
high latency system). Mappers output is passed to the combiner for further process
6. Combiner
The combiner is also known as ‘Mini-reducer’. Hadoop MapReduce Combiner
performs local aggregation on the mappers’ output, which helps to minimize the data
transfer between mapper and reducer. Once the combiner functionality is executed, the
output is then passed to the partitioner for further work.
7. Partitioner
Hadoop MapReduce, Partitioner comes into the picture if we are working on
more than one reducer (for one reducer partitioner is not used).
Partitioner takes the output from combiners and performs partitioning.
Partitioning of output takes place on the basis of the key and then sorted. By hash
function, key (or a subset of the key) is used to derive the partition.
Hadoop
Module III: Big Data Analytics 18
Hadoop
Module III: Big Data Analytics 19
Job Submission
Hadoop
Module III: Big Data Analytics 20
Job Initialization
When the JobTracker receives a call to its submitJob() method, it puts it into an internal
queue from where the job scheduler will pick it up and initialize it. Initialization involves
creating an object to represent the job being run, which encapsulates its tasks, and
bookkeeping information to keep track of the tasks’ status and progress (step 5).
To create the list of tasks to run, the job scheduler first retrieves the input splits
computed by the JobClient from the shared filesystem (step 6). It then creates one map task for
each split. The number of reduce tasks to create is determined by the mapred.reduce.tasks
property in the JobConf, which is set by the setNumReduce Tasks() method, and the scheduler
simply creates this number of reduce tasks to be run. Tasks are given IDs at this point.
Task Assignment
Tasktrackers run a simple loop that periodically sends heartbeat method calls to the
jobtracker. Heartbeats tell the jobtracker that a tasktracker is alive, but they also double as a
Hadoop
Module III: Big Data Analytics 21
channel for messages. As a part of the heartbeat, a tasktracker will indicate whether it is ready
to run a new task, and if it is, the jobtracker will allocate it a task, which it communicates to
the tasktracker using the heartbeat return value (step 7).
Before it can choose a task for the tasktracker, the jobtracker must choose a job to select
the task from. There are various scheduling algorithms as explained later in this chapter (see
“Job Scheduling”), but the default one simply maintains a priority list of jobs. Having chosen
a job, the jobtracker now chooses a task for the job.
Tasktrackers have a fixed number of slots for map tasks and for reduce tasks: for
example, a tasktracker may be able to run two map tasks and two reduce tasks simultaneously.
(The precise number depends on the number of cores and the amount of memory on the
tasktracker; see “Memory” ) The default scheduler fills empty map task slots before reduce
task slots, so if the tasktracker has at least one empty map task slot, the jobtracker will select a
map task; otherwise, it will select a reduce task.
To choose a reduce task, the jobtracker simply takes the next in its list of yet-to-be-run
reduce tasks, since there are no data locality considerations. For a map task, however, it takes
account of the tasktracker’s network location and picks a task whose input split is as close as
possible to the tasktracker. In the optimal case, the task is data-local, that is, running on the
same node that the split resides on. Alternatively, the task may be rack-local: on the same rack,
but not the same node, as the split. Some tasks are neither data-local nor rack-local and retrieve
their data from a different rack from the one theyare running on. You can tell the proportion of
each type of task by looking at a job’s counters .
Task Execution
Now that the tasktracker has been assigned a task, the next step is for it to run the task.
First, it localizes the job JAR by copying it from the shared filesystem to the tasktracker’s
filesystem. It also copies any files needed from the distributed cache by the application to the
local disk; see “Distributed Cache” (step 8). Second, it creates a local working directory for the
task, and un-jars the contents of the JAR into this directory. Third, it creates an instance of
TaskRunner to run the task.
TaskRunner launches a new Java Virtual Machine (step 9) to run each task in (step 10),
so that any bugs in the user-defined map and reduce functions don’t affect the tasktracker (by
Hadoop
Module III: Big Data Analytics 22
causing it to crash or hang, for example). It is, however, possible to reuse the JVM between
tasks; see “Task JVM Reuse”
The child process communicates with its parent through the umbilical interface. This
way it informs the parent of the task’s progress every few seconds until the task is complete.
Job Completion
When the jobtracker receives a notification that the last task for a job is complete, it
changes the status for the job to “successful.” Then, when the JobClient polls for status, it learns
that the job has completed successfully, so it prints a message to tell the user and then returns
from the runJob() method.
FAILURES
In the real world, user code is buggy, processes crash, and machines fail. One of the
major benefits of using Hadoop is its ability to handle such failures and allow your job to
complete. Task Failure Consider first the case of the child task failing. The most common way
that this happens is when user code in the map or reduce task throws a runtime exception. If
this happens, the child JVM reports the error back to its parent tasktracker, before it exits.
The error ultimately makes it into the user logs. The tasktracker marks the task attempt as
failed, freeing up a slot to run another task.
For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
marked as failed. This behavior is governed by the stream.non.zero.exit.is.failure property (the
default is true).
Another failure mode is the sudden exit of the child JVM perhaps there is a JVM bug
that causes the JVM to exit for a particular set of circumstances exposed by the Map- Reduce
user code. In this case, the tasktracker notices that the process has exited and marks the attempt
as failed.
Hanging tasks are dealt with differently. The tasktracker notices that it hasn’t
received a progress update for a while and proceeds to mark the task as failed. The child JVM
process will be automatically killed after this period. The timeout period after which tasks are
considered failed is normally 10 minutes and can be configured on a per-job basis (or a
cluster basis) by setting the mapred.task.timeout property to a value in milliseconds.
Hadoop
Module III: Big Data Analytics 23
Setting the timeout to a value of zero disables the timeout, so long-running tasks are
never marked as failed. In this case, a hanging task will never free up its slot, and over time
there may be cluster slowdown as a result. This approach should therefore be avoided, and
making sure that a task is reporting progress periodically will suffice (see “What Constitutes
Progress in MapReduce?” ).
When the jobtracker is notified of a task attempt that has failed (by the
tasktracker’s heartbeat call), it will reschedule execution of the task. The jobtracker will
try to avoid rescheduling the task on a tasktracker where it has previously failed. Furthermore,
if a task fails four times (or more), it will not be retried further. This value is configurable:the
maximum number of attempts to run a task is controlled by themapred.map.max.attempts
property for map tasks and mapred.reduce.max.attempts for reduce tasks. By default, if any
task fails four times (or whatever the maximum number of attempts is configured to), the whole
job fails.
For some applications, it is undesirable to abort the job if a few tasks fail, as it may be
possible to use the results of the job despite some failures. In this case, the maximum percentage
of tasks that are allowed to fail without triggering job failure can be set for the job. Map
tasks and reduce tasks are controlled independently, using the
mapred.max.map.failures.percent and mapred.max.reduce.failures.percent properties.
If a Streaming process hangs, the tasktracker does not try to kill it (although the JVM
that launched it will be killed), so you should take precautions to monitor for this scenario,
and kill orphaned processes by some other means.
A task attempt may also be killed, which is different from it failing. A task attempt may
be killed because it is a speculative duplicate (for more, see “Speculative Execution” ), or
because the tasktracker it was running on failed, and the jobtracker marked all the task
attempts running on it as killed. Killed task attempts do not count against the number of
attempts to run the task (as set by mapred.map.max.attempts and
mapred.reduce.max.attempts), since it wasn’t the task’s fault that an attempt was killed. Users
may also kill or fail task attempts using the web UI or the command line (type hadoop job to
see the options). Jobs may also be killed by the same mechanisms. Tasktracker Failure
Hadoop
Module III: Big Data Analytics 24
infrequently). The jobtracker will notice a tasktracker that has stopped sending heartbeats (if it
hasn’t received one for 10 minutes, configured via the mapred.task tracker.expiry.interval
property, in milliseconds) and remove it from its pool of tasktrackers to schedule tasks on. The
jobtracker arranges for map tasks that were run and completed successfully on that tasktracker
to be rerun if they belong to incomplete jobs, since their intermediate output residing on the
failed tasktracker’s local filesystem may not be accessible to the reduce task. Any tasks in
progress are also rescheduled.
A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker has not
failed. A tasktracker is blacklisted if the number of tasks that have failed on it is significantly
higher than the average task failure rate on the cluster. Blacklisted tasktrackers can be
restarted to remove them from the jobtracker’s blacklist.
Jobtracker Failure
Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has no
mechanism for dealing with failure of the jobtracker it is a single point of failure so in this
case the job fails. However, this failure mode has a low chance of occurring, since the chance
of a particular machine failing is low. It is possible that a future release of Hadoop will remove
this limitation by running multiple jobtrackers, only one of which is the primary jobtracker at
any time
JOB SCHEDULING
Early versions of Hadoop had a very simple approach to scheduling users’ jobs: they
ran in order of submission, using a FIFO scheduler. Typically, each job would use the
whole cluster, so jobs had to wait their turn. Although a shared cluster offers great potential
for offering large resources to many users, the problem of sharing resources fairly
between users requires a better scheduler. Production jobs need to complete in a timely
manner, while allowing users who are making smaller ad hoc queries to get results back in a
reasonable time.
Later on, the ability to set a job’s priority was added, via the mapred.job.priority
property or the setJobPriority() method on JobClient (both of which take one of the values
VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW). When the job scheduler is
choosing the next job to run, it selects one with the highest priority. However, with the
FIFO scheduler, priorities do not support pre-emption, so a high-priority job can still be
Hadoop
Module III: Big Data Analytics 25
blocked by a long-running low priority job that started before the high-priority job was
scheduled.
Map Reduce in Hadoop comes with a choice of schedulers. The default is the
original FIFO queue-based scheduler, and there are also multiuser schedulers called the
Fair Scheduler and the Capacity Scheduler.
The Fair Scheduler aims to give every user a fair share of the cluster capacity over time.
If a single job is running, it gets all of the cluster. As more jobs are submitted, free task slots
are given to the jobs in such a way as to give each user a fair share of the cluster.
Hadoop
Module III: Big Data Analytics 26
A short job belonging to one user will complete in a reasonable time even while another
user’s long job is running, and the long job will still make progress. Jobs are placed in pools,
and by default, each user gets their own pool. A user who submits more jobs than a second user
will not get any more cluster resources than the second, on average. It is also possible to define
custom pools with guaranteed minimum capacities defined in terms of the number of map and
reduce slots, and to set weightings for each pool.
The Fair Scheduler supports preemption, so if a pool has not received its fair share for
a certain period of time, then the scheduler will kill tasks in pools running over capacity in
order to give the slots to the pool running under capacity.
Hadoop
Module III: Big Data Analytics 27
Shuffling in MapReduce
The process of transferring data from the mappers to reducers is shuffling. It is also the
process by which the system performs the sort. Then it transfers the map output to the reducer
as input. This is the reason shuffle phase is necessary for the reducers.
Otherwise, they would not have any input (or input from every mapper). Since shuffling
can start even before the map phase has finished. So this saves some time and completes the
tasks in lesser time.
Sorting in MapReduce
This saves time for the reducer. Reducer in MapReduce starts a new reduce task
when the next key in the sorted input data is different than the previous. Each reduce task takes
key value pairs as input and generates key-value pair as output.
The important thing to note is that shuffling and sorting in Hadoop MapReduce
are will not take place at all if you specify zero reducers (setNumReduceTasks(0)).
If reducer is zero, then the MapReduce job stops at the map phase. And the map
phase does not include any kind of sorting (even the map phase is faster).
Hadoop
Module III: Big Data Analytics 28
TASK EXECUTION
Task Tracker copies the job jar file from the shared file system (HDFS).
Task tracker creates a local working directory and un-jars the jar file into the local
filesystem.
Task tracker creates an instance of TaskRunner.
Task tracker starts TaskRunner in a new JVM to run the map or reduce task.
The child process communicates the progress to parent process.
Each task can perform setup and cleanup actions based on OutputComitter.
The input provided via stdin and get output via stdout from the running process even if
the map or reduce task ran via pipes or socket in case of streaming.
MapReduce Types
Mapping is the core technique of processing a list of data elements that come in
pairs of keys and values. The map function applies to individual elements defined as key-value
pairs of a list and produces a new list.
The general idea of map and reduce function of Hadoop can be illustrated as follows:
The input parameters of the key and value pair, represented by K1 and
V1 respectively, are different from the output pair type: K2 and V2. The reduce function
accepts the same format output by the map, but the type of output again of the reduce
operation is different: K3 and V3. The Java API for this is as follows:
Closeable {
void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter) throws
IOException;
Hadoop
Module III: Big Data Analytics 29
Closeable {
Note that the combine and reduce functions use the same type, except in
the variable names where K3 is K2 and V3 is V2.
Hadoop
Module III: Big Data Analytics 30
execution, InputFormat is the first step. It is also responsible for creating the input splits and
dividing them into records.
Input files store the data for MapReduce job. Input files reside in HDFS.
Although these files format is arbitrary, we can also use line-based log files and binary format.
Hence, In MapReduce, InputFormat class is one of the fundamental classes which provides
below functionality:
There are different types of MapReduce InputFormat in Hadoop which are used
for different purpose. Let’s discuss the Hadoop InputFormat types below:
1. FileInputFormat
It is the base class for all file-based InputFormats. FileInputFormat also specifies
input directory which has data files location. When we start a MapReduce job execution,
FileInputFormat provides a path containing files to read.
This InpuFormat will read all files. Then it divides these files into one or more
InputSplits.
2. TextInputFormat
It is the default InputFormat. This InputFormat treats each line of each input file
as a separate record. It performs no parsing. TextInputFormat is useful for unformatted data or
line-based records like log files. Hence,
Key – It is the byte offset of the beginning of the line within the file (not whole
file one split). So it will be unique if combined with the file name.
3. KeyValueTextInputFormat
Hadoop
Module III: Big Data Analytics 31
4. SequenceFileInputFormat
It is an InputFormat which reads sequence files. Sequence files are binary files.
These files also store sequences of binary key-value pairs. These are block-compressed and
provide direct serialization and deserialization of several arbitrary data. Hence,
5. SequenceFileAsTextInputFormat
6. SequenceFileAsBinaryInputFormat
7. NlineInputFormat
It is another form of TextInputFormat where the keys are byte offset of the line.
And values are contents of the line. So, each mapper receives a variable number of lines of
input with TextInputFormat and KeyValueTextInputFormat.
The number depends on the size of the split. Also, depends on the length of the
lines. So, if want our mapper to receive a fixed number of lines of input, then we use
NLineInputFormat.
Hadoop
Module III: Big Data Analytics 32
Suppose N=2, then each split contains two lines. So, one mapper receives the
first two Key-Value pairs. Another mapper receives the second two key-value pairs.
8. DBInputFormat
This InputFormat reads data from a relational database, using JDBC. It also loads
small datasets, perhaps for joining with large datasets from HDFS using MultipleInputs. Hence,
Key – LongWritables
Value – DBWritables.
Output Formats
Output Format check the output specification for execution of the Map-Reduce
job. It describes how RecordWriter implementation is used to write output to output files.
The RecordWriter takes output data from Reducer. Then it writes this data to
output files. OutputFormat determines the way these output key-value pairs are written in
output files by RecordWriter.
Hadoop MapReduce job checks that the output directory does not already present.
OutputFormat in MapReduce job provides the RecordWriter implementation to be used
to write the output files of the job. Then the output files are stored in a FileSystem.
1. TextOutputFormat
Hadoop
Module III: Big Data Analytics 33
2. SequenceFileOutputFormat
And the corresponding SequenceFileInputFormat will deserialize the file into the same
types. It presents the data to the next mapper in the same manner as it was emitted by the
previous reducer. Static methods also control the compression.
3. SequenceFileAsBinaryOutputFormat
4. MapFileOutputFormat
5. MultipleOutputs
This format allows writing data to files whose names are derived from the output keys
and values.
6. LazyOutputFormat
7. DBOutputFormat
Hadoop
Module III: Big Data Analytics 34
It is the OutputFormat for writing to relational databases and HBase. This format
also sends the reduce output to a SQL table. It also accepts key-value pairs. In this, the key has
a type extending DBwritable.
Features of MapReduce
1. Scalability
2. Flexibility
Additionally, the MapReduce framework also provides support for the multiple
languages and data from sources ranging from email, social media, to clickstream.
The MapReduce processes data in simple key-value pairs thus supports data type
including meta-data, images, and large files. Hence, MapReduce is flexible to deal with data
rather than traditional DBMS.
The MapReduce programming model uses HBase and HDFS security platform
that allows access only to the authenticated users to operate on the data. Thus, it protects
unauthorized access to system data and enhances system security.
4. Cost-effective solution
Hadoop
Module III: Big Data Analytics 35
5. Fast
The tools that are used for data processing, such as MapReduce programming,
are generally located on the very same servers that allow for the faster processing of data.
Amongst the various features of Hadoop MapReduce, one of the most important
features is that it is based on a simple programming model. Basically, this allows programmers
to develop the MapReduce programs which can handle tasks easily and efficiently.
The MapReduce programs can be written in Java, which is not very hard to pick
up and is also used widely. So, anyone can easily learn and write MapReduce programs and
meet their data processing needs.
7. Parallel Programming
Whenever the data is sent to an individual node, the same set of data is forwarded
to some other nodes in a cluster. So, if any particular node suffers from a failure, then there are
always other copies present on other nodes that can still be accessed whenever needed. This
assures high availability of data.
One of the major features offered by Apache Hadoop is its fault tolerance. The
Hadoop MapReduce framework has the ability to quickly recognizing faults that occur.
Hadoop
Module III: Big Data Analytics 36
It then applies a quick and automatic recovery solution. This feature makes it a
game-changer in the world of big data processing.
Hadoop