0% found this document useful (0 votes)
67 views37 pages

Module 4

The document discusses MapReduce workflows and jobs. It explains how to decompose complex data processing problems into multiple MapReduce jobs and how to run dependent jobs in linear chains or more complex directed acyclic graphs. It then provides details on the anatomy of a classic MapReduce job run, including the steps of job submission, initialization, task assignment, execution, progress and status updates, and completion. Key entities involved are the client, JobTracker, TaskTrackers, and HDFS.

Uploaded by

Aryan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views37 pages

Module 4

The document discusses MapReduce workflows and jobs. It explains how to decompose complex data processing problems into multiple MapReduce jobs and how to run dependent jobs in linear chains or more complex directed acyclic graphs. It then provides details on the anatomy of a classic MapReduce job run, including the steps of job submission, initialization, task assignment, execution, progress and status updates, and completion. Key entities involved are the client, JobTracker, TaskTrackers, and HDFS.

Uploaded by

Aryan V
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

1. What are the limitations of classic map reduce?

2. Compare classic map reduce with YARN


3. Write a note on the use of Zookeeper
4. What are the differences between HBASE and HDFS
5. List the differences Between RDBMS and Cassandra
6. Write a note on Pig latin
7. What is the use of Grunt?
8. Mention the advantages of Hive
9. Compare Relational Databases with HBase
10. When to use Hbase
11. Explain the anatomy of classic map reduce job run (16)
12. Explain in detail YARN architecture (16)
13. Explain in detail the various types of job scheduler (16)
14. Describe how the failures are handled in classic map reduce and YARN (16)
15. Explain HBASE architecture and its data model in detail (16)
16. Explain Cassandra architecture and its data model in detail (16)
17. Write a note on Cassandra clients (16)
18. Write note on Hbase clients (16)
19. Explain Hive data types and file formats (8)
20. Explain pig latin script and Grunt shell (8)
21. Describe HiveQL data definition in detail (8)

UNIT IV MAPREDUCE APPLICATIONS

MapReduce workflows – unit tests with MRUnit – test data and local tests – anatomy of
MapReduce job run – classic Map-reduce – YARN – failures in classic Map-reduce and
YARN – job scheduling – shuffle and sort – task execution – MapReduce types – input
formats – output formats

MapReduce Workflows

It explains the data processing problem into the MapReduce model. When the
processing gets more complex, the complexity is generally manifested by having more
MapReduce jobs, rather than having more complex map and reduce functions. In other
words, as a rule of thumb, think about adding more jobs, rather than adding complexity
to jobs. Map Reduce workflow is divided into two steps:

• Decomposing a Problem into MapReduce Jobs


• Running Jobs

1. Decomposing a Problem into MapReduce Jobs


Let’s look at an example of a more complex problem that we want to translate into a
MapReduce workflow. When we write a MapReduce workflow, we have to create
two scripts:

• the map script, and


• the reduce script.

When we start a map/reduce workflow, the framework will split the input in
to segments, passing each segment to a different machine. Each machine
thenruns the map script on the portion of data attributed to it.
2. Running Dependent Jobs (linear chain of jobs) or More complex
Directed Acyclic Graph jobs

When there is more than one job in a MapReduce workflow, the question arises: how
do you manage the jobs so they are executed in order? There are several approaches,
and the main consideration is whether you have a linear chain of jobs, or a more
complex directed acyclic graph (DAG) of jobs.

For a linear chain, the simplest approach is to run each job one after another, waiting until
a job completes successfully before running the next:

JobClient.runJob(conf1);

JobClient.runJob(conf

2);

For anything more complex job like DAG than a linear chain, there is a class called
JobControl which represents a graph of jobs to be run.

Oozie

Unlike JobControl, which runs on the client machine submitting the jobs, Oozie runs as
a server, and a client submits a workflow to the server. In Oozie, a workflow is a DAG of
action
nodes and control-flow nodes. There is an action node which performs a workflow task,
like
moving files in HDFS, running a MapReduce job or running a Pig job. When the
workflow completes, Oozie can make an HTTP callback to the client to inform it of the
workflow status. It is also possible to receive callbacks every time the workflow enters
or exits an action node.
Oozie allows failed workflows to be re-run from an arbitrary point. This is useful for
dealing with transient errors when the early actions in the workflow are time consuming
to execute.

Anatomy of classic map reduce job run

How MapReduce Works? / Explain the anatomy of classic map reduce job run/How
Hadoop runs map reduce Job?

You can run a MapReduce job with a single line of code: JobClient.runJob(conf). It is
very short, but it conceals a great deal of processing behind the scenes. The whole
process is illustrated in following figure.
As shown in Figure 1, there are four independent entities in the framework:
- Client, which submits the MapReduce Job
- JobTracker, which coordinates and controls the job run. It is a Java class called
JobTracker.
- TaskerTrackers, which run the task that is split job, control the specific map or
reduce task, and make reports to JobTracker. They are Java class as well.
- HDFS, which provides distributed data storage and is used to share job files
between other entities.

As the Figure 1 show, a MapReduce processing including 10 steps, and in short, that is:
- The clients submit MapReduce jobs to the JobTracker.
- The JobTracker assigns Map and Reduce tasks to other nodes in the cluser
- These nodes each run a software daemon TaskTracker on separate JVM.
- Each TaskTracker actually initiates the Map or Reduce tasks and reports progress
back to the JobTracker
There are six detailed levels in workflows. They are:

1. Job Submission
2. Job Initialization
3. Task Assignment
4. Task Execution
5. Task Progress and status updates
6. Task Completion

Job Submission

When the client call submit() on job object. An internal JobSubmmitter Java Object is
initiated and submitJobInternal() is called. If the clients calls the waiForCompletion(),
the job progresss will begin and it will response to the client with process results to
clients until the job completion.
JobSubmmiter do the following work:
- Ask the JobTracker for a new job ID.
- Checks the output specification of the job.
- Computes the input splits for the job.
- Copy the resources needed to run the job. Resources include the job jar file, the
configuration file and the computed input splits. These resources will be copied to HDFS
in a directory named after the job id. The job jar will be copied more than 3 times across
the cluster so that TaskTrackers can access it quickly.
- Tell the JobTracker that the job is ready for execution by calling submitJob() on
JobTracker.

Job Initialization

When the JobTracker receives the call submitJob(), it will put the call into an internal
queue from where the job scheduler will pick it up and initialize it. The initialization is
done as follow:
- An job object is created to represent the job being run. It encapsulates its
tasks and bookkeeping information so as to keep track the task progress and
status.
- Retrieves the input splits from HDFS and create the list of tasks, each of which has
task ID. JobTracker creates one map task for each split, and the number of reduce
tasks according to configuration.
- JobTracker will create the setup task and cleanup task. Setup task is to create the
final output directory for the job and the temporary working space for the task output.
Cleanup task is to delete the temporary working space for the task ouput.
- JobTracker will assign tasks to free TaskTrackers

Task Assignment

TaskTrackers send heartbeat periodically to JobTracker Node to tell it if it is alive or


ready to get a new task. The JobTracker will allocate a new task to the ready
TaskTracker. Task assignment is as follows:
- The JobTracker will choose a job to select the task from according to scheduling
algorithm, a simple way is chosen on a priority list of job. After chose the job, the
JobTracker will choose a task from the job.
- TaskTrackers has a fixed number of slots for map tasks and for reduces tasks
which are set independently, the scheduler will fits the empty map task slots before
reduce task slots.
- To choose a reduce task, the JobTracker simply takes next in its list of yet-to-be-run
reduce task, because there is no data locality consideration. But map task chosen
depends on the data locality and TaskTracker’s network location.

Task Execution

When the TaskTracker has been assigned a task. The task execution will be run as
follows:
- Copy jar file from HDFS, copy needed files from the distributed cache on the local disk.
- Creates a local working directory for the task and ‘un-jars’ the jar file contents to the
direcoty
- Creates a TaskRunner to run the task. The TaskRunner will lauch a new JVM to run
each task.. TaskRunner fails by bugs will not affect TaskTracker. And multiple tasks on
the node can reuse the JVM created by TaskRunner.
- Each task on the same JVM created by TaskRunner will run setup task and cleanup task.
- The child process created by TaskRunner will informs the parent process of the task’s
progress every few seconds until the task is complete.
Progress and Status Updates

After clients submit a job. The MapReduce job is a long time batching job. Hence the
job progress report is important. What consists of the Hadoop task progress is as
follows:
- Reading an input record in a mapper or reducer
- Writing an output record in a mapper or a reducer
- Setting the status description on a reporter, using the Reporter’s setStatus() method
- Incrementing a counter
- Calling Reporter’s progress()

As shown in Figure , when a task is running, the TaskTracker will notify the
JobTracker its task progress by heartbeat every 5 seconds.And mapper and reducer
on the child JVM will report to TaskTracker with it’s progress status every few
seconds. The mapper or reducers will set a flag to indicate the status change that
should be sent to the TaskTracker. The flag is checked in a separated thread every 3
seconds. If the flag sets, it will notify the TaskTracker of current task status.
The JobTracker combines all of the updates to produce a global view, and the Client can
use getStatus() to get the job progress status.
Job Completion

When the JobTracker receives a report that the last task for a job is complete, it will
change its status to successful. Then the JobTracker will send a HTTP notification to
the client which calls the waitForCompletion(). The job statistics and the counter
information will be printed to the client console. Finally the JobTracker and the
TaskTracker will do clean up action for the job.

MRUnit test

MRUnit is based on JUnit and allows for the unit testing of mappers, reducers and some
limited integration testing of the mapper – reducer interaction along with combiners,
custom counters and partitioners.
To write your test you would:

 Testing Mappers
1. Instantiate an instance of the MapDriver class parameterized exactly as the
mapper under test.
2. Add an instance of the Mapper you are testing in the with Mapper call.
3. In the withInput call pass in your key and input value
4. Specify the expected output in the withOutput call
5. The last call runTest feeds the specified input values into the mapper and
compares the actual output against the expected output set in the ‘withOutput’
method.

 Testing Reducers
1. The test starts by creating a list of objects (pairList) to be used as the input
to the reducer.
2. A ReducerDriver is instantiated
3. Next we pass in an instance of the reducer we want to test in the withReducer call.
4. In the withInput call we pass in the key (of “190101”) and the pairList object
created at the start of the test.
5. Next we specify the output that we expect our reducer to emit
6. Finally runTest is called, which feeds our reducer the inputs specified and
compares the output from the reducer against the expect output.

MRUnit testing framework is based on JUnit and it can test Map Reduce programs
written on several versions of Hadoop.
Following is an example to use MRUnit to unit test a Map Reduce program that does SMS
Call Details Record (call details record) analysis.
The records look like

1. CDRID; CDRType; Phone1; Phone2; SMS


Status Code 655209; 1; 796764372490213;
804422938115889; 6 353415; 0;
356857119806206; 287572231184798; 4
835699; 1; 252280313968413;
889717902341635; 0

The MapReduce program analyzes these records, finds all records with CDRType as 1,
and note its corresponding SMS Status Code. For example, the Mapper outputs are
6, 1
0, 1
The Reducer takes these as inputs and output number of times a particular status code
has been obtained in the CDR records.

The corresponding Mapper and Reducer are

public class SMSCDRMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

private Text status = new Text();


private final static IntWritable addOne = new IntWritable(1);

/** * Returns the SMS status code and its count */

protected void map(LongWritable key, Text value, Context context)


throws java.io.IOException, InterruptedException {

//655209;1;796764372490213;804422938115889;6 is the Sample record format

String[] line = value.toString().split(";");


// If record is of SMS CDR
if (Integer.parseInt(line[1]) == 1) {
status.set(line[4]);
context.write(status, addOne);
}
}
}
The corresponding Reducer code is
public class SMSCDRReducer extends
Reducer<Text, IntWritable, Text,
IntWritable> {

protected void reduce(Text key, Iterable<IntWritable> values, Context


context) throws java.io.IOException,
InterruptedException
{
int sum = 0;
for (IntWritable value :
values) { sum +=
value.get();
}
context.write(key, new IntWritable(sum));
}
}

The MRUnit test class for the Mapper is

public class SMSCDRMapperReducerTest


{
MapDriver<LongWritable, Text, Text, IntWritable> mapDriver;
ReduceDriver<Text, IntWritable, Text, IntWritable> reduceDriver;
MapReduceDriver<LongWritable, Text, Text, IntWritable, Text, IntWritable>
mapReduceDriver;

public void setUp()


{
SMSCDRMapper mapper = new SMSCDRMapper();
SMSCDRReducer reducer = new
SMSCDRReducer(); mapDriver =
MapDriver.newMapDriver(mapper);; reduceDriver =
ReduceDriver.newReduceDriver(reducer);
mapReduceDriver = MapReduceDriver.newMapReduceDriver(mapper, reducer);
}

@Test
public void testMapper()
{
mapDriver.withInput(new LongWritable(), new Text(
"655209;1;796764372490213;8044229
38115889;6"));
mapDriver.withOutput(new Text("6"), new
IntWritable(1)); mapDriver.runTest();
}
@Test
public void testReducer()
{
List<IntWritable> values = new
ArrayList<IntWritable>(); values.add(new
IntWritable(1));
values.add(new IntWritable(1));
reduceDriver.withInput(new Text("6"), values);
reduceDriver.withOutput(new Text("6"), new
IntWritable(2)); reduceDriver.runTest();
}
}

YARN: It is a Hadoop MapReduce 2 and developed to address the various limitations


of classic map reduce

Current MapReduce (classic) Limitations:

 Scalability problem

 Maximum Cluster Size – 4000 Nodes only

 Maximum Concurrent Tasks – 40000 only

 Coarse synchronization in Job Tracker

 It supports Single point of failure

 When a failure occurs it kills all queued and running jobs

 Jobs need to be resubmitted by users

 Restart is very tricky due to complex state

For large clusters with more than 4000 nodes, the classic MapReduce framework hit
the scalability problems.

YARN stands for Yet Another Resource Negotiator


A group in Yahoo began to design the next generation MapReduce in 2010, and in 2013
Hadoop
2.x releases MapReduce 2, Yet Another Resource Negotiator (YARN) to remedy the
sociability shortcoming.

What does Yarn do ?

 Provides a cluster level resource manager

 Adds application level resource management

 Provides slots for jobs other than Map / Reduce

 Improves resource utilization

 Improves scaling

 Cluster size is 6000 – 10000 nodes

 100,000+ concurrent tasks can be executed

 10,000 concurrent jobs can be executed

 Split Job Tracker into

1. Resource Manager(RM): performs cluster level resource management

2. Application Master(AM): performs job Scheduling and Monitoring


YARN Architecture

As shown in Figure, the YARN involves more entities than classic MapReduce 1 :
- Client, the same as classic MapReduce which submits the MapReduce job.
- Resource Manager, which has the ultimate authority that arbitrates resources among
all the applications in the cluster, it coordinates the allocation of compute resources on
the cluster.
- Node Manager, which is in charge of resource containers, monitoring resource
usage (cpu, memory, disk , network) on the node , and reporting to the Resource
Manager.
- Application Master, which is in charge of the life cycle an application, like a
MapReduce Job. It will negotiates with the Resource Manager of cluster resources—in
YARN called containers. The Application Master and the MapReduce task in the
containers are scheduled by the Resource Manager. And both of them are managed by
the Node Manager. Application Mater is also responsible for keeping track of task
progress and status.
- HDFS, the same as classic MapReduce, for files sharing between different entities.
Resource Manager consists of two components:
• Scheduler and
• Applications Manager.

Scheduler is in charge of allocating resources. The resource Container incorporates


elements such as memory, cup, disk, network etc. Scheduler just has the resource
allocation function, has
no responsible for job status monitoring. And the scheduler is pluggable, can be
replaced by other scheduler plugin-in.

The ApplicationsManager is responsible for accepting job-submissions, negotiating


the first container for executing the application specific Application Master, and it
provides restart service when the container fails.
The MapReduce job is just one type of application in YARN. Different application can run
on the same cluster with YARN framework.
YARN MapReduce
As shown in above Figure, it is the MapReduce process with YARN, there are 11 steps,
and we will explain it in 6 steps the same as the MapReduce 1 framework. They are Job
Submission, Job Initialization, Task Assignment, Task Execution, Progress and Status
Updates, and Job Completion.
Job Submission

Clients can submit jobs with the same API as MapReduce 1 in YARN. YARN implements
its ClientProtocol, the submission process is similar to MapReduce 1.
- The client calls the submit() method, which will initiate the JobSubmmitter object
and call submitJobInternel().
- Resource Manager will allocate a new application ID and response it to client.
- The job client checks the output specification of the job
- The job client computes the input splits
- The job client copies resources, including the splits data, configuration information,
the job JAR into HDFS
- Finally, the job client notify Resource Manager it is ready by calling submitApplication()
on the Resource Manager.
Job Initialization

When the Resource Manager(RM) receives the call submitApplication(), RM will hands
off the job to its scheduler. The job initialization is as follows:
- The scheduler allocates a resource container for the job,
- The RM launches the Application Master under the Node Manager’s management.
- Application Master initialize the job. Application Master is a Java class named
MRAppMaster, which initializes the job by creating a number of bookkeeping objects to
keep track of the job progress. It will receive the progress and the completion reports
from the tasks.
- Application Master retrieves the input splits from HDFS, and creates a map task
object for each split. It will create a number of reduce task objects determined by the
mapreduce.job.reduces configuration property.
- Application Master then decides how to run the job.
For small job, called uber job, which is the one has less than 10 mappers and only one
reducer, or the input split size is smaller than a HDFS block, the Application Manager
will run the job on its own JVM sequentially. This policy is different from MapReduce 1
which will ignore the small jobs on a single TaskTracker.
For large job, the Application Master will launches a new node with new
NodeManager and new container, in which run the task. This can run job in parallel
and gain more performance.
Application Master calls the job setup method to create the job’s output directory.
That’s different from MapReduce 1, where the setup task is called by each task’s
TaskTracker.
Task Assignment

When the job is very large so that it can’t be run on the same node as the Application
Master. The Application Master will make request to the Resource Manager to
negotiate more resource container which is in piggybacked on heartbeat calls. The task
assignment is as follows:
- The Application Master make request to the Resource Manager in heartbeat call. The
request includes the data locality information, like hosts and corresponding racks that
the input splits resides on.
- The Recourse Manager hand over the request to the Scheduler. The Scheduler
makes decisions based on these information. It attempts to place the task as close
the data as possible. The data-local nodes is great, if this is not possible , the rack-
local the preferred to nolocal node.
- The request also specific the memory requirements, which is between the minimum
allocation (1GB by default) and the maximum allocation (10GB). The Scheduler will
schedule a container with multiples of 1GB memory to the task, based on the
mapreduce.map.memory.mb and mapreduce.reduce.memory.mb property set by the task.

This way is more flexible than MapReduce 1. In MapReduce 1, the TaskTrackers have
a fixed number of slots and each task runs in a slot. Each slot has fixed memory
allowance which results in two problems. For small task, it will waste of memory, and
for large task which need more memeory, it will lack of memory. In YARN, the memory
allocation is more fine-grained, which is also the beauty of YARE resides in.
Task Execution

After the task has been assigned the container by the Resource Manger’s scheduler,
the Application Master will contact the NodeManger which will launch the task JVM.
The task execution is as follows:
- The Java Application whose class name is YarnChild localizes the resources
that the task needs. YarnChild retrieves job resources including the job jar,
configuration file, and any needed files from the HDFS and the distributed cache
on the local disk.
- YarnChild run the map or the reduce task
Each YarnChild runs on a dedicated JVM, which isolates user code from the long
running system daemons like NodeManager and the Application Master. Different from
MapReduce 1, YARN doesn’t support JVM reuse, hence each task must run on new
JVM.
The streaming and the pipeline processs and communication in the same as MapReduce
1.
Progress and Status Updates

When the job is running under YARN, the mapper or reducer will report its status and
progress to its Application Master every 3 seconds over the umbilical interface. The
Application Master will aggregate these status reports into a view of the task status and
progress. While in MapReduce 1, the TaskTracker reports status to JobTracker which is
responsible for aggregating status into a global view.
Moreover, the Node Manger will send heartbeats to the Resource Manager every few
seconds. The Node Manager will monitoring the Application Master and the recourse
container usage like cpu, memeory and network, and make reports to the Resource
Manager. When the Node Manager fails and stops heartbeat the Resource Manager,
the Resource Manager will remove the node from its available resource nodes pool.
The client pulls the status by calling getStatus() every 1 second to receive the progress
updates, which are printed on the user console. User can also check the status from the
web UI. The Resource Manager web UI will display all the running applications with
links to the web UI where displays task status and progress in detail.
Job Completion

Every 5 second the client will check the job completion over the HTTP ClientProtocol
by calling waitForCompletion(). When the job is done, the Application Master and the
task containers clean up their working state and the outputCommitter’s job cleanup
method is called. And the job information is archived as history for later interrogation
by user.

Compare Classic Map Reduce with YARN map reduce


 YARN has Fault Tolerance (continue to work in the event of failure) and Availability

 Resource Manager

 No single point of failure – state saved in ZooKeeper

 Application Masters are restarted automatically on RM restart

 Application Master

 Optional failover via application-specific checkpoint

 MapReduce applications pick up where they left off via state


saved in HDFS

 YARN has Network Compatibility

 Protocols are wire-compatible


 Old clients can talk to new servers

 Rolling upgrades

 YARN supports for programming paradigms other than MapReduce (Multi tenancy)

 Tez – Generic framework to run a complex MR

 HBase on YARN

 Machine Learning: Spark

 Graph processing: Giraph

 Real-time processing: Storm

 YARN is Enabled by allowing the use of paradigm-specific application master

 YARN runs all on the same Hadoop cluster

 YARN’s biggest advantage is multi-tenancy, being able to run multiple


paradigms simultaneously is a big plus.
Job Scheduling in MapReduce
Types of job schedulers
Failures in classic map reduce

One of the major benefits of using Hadoop is its ability to handle such failures and
allow your job to complete.

1. Task Failure
• Consider first the case of the child task failing. The most common way that this
happens is when user code in the map or reduce task throws a runtime
exception. If this happens, the child JVM reports the error back to its parent
tasktracker, before it exits. The error ultimately makes it into the user logs. The
tasktracker marks the task attempt as failed, freeing up a slot to run another task.
• For Streaming tasks, if the Streaming process exits with a nonzero exit code, it is
marked as failed. This behavior is managed by the stream.non.zero.exit.is.failure
property.
• Another failure mode is the sudden exit of the child JVM. In this case, the tasktracker
notices that the process has exited and marks the attempt as failed.
• A task attempt may also be killed, which is also different kind of failing. Killed
task attempts do not count against the number of attempts to run the task since
it wasn’t the task’s fault that an attempt was killed.

2. Tasktracker Failure
• If a tasktracker fails by crashing, or running very slowly, it will stop sending
heartbeats to the jobtracker (or send them very infrequently). The jobtracker will
notice a tasktracker that has stopped sending heartbeats (if it hasn’t received one
for 10 minutes) and remove it from its pool of tasktrackers to schedule tasks on.

• A tasktracker can also be blacklisted by the jobtracker, even if the tasktracker


has not failed. A tasktracker is blacklisted if the number of tasks that have
failed on it is significantly higher than the average task failure rate on the
cluster. Blacklisted tasktrackers can be restarted to remove them from the
jobtracker’s blacklist.

3. Jobtracker Failure
• Failure of the jobtracker is the most serious failure mode. Currently, Hadoop has
no mechanism for dealing with failure of the jobtracker—it is a single point of
failure— so in this case the job fails.
Failures in YARN

Refer to yarn architecture figure above, container and task failures are handled by node- manager.
When a container fails or dies, node-manager detects the failure event and launches a new container to
replace the failing container and restart the task execution in the new container. In the event of
application-master failure, the resource-manager detects the failure and start a new instance of the
application-master with a new container. The ability to recover the associated job state depends on the
application-master implementation. MapReduce application-master has the ability to recover the state
but it is not enabled by default. Other than resource-manager, associated client also reacts with the
failure. The client contacts the resource-manager to locate the new application-master’s address.

Upon failure of a node-manager, the resource-manager updates its list of


available node- managers. Application-master should recover the tasks run on the
failing node-managers but it depends on the application-master implementation.
MapReduce application-master has an additional capability to recover the failing task
and blacklist the node-managers that often fail.

Failure of the resource-manager is severe since clients can not submit a new job
and existing running jobs could not negotiate and request for new container.
Existing node-managers and application-masters try to reconnect to the failed resource-
manager. The job progress will be lost when they are unable to reconnect. This lost of
job progress will likely frustrate engineers or data scientists that use YARN because
typical production jobs that run on top of YARN are expected to have long running time
and typically they are in the order of few hours.
Furthermore, this limitation is preventing YARN to be used efficiently in cloud
environment (such as Amazon EC2) since node failures often happen in cloud
environment.
Shuffle and Sort

 MapReduce makes the guarantee that the input to every reducer is sorted by key.
 The process by which the system performs the sort—and transfers the map
outputs to the reducers as inputs—is known as the shuffle.
 The shuffle is an area of the codebase where refinements and improvements are
continually being made.

STEPS

1. The Map Side


2. The Reduce Side
3. Configuration Tuning

I. The Map Side


When the map function starts producing output, it is not simply written to disk. The
process is more involved, and takes advantage of buffering writes in memory and doing
some presorting for efficiency reasons.

Shuffle and sort in MapReduce

The buffer is 100 MB by default, a size which can be tuned by changing the
io.sort.mb property. When the contents of the buffer reaches a certain threshold size a
background thread will start to spill the contents to disk. Map outputs will continue to be
written to the buffer while the spill takes place, but if the buffer fills up during this time, the
map will block until the spill is complete. Spills are written in round-robin fashion to the
directories specified by the mapred.local.dir property, in a job-specific subdirectory.
Before it writes to disk, the thread first divides the data into partitions corresponding to the
reducers that they will ultimately be sent to. Within each partition, the background thread
performs an in-memory sort by key, and if there is a combiner function, it is run on the
output of the sort.

Running the combiner function makes for a more compact map output, so there is
less data to write to local disk and to transfer to the reducer. Each time the memory buffer
reaches the spill threshold, a new spill file is created, so after the map task has written its
last output record there could be several spill files. Before the task is finished, the spill
files are merged into a single partitioned and sorted output file. The configuration property
io.sort.factor controls the
maximum number of streams to merge at once; the default is 10.If there are at least three
spill files then the combiner is run again before the output file is written. Combiners may be
run repeatedly over the input without affecting the final result.If there are only one or two
spills, then the potential reduction in map output size is not worth the overhead in invoking
the combiner, so it is not run again for this map output.

To compress the map output as it is written to disk, makes it faster to write to disk,
saves disk space, and reduces the amount of data to transfer to the reducer. By default,
the output is not compressed, but it is easy to enable by setting
mapred.compress.map.output to true. The output file’s partitions are made available to the
reducers over HTTP. The maximum number of worker threads used to serve the file
partitions is controlled by the tasktracker.http.threads property. The default of 40 may
need increasing for large clusters running large jobs.
II. The Reduce Side

The map output file is sitting on the local disk of the machine that ran the
map task. The reduce task needs the map output for its particular partition from
several map tasks across the cluster.

Copy phase of reduce: The map tasks may finish at different times, so the reduce task
starts copyingtheir outputs as soon as each completes. The reduce task has a small
number of copier threads so that it can fetch map outputs in parallel.

The default is five threads, but this number can be changed by setting the
mapred.reduce.parallel.copies property. The map outputs are copied to reduce task
JVM’s memory otherwise, they are copied to disk. When the in-memory buffer reaches a
threshold size or reaches a threshold number of map outputs it is merged and spilled to
disk. If a combiner is specified it will be run during the merge to reduce the amount of data
written to disk.The copies accumulate on disk, a background thread merges them into
larger, sorted files. This saves some time merging later on.
Any map outputs that were compressed have to be decompressed in memory in
order to perform a merge on them. When all the map outputs have been copied, the
reduce task moves into the sort phase which merges the map outputs, maintaining their
sort ordering. This is done in rounds. For example, if there were 50 map outputs, and the
merge factor was 10, then there would be 5 rounds. Each round would merge 10 files into
one, so at the end there would be five intermediate files. These five files into a single
sorted file, the merge saves a trip to disk by directly feeding the reduce function. This final
merge can come from a mixture of in-memory and on-disk segments.

During the reduce phase, the reduce function is invoked for each key in the sorted
output. The output of this phase is written directly to the output filesystem, typically HDFS.
In the case of HDFS, since the tasktracker node is also running a datanode, the first
block replica will be written to the local disk.

III. Configuration Tuning


To understand how to tune the shuffle to improve MapReduce performance. The general
principle is to give the shuffle as much memory as possible. There is a trade-off, in that
you need to make sure that your map and reduce functions get enough memory to
operate. Write your map and reduce functions to use as little memory as possible. They
should not use an unbounded amount of memory. The amount of memory given to the
JVMs in which the map and reduce tasks run is set by the mapred.child.java.opts property.
To make this as large as possible for the amount of memory on your task nodes.

On the map side, the best performance can be obtained by avoiding multiple spills to
disk; one is optimal. If you can estimate the size of your map outputs, then you can set the
io.sort.* properties appropriately to minimize the number of spills. There is a MapReduce
counter that counts the total number of records that were spilled to disk over the course of
a job, which can be useful for tuning. The counter includes both map and reduces side
spills.
On the reduce side, the best performance is obtained when the intermediate data can
reside entirely in memory. By default, this does not happen, since for the general case
all the memory is reserved for the reduce function. But if your reduce function has light
memory requirements, then setting mapred.inmem.merge.threshold to 0 and
mapred.job.reduce.input.buffer.percent to 1.0 may bring a performance boost. Hadoop
uses a buffer size of 4 KB by default, which is low, so you should increase this across
the cluster.

Input Formats
Hadoop can process many different types of data formats, from flat text files to
datab ases.
1) Input Splits and Records:
An input split is a chunk of the input that is processed by a single map. Each map
processes a single split. Each split is divided into records, and the map processes each
record—a key-value pair—in turn.
public abstract class InputSplit {
public abstract long getLength() throws IOException,
InterruptedException; public abstract String[] getLocations() throws
IOException, InterruptedException;
}
 FileInputFormat: FileInputFormat is the base class for all implementations of
InputFormat that use files as their data source. It provides two things: a place to
define which files are included as the input to a job, and an implementation for
generating splits for the input files.
 FileInputFormat input paths: The input to a job is specified as a collection of paths,
which offers great flexibility in constraining the input to a job. FileInputFormat offers
four static convenience methods for setting a Job’s input paths:
public static void addInputPath(Job job, Path path)
public static void addInputPaths(Job job, String
commaSeparatedPaths) public static void setInputPaths(Job job,
Path... inputPaths)
public static void setInputPaths(Job job, String commaSeparatedPaths )
Fig: InputFormat class hierarchy

Table: Input path and filter properties

 FileInputFormat input splits: FileInputFormat splits only large files. Here “large”
means larger than an HDFS block. The split size is normally the size of an HDFS
block.
Table: Properties for controlling split size
 Preventing splitting: There are a couple of ways to ensure that an existing file is
not split. The first way is to increase the minimum split size to be larger than the
largest file in your system. The second is to subclass the concrete subclass of
FileInputFormat that you want to use, to override the isSplitable() method to return
false.
 File information in the mapper: A mapper processing a file input split can find
information about the split by calling the getInputSplit() method on the Mapper’s
Context object.
Table: File split properties

 Processing a whole file as a record: A related requirement that sometimes crops


up is for mappers to have access to the full contents of a file. The listing for
WholeFileInputFormat shows a way of doing this.

Example : An InputFormat for reading a whole file as a record


public class WholeFileInputFormat extends
FileInputFormat<NullWritable, BytesWritable> {
@Override
protected boolean isSplitable(JobContext context, Path
file) { return false;
}}
WholeFileRecordReader is responsible for taking a FileSplit and converting it into a
single record, with a null key and a value containing the bytes of the file.
2) Text Input
 TextInputFormat: TextInputFormat is the default InputFormat. Each record is a line
of input. A file containing the following text:
On the top of the Crumpetty
Tree The Quangle Wangle sat,
But his face you could not
see, On account of his
Beaver Hat.
is divided into one split of four records. The records are interpreted as the following key-
value pairs:
(0, On the top of the Crumpetty Tree)
(33, The Quangle Wangle sat,)
(57, But his face you could not
see,) (89, On account of his
Beaver Hat.)
Fig: Logical records and HDFS blocks for TextInputFormat

 You can specify the separator via the


mapreduce.input.keyvaluelinerecordreader.key.value.separator property. It is a
tab character by default. Consider the following input file, where → represents a
(horizontal) tab character:
line1→On the top of the Crumpetty
Tree line2→The Quangle Wangle
sat,
line3→But his face you could not
see, line4→On account of his
Beaver Hat.
Like in the TextInputFormat case, the input is in a single split comprising four records,
although this time the keys are the Text sequences before the tab in each line:
(line1, On the top of the Crumpetty
Tree) (line2, The Quangle Wangle
sat,)
(line3, But his face you could not
see,) (line4, On account of his
Beaver Hat.)
 NLineInputFormat: N refers to the number of lines of input that each mapper
receives. With N set to one, each mapper receives exactly one line of input.
mapreduce.input.lineinputformat.linespermap property controls the value of N.
For example, N is two, then each split contains two lines. One mapper will receive
the first two key-value pairs:
(0, On the top of the Crumpetty
Tree) (33, The Quangle Wangle
sat,)
And another mapper will receive the second two key-value pairs:
(57, But his face you could not
see,) (89, On account of his
Beaver Hat.)
3) Binary Input: Hadoop MapReduce is not just restricted to processing textual data—it has
support for binary formats, too.
 SequenceFileInputFormat: Hadoop’s sequence file format stores sequences of
binary key- value pairs.
 SequenceFileAsTextInputFormat: SequenceFileAsTextInputFormat is a variant of
SequenceFileInputFormat that converts the sequence file’s keys and values to
Text objects.
 SequenceFileAsBinaryInputFormat: SequenceFileAsBinaryInputFormat is a variant
of SequenceFileInputFormat that retrieves the sequence file’s keys and values as
opaque binary objects.
4) Multiple Inputs: Although the input to a MapReduce job may consist of multiple input
files, all of the input is interpreted by a single InputFormat and a single Mapper.
The MultipleInputs class has an overloaded version of addInputPath() that doesn’t take a
mapper:
public static void addInputPath(Job job, Path path, Class<? extends
InputFormat> inputFormatClass)
Output
Formats
Figure: OutputFormat class hierarchy

1) Text Output: The default output format, TextOutputFormat, writes records as lines of
text. Its keys
and values may be of any type, since TextOutputFormat turns them to strings by calling
toString() on them.
2) Binary Output
 SequenceFileOutputFormat: As the name indicates, SequenceFileOutputFormat
writes sequence files for its output. Compression is controlled via the static
methods on SequenceFileOutputFormat.
 SequenceFileAsBinaryOutputFormat: SequenceFileAsBinaryOutputFormat is the
counterpart to SequenceFileAsBinaryInput Format, and it writes keys and values in
raw binary format into a SequenceFile container.
 MapFileOutputFormat: MapFileOutputFormat writes MapFiles as output. The keys
in a MapFile must be added in order, so you need to ensure that your reducers
emit keys in sorted order.
3) Multiple Outputs: FileOutputFormat and its subclasses generate a set of files in the
output directory. There is one file per reducer, and files are named by the partition number:
part-r- 00000, partr-00001, etc. MapReduce comes with the MultipleOutputs class to help
you do this.
Zero reducers: There are no partitions, as the application needs to run only map
tasks.
One reducer: It can be convenient to run small jobs to combine the output of
previous jobs into a single file. This should only be attempted when the amount of data is
small enough to be processed comfortablyby one reducer.
 MultipleOutputs: MultipleOutputs allows you to write data to files whose names are
derived from the output keys and values, or in fact from an arbitrary
string.MultipleOutputs delegates to the mapper’s OutputFormat, which in this
example is a TextOutputFormat, but more complex set ups are possible.
 Lazy Output: FileOutputFormat subclasses will create output (part-r-nnnnn) files,
even if they are
empty. Some applications prefer that empty files not be created, which is where
LazyOutputFormat helps. It is a wrapper output format that ensures that the
output file is Output Formats created only when the first record is emitted for a
given partition. To use it, call its setOutputFormatClass() method with the JobConf
and the underlying output format.
 Database Output: The output formats for writing to relational databases and to
HBase

You might also like