0% found this document useful (0 votes)
111 views82 pages

05-MapReduce and Yarn

Uploaded by

Tarike Zewude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
111 views82 pages

05-MapReduce and Yarn

Uploaded by

Tarike Zewude
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 82

Unit 5 MapReduce and YARN

MapReduce and YARN

Big Data Ecosystem

© Copyright IBM Corporation 2019


Course materials may not be reproduced in whole or in part without the written permission of IBM.
Unit 5 MapRed uc e and Y A RN

© Copyright IBM Corp. 2019 5-2


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Unit objectives
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1

MapReduce and YARN © Copyright IBM Corporation 2019

Unit objectives

© Copyright IBM Corp. 2019 5-3


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Topic:
Introduction to MapReduce
processing based on MR1

MapReduce and YARN © Copyright IBM Corporation 2019

Topic: Introduction to MapReduce processing based on MR1

© Copyright IBM Corp. 2019 5-4


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce: The Distributed File System (DFS)


• Driving principles
▪ data is stored across the entire cluster
▪ programs are brought to the data, not the data to the program
• Data is stored across the entire cluster (the DFS)
▪ the entire cluster participates in the file system
▪ blocks of a single file are distributed across the cluster
▪ a given block is typically replicated as well for resiliency

101101001 Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
010100101
3
010101011 2 4
100100110 2
101110100 4 3
1
4

Logical File
MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce: The Distributed File System (DFS)


The driving principle of MapReduce is a simple one: spread your data out across a
huge cluster of machines and then, rather than bringing the data to your programs as
you do in traditional programming, you write your program in a specific way that allows
the program to be moved to the data. Thus, the entire cluster is made available in both
reading the data as well as processing the data.
A Distributed File System (DFS) is at the heart of MapReduce. It is responsible for
spreading data across the cluster, by making the entire cluster look like one giant file
system. When a file is written to the cluster, blocks of the file are spread out and
replicated across the whole cluster (in the diagram, notice that every block of the file is
replicated to three different machines).
Adding more nodes to the cluster instantly adds capacity to the file system and, as we'll
see on the next slide, automatically increases the available processing power and
parallelism.

© Copyright IBM Corp. 2019 5-5


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce v1 explained
• Hadoop computational model
▪ Data stored in a distributed file system spanning many inexpensive
computers
▪ Bring function to the data
▪ Distribute application to the compute resources where the data is stored
• Scalable to thousands of nodes and petabytes of data
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
Hadoop Data Nodes
one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text val, Context


StringTokenizer itr =

1.Map Phase
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);

}
}
}
(break job into small parts)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();

public void reduce(Text key,


Distribute map 2.Shuffle
(transfer interim output
tasks to cluster
Iterable<IntWritable> val, Context context){
int sum = 0;
for (IntWritable v : val) {
sum += v.get();
for final processing)
. . .

3.Reduce Phase
MapReduce (boil all output down to
Application Shuffle a single result set)

Return a single result


Result Set set
MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce v1 explained
There are two aspects of Hadoop that are important to understand:
• MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
• The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. Furthermore, HDFS assumes nodes will fail, so it replicates a given
chunk of data across multiple nodes to achieve reliability. The degree of
replication can be customized by the Hadoop administrator or programmer.
However, by default is to replicate every chunk of data across 3 nodes: 2 on the
same rack, and 1 on a different rack.
The key to understanding Hadoop lies in the MapReduce programming model. This is
essentially a representation of the divide and conquer processing model, where your
input is split into many small pieces (the map step), and the Hadoop nodes process
these pieces in parallel. Once these pieces are processed, the results are distilled (in
the reduce step) down to a single answer.

© Copyright IBM Corp. 2019 5-6


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce v1 engine
• Master/Slave architecture
▪ Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
• JobTracker
▪ Accepts MapReduce jobs submitted by clients
▪ Pushes map and reduce tasks out to TaskTracker nodes
▪ Keeps the work as physically close to data as possible
▪ Monitors tasks and TaskTracker status
• TaskTracker
▪ Runs map and reduce tasks
▪ Reports status to JobTracker
▪ Manages storage and transmission of intermediate output
cluster Computer / Node 1
JobTracker

TaskTracker TaskTracker TaskTracker TaskTracker


Computer / Node 2 Computer / Node 3 Computer / Node 4 Computer / Node 5
MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce v1 engine
If one TaskTracker is very slow, it can delay the entire MapReduce job, especially
towards the end of a job, where everything can end up waiting for the slowest task. With
speculative-execution enabled, however, a single task can be executed on multiple
slave nodes.
For jobs scheduling, by default Hadoop uses FIFO (First in, First Out), and optional 5
scheduling priorities to schedule jobs from a work queue. Other scheduling algorithms
are available as add-ins: Fair Scheduler, Capacity Scheduler, and so on.

© Copyright IBM Corp. 2019 5-7


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

The MapReduce programming model


• "Map" step
▪ Input is split into pieces (HDFS blocks or "splits")
▪ Worker nodes process the individual pieces in parallel
(under global control of a Job Tracker)
▪ Each worker node stores its result in its local file system where a reducer
is able to access it
• "Reduce" step
▪ Data is aggregated ("reduced" from the map steps) by worker nodes
(under control of the Job Tracker)
▪ Multiple reduce tasks parallelize the aggregation
▪ Output is stored in HDFS (and thus replicated)

MapReduce and YARN © Copyright IBM Corporation 2019

The MapReduce programming model


From Wikipedia on MapReduce (http://en.wikipedia.org/wiki/MapReduce):
MapReduce is a framework for processing huge datasets on certain kinds of
distributable problems using many computers (nodes), collectively referred to as a
cluster (if all nodes use the same hardware) or as a grid (if the nodes use different
hardware). Computational processing can occur on data stored either in a file
system (unstructured) or within a database (structured).
"Map" step: The master node takes the input, chops it up into smaller sub-problems,
and distributes those to worker nodes. A worker node may do this again in turn, leading
to a multi-level tree structure. The worker node processes that smaller problem and
passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and
combines them in some way to get the output - the answer to the problem it was
originally trying to solve.

© Copyright IBM Corp. 2019 5-8


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

The MapReduce execution environments


• APIs vs. Execution Environment
▪ APIs are implemented by applications and are largely independent of
execution environment
▪ Execution Environment defines how MapReduce jobs are executed
• MapReduce APIs
▪ org.apache.mapred:
− Old API, largely superseded some classes still used in new API
− Not changed with YARN
▪ org.apache.mapreduce:
− New API, more flexibility, widely used
− Applications may have to be recompiled to use YARN (not binary compatible)
• Execution Environments
▪ Classic JobTracker/TaskTracker from Hadoop v1
▪ YARN (MapReduce v2): Flexible execution environment to run MapReduce
and much more
− No single JobTracker, instead ApplicationMaster jobs for every application
MapReduce and YARN © Copyright IBM Corporation 2019

The MapReduce execution environments


There are two aspects of Hadoop that are important to understand:
1. MapReduce is a software framework introduced by Google to support distributed
computing on large data sets of clusters of computers. I'll explain more about
MapReduce in a minute.
2. The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. I'll explain more about HDFS in a minute, too. You can use other file
systems with Hadoop, but HDFS is quite common.

© Copyright IBM Corp. 2019 5-9


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce 1 overview

Results can be
written to HDFS or a
database

Map Shuffle Reduce


Distributed
FileSystem HDFS,
data in blocks
MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce 1 overview
This slide provides an overview of the MR1 process.
File blocks (stored on different DataNodes) in HDFS are read and processed by the
Mappers.
The output of the Mapper processes are shuffled (sent) to the Reducers (one output file
from each Mapper to each Reducer); the files here are not replicated and are stored
local to the Mapper node.
The Reducers produces the output and that output is stored in HDFS, with one file for
each Reducer.

© Copyright IBM Corp. 2019 5-10


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce: Map phase


• Mappers
▪ Small program (typically), distributed across the cluster, local to data
▪ Handed a portion of the input data (called a split)
▪ Each mapper parses, filters, or transforms its input
▪ Produces grouped <key,value> pairs
Map Phase
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111 Logical Output
011011010
101101001 File
sort merge
3
010100101
010101011
101101001

100100110
3 map 010010011
100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort 01

4 map
Logical
Input File

MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce: Map phase


Earlier, you learned that if you write your programs in a special way, the programs can
be brought to the data. This special way is called MapReduce, and it involves breaking
your program down into two discrete parts: Map and Reduce.
A mapper is typically a relatively small program with a relatively simple task: it is
responsible for reading a portion of the input data, interpreting, filtering or transforming
the data as necessary and then finally producing a stream of <key, value> pairs. What
these keys and values are is not of importance in the scope of this topic, but just keep
in mind that these values can be as big and complex as you need.
Notice in the diagram, how the MapReduce environment will automatically take care of
taking your small "map" program (the black boxes) and pushing that program out to
every machine that has a block of the file you are trying to process. This means that the
bigger the file, the bigger the cluster, the more mappers get involved in processing the
data. That's a powerful idea.

© Copyright IBM Corp. 2019 5-11


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce: Shuffle phase


• The output of each mapper is locally grouped together by key
• One node is chosen to process data for each unique key
• All of this movement (shuffle) of data is transparently orchestrated by
MapReduce

Shuffle
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
010101011 101101001
100100110
3 map 010010011
101110100 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map

MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce: Shuffle phase


This next phase is called "Shuffle" and is orchestrated behind the scenes by
MapReduce.
The idea here is that all the data that is being emitted from the mappers is first locally
grouped by the <key> that your program chose, and then for each unique key, a node
is chosen to process all the values from all the mappers for that key.
For example, if you used U.S. state (such as MA, AK, NY, CA, etc.) as the key of your
data, then one machine would be chosen to send all the California data to, and another
chosen for all the New York data, and so on. Each machine would be responsible for
processing the data for its selected state. In the diagram above, the data clearly only
had two keys (shown as white and black boxes), but keep in mind that there may be
many records with the same key coming from a given mapper.

© Copyright IBM Corp. 2019 5-12


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce: Reduce phase


• Reducers
▪ Small programs (typically) that aggregate all of the values for the key
that they are responsible for
▪ Each reducer writes output to its own file

Reduce Phase
Logical
sort Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010
101101001
sort merge 101101001
3
010100101
010101011 010010011
100100110
3 map 100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort
01

4 map
Logical
Output
File

MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce: Reduce phase


Reducers are the last part of the picture. Again, these are small programs (typically
small) that are responsible for sorting and/or aggregating all the values with the key that
is was assigned to work on. Just like with mappers, the more unique keys you have the
more parallelism.
Once each reducer has completed whatever is assigned to do, such as add up the total
sales for the state it was assigned, and it in turn, emits key/value pairs that get written to
storage that can then be used as the input to the next MapReduce job.
This is a simplified overview of MapReduce.

© Copyright IBM Corp. 2019 5-13


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

MapReduce: Combiner (Optional)


• The data that will go to each reduce node is sorted and merged before
going to the reduce node, pre-doing some of the work of the receiving
reduce node in order to minimize network traffic between map and
reduce nodes.

Combiner
sort Logical Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 & combine 010010011
001010011 100111111
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101 & combine
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
010101011
100100110
3 map 101101001
010010011
101110100 & combine 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
& combine

MapReduce and YARN © Copyright IBM Corporation 2019

MapReduce: Combiner (Optional)


At the same time as the Sort is done during the Shuffle work on the Mapper node, an
optional Combiner function may be applied.
For each key, all key/values with that key are sent to the same Reducer node (that is
the purpose of the Shuffle phase).
Rather than sending multiple key/value pairs with the same key value to the Reducer
node, the values are combined into one key/value pair. This is only possible where the
reduce function is additive (or does not lose information when combined).
Since only one key/value pair is sent, the file transferred from Mapper node to Reducer
node is smaller and network traffic is minimalized.

© Copyright IBM Corp. 2019 5-14


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

WordCount example
• In the example of a list of animal names
▪ MapReduce can automatically split files on line breaks
▪ The file has been split into two blocks on two nodes
• To count how often each big cat is mentioned, in SQL you would use:

SELECT COUNT(NAME) FROM ANIMALS


WHERE NAME IN (Tiger, Lion …)
GROUP BY NAME;

Node 1 Node 2
Tiger Tiger
Lion Tiger
Lion Wolf
Panther Panther
Wolf …

MapReduce and YARN © Copyright IBM Corporation 2019

WordCount example
In a file with two blocks (or "splits") of data, animal names are listed. There is one
animal name per line in the files.
Rather than count the number of mentions of each animal, you are interested only in
members of the cat family.
Since the blocks are held on different nodes, software running on the individual nodes
process the blocks separately.
If you were using SQL, which is not used, the SQL would be as stated in the slide.

© Copyright IBM Corp. 2019 5-15


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Map task
• There are two requirements for the Map task:
▪ filter out the non big-cat rows
▪ prepare count by transforming to <Text(name), Integer(1)>

Node 1
Tiger <Tiger, 1>
Lion <Lion, 1>
Lion <Lion, 1>
The Map Tasks are
Panther <Panther, 1>
executed locally on each
Wolf … split

Node 2
Tiger <Tiger, 1>
Tiger <Tiger, 1>
Wolf <Panther, 1>
Panther …

MapReduce and YARN © Copyright IBM Corporation 2019

Map task
Reviewing the description of MapReduce from Wikipedia
(https://en.wikipedia.org/wiki/MapReduce):
MapReduce is a framework for processing huge datasets on certain kinds of
distributable problems using a large number of computers (nodes), collectively
referred to as a cluster (if all nodes use the same hardware) or as a grid (if the
nodes use different hardware). Computational processing can occur on data stored
either in a file system (unstructured) or within a database structured).
"Map" step: The master node takes the input, breaks it up into smaller sub-problems,
and distributes those to worker nodes. A worker node may do this again in turn, leading
to a multi-level tree structure. The worker node processes that smaller problem and
passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and
combines them in some way to get the output, the answer to the problem it was
originally trying to solve.

© Copyright IBM Corp. 2019 5-16


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

The Map step shown does the following processing:


• each Map node reads its own "split" (block) of data
• the information required (in this case, the names of animals) is extracted from
each record (in this case, one line = one record)
• data is filtered (keeping only the names of cat family animals)
• key-value pairs are created (in this case, key = animal and value = 1)
• key-value pairs are accumulated into locally stored files on the individual nodes
where the Map task is being executed

© Copyright IBM Corp. 2019 5-17


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Shuffle
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitioner Class (normally hash distribution)
• Reduce Tasks can run on any node - here on Nodes 1 and 3
▪ The number of Map and Reduce tasks do not need to be identical
▪ Differences are handled by the hash partitioner
Node 1 Node 1
Tiger <Tiger, 1>
<Panther, <1,1>>
Lion <Lion, 1>
<Tiger, <1,1,1>>
Panther <Lion, 1>

Wolf <Panther, 1>
… …

Node 2 Node 3
Tiger <Tiger, 1> Shuffle distributes keys <Lion, <1,1>>
Tiger <Tiger, 1> using a hash partitioner. …
Results are stored in
Wolf <Panther, 1>
HDFS blocks on the
Panther …
machines that run the
… Reduce jobs
MapReduce and YARN © Copyright IBM Corporation 2019

Shuffle
Shuffle distributes the key-value pairs to the nodes where the Reducer task will run.
Each Mapper task produces one file for each Reducer task. A hash function running on
the Mapper node determines which Reducer task will receive any key-value pair. All
key-value pairs with a key will be sent to the same Reducer task.
Reduce tasks can run on any node, either different from the set of nodes where the
Map task run or on the same DataNodes. In the slide example, Node 1 is used for one
Reduce task, but a new node, Node 3, is used for a second Reduce node.
There is no relation between the number of Map tasks (generally one node for each
block of the file(s) begin read) and the number of Reduce tasks. Commonly the number
of Reduce tasks is smaller than the number of Map tasks.

© Copyright IBM Corp. 2019 5-18


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Reduce
• The reduce task computes aggregated values for each key
▪ Normally the output is written to the DFS
▪ Default is one output part-file per Reduce task
▪ Reduce tasks aggregate all values of a specific key - our case, the count of
the particular animal type
Reducer Tasks running on DataNodes Output files are stored in HDFS
Node 1
<Panther, 2>
<Panther, <1,1>> <Tiger, 3>
<Tiger, <1,1,1>> …

Node 3
<Lion, 2>
<Lion, <1,1>>

MapReduce and YARN © Copyright IBM Corporation 2019

Reduce
Note that these two Reducer tasks are running on Nodes 1 and 3.
The Reduce node then takes the answers to all the sub-problems and combines them
in some way to get the output - the answer to the problem it was originally trying to
solve.
In this case, the Reduce step shown on this slide does the following processing:
• Each Reduce node the data sent to it from the various Map nodes
• This data has been previously sorted (and possibly partially merged)
• The Reduce node aggregates the date; in the case of WordCount, it sums the
counts received for each word (each animal in this case)
• One file is produced for each Reduce task and that is written to HDFS where
the blocks will automatically be replicated

© Copyright IBM Corp. 2019 5-19


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Optional: Combiner
• For performance, an aggregate in the Map task can be helpful
• Reduces the amount of data sent over the network
▪ Also reduces Merge effort, since data is premerged in Map
▪ Done in the Map task, before Shuffle
Map Task running on each of two DataNodes Reduce Tasks

Node 1 Node 1
Tiger <Tiger, 1> <Lion, 1>
<Panther, <1,1>>
Lion <Lion, 1> <Panther, 1>
<Tiger, <1, 2>>
Panther <Panther, 1> <Tiger, 1>

Wolf … …

Node 2 Node 3
Tiger <Tiger, 1> <Tiger, 2> <Lion, 1>
Tiger <Tiger, 1> <Panther, 1> …
Wolf <Panther, 1> …
Panther …

MapReduce and YARN © Copyright IBM Corporation 2019

Optional: Combiner
The Combiner phase is optional. When it is used, it runs on the Mapper node and
preprocesses the data that will be sent to Reduce tasks by pre-merging and pre-
aggregating the data in the files that will be transmitted to the Reduce tasks.
The Combiner thus reduces the amount of data that will be sent to the Reducer tasks,
and that speeds up the processing as smaller files need to be transmitted to the
Reducer task nodes.

© Copyright IBM Corp. 2019 5-20


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Source code for WordCount.java (1 of 3)


1. package org.myorg;
2.
3. import java.io.IOException;
4. import java.util.*;
5.
6. import org.apache.hadoop.fs.Path;
7. import org.apache.hadoop.conf.*;
8. import org.apache.hadoop.io.*;
9. import org.apache.hadoop.mapred.*;
10. import org.apache.hadoop.util.*;
11.
12. public class WordCount {
13.
14. public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {
15. private final static IntWritable one = new IntWritable(1);
16. private Text word = new Text();
17.
18. public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable>
output,Reporter reporter) throws IOException {
19. String line = value.toString();
20. StringTokenizer tokenizer = new StringTokenizer(line);
21. while (tokenizer.hasMoreTokens()) {
22. word.set(tokenizer.nextToken());
23. output.collect(word, one);
24. }
25. }
26. }
27.
MapReduce and YARN © Copyright IBM Corporation 2019

Source code for WordCount.java


This is a slightly simplified version of WordCount.java for MapReduce 1. The full
program is slightly larger, and there are some recommended differences for compiling
for MapReduce 2 with Hadoop 2.
Code from the Hadoop classes is brought in with the import statements. Like an
iceberg, most of the actual code executed at runtime is hidden from the programmer; it
runs deep down in the Hadoop code itself.
The interest here is the Mapper class, Map.
This class reads the file (you will see on the driver class slide later as arg[0]) as a string.
The string is tokenized, for example, broken into words separated by space(s).
Note the following shortcomings of the code:
• No lowercasing is done, thus The and the are treated as separate words that
will be counted separately.
• Any adjacent punctuation is appended to the word, thus "the (leading double
quote) and the (no quotes) will be counted separately, and any word followed
by punctuation, for example cow, (trailing comma) is counted separately from
cow (the same word without trailing punctuation).

© Copyright IBM Corp. 2019 5-21


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

You will see these shortcomings in the output. Note that this is the standard WordCount
program and the interest is not in the actual results but only in the process at this stage.
The WordCount program is to Hadoop Java programs functions as the "Hello, world!"
program is to the C language. It is generally the first program that people experience
when coming to the new technology.

© Copyright IBM Corp. 2019 5-22


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Source code for WordCount.java (2 of 3)


28. public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable,
Text, IntWritable> {
29. public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {
30. int sum = 0;
31. while (values.hasNext()) {
32. sum += values.next().get();
33. }
34. output.collect(key, new IntWritable(sum));
35. }
36. }
37.

MapReduce and YARN © Copyright IBM Corporation 2019

The Reducer class, Reduce, is displayed in the slide.


The key-value pairs arrive at this class already sorted (courtesy of the core Hadoop
classes that you do not see), thus adjacent records will have the same key.
While the key does not change, the values are aggregated (in this case, summed)
using: sum += …

© Copyright IBM Corp. 2019 5-23


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Source code for WordCount.java (3 of 3)


38. public static void main(String[] args) throws Exception {
39. JobConf conf = new JobConf(WordCount.class);
40. conf.setJobName("wordcount");
41.
42. conf.setOutputKeyClass(Text.class);
43. conf.setOutputValueClass(IntWritable.class);
44.
45. conf.setMapperClass(Map.class);
46. conf.setCombinerClass(Reduce.class);
47. conf.setReducerClass(Reduce.class);
48.
49. conf.setInputFormat(TextInputFormat.class);
50. conf.setOutputFormat(TextOutputFormat.class);
51.
52. FileInputFormat.setInputPaths(conf, new Path(args[0]));
53. FileOutputFormat.setOutputPath(conf, new Path(args[1]));
54.
55. JobClient.runJob(conf);
57. }
58. }

MapReduce and YARN © Copyright IBM Corporation 2019

The driver routine, embedded in main, does the following work:


• sets the JobName for runtime
• sets the Mapper class to Map
• sets the Reducer class to Reduce
• sets the Combiner class to Reduce
• sets the input file to arg[0]
• sets the output directory to arg[1]
The combiner runs on the Map task and use the same code as the Reducer task.
The names of the output files will be generated inside the Hadoop code.

© Copyright IBM Corp. 2019 5-24


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

How does Hadoop run MapReduce jobs?


2. get new job ID
MapReduce 1. run job
JobClient JobTracker 5. initialize job
program
4. submit job
client JVM
jobtracker node
client node
7. heartbeat
6. retrieve
3. copy job (returns task)
input splits
resources

Distributed TaskTracker
file system 8. retrieve
(e.g. HDFS) job resources 9. launch

child JVM

• Client: submits MapReduce jobs


Child

• JobTracker: coordinates the job run, 10. run

breaks down the job to map and reduce MapTask


tasks for each node to work on the cluster or

• TaskTracker: execute the Map and Reduce ReduceTask

task functions TaskTracker node

MapReduce and YARN © Copyright IBM Corporation 2019

How does Hadoop run MapReduce jobs?


The process of running a MapReduce job on Hadoop consists of 10 major steps:
1. The first step is the MapReduce program you've written tells the Job Client to
run a MapReduce job.
2. This sends a message to the JobTracker which produces a unique ID for the
job.
3. The Job Client copies job resources, such as a jar file containing a Java code
you have written to implement the map or the reduce task, to the shared file
system, usually HDFS.
4. Once the resources are in HDFS, the Job Client can tell the JobTracker to start
the job.
5. The JobTracker does its own initialization for the job. It calculates how to split
the data so that it can send each "split" to a different mapper process to
maximize throughput.
6. It retrieves these "input splits" from the distributed file system, not the data itself.

© Copyright IBM Corp. 2019 5-25


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

7. The TaskTrackers are continually sending heartbeat messages to the


JobTracker. Now that the JobTracker has work for them, it will return a map task
or a reduce task as a response to the heartbeat.
8. The TaskTrackers need to obtain the code to execute, so they get it from the
shared file system.
9. Then they can launch a Java Virtual Machine with a child process running in it
and this child process runs your map code or your reduce code. The result of
the map operation will remain in the local disk for the given TaskTracker node
(not in HDFS).
10. The output of the Reduce task is stored in HDFS file system using the number
of copies specified by replication factor.

© Copyright IBM Corp. 2019 5-26


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
▪ InputSplitter dividing a file into splits
− Splitsare normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
▪ RecordReader takes a split and reads the files into records
− For example, one record per line (LineRecordReader)
− But note that a record can be spit across splits
▪ InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task
• Lots of additional helper classes may be required to handle
compression, e.g. LZO compression, etc.

MapReduce and YARN © Copyright IBM Corporation 2019

Classes
The InputSplitter, RecordSplitter, and InputFormat classes are provided inside the
Hadoop code.
Other helper classes are needed to support Java MapReduce programs. Some of
these are provided from inside the Hadoop code itself, but distribution vendors and end
programmers can provide other classes that either override or supplement standard
code. Thus some vendors provide the LZO compressions algorithm to supplement
standard compression codecs (such as codecs for bzip2, etc.).

© Copyright IBM Corp. 2019 5-27


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Splits
• Files in MapReduce are stored in blocks (128 MB)
• MapReduce divides data into fragments or splits.
▪ One map task is executed on each split
• Most files have records with defined split points
▪ Most common is the end of line character
• The InputSplitter class is responsible for taking a HDFS file and
transforming it into splits.
▪ Aim is to process as much data as possible locally

MapReduce and YARN © Copyright IBM Corporation 2019

Splits

© Copyright IBM Corp. 2019 5-28


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

RecordReader
• Most of the time a split will not happen at a block end
• Files are read into Records by the RecordReader class
▪ Normally the RecordReader will start and stop at the split points.
• LineRecordReader will read over the end of the split till the line end.
▪ HDFS will send the missing piece of the last record over the network
• Likewise LineRecordReader for Block2 will disregard the first
incomplete line
Node 1 Node 2
Tiger\n ther\n
Tiger\n Tiger\n
Lion\n Wolf\n
Pan Lion

In this example RecordReader1 will not stop at Pan but will read on until the end of
the line. Likewise RecordReader2 will ignore the first line.
MapReduce and YARN © Copyright IBM Corporation 2019

RecordReader

© Copyright IBM Corp. 2019 5-29


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

InputFormat
• MapReduce Tasks read files by defining an InputFormat class
▪ Map tasks expect <key, value> pairs
• To read line-delimited textfiles Hadoop provides the TextInputFormat
class
▪ It returns one key, value pair per line in the text
▪ The value is the content of the line
▪ The key is the character offset to the new line character (end of line)

Node 1
Tiger <0, Tiger>
Lion <6, Lion>
Lion <11, Lion>
Panther <16, Panther>
Wolf <24, Wolf>
… …
MapReduce and YARN © Copyright IBM Corporation 2019

InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
• Validate the input-specification of the job.
• Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
• Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of
FileInputFormat, is to split the input into logical InputSplits based on the total size, in
bytes, of the input files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since
record boundaries are to be respected. In such cases, the application must also
implement a RecordReader on whom lies the responsibility to respect record-
boundaries and present a record-oriented view of the logical InputSplit to the individual
task.

© Copyright IBM Corp. 2019 5-30


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Fault tolerance
JobTracker node

JobTracker
3
JobTracker fails

heartbeat

TaskTracker 2 TaskTracker fails

child JVM

Child

MapTask 1 Task fails

or
ReduceTask

TaskTracker node

MapReduce and YARN © Copyright IBM Corporation 2019

Fault tolerance
What happens when something goes wrong?
Failures can happen at the task level (1), TaskTracker level (2), or JobTracker level (3).
The primary way that Hadoop achieves fault tolerance is through restarting tasks.
Individual task nodes (TaskTrackers) are in constant communication with the head
node of the system, called the JobTracker. If a TaskTracker fails to communicate with
the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume
that the TaskTracker in question has crashed. The JobTracker knows which map and
reduce tasks were assigned to each TaskTracker.
If the job is still in the mapping phase, then other TaskTrackers will be asked to re-
execute all map tasks previously run by the failed TaskTracker. If the job is in the
reducing phase, then other TaskTrackers will re-execute all reduce tasks that were in
progress on the failed TaskTracker.

© Copyright IBM Corp. 2019 5-31


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Reduce tasks, once completed, have been written back to HDFS. Thus, if a
TaskTracker has already completed two out of three reduce tasks assigned to it, only
the third task must be executed elsewhere. Map tasks are slightly more complicated:
even if a node has completed ten map tasks, the reducers may not have all copied their
inputs from the output of those map tasks. If a node has crashed, then its mapper
outputs are inaccessible. So any already-completed map tasks must be re-executed to
make their results available to the rest of the reducing machines. All of this is handled
automatically by the Hadoop platform.
This fault tolerance underscores the need for program execution to be side-effect free.
If Mappers and Reducers had individual identities and communicated with one another
or the outside world, then restarting a task would require the other nodes to
communicate with the new instances of the map and reduce tasks, and the re-executed
tasks would need to reestablish their intermediate state. This process is notoriously
complicated and error-prone in the general case. MapReduce simplifies this problem
drastically by eliminating task identities or the ability for task partitions to communicate
with one another. An individual task sees only its own direct inputs and knows only its
own outputs, to make this failure and restart process clean and dependable.

© Copyright IBM Corp. 2019 5-32


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Topic:
Issues with and
limitations of Hadoop v1
and MapReduce v1

MapReduce and YARN © Copyright IBM Corporation 2019

Topic: Issues with and limitations of Hadoop v1 and MapReduce v1


The original Hadoop (v1) and MapReduce (v1) had limitations, and a number of issues
surfaced over time. You will review these in preparation for looking at the differences
and changes introduced with Hadoop 2 and MapReduce v2.

© Copyright IBM Corp. 2019 5-33


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Issues with the original MapReduce paradigm


• Centralized handling of job control flow
• Tight coupling of a specific programming model with the resource
management infrastructure
• Hadoop is now being used for all kinds of tasks beyond its original
design

MapReduce and YARN © Copyright IBM Corporation 2019

Issues with the original MapReduce paradigm


These will be reviewed in more detail in this unit.

© Copyright IBM Corp. 2019 5-34


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Limitations of classic MapReduce (MRv1)


The most serious limitations of classical MapReduce are:
▪ Scalability
▪ Resource utilization
▪ Support of workloads different from MapReduce.
• In the MapReduce framework, the job execution is controlled by two
types of processes:
▪ A single master process called JobTracker, which coordinates all jobs
running on the cluster and assigns map and reduce tasks to run on the
TaskTrackers
▪ A number of subordinate processes called TaskTrackers, which run
assigned tasks and periodically report the progress to the JobTracker

MapReduce and YARN © Copyright IBM Corporation 2019

Limitations of classic MapReduce (MRv1)


This topic is well covered in the article:
• Introduction to YARN: https://hortonworks.com/apache/yarn/

© Copyright IBM Corp. 2019 5-35


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Scalability in MRv1: Busy JobTracker

MapReduce and YARN © Copyright IBM Corporation 2019

Scalability in MRv1: Busy JobTracker


In Hadoop MapReduce, the JobTracker is charged with two distinct responsibilities:
• Management of computational resources in the cluster, which involves
maintaining the list of live nodes, the list of available and occupied map and
reduce slots, and allocating the available slots to appropriate jobs and tasks
according to selected scheduling policy
• Coordination of all tasks running on a cluster, which involves instructing
TaskTrackers to start map and reduce tasks, monitoring the execution of the
tasks, restarting failed tasks, speculatively running slow tasks, calculating total
values of job counters, and more
The large number of responsibilities given to a single process caused significant
scalability issues, especially on a larger cluster where the JobTracker had to constantly
keep track of thousands of TaskTrackers, hundreds of jobs, and tens of thousands of
map and reduce tasks. The diagram represents this issue. On the contrary, the
TaskTrackers usually run only a dozen or so tasks, which were assigned to them by the
hard-working JobTracker.

© Copyright IBM Corp. 2019 5-36


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN modifies MRv1


• MapReduce has undergone a complete modification with YARN,
splitting up the two major functionalities of JobTracker (resource
management and job scheduling/monitoring) into separate daemons
• ResourceManager (RM)
▪ The global ResourceManager and per-node slave, the NodeManager (NM),
form the data-computation framework
▪ The ResourceManager is the ultimate authority that arbitrates resources
among all the applications in the system
• ApplicationMaster (AM)
▪ The per-application ApplicationMaster is, in effect, a framework specific
library responsible for negotiating resources from the ResourceManager
and working with the NodeManager(s) to execute and monitor the tasks
▪ An application is either a single job in the classical sense of Map-Reduce
jobs or a directed acyclic graph (DAG) of jobs

MapReduce and YARN © Copyright IBM Corporation 2019

YARN overhauls MRv1


The fundamental idea of YARN/MRv2 is to split up the two major functionalities of the
JobTracker, resource management and job scheduling/monitoring, into separate
daemons. The idea is to have a global ResourceManager (RM) and per-application
ApplicationMaster (AM). An application is either a single job in the classical sense of
Map-Reduce jobs or a DAG of jobs.
The ResourceManager and per-node slave, the NodeManager (NM), form the data-
computation framework. The ResourceManager is the ultimate authority that arbitrates
resources among all the applications in the system.
The per-application ApplicationMaster is, in effect, a framework specific library and is
tasked with negotiating resources from the ResourceManager and working with the
NodeManager(s) to execute and monitor the tasks.
The ResourceManager has two main components: Scheduler and
ApplicationsManager.

© Copyright IBM Corp. 2019 5-37


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

The Scheduler is responsible for allocating resources to the various running


applications subject to familiar constraints of capacities, queues etc. The Scheduler is
pure scheduler in the sense that it performs no monitoring or tracking of status for the
application. Also, it offers no guarantees about restarting failed tasks either due to
application failure or hardware failures. The Scheduler performs its scheduling function
based the resource requirements of the applications; it does so based on the abstract
notion of a resource Container which incorporates elements such as memory, cpu, disk,
network etc. In the first version, only memory is supported.
The Scheduler has a pluggable policy plug-in, which is responsible for partitioning the
cluster resources among the various queues, applications etc. The current Map-
Reduce schedulers such as the CapacityScheduler and the FairScheduler would be
some examples of the plug-in.
The CapacityScheduler supports hierarchical queues to allow for more predictable
sharing of cluster resources.
The ApplicationsManager is responsible for accepting job-submissions, negotiating
the first container for executing the application specific ApplicationMaster and provides
the service for restarting the ApplicationMaster container on failure.
The NodeManager is the per-machine framework agent who is responsible for
containers, monitoring their resource usage (cpu, memory, disk, network) and reporting
the same to the ResourceManager/Scheduler.
The per-application ApplicationMaster has the responsibility of negotiating appropriate
resource containers from the Scheduler, tracking their status and monitoring for
progress.
MRV2 maintains API compatibility with previous stable release (hadoop-1.x). This
means that all Map-Reduce jobs should still run unchanged on top of MRv2 with just a
recompile.

© Copyright IBM Corp. 2019 5-38


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Hadoop 1 and 2 architectures compared


Hadoop 1 Hadoop 2
Hadoop 2 supports many execution
engines, including a port of MapReduce
that is now a YARN Application

MapReduce is the only


execution engine in Hadoop 1 MapReduce 2 Spark
(Batch Parallel (In-memory HBase
Computing) processing) on YARN ...

MapReduce YARN
(Batch parallel Computing) (Resource management)

HDFS HDFS
(Distributed Storage) (Distributed Storage)

The YARN framework provides work


HDFS is common to both versions scheduling that is neutral to the nature
of the work being performed

MapReduce and YARN © Copyright IBM Corporation 2019

Hadoop 1 and 2 architectures compared


This diagram is developed from that in Alex Holmes, Hadoop in Practice, 2nd ed.
(Shelter Island, NY: Manning, 2015).

© Copyright IBM Corp. 2019 5-39


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features

© Copyright IBM Corp. 2019 5-40


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features: Scalability


• There is one Application Master per job, which is why YARN scales
better than the previous Hadoop v1 architecture
▪ The Application Master for a given job can run on an arbitrary cluster node,
and it runs until the job reaches termination
• Separation of functionality allows the individual operations to be
improved with less effect on other operations
• YARN supports rolling upgrades without downtime

ResourceManager focuses exclusively on scheduling, allowing clusters to


expand to thousands of nodes managing petabytes of data

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features: Scalability


YARN lifts the scalability ceiling in Hadoop by splitting the roles of the Hadoop Job
Tracker into two processes. A ResourceManager controls access to the clusters
resources (memory, CPU, etc.), and the ApplicationManager (one per job) controls task
execution.
YARN can run on larger clusters than MapReduce 1. MapReduce 1 hits scalability
bottlenecks in the region of 4,000 nodes and 40,000 tasks, stemming from the fact that
the JobTacker has to manage both jobs and tasks. YARN overcomes these limitations
by virtue of its split ResourceManager/ApplicationMaster architecture: it is designed to
scale up to 10,000 nodes and 100,000 tasks.
In contrast to the JobTracker, each instance of an application has a dedicated
ApplicationMaster, which runs for the duration of the application. This model is actually
closer to the original Google MapReduce paper, which describes how a master process
is started to coordinate map and reduce tasks running on a set of workers.

© Copyright IBM Corp. 2019 5-41


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features: Multi-tenancy


• YARN allows multiple access engines (either open-source or
proprietary) to use Hadoop as the common standard for batch,
interactive, and real-time engines that can simultaneously access the
same data sets
• YARN uses a shared pool of nodes for all jobs
• YARN allows the allocation of Hadoop clusters of fixed size from the
shared pool

Multi-tenant data processing improves an


enterprise's return on its Hadoop investment

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features: Multi-tenancy


Multi-tenancy generally refers to a set of features that enable multiple business users
and processes to share a common set of resources, such as an Apache Hadoop
cluster via policy rather than physical separation, yet without negatively impacting
Service Level Agreements (SLA), violating security requirements, or even revealing the
existence of each party.
What YARN does is essentially de-couple Hadoop workload management from
resource management. This means that multiple applications can share a common
infrastructure pool. While this idea is not new to many, it is new to Hadoop. Earlier
versions of Hadoop consolidated both workload and resource management functions
into a single JobTracker. This approach resulted in limitations for customers hoping to
run multiple applications on the same cluster infrastructure.

© Copyright IBM Corp. 2019 5-42


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

To borrow from object-oriented programming terminology, multi-tenancy is an over-


loaded term. It means different things to different people depending on their orientation
and context. To say a solution is multi-tenant is not helpful unless being specific about
the meaning. Some interpretations of multi-tenancy in big data environments are:
• Support for multiple concurrent Hadoop jobs
• Support for multiple lines of business on a shared infrastructure
• Support for multiple application workloads of different types
(Hadoop and non-Hadoop)
• Provisions for security isolation between tenants
• Contract-oriented service level guarantees for tenants
• Support for multiple versions of applications and application frameworks
concurrently
Organizations that are sophisticated in their view of multi-tenancy will need all these
capabilities and more. YARN promises to address some of these requirements and
does so in large measure. But, you will find in future releases of Hadoop that there will
be other approaches that are being addressed to provide other forms of multi-tenancy.
While an important technology, the world is not suffering from a shortage of resource
managers. Some Hadoop providers are supporting YARN, while others are supporting
Apache Mesos.

© Copyright IBM Corp. 2019 5-43


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features: Compatibility


• To the end user (a developer, not an administrator), the changes are
almost invisible
• Possible to run unmodified MapReduce jobs using the same
MapReduce API and CLI
▪ May require a recompile

There is no reason not to migrate from MRv1 to YARN

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features: Compatibility


To ease the transition from Hadoop v1 to YARN, a major goal of YARN and the
MapReduce framework implementation on top of YARN was to ensure that existing
MapReduce applications that were programmed and compiled against previous
MapReduce APIs (we'll call these MRv1 applications) can continue to run with little or
no modification on YARN (we'll refer to these as MRv2 applications).
For the vast majority of users who use the org.apache.hadoop.mapred APIs,
MapReduce on YARN ensures full binary compatibility. These existing applications can
run on YARN directly without recompilation. You can use .jar files from your existing
application that code against mapred APIs and use bin/hadoop to submit them directly
to YARN.
Unfortunately, it was difficult to ensure full binary compatibility to the existing
applications that compiled against MRv1 org.apache.hadoop.mapreduce APIs. These
APIs have gone through many changes. For example, several classes stopped being
abstract classes and changed to interfaces. Therefore, the YARN community
compromised by only supporting source compatibility for
org.apache.hadoop.mapreduce APIs. Existing applications that use MapReduce APIs
are source-compatible and can run on YARN either with no changes, with simple
recompilation against MRv2 .jar files that are shipped with Hadoop 2, or with minor
updates.

© Copyright IBM Corp. 2019 5-44


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features: Higher cluster utilization


• Higher cluster utilization, whereby resources not used by one
framework can be consumed by another
• The NodeManager is a more generic and efficient version of the
TaskTracker.
▪ Instead of having a fixed number of map and reduce slots, the
NodeManager has a number of dynamically created resource containers
▪ The size of a container depends upon the amount of resources assigned to
it, such as memory, CPU, disk, and network IO

YARN's dynamic allocation of cluster resources improves utilization


over the more static MapReduce rules used in early versions of
Hadoop (v1)

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features: Higher cluster utilization


The NodeManager is a more generic and efficient version of the TaskTracker. Instead
of having a fixed number of map and reduce slots, the NodeManager has several
dynamically created resource containers. The size of a container depends upon the
amount of resources it contains, such as memory, CPU, disk, and network IO.
Currently, only memory and CPU are supported (YARN-3); cgroups might be used to
control disk and network IO in the future.
The number of containers on a node is a product of configuration parameters and the
total amount of node resources (such as total CPUs and total memory) outside the
resources dedicated to the slave daemons and the OS.

© Copyright IBM Corp. 2019 5-45


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN features: Reliability and availability


• High availability for the ResourceManager
▪ An application recovery is performed after the restart of ResourceManager
▪ The ResourceManager stores information about running applications and
completed tasks in HDFS
▪ If the ResourceManager is restarted, it recreates the state of applications
and re-runs only incomplete tasks
• Highly available NameNode, making the Hadoop cluster much more
efficient, powerful, and reliable

High Availability is work in progress, and is close to completion -


features have been actively tested by the community

MapReduce and YARN © Copyright IBM Corporation 2019

YARN features: Reliability and availability

© Copyright IBM Corp. 2019 5-46


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN major features summarized


• Multi-tenancy
▪ YARN allows multiple access engines (either open-source or proprietary) to use
Hadoop as the common standard for batch, interactive, and real-time engines
that can simultaneously access the same data sets
▪ Multi-tenant data processing improves an enterprise's return on its Hadoop
investments.
• Cluster utilization
▪ YARN's dynamic allocation of cluster resources improves utilization over more
static MapReduce rules used in early versions of Hadoop
• Scalability
▪ Data center processing power continues to rapidly expand. YARN's
ResourceManager focuses exclusively on scheduling and keeps pace as
clusters expand to thousands of nodes managing petabytes of data.
• Compatibility
▪ Existing MapReduce applications developed for Hadoop 1 can run YARN
without any disruption to existing processes that already work

MapReduce and YARN © Copyright IBM Corporation 2019

YARN major features summarized

© Copyright IBM Corp. 2019 5-47


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Topic:
The architecture of YARN

MapReduce and YARN © Copyright IBM Corporation 2019

Topic: The architecture of YARN

© Copyright IBM Corp. 2019 5-48


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Hadoop v1 to Hadoop v2

Single Use System Multi Purpose Platform


Batch Apps Usually Batch, Interactive, Online, Streaming

Hadoop 1.0 Hadoop 2.0

MR2 Pig Hive Other … RT, HBase


Stream, +
Pig Hive Other … Graph, Services
execution / data processing etc
MapReduce
(cluster resource mgmt YARN
& data processing) (cluster resource management)
HDFS HDFS2
(redundant, reliable storage) (redundant, reliable storage)

MapReduce and YARN © Copyright IBM Corporation 2019

Hadoop v1 to Hadoop v2
The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster
and resource management from the execution and data processing environment.
This allows for a new variety of application types to run, including MapReduce v2.

© Copyright IBM Corp. 2019 5-49


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Architecture of MRv1
• Classic version of MapReduce (MRv1)

MapReduce and YARN © Copyright IBM Corporation 2019

Architecture of MRv1
The effect most prominently seen with the overall job control.
In MapReduce v1 there is just one JobTracker that is responsible for allocation of
resources, task assignment to data nodes (as TaskTrackers), and ongoing monitoring
("heartbeat") as each job is run (the TaskTrackers constantly report back to the
JobTracker on the status of each running task).

© Copyright IBM Corp. 2019 5-50


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN architecture
• High level architecture of YARN

MapReduce and YARN © Copyright IBM Corporation 2019

YARN architecture
In the YARN architecture, a global ResourceManager runs as a master daemon,
usually on a dedicated machine, that arbitrates the available cluster resources among
various competing applications. The ResourceManager tracks how many live nodes
and resources are available on the cluster and coordinates what applications submitted
by users should get these resources and when.
The ResourceManager is the single process that has this information, so it can make its
allocation (or rather, scheduling) decisions in a shared, secure, and multi-tenant
manner (for instance, according to an application priority, a queue capacity, ACLs, data
locality, etc.).
When a user submits an application, an instance of a lightweight process called the
ApplicationMaster is started to coordinate the execution of all tasks within the
application. This includes monitoring tasks, restarting failed tasks, speculatively running
slow tasks, and calculating total values of application counters. These responsibilities
were previously assigned to the single JobTracker for all jobs. The ApplicationMaster
and tasks that belong to its application run in resource containers controlled by the
NodeManagers.

© Copyright IBM Corp. 2019 5-51


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

The NodeManager is a more generic and efficient version of the TaskTracker. Instead
of having a fixed number of map and reduce slots, the NodeManager has several
dynamically created resource containers. The size of a container depends upon the
amount of resources it contains, such as memory, CPU, disk, and network IO.
Currently, only memory and CPU (YARN-3) are supported. cgroups might be used to
control disk and network IO in the future. The number of containers on a node is a
product of configuration parameters and the total amount of node resources (such as
total CPUs and total memory) outside the resources dedicated to the slave daemons
and the OS.
Interestingly, the ApplicationMaster can run any type of task inside a container. For
example, the MapReduce ApplicationMaster requests a container to launch a map or a
reduce task, while the Giraph ApplicationMaster requests a container to run a Giraph
task. You can also implement a custom ApplicationMaster that runs specific tasks and,
in this way, invent a shiny new distributed application framework that changes the big
data world. I encourage you to read about Apache Twill, which aims to make it easy to
write distributed applications sitting on top of YARN.
In YARN, MapReduce is simply degraded to a role of a distributed application (but still a
very popular and useful one) and is now called MRv2. MRv2 is simply the
re-implementation of the classical MapReduce engine, now called MRv1, that runs on
top of YARN.

© Copyright IBM Corp. 2019 5-52


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Terminology changes from MRv1 to YARN

YARN terminology Instead of MRv1 terminology

ResourceManager Cluster Manager

ApplicationMaster JobTracker
(but dedicated and short-lived)

NodeManager TaskTracker

Distributed Application One particular MapReduce job

Container Slot

MapReduce and YARN © Copyright IBM Corporation 2019

Terminology changes from MRv1 to YARN

© Copyright IBM Corp. 2019 5-53


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN
• Acronym for Yet Another Resource Negotiator
• New resource manager included in Hadoop 2.x and later
• De-couples Hadoop workload and resource management
• Introduces a general purpose application container
• Hadoop 2.2.0 includes the first GA version of YARN
• Most Hadoop vendors support YARN

MapReduce and YARN © Copyright IBM Corporation 2019

YARN
YARN is a key component in Hortonworks Data Platform.

© Copyright IBM Corp. 2019 5-54


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

YARN high level architecture


• In Hortonworks Data Platform (HDP), users can take advantage of
YARN and applications written to YARN APIs

Existing MapReduce Applications

MapReduce v2 Tez HBase Spark Others


(batch) (interactive) (online) (in memory) (varied)

YARN
(cluster resource management)

HDFS

MapReduce and YARN © Copyright IBM Corporation 2019

YARN high level architecture

© Copyright IBM Corp. 2019 5-55


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (1 of 7)


NodeManager @node133

NodeManager @node134

NodeManager @node135
Resource
Manager
@node132

NodeManager @node136

MapReduce and YARN © Copyright IBM Corporation 2019

Running an application in YARN

© Copyright IBM Corp. 2019 5-56


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (2 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table
NodeManager @node134

Launch
NodeManager @node135
Resource
Manager Application
@node132 Master 1

NodeManager @node136

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-57


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (3 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table
NodeManager @node134

NodeManager @node135
Resource Resource request
Manager
Application
@node132 Master 1
Container IDs

NodeManager @bigaperf136

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-58


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (4 of 7)


NodeManager @node133

Application 1:
Analyze lineitem table
NodeManager @node134

App 1 App 1

Launch

NodeManager @node135
Resource
Manager
Application
@node132 Master 1

Launch

NodeManager @node136

App 1

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-59


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (5 of 7)


NodeManager @node133

Application 1:
Analyze
lineitem
table NodeManager @node134

App 1 App 1

Application 2:
Analyze customer table

NodeManager @node135
Resource
Manager
Application
@node132 Master 1

NodeManager @node136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-60


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (6 of 7)


NodeManager @nodef133

Application 1:
Analyze
lineitem
table NodeManager @node134

App 1 App 1
Application 2:
Analyze customer table

NodeManager @node135
Resource
Manager
Application
@node132 Master 1

NodeManager @node136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-61


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Running an application in YARN (7 of 7)


NodeManager @node133

App 2

Application 1:
Analyze
lineitem
table NodeManager @node134

App 1 App 1
Application 2:
Analyze customer table

NodeManager @node135
Resource
Manager
Application
AppApp
2 2
@node132 Master 1

NodeManager @nodef136
Application
Master 2 App 1

MapReduce and YARN © Copyright IBM Corporation 2019

© Copyright IBM Corp. 2019 5-62


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

How YARN runs an application

MapReduce and YARN © Copyright IBM Corporation 2019

How YARN runs an application


To run an application on YARN, a client contacts the resource manager and asks it to
run an application master process (step 1). The resource manager then finds a node
manager that can launch the application master in a container (steps 2a and 2b).
Precisely what the application master does once it is running depends on the
application. It could simply run a computation in the container it is running in and return
the result to the client. Or it could request more containers from the resource managers
(step 3), and use them to run a distributed computation (steps 4a and 4b).
This is well described in: White, T. (2015) Hadoop: The definitive guide (4th ed.).
Sabastopol, CA: O'Reilly Media, p. 80.

© Copyright IBM Corp. 2019 5-63


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Container Java command line


• Container JVM command (generally behind the scenes)
▪ Launched by "yarn" user with /bin/bash
yarn 1251527 1199943 0 14:38 ? 00:00:00 /bin/bash -c
/hdp/jdk/bin/java -Djava.net …

▪ If you count "java" process ids (pids) running with the "yarn" user, you will
see 2X
00:00:00 /bin/bash -c /hdp/jdk/bin/java
00:00:00 /bin/bash -c /hdp/jdk/bin/java
. . .
00:07:48 /hdp/jdk/bin/java -Djava.net.pr
00:08:11 /hdp/jdk/bin/java -Djava.net.pr
. . .

MapReduce and YARN © Copyright IBM Corporation 2019

Container JAVA command line


The best references for writing MapReduce programs are on the Hadoop website here:
• http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html
• http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-
site/WritingYarnApplications.html.

This is for the more experienced user.


yarn 1251527 1199943 0 14:38 ? 00:00:00 /bin/bash -c /hdp/jdk/bin/java -
Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2000m -Xms1000m -
Xmn100m -Xtune:virtualized -
Xshareclasses:name=mrscc_%g,groupAccess,cacheDir=/var/hdp/hadoop/tmp,nonFatal -Xscmx20m
-Xdump:java:file=/var/hdp/hadoop/tmp/javacore.%Y%m%d.%H%M%S.%pid.%seq.txt -
Xdump:heap:file=/var/hdp/hadoop/tmp/heapdump.%Y%m%d.%H%M%S.%pid.%seq.phd -
Djava.io.tmpdir=/data6/yarn/local/nodemanager -
local/usercache/bigsql/appcache/application_1417731580977_0002/container_1417731580977_0
002_01_000095/tmp -Dlog4j.configuration=container-log4j.properties -
Dyarn.app.container.log.dir=/var/ hdp/hadoop/yarn/logs/application_1417731580977_0002/con
tainer_1417731580977_0002_01_000095 -Dyarn.app.container.log.filesize=0 -
Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild 9.30.75.55 51923
attempt_1417731580977_0002_m_000073_0 95
1>/var/hdp/hadoop/yarn/logs/application_1417731580977_0002/container_1417731580977_0002_
01_000095/stdout
2>/var/hdp/hadoop/yarn/logs/application_1417731580977_0002/container_1417731580977_0002_
01_000095/stderr

© Copyright IBM Corp. 2019 5-64


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Provisioning, managing, and monitoring


• The Apache Ambari project is aimed at making Hadoop management simpler by
developing software for provisioning, managing, and monitoring Apache Hadoop
clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI
backed by its RESTful APIs.
• Ambari enables System Administrators to:
▪ Provision a Hadoop Cluster
− Ambari provides a step-by-step wizard for installing Hadoop services across any number of
hosts.
− Ambari handles configuration of Hadoop services for the cluster.
▪ Manage a Hadoop Cluster
− Ambari provides central management for starting, stopping, and reconfiguring Hadoop
services across the entire cluster.
▪ Monitor a Hadoop Cluster
− Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
− Ambari leverages Ganglia for metrics collection.
− Ambari leverages Nagios for system alerting and will send emails when your attention is
needed (for example, a node goes down, remaining disk space is low, etc.).
• Ambari enables Application Developers and System Integrators to:
▪ Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own
applications with the Ambari REST APIs
MapReduce and YARN © Copyright IBM Corporation 2019

Provisioning, managing, and monitoring


This is just a reminder of the role of Ambari in Hadoop 2.
With Hadoop 2 and YARN there is a great need for provisioning, management, and
monitoring because of the greater complexity.
In the final unit you will look at Ambari Slider as a mechanism for dynamically changing
requirements at run time for long running jobs.

© Copyright IBM Corp. 2019 5-65


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Spark with Hadoop 2+


• Spark is an alternative in-memory framework to MapReduce
• Supports general workloads as well as streaming, interactive queries
and machine learning providing performance gains
• Spark SQL provides APIs that allow SQL queries to be embedded in
Java, Scala or Python programs in Spark
• MLlib: Spark optimized library support machine learning functions
• GraphX: API for graphs and parallel computation
• Spark streaming: Write applications to process streaming data in Java,
Scala or Python

MapReduce and YARN © Copyright IBM Corporation 2019

Spark in Hadoop 2+
Apache Spark is a new, alternative, in-memory framework that is an alternative to
MapReduce.
Spark is the subject of the next unit.

© Copyright IBM Corp. 2019 5-66


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Checkpoint
1. List the phases in a MR job.
2. What are the limitations of MR v1?
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
4. What are the major features of YARN?

MapReduce and YARN © Copyright IBM Corporation 2019

Checkpoint

© Copyright IBM Corp. 2019 5-67


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Checkpoint solution
1. List the phases in a MR job.
▪ Map, Shuffle, Reduce, Combiner
2. What are the limitations of MR v1?
▪ Centralized handling of job control flow
▪ Tight coupling of a specific programming model with the resource management
infrastructure
▪ Hadoop is now being used for all kinds of tasks beyond its original design
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
▪ ResourceManager
▪ ApplicationMaster
4. What are the major features of YARN?
▪ Multi-tenancy
▪ Cluster utilization
▪ Scalability
▪ Compatibility
MapReduce and YARN © Copyright IBM Corporation 2019

Checkpoint solution

© Copyright IBM Corp. 2019 5-68


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Unit summary
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1

MapReduce and YARN © Copyright IBM Corporation 2019

Unit summary

© Copyright IBM Corp. 2019 5-69


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Lab 1
• Running MapReduce and YARN jobs

MapReduce and YARN © Copyright IBM Corporation 2019

Lab 1: Running MapReduce and YARN jobs

© Copyright IBM Corp. 2019 5-70


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Lab 1:
Running MapReduce and YARN jobs
Purpose:
You will run a sample Java program using Hadoop v2, and related
technologies. You will not need to do any Java programming. The Hadoop
community provides several standard example programs (akin to "Hello,
World" for people learning to write C language programs).
In the real world, the sample/example programs provide opportunities to learn
the relevant technology, and on a newly installed Hadoop cluster, to exercise
the system and the operations environment.
The Java program that you will be compiling and running, WordCount2.java,
resides in the Hadoop samples directory /home/labfiles.

Property Location Sample Value

Ambari cluser.service_endpoints.ambari_console https://chs-gbq-108-mn001.us-


URL south.ae.appdomain.cloud:9443
Hostname cluser.service_endpoints.ssh chs-gbq-108-mn003.us-
(after the @) south.ae.appdomain.cloud

Password cluser.password 24Z5HHf7NUuy

SSH cluser.service_endpoints. ssh ssh clsadmin@chs-gbq-108-


mn003.us-south.ae.appdomain.cloud

Username cluster.user clsadmin


Task 1. Run a simple MapReduce job from a Hadoop sample
program
1. Using PuTTY, connect to management node mn003 of your cluster.
2. List all MapReduce examples on the system:
[clsadmin@chs-gbq-108-mn003 ~]$ hadoop jar /usr/hdp/current/hadoop-mapreduce-
client/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in
the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram
of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits
of Pi.
dbcount: An example job that count the pageview counts from a database.

© Copyright IBM Corp. 2019 5-71


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of
Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi -Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per
node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the
input files.
wordmedian: A map/reduce program that counts the median length of the words in the
input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of
the length of the words in the input files.
3. To run the sample program, run the following command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount
The last parameter (wcount) is the directory name where the program will write
the final output.
The output should be like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hadoop jar /usr/hdp/current/hadoop-mapreduce-
client/hadoop-mapreduce-examples.jar wordcount Gutenberg/Frankenstein.txt wcount
18/10/29 09:14:37 INFO client.RMProxy: Connecting to ResourceManager at chs-gbq-108-
mn002.us-south.ae.appdomain.cloud/172.16.162.134:8050
18/10/29 09:14:37 INFO client.AHSProxy: Connecting to Application History server at
chs-gbq-108-mn002.us-south.ae.appdomain.cloud/172.16.162.134:10200
18/10/29 09:14:38 INFO input.FileInputFormat: Total input paths to process : 1
18/10/29 09:14:38 INFO mapreduce.JobSubmitter: number of splits:1
18/10/29 09:14:38 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1540427840896_0006
18/10/29 09:14:39 INFO impl.YarnClientImpl: Submitted application
application_1540427840896_0006
18/10/29 09:14:39 INFO mapreduce.Job: The url to track the job: http://chs -gbq-108-
mn002.us-south.ae.appdomain.cloud:8088/proxy/application_1540427840896_0006/
18/10/29 09:14:39 INFO mapreduce.Job: Running job: job_1540427840896_0006
18/10/29 09:14:49 INFO mapreduce.Job: Job job_1540427840896_0006 running in uber mode
: false
18/10/29 09:14:49 INFO mapreduce.Job: map 0% reduce 0%
18/10/29 09:14:58 INFO mapreduce.Job: map 100% reduce 0%
18/10/29 09:15:08 INFO mapreduce.Job: map 100% reduce 100%
18/10/29 09:15:08 INFO mapreduce.Job: Job job_1540427840896_0006 completed
successfully
18/10/29 09:15:08 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=167616
FILE: Number of bytes written=638315
FILE: Number of read operations=0

© Copyright IBM Corp. 2019 5-72


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

FILE: Number of large read operations=0


FILE: Number of write operations=0
HDFS: Number of bytes read=421667
HDFS: Number of bytes written=122090
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=53472
Total time spent by all reduces in occupied slots (ms)=129856
Total time spent by all map tasks (ms)=6684
Total time spent by all reduce tasks (ms)=8116
Total vcore-milliseconds taken by all map tasks=6684
Total vcore-milliseconds taken by all reduce tasks=8116
Total megabyte-milliseconds taken by all map tasks=27377664
Total megabyte-milliseconds taken by all reduce tasks=66486272
Map-Reduce Framework
Map input records=7244
Map output records=74952
Map output bytes=717818
Map output materialized bytes=167616
Input split bytes=163
Combine input records=74952
Combine output records=11603
Reduce input groups=11603
Reduce shuffle bytes=167616
Reduce input records=11603
Reduce output records=11603
Spilled Records=23206
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=114
CPU time spent (ms)=5520
Physical memory (bytes) snapshot=2163986432
Virtual memory (bytes) snapshot=14135001088
Total committed heap usage (bytes)=2055208960
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=421504
File Output Format Counters
Bytes Written=122090
There was just one Reduce task:
Job Counters
Launched map tasks=1
Launched reduce tasks=1
… and hence there would be only one output file.
4. To find what file(s) were produced, run the following command:

© Copyright IBM Corp. 2019 5-73


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

hdfs dfs -ls wcount


The output should be like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hdfs dfs -ls wcount
Found 2 items
-rw-r--r-- 3 clsadmin biusers 0 2018-10-29 05:40 wcount/_SUCCESS
-rw-r--r-- 3 clsadmin biusers 122090 2018-10-29 05:40 wcount/part-r-00000
5. To review the part-r-0000 file, run the following command:
hadoop fs -cat wcount/part-r-00000 | more
Scroll through the file by pressing Enter and look at pages of output.
Here is a sample from several pages down:
"HE 1
"Have 3
"Having 1
"He 3
"Here 1
"Here, 1
"How 5
"I 71
"If 3
6. Type q to exit from more.
7. Remove the directory where your output file was stored (wcount), by running:
hdfs dfs -rm -R wcount
It is necessary to remove this directory and all files in it to use the command
again without changes. Alternatively, you could run the command again, but
with a different output directory such as wcount2.
Task 2. Check the MapReduce job’s history using Ambari
You can check history of MapReduce jobs and see their logs.
1. Browse to Ambari URL using a Web browser.
Example: https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud:9443.
2. From the services menu on the left, click MapReduce2.
3. In the middle of the page, click Quick Links, then click JobHistory UI.

© Copyright IBM Corp. 2019 5-74


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

4. Click the job ID to open its history.

You can see the status and logs of the job.

© Copyright IBM Corp. 2019 5-75


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Task 3. Run a simple MapReduce job using YARN


1. The same job can be run (and monitored) as a YARN job. To do this, run:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount2
2. Clean up the output directories:
hdfs dfs -rm -R wcount*
3. Re-run the job against all four files in the Gutenberg directory, by running:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/* wcount2
The output should look like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ yarn jar /usr/hdp/current/hadoop-mapreduce-
client/hadoop-mapreduce-examples.jar wordcount Gutenberg/*.txt wcount2
18/10/29 11:21:15 INFO client.RMProxy: Connecting to ResourceManager at chs-gbq-108-
mn002.us-south.ae.appdomain.cloud/172.16.162.134:8050
18/10/29 11:21:15 INFO client.AHSProxy: Connecting to Application History server at
chs-gbq-108-mn002.us-south.ae.appdomain.cloud/172.16.162.134:10200
18/10/29 11:21:16 INFO input.FileInputFormat: Total input paths to process : 4
18/10/29 11:21:16 INFO mapreduce.JobSubmitter: number of splits:4
18/10/29 11:21:16 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1540811761898_0002
18/10/29 11:21:17 INFO impl.YarnClientImpl: Submitted application
application_1540811761898_0002
18/10/29 11:21:17 INFO mapreduce.Job: The url to track the job: http://chs -gbq-108-
mn002.us-south.ae.appdomain.cloud:8088/proxy/application_1540811761898_0002/
18/10/29 11:21:17 INFO mapreduce.Job: Running job: job_1540811761898_0002
18/10/29 11:21:27 INFO mapreduce.Job: Job job_1540811761898_0002 running in uber mode
: false
18/10/29 11:21:27 INFO mapreduce.Job: map 0% reduce 0%
18/10/29 11:21:39 INFO mapreduce.Job: map 25% reduce 0%
18/10/29 11:21:46 INFO mapreduce.Job: map 100% reduce 0%

© Copyright IBM Corp. 2019 5-76


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

18/10/29 11:21:49 INFO mapreduce.Job: map 100% reduce 100%


18/10/29 11:21:50 INFO mapreduce.Job: Job job_1540811761898_0002 completed
successfully
18/10/29 11:21:50 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=752057
FILE: Number of bytes written=2261982
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2158590
HDFS: Number of bytes written=380289
HDFS: Number of read operations=15
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=4
Launched reduce tasks=1
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=485608
Total time spent by all reduces in occupied slots (ms)=110464
Total time spent by all map tasks (ms)=60701
Total time spent by all reduce tasks (ms)=6904
Total vcore-milliseconds taken by all map tasks=60701
Total vcore-milliseconds taken by all reduce tasks=6904
Total megabyte-milliseconds taken by all map tasks=248631296
Total megabyte-milliseconds taken by all reduce tasks=56557568
Map-Reduce Framework
Map input records=40805
Map output records=381974
Map output bytes=3661058
Map output materialized bytes=752075
Input split bytes=663
Combine input records=381974
Combine output records=51662
Reduce input groups=34444
Reduce shuffle bytes=752075
Reduce input records=51662
Reduce output records=34444
Spilled Records=103324
Shuffled Maps =4
Failed Shuffles=0
Merged Map outputs=4
GC time elapsed (ms)=409
CPU time spent (ms)=31550
Physical memory (bytes) snapshot=7982600192
Virtual memory (bytes) snapshot=29942259712
Total committed heap usage (bytes)=7870611456
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=2157927
File Output Format Counters
Bytes Written=380289

© Copyright IBM Corp. 2019 5-77


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Question 1: How many Mappers were used for your run?


Question 2: How many Reducers were used for your run?
You can answer these questions by going to the YARN service in Ambari, and
opening ResourceManager UI.

4. Clean up the output directories:


hdfs dfs -rm -R wcount*

Results:
You ran Java programs using Hadoop v2, YARN, and related technologies.
You tested an existing sample program from available Hadoop samples,
WordCount. You also checked the status and history of jobs using Amabri.

© Copyright IBM Corp. 2019 5-78


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Lab 2
• Creating and coding a simple MapReduce job

MapReduce and YARN © Copyright IBM Corporation 2019

Lab 2: Creating and coding a simple MapReduce job

© Copyright IBM Corp. 2019 5-79


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

Lab 2:
Creating and coding a simple MapReduce job
Purpose:
You will compile and run a more complete version of WordCount that has
been written specifically for MapReduce2.

Property Location Sample Value

Ambari cluser.service_endpoints.ambari_console https://chs-gbq-108-mn001.us-


URL south.ae.appdomain.cloud:9443

Hostname cluser.service_endpoints.ssh chs-gbq-108-mn003.us-


(after the @) south.ae.appdomain.cloud

Password cluser.password 24Z5HHf7NUuy

SSH cluser.service_endpoints. ssh ssh clsadmin@chs-gbq-108-


mn003.us-south.ae.appdomain.cloud

Username cluster.user clsadmin


Task 1. Compile and run a more complete version of
WordCount that has been written specifically for
MapReduce2
The program that you will use (WordCount2.java) has been taken from:
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-
mapreduce-client-core/MapReduceTutorial.html#Example:_WordCount_v2.0.
There is a minor change in the Java code. This has been updated in the version
under folder LabFiles.
This version 2 of WordCount.java is more sophisticated than the one that you
have already run, since it allows for splitting on text in addition to and other than
whitespace and allows you to specify anything that you want to skip when
counting words (such as "to", "the", etc.).
There are some limitations; if you are a competent Java programmer, you may
want to experiment later with other features. For instance, are all words
lowercased when they are tokenized?
Since you are now more familiar with the process of running MR/YARN jobs,
the directions provided here will concentrate on the compilation only.
1. Open Command Prompt.

© Copyright IBM Corp. 2019 5-80


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

2. Set the Analytics Engine server endpoint to Hostname (press Enter to accept
default values when you are asked about Ambari and Knox port numbers):
C:\>bx ae endpoint Hostname
Example:
C:\>bx ae endpoint https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud:9443
Registering endpoint 'https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud'...
Ambari Port Number [Optional: Press enter for default value] (9443)>
Knox Port Number [Optional: Press enter for default value] (8443)>
OK
Endpoint 'https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud' set.
3. Change directory to the LabFiles directory on your local machine. Example:
cd C:\LabFiles
4. Upload sample files to HDFS:
bx ae file-system --user Error! Reference source not found. --password Password put
patternsToSkip patternsToSkip
bx ae file-system --user Error! Reference source not found. --password Password put
run-wc run-wc
bx ae file-system --user Error! Reference source not found. --password Password put
WordCount2.java WordCount2.java
Example:
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put patternsToSkip
patternsToSkip
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put run-wc run-wc
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put WordCount2.java
WordCount2.java
5. Using PuTTY, connect to management node mn003 of your cluster.
6. Copy the .java file from HDFS to local file system:
hdfs dfs -copyToLocal WordCount2.java ~/
7. Run the following command to get the CLASSPATH environmental variable that
you need for compilation:
hadoop classpath
The output looks like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hadoop classpath
/usr/hdp/2.6.5.0-292/hadoop/conf:/usr/hdp/2.6.5.0-292/hadoop/lib/*:/usr/hdp/2.6.5.0-
292/hadoop/.//*:/usr/hdp/2.6.5.0-292/hadoop-hdfs/./:/usr/hdp/2.6.5.0-292/hadoop-
hdfs/lib/*:/usr/hdp/2.6.5.0-292/hadoop-hdfs/.//*:/usr/hdp/2.6.5.0-292/hadoop-
yarn/lib/*:/usr/hdp/2.6.5.0-292/hadoop-yarn/.//*:/usr/hdp/2.6.5.0-292/hadoop-
mapreduce/lib/*:/usr/hdp/2.6.5.0-292/hadoop-mapreduce/.//*::mysql-connector-
java.jar:/home/common/lib/dataconnectorStocator/*:/usr/hdp/2.6.5.0 -
292/tez/*:/usr/hdp/2.6.5.0-292/tez/lib/*:/usr/hdp/2.6.5.0-292/tez/conf
8. Compile WordCount.java with the Hadoop2 API and this CLASSPATH:
javac -cp `hadoop classpath` WordCount2.java
Note here, the use of the back-quote character to substitute the value of the
CLASSPATH set into the compilation as the classpath (-cp) needed for the
compilation.
9. Create a Jar file that can be run in the Hadoop2/YARN environment:

© Copyright IBM Corp. 2019 5-81


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.
Unit 5 MapRed uc e and Y A RN

jar cf WC2.jar *.class


10. To run your Jar file successfully, you will need to remove all the class files from
your Linux directory:
rm *.class
11. You are now ready to run your compiled program with appropriate parameters
(see the program logic and the use of these additional parameters on the
website listed above). Type the following command on one line:
hadoop jar WC2.jar WordCount2 -D wordcount.case.sensitive=false ./Gutenberg/*.txt
./wc2out -skip ./patternsToSkip
This can also be run with "yarn" and the MR job can be monitored as previously.
Question 1: How many Mappers were used for your run?
Question 2: How many Reducers were used for your run?
12. Look at your output (in wc2out) and see if it differs from what you generated
previously.
hdfs dfs -cat wc2out/part-r-00000 | more
13. Quit the more command by pressing Q.
Results:
You compiled and ran a more complete version of WordCount that has been
written specifically for MapReduce2.

© Copyright IBM Corp. 2019 5-82


Course materials may not be reproduced in w hole or in part w ithout the prior w ritten permission of IBM.

You might also like