05-MapReduce and Yarn
05-MapReduce and Yarn
Unit objectives
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1
Unit objectives
Topic:
Introduction to MapReduce
processing based on MR1
101101001 Cluster
010010011
1
100111111
001010011
101001010 1 3 2
010110010
010101001
2
100010100
101110101 4 1 3
Blocks 110101111
011011010
101101001
010100101
3
010101011 2 4
100100110 2
101110100 4 3
1
4
Logical File
MapReduce and YARN © Copyright IBM Corporation 2019
MapReduce v1 explained
• Hadoop computational model
▪ Data stored in a distributed file system spanning many inexpensive
computers
▪ Bring function to the data
▪ Distribute application to the compute resources where the data is stored
• Scalable to thousands of nodes and petabytes of data
public static class TokenizerMapper
extends Mapper<Object,Text,Text,IntWritable> {
private final static IntWritable
Hadoop Data Nodes
one = new IntWritable(1);
private Text word = new Text();
1.Map Phase
new StringTokenizer(val.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
(break job into small parts)
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWrita
private IntWritable result = new IntWritable();
3.Reduce Phase
MapReduce (boil all output down to
Application Shuffle a single result set)
MapReduce v1 explained
There are two aspects of Hadoop that are important to understand:
• MapReduce is a software framework introduced by Google to support
distributed computing on large data sets of clusters of computers.
• The Hadoop Distributed File System (HDFS) is where Hadoop stores its data.
This file system spans all the nodes in a cluster. Effectively, HDFS links together
the data that resides on many local nodes, making the data part of one big file
system. Furthermore, HDFS assumes nodes will fail, so it replicates a given
chunk of data across multiple nodes to achieve reliability. The degree of
replication can be customized by the Hadoop administrator or programmer.
However, by default is to replicate every chunk of data across 3 nodes: 2 on the
same rack, and 1 on a different rack.
The key to understanding Hadoop lies in the MapReduce programming model. This is
essentially a representation of the divide and conquer processing model, where your
input is split into many small pieces (the map step), and the Hadoop nodes process
these pieces in parallel. Once these pieces are processed, the results are distilled (in
the reduce step) down to a single answer.
MapReduce v1 engine
• Master/Slave architecture
▪ Single master (JobTracker) controls job execution on multiple slaves
(TaskTrackers).
• JobTracker
▪ Accepts MapReduce jobs submitted by clients
▪ Pushes map and reduce tasks out to TaskTracker nodes
▪ Keeps the work as physically close to data as possible
▪ Monitors tasks and TaskTracker status
• TaskTracker
▪ Runs map and reduce tasks
▪ Reports status to JobTracker
▪ Manages storage and transmission of intermediate output
cluster Computer / Node 1
JobTracker
MapReduce v1 engine
If one TaskTracker is very slow, it can delay the entire MapReduce job, especially
towards the end of a job, where everything can end up waiting for the slowest task. With
speculative-execution enabled, however, a single task can be executed on multiple
slave nodes.
For jobs scheduling, by default Hadoop uses FIFO (First in, First Out), and optional 5
scheduling priorities to schedule jobs from a work queue. Other scheduling algorithms
are available as add-ins: Fair Scheduler, Capacity Scheduler, and so on.
MapReduce 1 overview
Results can be
written to HDFS or a
database
MapReduce 1 overview
This slide provides an overview of the MR1 process.
File blocks (stored on different DataNodes) in HDFS are read and processed by the
Mappers.
The output of the Mapper processes are shuffled (sent) to the Reducers (one output file
from each Mapper to each Reducer); the files here are not replicated and are stored
local to the Mapper node.
The Reducers produces the output and that output is stored in HDFS, with one file for
each Reducer.
100100110
3 map 010010011
100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort 01
4 map
Logical
Input File
Shuffle
Logical Output
sort File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
010101011 101101001
100100110
3 map 010010011
101110100 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
Reduce Phase
Logical
sort Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 010010011
100111111
001010011
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101
110101111
011011010
101101001
sort merge 101101001
3
010100101
010101011 010010011
100100110
3 map 100111111
101110100 001010011
reduce 101001010 To DFS
010110010
4 sort
01
4 map
Logical
Output
File
Combiner
sort Logical Output
File
101101001
010010011
1 map copy merge 101101001
1
100111111 & combine 010010011
001010011 100111111
101001010 001010011
reduce To DFS
010110010 sort 101001010
010110010
010101001
2
100010100 2 map 01
101110101 & combine
110101111
011011010 Logical Output
101101001 File
sort merge
3
010100101
010101011
100100110
3 map 101101001
010010011
101110100 & combine 100111111
reduce 001010011
101001010 To DFS
4 sort 010110010
01
4 map
& combine
WordCount example
• In the example of a list of animal names
▪ MapReduce can automatically split files on line breaks
▪ The file has been split into two blocks on two nodes
• To count how often each big cat is mentioned, in SQL you would use:
Node 1 Node 2
Tiger Tiger
Lion Tiger
Lion Wolf
Panther Panther
Wolf …
…
WordCount example
In a file with two blocks (or "splits") of data, animal names are listed. There is one
animal name per line in the files.
Rather than count the number of mentions of each animal, you are interested only in
members of the cat family.
Since the blocks are held on different nodes, software running on the individual nodes
process the blocks separately.
If you were using SQL, which is not used, the SQL would be as stated in the slide.
Map task
• There are two requirements for the Map task:
▪ filter out the non big-cat rows
▪ prepare count by transforming to <Text(name), Integer(1)>
Node 1
Tiger <Tiger, 1>
Lion <Lion, 1>
Lion <Lion, 1>
The Map Tasks are
Panther <Panther, 1>
executed locally on each
Wolf … split
…
Node 2
Tiger <Tiger, 1>
Tiger <Tiger, 1>
Wolf <Panther, 1>
Panther …
…
Map task
Reviewing the description of MapReduce from Wikipedia
(https://en.wikipedia.org/wiki/MapReduce):
MapReduce is a framework for processing huge datasets on certain kinds of
distributable problems using a large number of computers (nodes), collectively
referred to as a cluster (if all nodes use the same hardware) or as a grid (if the
nodes use different hardware). Computational processing can occur on data stored
either in a file system (unstructured) or within a database structured).
"Map" step: The master node takes the input, breaks it up into smaller sub-problems,
and distributes those to worker nodes. A worker node may do this again in turn, leading
to a multi-level tree structure. The worker node processes that smaller problem and
passes the answer back to its master node.
"Reduce" step: The master node then takes the answers to all the sub-problems and
combines them in some way to get the output, the answer to the problem it was
originally trying to solve.
Shuffle
• Shuffle moves all values of one key to the same target node
• Distributed by a Partitioner Class (normally hash distribution)
• Reduce Tasks can run on any node - here on Nodes 1 and 3
▪ The number of Map and Reduce tasks do not need to be identical
▪ Differences are handled by the hash partitioner
Node 1 Node 1
Tiger <Tiger, 1>
<Panther, <1,1>>
Lion <Lion, 1>
<Tiger, <1,1,1>>
Panther <Lion, 1>
…
Wolf <Panther, 1>
… …
Node 2 Node 3
Tiger <Tiger, 1> Shuffle distributes keys <Lion, <1,1>>
Tiger <Tiger, 1> using a hash partitioner. …
Results are stored in
Wolf <Panther, 1>
HDFS blocks on the
Panther …
machines that run the
… Reduce jobs
MapReduce and YARN © Copyright IBM Corporation 2019
Shuffle
Shuffle distributes the key-value pairs to the nodes where the Reducer task will run.
Each Mapper task produces one file for each Reducer task. A hash function running on
the Mapper node determines which Reducer task will receive any key-value pair. All
key-value pairs with a key will be sent to the same Reducer task.
Reduce tasks can run on any node, either different from the set of nodes where the
Map task run or on the same DataNodes. In the slide example, Node 1 is used for one
Reduce task, but a new node, Node 3, is used for a second Reduce node.
There is no relation between the number of Map tasks (generally one node for each
block of the file(s) begin read) and the number of Reduce tasks. Commonly the number
of Reduce tasks is smaller than the number of Map tasks.
Reduce
• The reduce task computes aggregated values for each key
▪ Normally the output is written to the DFS
▪ Default is one output part-file per Reduce task
▪ Reduce tasks aggregate all values of a specific key - our case, the count of
the particular animal type
Reducer Tasks running on DataNodes Output files are stored in HDFS
Node 1
<Panther, 2>
<Panther, <1,1>> <Tiger, 3>
<Tiger, <1,1,1>> …
…
Node 3
<Lion, 2>
<Lion, <1,1>>
…
…
Reduce
Note that these two Reducer tasks are running on Nodes 1 and 3.
The Reduce node then takes the answers to all the sub-problems and combines them
in some way to get the output - the answer to the problem it was originally trying to
solve.
In this case, the Reduce step shown on this slide does the following processing:
• Each Reduce node the data sent to it from the various Map nodes
• This data has been previously sorted (and possibly partially merged)
• The Reduce node aggregates the date; in the case of WordCount, it sums the
counts received for each word (each animal in this case)
• One file is produced for each Reduce task and that is written to HDFS where
the blocks will automatically be replicated
Optional: Combiner
• For performance, an aggregate in the Map task can be helpful
• Reduces the amount of data sent over the network
▪ Also reduces Merge effort, since data is premerged in Map
▪ Done in the Map task, before Shuffle
Map Task running on each of two DataNodes Reduce Tasks
Node 1 Node 1
Tiger <Tiger, 1> <Lion, 1>
<Panther, <1,1>>
Lion <Lion, 1> <Panther, 1>
<Tiger, <1, 2>>
Panther <Panther, 1> <Tiger, 1>
…
Wolf … …
…
Node 2 Node 3
Tiger <Tiger, 1> <Tiger, 2> <Lion, 1>
Tiger <Tiger, 1> <Panther, 1> …
Wolf <Panther, 1> …
Panther …
…
Optional: Combiner
The Combiner phase is optional. When it is used, it runs on the Mapper node and
preprocesses the data that will be sent to Reduce tasks by pre-merging and pre-
aggregating the data in the files that will be transmitted to the Reduce tasks.
The Combiner thus reduces the amount of data that will be sent to the Reducer tasks,
and that speeds up the processing as smaller files need to be transmitted to the
Reducer task nodes.
You will see these shortcomings in the output. Note that this is the standard WordCount
program and the interest is not in the actual results but only in the process at this stage.
The WordCount program is to Hadoop Java programs functions as the "Hello, world!"
program is to the C language. It is generally the first program that people experience
when coming to the new technology.
Distributed TaskTracker
file system 8. retrieve
(e.g. HDFS) job resources 9. launch
child JVM
Classes
• There are three main Java classes provided in Hadoop to read data
in MapReduce:
▪ InputSplitter dividing a file into splits
− Splitsare normally the block size but depends on number of requested Map
tasks, whether any compression allows splitting, etc.
▪ RecordReader takes a split and reads the files into records
− For example, one record per line (LineRecordReader)
− But note that a record can be spit across splits
▪ InputFormat takes each record and transforms it into a
<key, value> pair that is then passed to the Map task
• Lots of additional helper classes may be required to handle
compression, e.g. LZO compression, etc.
Classes
The InputSplitter, RecordSplitter, and InputFormat classes are provided inside the
Hadoop code.
Other helper classes are needed to support Java MapReduce programs. Some of
these are provided from inside the Hadoop code itself, but distribution vendors and end
programmers can provide other classes that either override or supplement standard
code. Thus some vendors provide the LZO compressions algorithm to supplement
standard compression codecs (such as codecs for bzip2, etc.).
Splits
• Files in MapReduce are stored in blocks (128 MB)
• MapReduce divides data into fragments or splits.
▪ One map task is executed on each split
• Most files have records with defined split points
▪ Most common is the end of line character
• The InputSplitter class is responsible for taking a HDFS file and
transforming it into splits.
▪ Aim is to process as much data as possible locally
Splits
RecordReader
• Most of the time a split will not happen at a block end
• Files are read into Records by the RecordReader class
▪ Normally the RecordReader will start and stop at the split points.
• LineRecordReader will read over the end of the split till the line end.
▪ HDFS will send the missing piece of the last record over the network
• Likewise LineRecordReader for Block2 will disregard the first
incomplete line
Node 1 Node 2
Tiger\n ther\n
Tiger\n Tiger\n
Lion\n Wolf\n
Pan Lion
In this example RecordReader1 will not stop at Pan but will read on until the end of
the line. Likewise RecordReader2 will ignore the first line.
MapReduce and YARN © Copyright IBM Corporation 2019
RecordReader
InputFormat
• MapReduce Tasks read files by defining an InputFormat class
▪ Map tasks expect <key, value> pairs
• To read line-delimited textfiles Hadoop provides the TextInputFormat
class
▪ It returns one key, value pair per line in the text
▪ The value is the content of the line
▪ The key is the character offset to the new line character (end of line)
Node 1
Tiger <0, Tiger>
Lion <6, Lion>
Lion <11, Lion>
Panther <16, Panther>
Wolf <24, Wolf>
… …
MapReduce and YARN © Copyright IBM Corporation 2019
InputFormat
InputFormat describes the input-specification for a Map-Reduce job.
The Map-Reduce framework relies on the InputFormat of the job to:
• Validate the input-specification of the job.
• Split-up the input file(s) into logical InputSplits, each of which is then assigned to
an individual Mapper.
• Provide the RecordReader implementation to be used to glean input records
from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of
FileInputFormat, is to split the input into logical InputSplits based on the total size, in
bytes, of the input files. However, the FileSystem blocksize of the input files is treated
as an upper bound for input splits. A lower bound on the split size can be set via
mapreduce.input.fileinputformat.split.minsize.
Clearly, logical splits based on input-size is insufficient for many applications since
record boundaries are to be respected. In such cases, the application must also
implement a RecordReader on whom lies the responsibility to respect record-
boundaries and present a record-oriented view of the logical InputSplit to the individual
task.
Fault tolerance
JobTracker node
JobTracker
3
JobTracker fails
heartbeat
child JVM
Child
or
ReduceTask
TaskTracker node
Fault tolerance
What happens when something goes wrong?
Failures can happen at the task level (1), TaskTracker level (2), or JobTracker level (3).
The primary way that Hadoop achieves fault tolerance is through restarting tasks.
Individual task nodes (TaskTrackers) are in constant communication with the head
node of the system, called the JobTracker. If a TaskTracker fails to communicate with
the JobTracker for a period of time (by default, 1 minute), the JobTracker will assume
that the TaskTracker in question has crashed. The JobTracker knows which map and
reduce tasks were assigned to each TaskTracker.
If the job is still in the mapping phase, then other TaskTrackers will be asked to re-
execute all map tasks previously run by the failed TaskTracker. If the job is in the
reducing phase, then other TaskTrackers will re-execute all reduce tasks that were in
progress on the failed TaskTracker.
Reduce tasks, once completed, have been written back to HDFS. Thus, if a
TaskTracker has already completed two out of three reduce tasks assigned to it, only
the third task must be executed elsewhere. Map tasks are slightly more complicated:
even if a node has completed ten map tasks, the reducers may not have all copied their
inputs from the output of those map tasks. If a node has crashed, then its mapper
outputs are inaccessible. So any already-completed map tasks must be re-executed to
make their results available to the rest of the reducing machines. All of this is handled
automatically by the Hadoop platform.
This fault tolerance underscores the need for program execution to be side-effect free.
If Mappers and Reducers had individual identities and communicated with one another
or the outside world, then restarting a task would require the other nodes to
communicate with the new instances of the map and reduce tasks, and the re-executed
tasks would need to reestablish their intermediate state. This process is notoriously
complicated and error-prone in the general case. MapReduce simplifies this problem
drastically by eliminating task identities or the ability for task partitions to communicate
with one another. An individual task sees only its own direct inputs and knows only its
own outputs, to make this failure and restart process clean and dependable.
Topic:
Issues with and
limitations of Hadoop v1
and MapReduce v1
MapReduce YARN
(Batch parallel Computing) (Resource management)
HDFS HDFS
(Distributed Storage) (Distributed Storage)
YARN features
• Scalability
• Multi-tenancy
• Compatibility
• Serviceability
• Higher cluster utilization
• Reliability/Availability
YARN features
Topic:
The architecture of YARN
Hadoop v1 to Hadoop v2
Hadoop v1 to Hadoop v2
The most notable change from Hadoop v1 to Hadoop v2 is the separation of cluster
and resource management from the execution and data processing environment.
This allows for a new variety of application types to run, including MapReduce v2.
Architecture of MRv1
• Classic version of MapReduce (MRv1)
Architecture of MRv1
The effect most prominently seen with the overall job control.
In MapReduce v1 there is just one JobTracker that is responsible for allocation of
resources, task assignment to data nodes (as TaskTrackers), and ongoing monitoring
("heartbeat") as each job is run (the TaskTrackers constantly report back to the
JobTracker on the status of each running task).
YARN architecture
• High level architecture of YARN
YARN architecture
In the YARN architecture, a global ResourceManager runs as a master daemon,
usually on a dedicated machine, that arbitrates the available cluster resources among
various competing applications. The ResourceManager tracks how many live nodes
and resources are available on the cluster and coordinates what applications submitted
by users should get these resources and when.
The ResourceManager is the single process that has this information, so it can make its
allocation (or rather, scheduling) decisions in a shared, secure, and multi-tenant
manner (for instance, according to an application priority, a queue capacity, ACLs, data
locality, etc.).
When a user submits an application, an instance of a lightweight process called the
ApplicationMaster is started to coordinate the execution of all tasks within the
application. This includes monitoring tasks, restarting failed tasks, speculatively running
slow tasks, and calculating total values of application counters. These responsibilities
were previously assigned to the single JobTracker for all jobs. The ApplicationMaster
and tasks that belong to its application run in resource containers controlled by the
NodeManagers.
The NodeManager is a more generic and efficient version of the TaskTracker. Instead
of having a fixed number of map and reduce slots, the NodeManager has several
dynamically created resource containers. The size of a container depends upon the
amount of resources it contains, such as memory, CPU, disk, and network IO.
Currently, only memory and CPU (YARN-3) are supported. cgroups might be used to
control disk and network IO in the future. The number of containers on a node is a
product of configuration parameters and the total amount of node resources (such as
total CPUs and total memory) outside the resources dedicated to the slave daemons
and the OS.
Interestingly, the ApplicationMaster can run any type of task inside a container. For
example, the MapReduce ApplicationMaster requests a container to launch a map or a
reduce task, while the Giraph ApplicationMaster requests a container to run a Giraph
task. You can also implement a custom ApplicationMaster that runs specific tasks and,
in this way, invent a shiny new distributed application framework that changes the big
data world. I encourage you to read about Apache Twill, which aims to make it easy to
write distributed applications sitting on top of YARN.
In YARN, MapReduce is simply degraded to a role of a distributed application (but still a
very popular and useful one) and is now called MRv2. MRv2 is simply the
re-implementation of the classical MapReduce engine, now called MRv1, that runs on
top of YARN.
ApplicationMaster JobTracker
(but dedicated and short-lived)
NodeManager TaskTracker
Container Slot
YARN
• Acronym for Yet Another Resource Negotiator
• New resource manager included in Hadoop 2.x and later
• De-couples Hadoop workload and resource management
• Introduces a general purpose application container
• Hadoop 2.2.0 includes the first GA version of YARN
• Most Hadoop vendors support YARN
YARN
YARN is a key component in Hortonworks Data Platform.
YARN
(cluster resource management)
HDFS
NodeManager @node134
NodeManager @node135
Resource
Manager
@node132
NodeManager @node136
Application 1:
Analyze lineitem table
NodeManager @node134
Launch
NodeManager @node135
Resource
Manager Application
@node132 Master 1
NodeManager @node136
Application 1:
Analyze lineitem table
NodeManager @node134
NodeManager @node135
Resource Resource request
Manager
Application
@node132 Master 1
Container IDs
NodeManager @bigaperf136
Application 1:
Analyze lineitem table
NodeManager @node134
App 1 App 1
Launch
NodeManager @node135
Resource
Manager
Application
@node132 Master 1
Launch
NodeManager @node136
App 1
Application 1:
Analyze
lineitem
table NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table
NodeManager @node135
Resource
Manager
Application
@node132 Master 1
NodeManager @node136
Application
Master 2 App 1
Application 1:
Analyze
lineitem
table NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table
NodeManager @node135
Resource
Manager
Application
@node132 Master 1
NodeManager @node136
Application
Master 2 App 1
App 2
Application 1:
Analyze
lineitem
table NodeManager @node134
App 1 App 1
Application 2:
Analyze customer table
NodeManager @node135
Resource
Manager
Application
AppApp
2 2
@node132 Master 1
NodeManager @nodef136
Application
Master 2 App 1
▪ If you count "java" process ids (pids) running with the "yarn" user, you will
see 2X
00:00:00 /bin/bash -c /hdp/jdk/bin/java
00:00:00 /bin/bash -c /hdp/jdk/bin/java
. . .
00:07:48 /hdp/jdk/bin/java -Djava.net.pr
00:08:11 /hdp/jdk/bin/java -Djava.net.pr
. . .
Spark in Hadoop 2+
Apache Spark is a new, alternative, in-memory framework that is an alternative to
MapReduce.
Spark is the subject of the next unit.
Checkpoint
1. List the phases in a MR job.
2. What are the limitations of MR v1?
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
4. What are the major features of YARN?
Checkpoint
Checkpoint solution
1. List the phases in a MR job.
▪ Map, Shuffle, Reduce, Combiner
2. What are the limitations of MR v1?
▪ Centralized handling of job control flow
▪ Tight coupling of a specific programming model with the resource management
infrastructure
▪ Hadoop is now being used for all kinds of tasks beyond its original design
3. The JobTracker in MR1 is replaced by which component(s) in YARN?
▪ ResourceManager
▪ ApplicationMaster
4. What are the major features of YARN?
▪ Multi-tenancy
▪ Cluster utilization
▪ Scalability
▪ Compatibility
MapReduce and YARN © Copyright IBM Corporation 2019
Checkpoint solution
Unit summary
• Describe the MapReduce model v1
• List the limitations of Hadoop 1 and MapReduce 1
• Review the Java code required to handle the Mapper class, the
Reducer class, and the program driver needed to access MapReduce
• Describe the YARN model
• Compare Hadoop 2/YARN with Hadoop 1
Unit summary
Lab 1
• Running MapReduce and YARN jobs
Lab 1:
Running MapReduce and YARN jobs
Purpose:
You will run a sample Java program using Hadoop v2, and related
technologies. You will not need to do any Java programming. The Hadoop
community provides several standard example programs (akin to "Hello,
World" for people learning to write C language programs).
In the real world, the sample/example programs provide opportunities to learn
the relevant technology, and on a newly installed Hadoop cluster, to exercise
the system and the operations environment.
The Java program that you will be compiling and running, WordCount2.java,
resides in the Hadoop samples directory /home/labfiles.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of
Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi -Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per
node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the
input files.
wordmedian: A map/reduce program that counts the median length of the words in the
input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of
the length of the words in the input files.
3. To run the sample program, run the following command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount
The last parameter (wcount) is the directory name where the program will write
the final output.
The output should be like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hadoop jar /usr/hdp/current/hadoop-mapreduce-
client/hadoop-mapreduce-examples.jar wordcount Gutenberg/Frankenstein.txt wcount
18/10/29 09:14:37 INFO client.RMProxy: Connecting to ResourceManager at chs-gbq-108-
mn002.us-south.ae.appdomain.cloud/172.16.162.134:8050
18/10/29 09:14:37 INFO client.AHSProxy: Connecting to Application History server at
chs-gbq-108-mn002.us-south.ae.appdomain.cloud/172.16.162.134:10200
18/10/29 09:14:38 INFO input.FileInputFormat: Total input paths to process : 1
18/10/29 09:14:38 INFO mapreduce.JobSubmitter: number of splits:1
18/10/29 09:14:38 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1540427840896_0006
18/10/29 09:14:39 INFO impl.YarnClientImpl: Submitted application
application_1540427840896_0006
18/10/29 09:14:39 INFO mapreduce.Job: The url to track the job: http://chs -gbq-108-
mn002.us-south.ae.appdomain.cloud:8088/proxy/application_1540427840896_0006/
18/10/29 09:14:39 INFO mapreduce.Job: Running job: job_1540427840896_0006
18/10/29 09:14:49 INFO mapreduce.Job: Job job_1540427840896_0006 running in uber mode
: false
18/10/29 09:14:49 INFO mapreduce.Job: map 0% reduce 0%
18/10/29 09:14:58 INFO mapreduce.Job: map 100% reduce 0%
18/10/29 09:15:08 INFO mapreduce.Job: map 100% reduce 100%
18/10/29 09:15:08 INFO mapreduce.Job: Job job_1540427840896_0006 completed
successfully
18/10/29 09:15:08 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=167616
FILE: Number of bytes written=638315
FILE: Number of read operations=0
Results:
You ran Java programs using Hadoop v2, YARN, and related technologies.
You tested an existing sample program from available Hadoop samples,
WordCount. You also checked the status and history of jobs using Amabri.
Lab 2
• Creating and coding a simple MapReduce job
Lab 2:
Creating and coding a simple MapReduce job
Purpose:
You will compile and run a more complete version of WordCount that has
been written specifically for MapReduce2.
2. Set the Analytics Engine server endpoint to Hostname (press Enter to accept
default values when you are asked about Ambari and Knox port numbers):
C:\>bx ae endpoint Hostname
Example:
C:\>bx ae endpoint https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud:9443
Registering endpoint 'https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud'...
Ambari Port Number [Optional: Press enter for default value] (9443)>
Knox Port Number [Optional: Press enter for default value] (8443)>
OK
Endpoint 'https://chs-gbq-108-mn001.us-south.ae.appdomain.cloud' set.
3. Change directory to the LabFiles directory on your local machine. Example:
cd C:\LabFiles
4. Upload sample files to HDFS:
bx ae file-system --user Error! Reference source not found. --password Password put
patternsToSkip patternsToSkip
bx ae file-system --user Error! Reference source not found. --password Password put
run-wc run-wc
bx ae file-system --user Error! Reference source not found. --password Password put
WordCount2.java WordCount2.java
Example:
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put patternsToSkip
patternsToSkip
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put run-wc run-wc
bx ae file-system --user clsadmin --password 24Z5HHf7NUuy put WordCount2.java
WordCount2.java
5. Using PuTTY, connect to management node mn003 of your cluster.
6. Copy the .java file from HDFS to local file system:
hdfs dfs -copyToLocal WordCount2.java ~/
7. Run the following command to get the CLASSPATH environmental variable that
you need for compilation:
hadoop classpath
The output looks like the following:
[clsadmin@chs-gbq-108-mn003 ~]$ hadoop classpath
/usr/hdp/2.6.5.0-292/hadoop/conf:/usr/hdp/2.6.5.0-292/hadoop/lib/*:/usr/hdp/2.6.5.0-
292/hadoop/.//*:/usr/hdp/2.6.5.0-292/hadoop-hdfs/./:/usr/hdp/2.6.5.0-292/hadoop-
hdfs/lib/*:/usr/hdp/2.6.5.0-292/hadoop-hdfs/.//*:/usr/hdp/2.6.5.0-292/hadoop-
yarn/lib/*:/usr/hdp/2.6.5.0-292/hadoop-yarn/.//*:/usr/hdp/2.6.5.0-292/hadoop-
mapreduce/lib/*:/usr/hdp/2.6.5.0-292/hadoop-mapreduce/.//*::mysql-connector-
java.jar:/home/common/lib/dataconnectorStocator/*:/usr/hdp/2.6.5.0 -
292/tez/*:/usr/hdp/2.6.5.0-292/tez/lib/*:/usr/hdp/2.6.5.0-292/tez/conf
8. Compile WordCount.java with the Hadoop2 API and this CLASSPATH:
javac -cp `hadoop classpath` WordCount2.java
Note here, the use of the back-quote character to substitute the value of the
CLASSPATH set into the compilation as the classpath (-cp) needed for the
compilation.
9. Create a Jar file that can be run in the Hadoop2/YARN environment: