TP 2
TP 2
Overview
This exercise introduces you to a simple MapReduce program that uses Hadoop v2 and related
technologies. You compile and run the program by using Hadoop and Yarn commands. You also
explore the MapReduce job’s history with Ambari Web UI.
Objectives
After completing this exercise, you will be able to:
• List the sample MapReduce programs provided by the Hadoop community.
• Compile MapReduce programs and run them by using Hadoop and YARN commands.
• Explore the MapReduce job’s history by using the Ambari Web UI.
Introduction
In the MapReduce programming model, the program is written in a special way so that the
programs can be brought to the data. To accomplish this goal, the program is broken down into two
discrete parts: Map and Reduce.
• A mapper is typically a relatively small program with a relatively simple task. A mapper is
responsible for reading a portion of the input data, interpreting, filtering, or transforming the data
as necessary and then finally producing a stream of <key, value> pairs.
• Reducers are the last part of the picture. They are also typically small programs that are
responsible for sorting and aggregating the values with the key that they were assigned to work
on. Just like with mappers, the more unique keys you have, the more parallelism. After each
reducer completes what it is assigned to do, for example, add up the total sales for the state, it
emits key/value pairs that are written to storage. These key/value pairs can be used as the input
to the next MapReduce job.
The fundamental idea of YARN/MRv2 is to split the two main functions of the JobTracker, which are
resource management and job scheduling/monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and a pre-application ApplicationMaster (AM) per application. For
more information about MapReduce 2.0 (MRv2) or YARN, see Apache Hadoop YARN.
The Hadoop community provides several standard example programs. You can use the example
programs to learn the relevant technology, and on a newly installed Hadoop cluster, to verify the
system’s operational environment.
In this exercise, you list the example programs that are provided by the Hadoop community. Then,
you compile the sample Java program wordcount, which is a MapReduce program that counts the
words in the input files. You run wordcount by using Hadoop and YARN commands. You also learn
to explore the MapReduce job’s history with Ambari Web UI from both MapReduce2 and YARN
services.
Requirements
• Complete "Exercise 1. Exploring the lab environment".
• Complete "Exercise 3. File access and basic commands with HDFS".
• PuTTY SSH client installed in your workstation.
Exercise instructions
In this exercise, you complete the following tasks:
1. Run a simple MapReduce job from a Hadoop sample program.
2. Explore the MapReduce job’s history with the Ambari Web UI.
3. Run a simple MapReduce job by using YARN.
3. Run the sample program wordcount, which is a MapReduce program that counts the words
in the input files. Enter the following command in one line:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount
Where wcount is the directory where the program writes the final output.
Important
LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.
Note
If the Gutenberg folder does not exist on your HDFS directory, run the steps in "Exercise 3. File
access and basic commands with HDFS", "Part 2.Exploring basic HDFS commands".
4. Notice that there is only one Reduce task, which is highlighted in bold under the Job
Counters section in the command results. Therefore, the result is produced in one file.
5. List the generated files by running the following command:
hdfs dfs -ls wcount
The result is shown in the following output.
[student0000@dataengineer ~]$ hdfs dfs -ls wcount
Found 2 items
-rw-r--r-- 3 student0000 hdfs 0 2020-09-21 15:39 wcount/_SUCCESS
-rw-r--r-- 3 student0000 hdfs 122090 2020-09-21 15:39 wcount/part-r-00000
As expected, the result is produced in only one file (part-r-0000)
6. To review part-r-0000, run the following command:
hadoop fs -cat wcount/part-r-00000 | more
The result is shown in the following output.
Note
It is necessary to remove the wcount directory and all files in it to run the command again without
changes. Alternatively, you could run the command again, but with a different output directory such
as wcount2.
Part 2: Exploring the MapReduce job’s history with the Ambari Web UI
In this part, you explore the MapReduce job that you submitted in the previous part by using the
Ambari Web UI.
Complete the following steps:
1. Start Ambari Web UI by opening the Ambari URL <hostname:8080> in your browser. Refer
to "Exercise 1. Exploring the lab environment".
2. Log in using the Ambari Username <ambari username> and Ambari Password <ambari
password> from "Exercise 1. Exploring the lab environment".
3. Click MapReduce2 under Services on the left pane.
Hint
Press Ctrl + F and search for your username to find the jobs you ran. You can also search for
your Job ID by using the Job ID from the results of running wordcount.
5. Click the job ID to open its history to see the status and logs of the job.
Note
Important
LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.
4. Re-run the job with all four files in the Gutenberg directory as input, by using the command:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/* wcount2
Important
LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.
Note
You can get the application ID from the result of running the job. Find the value of Submitted
application (for example application_1599487765089_0037 highlighted in bold in the command
results).
11. Return to PuTTY to clean up the output directories by running the following command:
hdfs dfs -rm -R wcount*
End of exercise
Overview
In this exercise, you compile and run a new and more complex version of the WordCount program
that was introduced in “Exercise 4. Running MapReduce and YARN jobs”. This new version uses
many of the features that are provided by the MapReduce framework.
Objectives
After completing this exercise, you will be able to:
• Compile and run more complex MapReduce programs.
Introduction
In this exercise, you use a more complex MapReduce program, WordCount2.java, which is
provided as part of the Apache Hadoop MapReduce tutorials at
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/M
apReduceTutorial.html#Example:_WordCount_v2.0.
WordCount.java v2 is more sophisticated than the one that you used in Exercise 4. In this version,
you can specify patterns that you might want to skip when the program counts words, such as "to",
"the", “/”, and others.
There are some limitations; if you are an experienced Java programmer, you might want to
experiment later with other features. For instance, are all words lowercased when they are
tokenized?
Since you are now more familiar with the process of running MapReduced and YARN jobs, the
directions provided here concentrate on the compilation and running only.
Requirements
• Complete "Exercise 1. Exploring the lab environment".
• Complete "Exercise 3. File access and basic commands with HDFS".
• PuTTY SSH client installed in your workstation.
Exercise instructions
In this exercise, you complete the following tasks:
1. Compile and run a more complete version of WordCount program.
Note
At the end of this command, there is a period. It indicates the current directory, which is your home
directory in this case.
4. Get the classpath environment variable for Hadoop, which you need for compilation:
hadoop classpath
The result is similar to the following output.
[student0000@dataengineer WordCount2]$ hadoop classpath
/usr/hdp/3.1.4.0-315/hadoop/conf:/usr/hdp/3.1.4.0-315/hadoop/lib/*:/usr/hdp/3.1.4.
0-315/hadoop/.//*:/usr/hdp/3.1.4.0-315/hadoop-hdfs/./:/usr/hdp/3.1.4.0-315/hadoop-
hdfs/lib/*:/usr/hdp/3.1.4.0-315/hadoop-hdfs/.//*:/usr/hdp/3.1.4.0-315/hadoop-mapre
duce/lib/*:/usr/hdp/3.1.4.0-315/hadoop-mapreduce/.//*:/usr/hdp/3.1.4.0-315/hadoop-
yarn/./:/usr/hdp/3.1.4.0-315/hadoop-yarn/lib/*:/usr/hdp/3.1.4.0-315/hadoop-yarn/./
/*:/usr/hdp/3.1.4.0-315/tez/*:/usr/hdp/3.1.4.0-315/tez/lib/*:/usr/hdp/3.1.4.0-315/
tez/conf:/usr/hdp/3.1.4.0-315/tez/conf_llap:/usr/hdp/3.1.4.0-315/tez/doc:/usr/hdp/
3.1.4.0-315/tez/hadoop-shim-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/hadoop-
shim-2.8-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib:/usr/hdp/3.1.4.0-315/t
ez/man:/usr/hdp/3.1.4.0-315/tez/tez-api-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315
/tez/tez-common-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-dag-0.9.1.3.1.4
.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-examples-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.
1.4.0-315/tez/tez-history-parser-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/te
z-javadoc-tools-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-job-analyzer-0.
9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-mapreduce-0.9.1.3.1.4.0-315.jar:/
usr/hdp/3.1.4.0-315/tez/tez-protobuf-history-plugin-0.9.1.3.1.4.0-315.jar:/usr/hdp
/3.1.4.0-315/tez/tez-runtime-internals-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/
tez/tez-runtime-library-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-tests-0
.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-cache-plugin-0.9.1
.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-0.9.1.3.1.4.0-
315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-with-acls-0.9.1.3.1.4.0
-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-with-fs-0.9.1.3.1.4.0-
315.jar:/usr/hdp/3.1.4.0-315/tez/ui:/usr/hdp/3.1.4.0-315/tez/lib/async-http-client
-1.9.40.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-cli-1.2.jar:/usr/hdp/3.1.4.0-315/
tez/lib/commons-codec-1.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-collections-3.2
.2.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-collections4-4.1.jar:/usr/hdp/3.1.4.0-
315/tez/lib/commons-io-2.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-lang-2.6.jar:/
usr/hdp/3.1.4.0-315/tez/lib/commons-math3-3.1.1.jar:/usr/hdp/3.1.4.0-315/tez/lib/g
cs-connector-1.9.10.3.1.4.0-315-shaded.jar:/usr/hdp/3.1.4.0-315/tez/lib/guava-28.0
-jre.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-aws-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.
1.4.0-315/tez/lib/hadoop-azure-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/
hadoop-azure-datalake-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-hd
fs-client-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-mapreduce-clie
nt-common-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-mapreduce-clie
nt-core-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-yarn-server-time
line-pluginstorage-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/jersey-clien
t-1.19.jar:/usr/hdp/3.1.4.0-315/tez/lib/jersey-json-1.19.jar:/usr/hdp/3.1.4.0-315/
tez/lib/jettison-1.3.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/jetty-server-9.3.24.v20180
605.jar:/usr/hdp/3.1.4.0-315/tez/lib/jetty-util-9.3.24.v20180605.jar:/usr/hdp/3.1.
4.0-315/tez/lib/jsr305-3.0.0.jar:/usr/hdp/3.1.4.0-315/tez/lib/metrics-core-3.1.0.j
ar:/usr/hdp/3.1.4.0-315/tez/lib/protobuf-java-2.5.0.jar:/usr/hdp/3.1.4.0-315/tez/l
ib/RoaringBitmap-0.4.9.jar:/usr/hdp/3.1.4.0-315/tez/lib/servlet-api-2.5.jar:/usr/h
dp/3.1.4.0-315/tez/lib/slf4j-api-1.7.10.jar:/usr/hdp/3.1.4.0-315/tez/lib/tez.tar.g
z
5. Compile WordCount.java with the Hadoop2 API and this CLASSPATH:
javac -cp `hadoop classpath` WordCount2.java
Note
Notice the back quotation mark, also known as backtick around `hadoop classpath`. A backtick is
not a quotation sign. It has a very special meaning as a command substitution. The purpose of
command substitution is to evaluate the command, which is placed inside the backtick and provide
its result as an argument to the actual command. Everything that you type between backticks is
evaluated (executed) by the shell before the main command (like hadoop classpath in this
example), and the output of that execution is used by that command (javac in this example) as if
you'd type that output at that place in the command line.
Note
By default the output of this command goes to your HDFS home directory, which is
/user/<username> in the environment for this course.
9. You are now ready to run the compiled program with the appropriate parameters. Review
the program logic and the use of these additional parameters in the WordCount V2.0 tutorial
at
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client
-core/MapReduceTutorial.html#Example:_WordCount_v2.0.
Run the following command in one line:
hadoop jar WC2.jar WordCount2 -D wordcount.case.sensitive=false Gutenberg/*.txt
wc2out -skip patternsToSkip
Important
LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.
Note
• If the Gutenberg folder does not exist in your HDFS directory, run the steps in Exercise 3.
• You can run the same program with the following yarn command. But before you run it, clean
up the output directory wc2out by running hdfs dfs -rm -R wc2out.
yarn jar WC2.jar WordCount2 -D wordcount.case.sensitive=false Gutenberg/*.txt
wc2out -skip patternsToSkip
10. Notice how many mappers and reducers run for this job. They are highlighted in bold under
the Job Counters section of the command results.
11. List the generated files by running the command:
hdfs dfs -ls wc2out
The result is similar to the following output.
[student0000@dataengineer ~]$ hdfs dfs -ls wc2out
Found 2 items
-rw-r--r-- 3 student0000 hdfs 0 2020-09-22 01:46 wc2out/_SUCCESS
-rw-r--r-- 3 student0000 hdfs 161660 2020-09-22 01:46 wc2out/part-r-00000
12. Explore the results of the program that are generated in the file part-r-00000.
hdfs dfs -cat wc2out/part-r-00000 | more
Scroll through the file by pressing Enter and look at the output pages.
13. Enter q to quit.
13. Clean up the output directory wc2out by running the command:
hdfs dfs -rm -R wc2out
The result is similar to the following output.
student0000@dataengineer ~]$ hdfs dfs -rm -R wc2out
Deleted wc2out
End of exercise