0% found this document useful (0 votes)
5 views30 pages

TP 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views30 pages

TP 2

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Exercise 4.

Running MapReduce and YARN jobs

Exercise 4. Running MapReduce and


YARN jobs
Estimated time
0:45

Overview
This exercise introduces you to a simple MapReduce program that uses Hadoop v2 and related
technologies. You compile and run the program by using Hadoop and Yarn commands. You also
explore the MapReduce job’s history with Ambari Web UI.

Objectives
After completing this exercise, you will be able to:
• List the sample MapReduce programs provided by the Hadoop community.
• Compile MapReduce programs and run them by using Hadoop and YARN commands.
• Explore the MapReduce job’s history by using the Ambari Web UI.

Introduction
In the MapReduce programming model, the program is written in a special way so that the
programs can be brought to the data. To accomplish this goal, the program is broken down into two
discrete parts: Map and Reduce.
• A mapper is typically a relatively small program with a relatively simple task. A mapper is
responsible for reading a portion of the input data, interpreting, filtering, or transforming the data
as necessary and then finally producing a stream of <key, value> pairs.
• Reducers are the last part of the picture. They are also typically small programs that are
responsible for sorting and aggregating the values with the key that they were assigned to work
on. Just like with mappers, the more unique keys you have, the more parallelism. After each
reducer completes what it is assigned to do, for example, add up the total sales for the state, it
emits key/value pairs that are written to storage. These key/value pairs can be used as the input
to the next MapReduce job.
The fundamental idea of YARN/MRv2 is to split the two main functions of the JobTracker, which are
resource management and job scheduling/monitoring, into separate daemons. The idea is to have
a global ResourceManager (RM) and a pre-application ApplicationMaster (AM) per application. For
more information about MapReduce 2.0 (MRv2) or YARN, see Apache Hadoop YARN.
The Hadoop community provides several standard example programs. You can use the example
programs to learn the relevant technology, and on a newly installed Hadoop cluster, to verify the
system’s operational environment.

© Copyright IBM Corp. 2016, 2021 4-1


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

In this exercise, you list the example programs that are provided by the Hadoop community. Then,
you compile the sample Java program wordcount, which is a MapReduce program that counts the
words in the input files. You run wordcount by using Hadoop and YARN commands. You also learn
to explore the MapReduce job’s history with Ambari Web UI from both MapReduce2 and YARN
services.

Requirements
• Complete "Exercise 1. Exploring the lab environment".
• Complete "Exercise 3. File access and basic commands with HDFS".
• PuTTY SSH client installed in your workstation.

© Copyright IBM Corp. 2016, 2021 4-2


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Exercise instructions
In this exercise, you complete the following tasks:
1. Run a simple MapReduce job from a Hadoop sample program.
2. Explore the MapReduce job’s history with the Ambari Web UI.
3. Run a simple MapReduce job by using YARN.

Part 1: Running a simple MapReduce job from a Hadoop sample program


In this part, you list the example programs that are provided by the Hadoop community. Then, you
compile the sample Java program wordcount, which is a MapReduce program that counts the
words in the input files. You run wordcount by using Hadoop commands.
Complete the following steps:
1. SSH to the Hadoop host by using PuTTY.
a. Enter the <hostname> or <ip_address> that is provided to you and that you updated in
the table of “Exercise 1: Part 1. Accessing your VM”.
b. Log in with the <username> and <password> that are assigned to you.
2. List all MapReduce examples on the system by running the following command:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar

© Copyright IBM Corp. 2016, 2021 4-3


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

The results are shown in the following output.


[student0000@dataengineer ~]$ hadoop jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words
in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the
histogram of the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact
digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits
of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino
problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data
per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the
input files.
wordmedian: A map/reduce program that counts the median length of the words in
the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of
the length of the words in the input files.

© Copyright IBM Corp. 2016, 2021 4-4


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

3. Run the sample program wordcount, which is a MapReduce program that counts the words
in the input files. Enter the following command in one line:
hadoop jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount
Where wcount is the directory where the program writes the final output.

Important

LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.

© Copyright IBM Corp. 2016, 2021 4-5


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

The results are similar to the following output.


[student0000@dataengineer ~]$ hadoop jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount
Gutenberg/Frankenstein.txt wcount
20/09/21 15:39:00 INFO client.RMProxy: Connecting to ResourceManager at
dataengineer.ibm.com/192.168.122.1:8050
20/09/21 15:39:00 INFO client.AHSProxy: Connecting to Application History server at
dataengineer.ibm.com/192.168.122.1:10200
20/09/21 15:39:01 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /user/student0000/.staging/job_1599487765089_0034
20/09/21 15:39:01 INFO input.FileInputFormat: Total input files to process : 1
20/09/21 15:39:01 INFO mapreduce.JobSubmitter: number of splits:1
20/09/21 15:39:01 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1599487765089_0034
20/09/21 15:39:01 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/09/21 15:39:01 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
20/09/21 15:39:02 INFO impl.YarnClientImpl: Submitted application
application_1599487765089_0034
20/09/21 15:39:02 INFO mapreduce.Job: The url to track the job:
http://dataengineer.ibm.com:8088/proxy/application_1599487765089_0034/
20/09/21 15:39:02 INFO mapreduce.Job: Running job: job_1599487765089_0034
20/09/21 15:39:09 INFO mapreduce.Job: Job job_1599487765089_0034 running in uber
mode : false
20/09/21 15:39:09 INFO mapreduce.Job: map 0% reduce 0%
20/09/21 15:39:18 INFO mapreduce.Job: map 100% reduce 0%
20/09/21 15:39:24 INFO mapreduce.Job: map 100% reduce 100%
20/09/21 15:39:25 INFO mapreduce.Job: Job job_1599487765089_0034 completed
successfully
20/09/21 15:39:25 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=167616
FILE: Number of bytes written=802409
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=421645
HDFS: Number of bytes written=122090
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=32085
Total time spent by all reduces in occupied slots (ms)=20765
Total time spent by all map tasks (ms)=6417

© Copyright IBM Corp. 2016, 2021 4-6


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Total time spent by all reduce tasks (ms)=4153


Total vcore-milliseconds taken by all map tasks=6417
Total vcore-milliseconds taken by all reduce tasks=4153
Total megabyte-milliseconds taken by all map tasks=32855040
Total megabyte-milliseconds taken by all reduce tasks=21263360
Map-Reduce Framework
Map input records=7244
Map output records=74952
Map output bytes=717818
Map output materialized bytes=167616
Input split bytes=141
Combine input records=74952
Combine output records=11603
Reduce input groups=11603
Reduce shuffle bytes=167616
Reduce input records=11603
Reduce output records=11603
Spilled Records=23206
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=159
CPU time spent (ms)=6820
Physical memory (bytes) snapshot=2789412864
Virtual memory (bytes) snapshot=12801667072
Total committed heap usage (bytes)=3138387968
Peak Map Physical memory (bytes)=2466148352
Peak Map Virtual memory (bytes)=6387011584
Peak Reduce Physical memory (bytes)=323264512
Peak Reduce Virtual memory (bytes)=6414655488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=421504
File Output Format Counters
Bytes Written=122090

Note

If the Gutenberg folder does not exist on your HDFS directory, run the steps in "Exercise 3. File
access and basic commands with HDFS", "Part 2.Exploring basic HDFS commands".

© Copyright IBM Corp. 2016, 2021 4-7


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

4. Notice that there is only one Reduce task, which is highlighted in bold under the Job
Counters section in the command results. Therefore, the result is produced in one file.
5. List the generated files by running the following command:
hdfs dfs -ls wcount
The result is shown in the following output.
[student0000@dataengineer ~]$ hdfs dfs -ls wcount
Found 2 items
-rw-r--r-- 3 student0000 hdfs 0 2020-09-21 15:39 wcount/_SUCCESS
-rw-r--r-- 3 student0000 hdfs 122090 2020-09-21 15:39 wcount/part-r-00000
As expected, the result is produced in only one file (part-r-0000)
6. To review part-r-0000, run the following command:
hadoop fs -cat wcount/part-r-00000 | more
The result is shown in the following output.

© Copyright IBM Corp. 2016, 2021 4-8


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

7. Type q or Ctrl + C to exit from the file.


8. Remove the directory wcount where your output file was stored, by running the following
command:
hdfs dfs -rm -R wcount
The result is shown in the following output.
[student0000@dataengineer ~]$ hdfs dfs -rm -R wcount
Deleted wcount

Note

It is necessary to remove the wcount directory and all files in it to run the command again without
changes. Alternatively, you could run the command again, but with a different output directory such
as wcount2.

Part 2: Exploring the MapReduce job’s history with the Ambari Web UI
In this part, you explore the MapReduce job that you submitted in the previous part by using the
Ambari Web UI.
Complete the following steps:
1. Start Ambari Web UI by opening the Ambari URL <hostname:8080> in your browser. Refer
to "Exercise 1. Exploring the lab environment".
2. Log in using the Ambari Username <ambari username> and Ambari Password <ambari
password> from "Exercise 1. Exploring the lab environment".
3. Click MapReduce2 under Services on the left pane.

4. Click JobHistory UI in Quick Links section.


The job history page is displayed as shown in the following figure.

© Copyright IBM Corp. 2016, 2021 4-9


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Hint

Press Ctrl + F and search for your username to find the jobs you ran. You can also search for
your Job ID by using the Job ID from the results of running wordcount.

5. Click the job ID to open its history to see the status and logs of the job.

Part 3: Running a simple MapReduce job by using YARN


In this part, you run the same job that you run in Part 1 but by using YARN commands. You also run
the wordcount sample program with more than one input text file.
Complete the following steps:
1. Return to the PuTTY SSH client that you connected to the VM in Part 1.

© Copyright IBM Corp. 2016, 2021 4-10


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Note

If your PuTTY session expired, reconnect.

2. Run the wordcount program by using the following yarn command.


yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/Frankenstein.txt wcount2

Important

LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.

© Copyright IBM Corp. 2016, 2021 4-11


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

The result is shown in the following output.


[student0000@dataengineer ~]$ yarn jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount
Gutenberg/Frankenstein.txt wcount2
20/09/21 16:21:29 INFO client.RMProxy: Connecting to ResourceManager at
dataengineer.ibm.com/192.168.122.1:8050
20/09/21 16:21:29 INFO client.AHSProxy: Connecting to Application History server at
dataengineer.ibm.com/192.168.122.1:10200
20/09/21 16:21:29 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /user/student0000/.staging/job_1599487765089_0035
20/09/21 16:21:29 INFO input.FileInputFormat: Total input files to process : 1
20/09/21 16:21:29 INFO mapreduce.JobSubmitter: number of splits:1
20/09/21 16:21:29 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1599487765089_0035
20/09/21 16:21:29 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/09/21 16:21:30 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
20/09/21 16:21:30 INFO impl.YarnClientImpl: Submitted application
application_1599487765089_0035
20/09/21 16:21:30 INFO mapreduce.Job: The url to track the job:
http://dataengineer.ibm.com:8088/proxy/application_1599487765089_0035/
20/09/21 16:21:30 INFO mapreduce.Job: Running job: job_1599487765089_0035
20/09/21 16:21:37 INFO mapreduce.Job: Job job_1599487765089_0035 running in uber
mode : false
20/09/21 16:21:37 INFO mapreduce.Job: map 0% reduce 0%
20/09/21 16:21:45 INFO mapreduce.Job: map 100% reduce 0%
20/09/21 16:21:50 INFO mapreduce.Job: map 100% reduce 100%
20/09/21 16:21:51 INFO mapreduce.Job: Job job_1599487765089_0035 completed
successfully
20/09/21 16:21:51 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=167616
FILE: Number of bytes written=802411
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=421645
HDFS: Number of bytes written=122090
HDFS: Number of read operations=8
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=30780
Total time spent by all reduces in occupied slots (ms)=14920
Total time spent by all map tasks (ms)=6156

© Copyright IBM Corp. 2016, 2021 4-12


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Total time spent by all reduce tasks (ms)=2984


Total vcore-milliseconds taken by all map tasks=6156
Total vcore-milliseconds taken by all reduce tasks=2984
Total megabyte-milliseconds taken by all map tasks=31518720
Total megabyte-milliseconds taken by all reduce tasks=15278080
Map-Reduce Framework
Map input records=7244
Map output records=74952
Map output bytes=717818
Map output materialized bytes=167616
Input split bytes=141
Combine input records=74952
Combine output records=11603
Reduce input groups=11603
Reduce shuffle bytes=167616
Reduce input records=11603
Reduce output records=11603
Spilled Records=23206
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=146
CPU time spent (ms)=5870
Physical memory (bytes) snapshot=2794078208
Virtual memory (bytes) snapshot=12812599296
Total committed heap usage (bytes)=3176660992
Peak Map Physical memory (bytes)=2474508288
Peak Map Virtual memory (bytes)=6391799808
Peak Reduce Physical memory (bytes)=319569920
Peak Reduce Virtual memory (bytes)=6420799488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=421504
File Output Format Counters
Bytes Written=122090
3. Cleanup the output directory wcount2 by running the following command:
hdfs dfs -rm -R wcount*
The results are shown in the following output.
[student0000@dataengineer ~]$ hdfs dfs -rm -R wcount*
Deleted wcount2

© Copyright IBM Corp. 2016, 2021 4-13


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

4. Re-run the job with all four files in the Gutenberg directory as input, by using the command:
yarn jar /usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar
wordcount Gutenberg/* wcount2

Important

LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.

© Copyright IBM Corp. 2016, 2021 4-14


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

The result is shown in the following output.


[student0000@dataengineer ~]$ yarn jar
/usr/hdp/current/hadoop-mapreduce-client/hadoop-mapreduce-examples.jar wordcount
Gutenberg/* wcount2
20/09/21 16:38:43 INFO client.RMProxy: Connecting to ResourceManager at
dataengineer.ibm.com/192.168.122.1:8050
20/09/21 16:38:44 INFO client.AHSProxy: Connecting to Application History server at
dataengineer.ibm.com/192.168.122.1:10200
20/09/21 16:38:44 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /user/student0000/.staging/job_1599487765089_0037
20/09/21 16:38:44 INFO input.FileInputFormat: Total input files to process : 4
20/09/21 16:38:44 INFO mapreduce.JobSubmitter: number of splits:4
20/09/21 16:38:44 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1599487765089_0037
20/09/21 16:38:44 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/09/21 16:38:45 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
20/09/21 16:38:45 INFO impl.YarnClientImpl: Submitted application
application_1599487765089_0037
20/09/21 16:38:45 INFO mapreduce.Job: The url to track the job:
http://dataengineer.ibm.com:8088/proxy/application_1599487765089_0037/
20/09/21 16:38:45 INFO mapreduce.Job: Running job: job_1599487765089_0037
20/09/21 16:38:52 INFO mapreduce.Job: Job job_1599487765089_0037 running in uber
mode : false
20/09/21 16:38:52 INFO mapreduce.Job: map 0% reduce 0%
20/09/21 16:39:04 INFO mapreduce.Job: map 50% reduce 0%
20/09/21 16:39:05 INFO mapreduce.Job: map 100% reduce 0%
20/09/21 16:39:10 INFO mapreduce.Job: map 100% reduce 100%
20/09/21 16:39:11 INFO mapreduce.Job: Job job_1599487765089_0037 completed
successfully
20/09/21 16:39:12 INFO mapreduce.Job: Counters: 53
File System Counters
FILE: Number of bytes read=474331
FILE: Number of bytes written=2116635
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1401923
HDFS: Number of bytes written=261426
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=4
Launched reduce tasks=1
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=193980
Total time spent by all reduces in occupied slots (ms)=20745

© Copyright IBM Corp. 2016, 2021 4-15


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Total time spent by all map tasks (ms)=38796


Total time spent by all reduce tasks (ms)=4149
Total vcore-milliseconds taken by all map tasks=38796
Total vcore-milliseconds taken by all reduce tasks=4149
Total megabyte-milliseconds taken by all map tasks=198635520
Total megabyte-milliseconds taken by all reduce tasks=21242880
Map-Reduce Framework
Map input records=24960
Map output records=246237
Map output bytes=2365551
Map output materialized bytes=474349
Input split bytes=579
Combine input records=246237
Combine output records=32643
Reduce input groups=23934
Reduce shuffle bytes=474349
Reduce input records=32643
Reduce output records=23934
Spilled Records=65286
Shuffled Maps =4
Failed Shuffles=0
Merged Map outputs=4
GC time elapsed (ms)=1229
CPU time spent (ms)=28690
Physical memory (bytes) snapshot=7629279232
Virtual memory (bytes) snapshot=31986794496
Total committed heap usage (bytes)=9484369920
Peak Map Physical memory (bytes)=2478088192
Peak Map Virtual memory (bytes)=6393864192
Peak Reduce Physical memory (bytes)=327307264
Peak Reduce Virtual memory (bytes)=6420643840
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=1401344
File Output Format Counters
Bytes Written=261426
5. Notice how many mappers and reducers run for this job (highlighted in bold under Job
Counters section of the command results).
6. Return to the Ambari Web UI. Click YARN under Services on the left pane.

© Copyright IBM Corp. 2016, 2021 4-16


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

7. Click ResourceManager UI under Quick Links.

8. Select the Applications tab.

© Copyright IBM Corp. 2016, 2021 4-17


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

9. Select the application ID of the job you run in step 4.

Note

You can get the application ID from the result of running the job. Find the value of Submitted
application (for example application_1599487765089_0037 highlighted in bold in the command
results).

10. Click History at the upper right.


Notice the number of Map and Reduce tasks.

11. Return to PuTTY to clean up the output directories by running the following command:
hdfs dfs -rm -R wcount*

© Copyright IBM Corp. 2016, 2021 4-18


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

The result looks like the following output.


[student0000@dataengineer ~]$ hdfs dfs -rm -R wcount*
Deleted wcount2

End of exercise

© Copyright IBM Corp. 2016, 2021 4-19


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 4. Running MapReduce and YARN jobs

Exercise review and wrap-up


In this exercise, you compiled the sample Java program wordcount, which is a MapReduce
program that counts the words in the input files. You ran wordcount by using Hadoop and YARN
commands. You also explored the MapReduce job’s history by using the Ambari Web UI.

© Copyright IBM Corp. 2016, 2021 4-20


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Exercise 5. Creating and coding a


simple MapReduce job
Estimated time
30 minutes

Overview
In this exercise, you compile and run a new and more complex version of the WordCount program
that was introduced in “Exercise 4. Running MapReduce and YARN jobs”. This new version uses
many of the features that are provided by the MapReduce framework.

Objectives
After completing this exercise, you will be able to:
• Compile and run more complex MapReduce programs.

Introduction
In this exercise, you use a more complex MapReduce program, WordCount2.java, which is
provided as part of the Apache Hadoop MapReduce tutorials at
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/M
apReduceTutorial.html#Example:_WordCount_v2.0.
WordCount.java v2 is more sophisticated than the one that you used in Exercise 4. In this version,
you can specify patterns that you might want to skip when the program counts words, such as "to",
"the", “/”, and others.
There are some limitations; if you are an experienced Java programmer, you might want to
experiment later with other features. For instance, are all words lowercased when they are
tokenized?
Since you are now more familiar with the process of running MapReduced and YARN jobs, the
directions provided here concentrate on the compilation and running only.

Requirements
• Complete "Exercise 1. Exploring the lab environment".
• Complete "Exercise 3. File access and basic commands with HDFS".
• PuTTY SSH client installed in your workstation.

© Copyright IBM Corp. 2016, 2021 5-1


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Exercise instructions
In this exercise, you complete the following tasks:
1. Compile and run a more complete version of WordCount program.

Part 1: Compiling and running a more complete version of WordCount


program
In this part, you compile WordCount v2, which is a more complete version of the WordCount
program that you compiled in Exercise 4. Then, you run WordCount by using a Hadoop command.
Complete the following steps:
1. Connect to the Hadoop host by using PuTTY, as described in “Exercise 1: Part 1. Accessing
your VM”.
2. Explore the files in the WordCount2 folder y by running the following command.
ls /labfiles/WordCount2/
The result is similar to the following output.
[student0001@dataengineer ~]$ ls /labfiles/WordCount2/
patternsToSkip WordCount2.java
As shown, the WordCount2 folder contains the following files:
- WordCount2.java: A more complete version of the WordCount progam was used in
Exercise 4.
- patternToSkip: A file that contains the patterns to skip, one pattern per line.
3. Copy WordCount2.java to the current directory by running the commands:
cd ~
cp /labfiles/WordCount2/WordCount2.java .

Note

At the end of this command, there is a period. It indicates the current directory, which is your home
directory in this case.

4. Get the classpath environment variable for Hadoop, which you need for compilation:
hadoop classpath
The result is similar to the following output.
[student0000@dataengineer WordCount2]$ hadoop classpath
/usr/hdp/3.1.4.0-315/hadoop/conf:/usr/hdp/3.1.4.0-315/hadoop/lib/*:/usr/hdp/3.1.4.
0-315/hadoop/.//*:/usr/hdp/3.1.4.0-315/hadoop-hdfs/./:/usr/hdp/3.1.4.0-315/hadoop-
hdfs/lib/*:/usr/hdp/3.1.4.0-315/hadoop-hdfs/.//*:/usr/hdp/3.1.4.0-315/hadoop-mapre
duce/lib/*:/usr/hdp/3.1.4.0-315/hadoop-mapreduce/.//*:/usr/hdp/3.1.4.0-315/hadoop-
yarn/./:/usr/hdp/3.1.4.0-315/hadoop-yarn/lib/*:/usr/hdp/3.1.4.0-315/hadoop-yarn/./

© Copyright IBM Corp. 2016, 2021 5-2


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

/*:/usr/hdp/3.1.4.0-315/tez/*:/usr/hdp/3.1.4.0-315/tez/lib/*:/usr/hdp/3.1.4.0-315/
tez/conf:/usr/hdp/3.1.4.0-315/tez/conf_llap:/usr/hdp/3.1.4.0-315/tez/doc:/usr/hdp/
3.1.4.0-315/tez/hadoop-shim-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/hadoop-
shim-2.8-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib:/usr/hdp/3.1.4.0-315/t
ez/man:/usr/hdp/3.1.4.0-315/tez/tez-api-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315
/tez/tez-common-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-dag-0.9.1.3.1.4
.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-examples-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.
1.4.0-315/tez/tez-history-parser-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/te
z-javadoc-tools-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-job-analyzer-0.
9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-mapreduce-0.9.1.3.1.4.0-315.jar:/
usr/hdp/3.1.4.0-315/tez/tez-protobuf-history-plugin-0.9.1.3.1.4.0-315.jar:/usr/hdp
/3.1.4.0-315/tez/tez-runtime-internals-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/
tez/tez-runtime-library-0.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-tests-0
.9.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-cache-plugin-0.9.1
.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-0.9.1.3.1.4.0-
315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-with-acls-0.9.1.3.1.4.0
-315.jar:/usr/hdp/3.1.4.0-315/tez/tez-yarn-timeline-history-with-fs-0.9.1.3.1.4.0-
315.jar:/usr/hdp/3.1.4.0-315/tez/ui:/usr/hdp/3.1.4.0-315/tez/lib/async-http-client
-1.9.40.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-cli-1.2.jar:/usr/hdp/3.1.4.0-315/
tez/lib/commons-codec-1.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-collections-3.2
.2.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-collections4-4.1.jar:/usr/hdp/3.1.4.0-
315/tez/lib/commons-io-2.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/commons-lang-2.6.jar:/
usr/hdp/3.1.4.0-315/tez/lib/commons-math3-3.1.1.jar:/usr/hdp/3.1.4.0-315/tez/lib/g
cs-connector-1.9.10.3.1.4.0-315-shaded.jar:/usr/hdp/3.1.4.0-315/tez/lib/guava-28.0
-jre.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-aws-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.
1.4.0-315/tez/lib/hadoop-azure-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/
hadoop-azure-datalake-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-hd
fs-client-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-mapreduce-clie
nt-common-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-mapreduce-clie
nt-core-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/hadoop-yarn-server-time
line-pluginstorage-3.1.1.3.1.4.0-315.jar:/usr/hdp/3.1.4.0-315/tez/lib/jersey-clien
t-1.19.jar:/usr/hdp/3.1.4.0-315/tez/lib/jersey-json-1.19.jar:/usr/hdp/3.1.4.0-315/
tez/lib/jettison-1.3.4.jar:/usr/hdp/3.1.4.0-315/tez/lib/jetty-server-9.3.24.v20180
605.jar:/usr/hdp/3.1.4.0-315/tez/lib/jetty-util-9.3.24.v20180605.jar:/usr/hdp/3.1.
4.0-315/tez/lib/jsr305-3.0.0.jar:/usr/hdp/3.1.4.0-315/tez/lib/metrics-core-3.1.0.j
ar:/usr/hdp/3.1.4.0-315/tez/lib/protobuf-java-2.5.0.jar:/usr/hdp/3.1.4.0-315/tez/l
ib/RoaringBitmap-0.4.9.jar:/usr/hdp/3.1.4.0-315/tez/lib/servlet-api-2.5.jar:/usr/h
dp/3.1.4.0-315/tez/lib/slf4j-api-1.7.10.jar:/usr/hdp/3.1.4.0-315/tez/lib/tez.tar.g
z
5. Compile WordCount.java with the Hadoop2 API and this CLASSPATH:
javac -cp `hadoop classpath` WordCount2.java

© Copyright IBM Corp. 2016, 2021 5-3


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Note

Notice the back quotation mark, also known as backtick around `hadoop classpath`. A backtick is
not a quotation sign. It has a very special meaning as a command substitution. The purpose of
command substitution is to evaluate the command, which is placed inside the backtick and provide
its result as an argument to the actual command. Everything that you type between backticks is
evaluated (executed) by the shell before the main command (like hadoop classpath in this
example), and the output of that execution is used by that command (javac in this example) as if
you'd type that output at that place in the command line.

6. Create a JAR file that can be run in the Hadoop2/YARN environment:


jar cf WC2.jar *.class
7. To run your JAR file successfully, remove all the class files from your Linux directory:
rm *.class
8. Copy the patternsToSkip file to your HDFS home directory by running the command:
hdfs dfs -put /labfiles/WordCount2/patternsToSkip

Note

By default the output of this command goes to your HDFS home directory, which is
/user/<username> in the environment for this course.

9. You are now ready to run the compiled program with the appropriate parameters. Review
the program logic and the use of these additional parameters in the WordCount V2.0 tutorial
at
http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client
-core/MapReduceTutorial.html#Example:_WordCount_v2.0.
Run the following command in one line:
hadoop jar WC2.jar WordCount2 -D wordcount.case.sensitive=false Gutenberg/*.txt
wc2out -skip patternsToSkip

Important

LONG RUNNING command. If a single user submits this command, it usually completes in
approximate 25 seconds. In the current class environment, when tens of students submit jobs at the
same time, more wait time is to be expected. In our tests, the maximum wait time that was
experienced was 3 minutes. Be patient: do not close or break your session.

© Copyright IBM Corp. 2016, 2021 5-4


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

The result is similar to the following output.


[student0000@dataengineer ~]$ hadoop jar WC2.jar WordCount2 -D
wordcount.case.sensitive=false Gutenberg/*.txt wc2out -skip patternsToSkip
20/09/22 01:45:58 INFO client.RMProxy: Connecting to ResourceManager at
dataengineer.ibm.com/192.168.122.1:8050
20/09/22 01:45:59 INFO client.AHSProxy: Connecting to Application History server at
dataengineer.ibm.com/192.168.122.1:10200
20/09/22 01:45:59 INFO mapreduce.JobResourceUploader: Disabling Erasure Coding for
path: /user/student0000/.staging/job_1599487765089_0041
20/09/22 01:45:59 INFO input.FileInputFormat: Total input files to process : 4
20/09/22 01:45:59 INFO mapreduce.JobSubmitter: number of splits:4
20/09/22 01:45:59 INFO mapreduce.JobSubmitter: Submitting tokens for job:
job_1599487765089_0041
20/09/22 01:45:59 INFO mapreduce.JobSubmitter: Executing with tokens: []
20/09/22 01:46:00 INFO conf.Configuration: found resource resource-types.xml at
file:/etc/hadoop/3.1.4.0-315/0/resource-types.xml
20/09/22 01:46:00 INFO impl.YarnClientImpl: Submitted application
application_1599487765089_0041
20/09/22 01:46:00 INFO mapreduce.Job: The url to track the job:
http://dataengineer.ibm.com:8088/proxy/application_1599487765089_0041/
20/09/22 01:46:00 INFO mapreduce.Job: Running job: job_1599487765089_0041
20/09/22 01:46:08 INFO mapreduce.Job: Job job_1599487765089_0041 running in uber
mode : false
20/09/22 01:46:08 INFO mapreduce.Job: map 0% reduce 0%
20/09/22 01:46:20 INFO mapreduce.Job: map 25% reduce 0%
20/09/22 01:46:22 INFO mapreduce.Job: map 50% reduce 0%
20/09/22 01:46:23 INFO mapreduce.Job: map 100% reduce 0%
20/09/22 01:46:28 INFO mapreduce.Job: map 100% reduce 100%
20/09/22 01:46:29 INFO mapreduce.Job: Job job_1599487765089_0041 completed
successfully
20/09/22 01:46:29 INFO mapreduce.Job: Counters: 54
File System Counters
FILE: Number of bytes read=314816
FILE: Number of bytes written=1803560
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=1401923
HDFS: Number of bytes written=161660
HDFS: Number of read operations=17
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=4
Launched reduce tasks=1
Data-local map tasks=4
Total time spent by all maps in occupied slots (ms)=211835
Total time spent by all reduces in occupied slots (ms)=25625

© Copyright IBM Corp. 2016, 2021 5-5


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Total time spent by all map tasks (ms)=42367


Total time spent by all reduce tasks (ms)=5125
Total vcore-milliseconds taken by all map tasks=42367
Total vcore-milliseconds taken by all reduce tasks=5125
Total megabyte-milliseconds taken by all map tasks=216919040
Total megabyte-milliseconds taken by all reduce tasks=26240000
Map-Reduce Framework
Map input records=24960
Map output records=237985
Map output bytes=2268163
Map output materialized bytes=314834
Input split bytes=579
Combine input records=237985
Combine output records=21832
Reduce input groups=14870
Reduce shuffle bytes=314834
Reduce input records=21832
Reduce output records=14870
Spilled Records=43664
Shuffled Maps =4
Failed Shuffles=0
Merged Map outputs=4
GC time elapsed (ms)=2480
CPU time spent (ms)=29210
Physical memory (bytes) snapshot=10452713472
Virtual memory (bytes) snapshot=31970623488
Total committed heap usage (bytes)=12929990656
Peak Map Physical memory (bytes)=2567954432
Peak Map Virtual memory (bytes)=6394359808
Peak Reduce Physical memory (bytes)=328450048
Peak Reduce Virtual memory (bytes)=6417879040
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
WordCount2$TokenizerMapper$CountersEnum
INPUT_WORDS=237985
File Input Format Counters
Bytes Read=1401344
File Output Format Counters
Bytes Written=161660

© Copyright IBM Corp. 2016, 2021 5-6


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Note

• If the Gutenberg folder does not exist in your HDFS directory, run the steps in Exercise 3.
• You can run the same program with the following yarn command. But before you run it, clean
up the output directory wc2out by running hdfs dfs -rm -R wc2out.
yarn jar WC2.jar WordCount2 -D wordcount.case.sensitive=false Gutenberg/*.txt
wc2out -skip patternsToSkip

10. Notice how many mappers and reducers run for this job. They are highlighted in bold under
the Job Counters section of the command results.
11. List the generated files by running the command:
hdfs dfs -ls wc2out
The result is similar to the following output.
[student0000@dataengineer ~]$ hdfs dfs -ls wc2out
Found 2 items
-rw-r--r-- 3 student0000 hdfs 0 2020-09-22 01:46 wc2out/_SUCCESS
-rw-r--r-- 3 student0000 hdfs 161660 2020-09-22 01:46 wc2out/part-r-00000
12. Explore the results of the program that are generated in the file part-r-00000.
hdfs dfs -cat wc2out/part-r-00000 | more

© Copyright IBM Corp. 2016, 2021 5-7


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Scroll through the file by pressing Enter and look at the output pages.
13. Enter q to quit.
13. Clean up the output directory wc2out by running the command:
hdfs dfs -rm -R wc2out
The result is similar to the following output.
student0000@dataengineer ~]$ hdfs dfs -rm -R wc2out
Deleted wc2out

End of exercise

© Copyright IBM Corp. 2016, 2021 5-8


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.
Exercise 5. Creating and coding a simple MapReduce job

Exercise review and wrap-up


In this exercise, you compiled and ran a more complete version of the WordCount program and
experienced some features offered by the MapReduce framework.

© Copyright IBM Corp. 2016, 2021 5-9


Course materials may not be reproduced in whole or in part without the prior written permission of IBM.

You might also like