0% found this document useful (0 votes)
59 views12 pages

Chapter 2: Running Example Program and Bench Mark: Big Data Analytics (15CS82)

The document discusses running and monitoring MapReduce examples and benchmarks in Hadoop. It provides instructions on running the Pi example, which calculates Pi using a quasi-Monte Carlo method. The YARN ResourceManager web GUI can then be used to monitor the example job. The GUI displays application metrics and progress, and allows drilling down to view task-level details, logs, and node information to confirm proper Hadoop operation.

Uploaded by

VISHNU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views12 pages

Chapter 2: Running Example Program and Bench Mark: Big Data Analytics (15CS82)

The document discusses running and monitoring MapReduce examples and benchmarks in Hadoop. It provides instructions on running the Pi example, which calculates Pi using a quasi-Monte Carlo method. The YARN ResourceManager web GUI can then be used to monitor the example job. The GUI displays application metrics and progress, and allows drilling down to view task-level details, logs, and node information to confirm proper Hadoop operation.

Uploaded by

VISHNU
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 12

Big Data Analytics [15CS82]

MODULE 1
Chapter 2: Running Example program and Bench Mark

When using new or updated hardware or software, simple examples and benchmarks help
confirm proper operation. Apache Hadoop includes many examples and bench- marks to aid
in this task. This chapter provides instructions on how to run, monitor, and manage some
basic Map Reduce examples and benchmarks
Running Map Reduce Examples

All Hadoop releases come with MapReduce example applications. Running the existing
MapReduce examples is a simple process—once the example files are located, that is. For
example, if you installed Hadoop version 2.6.0 from the Apache sources under
/opt, the examples will be in the following directory:
/opt/hadoop-2.6.0/share/hadoop/mapreduce/

In other versions, the examples may be in /usr/lib/hadoop-mapreduce/ or some other location.


The exact location of the example jar file can be found using the find command:
$ find / -name "hadoop-mapreduce-examples*.jar" -print

For this chapter the following software environment will be used:


OS: Linux
Platform: RHEL 6.6
Hortonworks HDP 2.2 with Hadoop Version: 2.6
In this environment, the location of the examples is /usr/hdp/2.2.4.2-2/hadoop- mapreduce.
For the purposes of this example, an environment variable called HADOOP_EXAMPLES
can be defined as follows:
$ export HADOOP_EXAMPLES=/usr/hdp/2.2.4.2-2/hadoop-mapreduce

Once you define the examples path, you can run the Hadoop examples using the commands
discussed in the following sections.
Business intelligence (BI) is an umbrella term that includes a variety of IT applications that
are used to analyze an organization’s data and communicate the information to relevant users.

Page 1
Big Data Analytics [15CS82]

Listing Available Examples


A list of the available examples can be found by running the following command. In some
cases, the version number may be part of the jar file (e.g., in the version 2.6 Apache sources,
the file is named hadoop-mapreduce-examples-2.6.0.jar).
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar
Note
In previous versions of Hadoop, the command hadoop jar . . . was used to run MapReduce
programs. Newer versions provide the yarn command, which offers more capabilities. Both
commands will work for these examples.
The possible examples are as follows:
An example program must be given as the first argument. Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the
input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of
the words in the input files.
bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database. distbbp: A
map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files. pentomino: A map/reduce tile laying
program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce. sort: A map/reduce
program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort terasort: Run the terasort
teravalidate: Checking results of terasort

Page 2
Big Data Analytics [15CS82]

wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input
files.
wordmedian: A map/reduce program that counts the median length of the words in the input
files.
Word standard deviation: A map/reduce program that counts the standard deviation of the
length of the words in the input files.

To illustrate several features of Hadoop and the YARN Resource Manager service GUI, the
pi and terasort examples are presented next.

Running the Pi Example


The pi example calculates the digits of  using a quasi-Monte Carlo method. If you have not
added users to HDFS (see Chapter 10, “Basic Hadoop Administration Pro- cedures”), run
these tests as user hdfs. To run the pi example with 16 maps and 1,000,000 samples per map,
enter the following command:
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar pi 16 1000000.

Using the Web GUI to Monitor Examples


This section provides an illustration of using the YARN ResourceManager web GUI to
monitor and find information about YARN jobs. The Hadoop version 2 YARN Resource
Manager web GUI differs significantly from the MapReduce web GUI found in Hadoop
version 1. Figure 2.1 shows the main YARN web interface. The cluster metrics are displayed
in the top row, while the running applications are displayed in the main table. A menu on
the left provides navigation to the nodes table, various job categories (e.g., New, Accepted,
Running, Finished, Failed), and the Capacity Sched- uler (covered in Chapter 10, “Basic
Hadoop Administration Procedures”). This inter- face can be opened directly from the
Ambari YARN service Quick Links menu or by directly entering http://hostname:8088 into a
local web browser. For this example, the pi application is used. Note that the application can
run quickly and may finish before you have fully explored the GUI. A longer-running
application, such as terasort, may be helpful when exploring all the various links in the GUI.

Page 3
Big Data Analytics [15CS82]

Figure 2.1 Hadoop RUNNING Applications web GUI for the pi example

For those readers who have used or read about Hadoop version 1, if you look at the Cluster
Metrics table, you will see some new information. First, you will notice that the
“Map/Reduce Task Capacity” has been replaced by the number of running containers. If
YARN is running a MapReduce job, these containers can be used for both map and reduce
tasks. Unlike in Hadoop version 1, the number of mappers and reducers is not fixed. There
are also memory metrics and links to node status. If you click on the Nodes link (left menu
under About), you can get a summary of the node activity and state. For example, Figure 2.2
is a snapshot of the node activity while the pi application is running. Notice the number of
containers, which are used by the MapReduce framework as either mappers or reducers.
Going back to the main Applications/Running window (Figure 2.1), if you click on the
application_14299… link, the Application status window in Figure 4.3 will appear. This
window provides an application overview and metrics, including the clus- ter node on which
the Application Master container is running.
Clicking the Application Master link next to “Tracking URL:” in Figure 2.3 leads to the
window shown in Figure 4.4. Note that the link to the application’s Application Master is

Page 4
Big Data Analytics [15CS82]

also found in the last column on the main Running Applications screen shown in Figure 2.1.
In the MapReduce Application window, you can see the details of the MapReduce
application and the overall progress of mappers and reducers. Instead of containers, the
MapReduce application now refers to maps and reducers. Clicking job_14299… brings up
the window shown in Figure 2.5. This window displays more detail about the number of
pending, running, completed, and failed mappers and reducers, including the elapsed time
since the job started.

Figure 2.2 Hadoop YARN Resource Manager nodes status window

Figure 2.3 Hadoop YARN application status for the pi example

Page 5
Big Data Analytics [15CS82]

Figure 2.4 Hadoop YARN Application Master for Map Reduce application

Figure 2.5 Hadoop YARN MapReduce job progress


The status of the job in Figure 2.5 will be updated as the job progresses (the win- dow needs
to be refreshed manually). The ApplicationMaster collects and reports the progress of each
mapper and reducer task. When the job is finished, the window is updated to that shown in
Figure 2.6. It reports the overall run time and provides a breakdown of the timing of the key
phases of the MapReduce job (map, shuffle, merge, reduce).
If you click the node used to run the Application Master (n0:8042 in Figure 2.6), the window
in Figure 2.7 opens and provides a summary from the Node Manager on node n0. Again, the
Node Manager tracks only containers; the actual tasks running in the containers are
determined by the Application Master.

Page 6
Big Data Analytics [15CS82]

Going back to the job summary page (Figure 2.6), you can also examine the logs for the
Application Master by clicking the “logs” link. To find information about the mappers and
reducers, click the numbers under the Failed, Killed, and Successful columns. In this
example, there were 16 successful mappers and one successful reducer. All the numbers in
these columns lead to more information about individual map or reduce process.

For instance, clicking the “16” under “Successful” in Figure 2 .6 displays the table of map
tasks in Figure 2.8. The metrics for the Application Master container are displayed in table
form. There is also a link to the log f ile for each process (in this case, a map process).
Viewing the logs requires that the yarn.log.aggregation-enable variable in the yarn-site.xml
file be set.

Figure 2.6 Hadoop YARN completed MapReduce job summary

Page 7
Big Data Analytics [15CS82]

Figure 2.7 Hadoop YARN Node Manager for n0 job summary

Figure 2.8 Hadoop YARN MapReduce logs available for browsing

Figure 2.9 Hadoop YARN application summary page

If you return to the main cluster window (Figure 4.1), choose Applications/ Finished, and
then select our application, you will see the summary page shown in Figure 2.9.
There are a few things to notice in the previous windows. First, because YARN manages
applications, all information reported by the ResourceManager con-
cerns the resources provided and the application type (in this case, MAPREDUCE). In Figure

Page 8
Big Data Analytics [15CS82]

2.1 and Figure 2.4, the YARN ResourceManager refers to the pi example by its application-id
(application_1429912013449_0044). YARN has no data about the actual application other
than the fact that it is a MapReduce job. Data from the actual MapReduce job are provided by
the MapReduce framework and referenced by a job- id (job_1429912013449_0044) in
Figure 4.6. Thus, two clearly different data streams are combined in the web GUI: YARN
applications and MapReduce framework jobs. If the framework does not provide job
information, then certain parts of the web GUI will not have anything to display.

Another interesting aspect of the previous windows is the dynamic nature of the mapper and
reducer tasks. These tasks are executed as YARN containers, and their number will change as
the application runs. Users may request specific numbers of mappers and reducers, but the
ApplicationMaster uses them in a dynamic fashion. As mappers complete, the
ApplicationMaster will return the containers to the Resource- Manager and request a smaller
number of reducer containers. This feature provides for much better cluster utilization
because mappers and reducers are dynamic—rather than fixed—resources.

Running Basic Hadoop Benchmarks


Many Hadoop benchmarks can provide insight into cluster performance. The best
benchmarks are always those that ref lect real application performance. The two bench-
marks discussed in this section, terasort and TestDFSIO, provide a good sense of how well
your Hadoop installation is operating and can be compared with public data pub- lished for
other Hadoop systems. The results, however, should not be taken as a single indicator for
system-wide performance on all applications.
The following benchmarks are designed for full Hadoop cluster installations. These tests
assume a multi-disk HDFS environment. Running these benchmarks in the Hor- tonworks
Sandbox or in the pseudo-distributed single-node install from Chapter 2 is not recommended
because all input and output (I/O) are done using a single system disk drive.
Running the Terasort Test
The terasort benchmark sorts a specified amount of randomly generated data.

Page 9
Big Data Analytics [15CS82]

This benchmark provides combined testing of the HDFS and MapReduce layers of a Hadoop
cluster. A full terasort benchmark run consists of the following three steps:
1. Generating the input data via teragen program.
2. Running the actual terasort benchmark on the input data.
3. Validating the sorted output data via the teravalidate program.
In general, each row is 100 bytes long; thus the total amount of data written is
100 times the number of rows specified as part of the benchmark (i.e., to write 100GB of
data, use 1 billion rows). The input and output directories need to be specified in HDFS. The
following sequence of commands will run the benchmark for 50GB of data as user hdfs.
Make sure the /user/hdfs directory exists in HDFS before running the benchmarks.
1. Run tera gen to generate rows of random data to sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen 500000000
/user/hdfs/TeraGen-50GB
2. Run tera sort to sort the database.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar tera sort
/user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
3. R tera validate to validate the sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar tera validate
/user/hdfs/TeraSort-50GB /user/hdfs/TeraValid-50GB

To report results, the time for the actual sort (terasort) is measured and the benchmark rate in
megabytes/second (MB/s) is calculated. For best performance, the actual terasort benchmark
should be run with a replication factor of 1. In addition, the default number of terasort reducer
tasks is set to 1. Increasing the number of reducers often helps with benchmark performance.
For example, the following com- mand will instruct terasort to use four reducer tasks:
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=4 /user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB

Also, do not forget to clean up the terasort data between runs (and after testing is finished).
The following command will perform the cleanup for the previous example:
$ hdfs dfs -rm -r -skipTrash Tera*

Page 10
Big Data Analytics [15CS82]

Running the Test DFSIO Benchmark


Hadoop also includes an HDFS benchmark application called TestDFSIO. The TestDFSIO
benchmark is a read and write test for HDFS. That is, it will write or read a number of files to
and from HDFS and is designed in such a way that it will use one map task per file. The file
size and number of files are specified as command-line arguments. Similar to the terasort
benchmark, you should run this test as user hdfs.
Similar to terasort, TestDFSIO has several steps. In the following example,
16 files of size 1GB are specified. Note that the TestDFSIO benchmark is part of the hadoop-
mapreduce-client-jobclient.jar. Other benchmarks are also available as part of this jar file.
Running it with no arguments will yield a list. In addition to TestDFSIO, NNBench (load
testing the NameNode) and MRBench (load testing the MapReduce frame- work) are
commonly used Hadoop benchmarks. Nevertheless, TestDFSIO is perhaps the most widely
reported of these benchmarks. The steps to run TestDFSIO are as follows:
1. Run TestDFSIO in write mode and create data.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -write -nrFiles 16 -fileSize 1000
2. Run TestDFSIO in read mode.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-client-jobclient-tests.jar
TestDFSIO -read -nrFiles 16 -fileSize 1000

Managing Hadoop MapReduce Jobs


Hadoop MapReduce jobs can be managed using the mapred job command. The most
important options for this command in terms of the examples and benchmarks are
-list, -kill, and -status. In particular, if you need to kill one of the examples or benchmarks,
you can use the mapred job –list command to find the job-id and then use mapred job –kill
<job-id> to kill the job across the cluster. Map Reduce jobs can also be controlled at the
application level with the yarn application command

Page 11
Big Data Analytics [15CS82]

Page 12

You might also like