Chapter 2: Running Example Program and Bench Mark: Big Data Analytics (15CS82)
Chapter 2: Running Example Program and Bench Mark: Big Data Analytics (15CS82)
MODULE 1
Chapter 2: Running Example program and Bench Mark
When using new or updated hardware or software, simple examples and benchmarks help
confirm proper operation. Apache Hadoop includes many examples and bench- marks to aid
in this task. This chapter provides instructions on how to run, monitor, and manage some
basic Map Reduce examples and benchmarks
Running Map Reduce Examples
All Hadoop releases come with MapReduce example applications. Running the existing
MapReduce examples is a simple process—once the example files are located, that is. For
example, if you installed Hadoop version 2.6.0 from the Apache sources under
/opt, the examples will be in the following directory:
/opt/hadoop-2.6.0/share/hadoop/mapreduce/
Once you define the examples path, you can run the Hadoop examples using the commands
discussed in the following sections.
Business intelligence (BI) is an umbrella term that includes a variety of IT applications that
are used to analyze an organization’s data and communicate the information to relevant users.
Page 1
Big Data Analytics [15CS82]
Page 2
Big Data Analytics [15CS82]
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input
files.
wordmedian: A map/reduce program that counts the median length of the words in the input
files.
Word standard deviation: A map/reduce program that counts the standard deviation of the
length of the words in the input files.
To illustrate several features of Hadoop and the YARN Resource Manager service GUI, the
pi and terasort examples are presented next.
Page 3
Big Data Analytics [15CS82]
Figure 2.1 Hadoop RUNNING Applications web GUI for the pi example
For those readers who have used or read about Hadoop version 1, if you look at the Cluster
Metrics table, you will see some new information. First, you will notice that the
“Map/Reduce Task Capacity” has been replaced by the number of running containers. If
YARN is running a MapReduce job, these containers can be used for both map and reduce
tasks. Unlike in Hadoop version 1, the number of mappers and reducers is not fixed. There
are also memory metrics and links to node status. If you click on the Nodes link (left menu
under About), you can get a summary of the node activity and state. For example, Figure 2.2
is a snapshot of the node activity while the pi application is running. Notice the number of
containers, which are used by the MapReduce framework as either mappers or reducers.
Going back to the main Applications/Running window (Figure 2.1), if you click on the
application_14299… link, the Application status window in Figure 4.3 will appear. This
window provides an application overview and metrics, including the clus- ter node on which
the Application Master container is running.
Clicking the Application Master link next to “Tracking URL:” in Figure 2.3 leads to the
window shown in Figure 4.4. Note that the link to the application’s Application Master is
Page 4
Big Data Analytics [15CS82]
also found in the last column on the main Running Applications screen shown in Figure 2.1.
In the MapReduce Application window, you can see the details of the MapReduce
application and the overall progress of mappers and reducers. Instead of containers, the
MapReduce application now refers to maps and reducers. Clicking job_14299… brings up
the window shown in Figure 2.5. This window displays more detail about the number of
pending, running, completed, and failed mappers and reducers, including the elapsed time
since the job started.
Page 5
Big Data Analytics [15CS82]
Figure 2.4 Hadoop YARN Application Master for Map Reduce application
Page 6
Big Data Analytics [15CS82]
Going back to the job summary page (Figure 2.6), you can also examine the logs for the
Application Master by clicking the “logs” link. To find information about the mappers and
reducers, click the numbers under the Failed, Killed, and Successful columns. In this
example, there were 16 successful mappers and one successful reducer. All the numbers in
these columns lead to more information about individual map or reduce process.
For instance, clicking the “16” under “Successful” in Figure 2 .6 displays the table of map
tasks in Figure 2.8. The metrics for the Application Master container are displayed in table
form. There is also a link to the log f ile for each process (in this case, a map process).
Viewing the logs requires that the yarn.log.aggregation-enable variable in the yarn-site.xml
file be set.
Page 7
Big Data Analytics [15CS82]
If you return to the main cluster window (Figure 4.1), choose Applications/ Finished, and
then select our application, you will see the summary page shown in Figure 2.9.
There are a few things to notice in the previous windows. First, because YARN manages
applications, all information reported by the ResourceManager con-
cerns the resources provided and the application type (in this case, MAPREDUCE). In Figure
Page 8
Big Data Analytics [15CS82]
2.1 and Figure 2.4, the YARN ResourceManager refers to the pi example by its application-id
(application_1429912013449_0044). YARN has no data about the actual application other
than the fact that it is a MapReduce job. Data from the actual MapReduce job are provided by
the MapReduce framework and referenced by a job- id (job_1429912013449_0044) in
Figure 4.6. Thus, two clearly different data streams are combined in the web GUI: YARN
applications and MapReduce framework jobs. If the framework does not provide job
information, then certain parts of the web GUI will not have anything to display.
Another interesting aspect of the previous windows is the dynamic nature of the mapper and
reducer tasks. These tasks are executed as YARN containers, and their number will change as
the application runs. Users may request specific numbers of mappers and reducers, but the
ApplicationMaster uses them in a dynamic fashion. As mappers complete, the
ApplicationMaster will return the containers to the Resource- Manager and request a smaller
number of reducer containers. This feature provides for much better cluster utilization
because mappers and reducers are dynamic—rather than fixed—resources.
Page 9
Big Data Analytics [15CS82]
This benchmark provides combined testing of the HDFS and MapReduce layers of a Hadoop
cluster. A full terasort benchmark run consists of the following three steps:
1. Generating the input data via teragen program.
2. Running the actual terasort benchmark on the input data.
3. Validating the sorted output data via the teravalidate program.
In general, each row is 100 bytes long; thus the total amount of data written is
100 times the number of rows specified as part of the benchmark (i.e., to write 100GB of
data, use 1 billion rows). The input and output directories need to be specified in HDFS. The
following sequence of commands will run the benchmark for 50GB of data as user hdfs.
Make sure the /user/hdfs directory exists in HDFS before running the benchmarks.
1. Run tera gen to generate rows of random data to sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen 500000000
/user/hdfs/TeraGen-50GB
2. Run tera sort to sort the database.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar tera sort
/user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
3. R tera validate to validate the sort.
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar tera validate
/user/hdfs/TeraSort-50GB /user/hdfs/TeraValid-50GB
To report results, the time for the actual sort (terasort) is measured and the benchmark rate in
megabytes/second (MB/s) is calculated. For best performance, the actual terasort benchmark
should be run with a replication factor of 1. In addition, the default number of terasort reducer
tasks is set to 1. Increasing the number of reducers often helps with benchmark performance.
For example, the following com- mand will instruct terasort to use four reducer tasks:
$ yarn jar $HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=4 /user/hdfs/TeraGen-50GB /user/hdfs/TeraSort-50GB
Also, do not forget to clean up the terasort data between runs (and after testing is finished).
The following command will perform the cleanup for the previous example:
$ hdfs dfs -rm -r -skipTrash Tera*
Page 10
Big Data Analytics [15CS82]
Page 11
Big Data Analytics [15CS82]
Page 12