Module 1 PDF
Module 1 PDF
Department of CSE
BIG DATA ANALYTICS
Module 1
The Hadoop Distributing File System (HDFS) was designed for Big Data processing. The design
assumes a large file write-one/read-many model that enables other optimizations and relaxes many of
the concurrency and coherence overhead requirements of a true parallel file system.
The design of HDFS is based on the design of the Google File System (GFS). HDFS is
designed for data streaming where large amounts of data are read from disk in bulk. The HDFS block
size is typically 64MB or 128MB. Thus, this approach is entirely unsuitable for standard POSIX file
system use. The large block and file sizes make it more efficient to reread data from HDFS than to
try to cache the data.
A principal design aspect of Hadoop MapReduce is the emphasis on moving the computation
to the data rather than moving the data to the computation. This distinction is reflected in how
Hadoop clusters are implemented. In other high-performance systems, a parallel file system will
exist on hardware separate from the compute hardware. Data is then moved to and from the
computer components via high-speed interfaces to the parallel file system array.
Finally, Hadoop clusters assume node (and even rack) failure will occur at some point. To
deal with this situation, HDFS has a redundant design that can tolerate system failure and still
provide the data needed by the compute part of the program.
Converged data storage and processing happen on the same server nodes.
"Moving computation is cheaper than moving data."
A reliable file system maintains multiple copies of data across the cluster.
A specialized file system is used,w hich is not designed for general use.
Module-1 BDA-15CS82
HDFS Components
The design of HDFS is based on two types of nodes: a NameNode and multiple DataNodes. A single
NameNode manages all the metadata node to store and retrieve the actual data from the DataNodes.
No data is actually stored on the NameNode, however .For a minimal Hadoop installation, there
needs to be a single NameNode daemon and a single DataNode daemon running machine on at least
one machine.
The design is a master/slave architecture in which the master (NameNode) manages the file
system namespace and regulates access to files by clients. File system namespace operations such as
opening, closing, and renaming files and directories are all managed by the NameNode. The
NameNode also determines the mapping of blocks to DataNodes and handles DataNode failures.
The slaves (DataNodes) are responsible for serving read and w rite requests from the file
system to the clients. The NameNode manages block creation, deletion, and replication.
An example of the client/NameNode/DataNode interaction is provided in Figure 1.1. When a client
writes data, it first communicates with the NameNode and requests to create a file.
The NameNode determines how many blocks are needed and provides the client with the DataNodes
that will store the data. As part of the storage process, the data blocks are replicated after they are
written to the assigned node. Depending on how many nodes are in the cluster, the Namenode will
attempt to write replicas of the data blocks on nodes that are in other separate racks. If he is only one
rack, then the replicated blocks are written to other servers in the same rack. After the DataNode
acknowledges that the file block replication is complete, the client closes the file and informs the
NameNode that the operation is complete. Note that the NameNode does not write any data directly
to the DataNodes. It does, however, give the client a limited amount of time to complete the
operation. If it does not complete in the time period, the operation is canceled.
Reading data happens in a similar fashion. The client requests a file from the NameNode,
which returns the best DataNodes from which to read the data. The client then accesses the data
directly from the DataNodes.
Thus, once the metadata has been delivered to the client, the NameNode steps back and lets
the conversation between the client and the DataNodes proceed. While data transfer is progressing,
the NameNode also monitors the DataNodes by listening for heartbeats sent from DataNodes. The
lack of a heartbeat signal indicates a potential node failure. In such a case, the NameNode will route
around the failed DataNode and begin re-replicating the now-missing blocks. Because the file system
is redundant, DataNodes can be taken offline (decommissioned) for maintenance by informing the
NameNode of the DataNodes to exclude from the HDFS pool.
The mappings between data blocks and the physical DataNodes are not kept in persistent
storage on the NameNode. For performance reasons, the NameNode stores all metadata in memory.
Upon startup, each DataNode provides a block report (which it keeps in persistent storage) to the
NameNode. The block reports are sent every 10 heartbeats.
In almost all Hadoop deployments, there is a SecondaryNameNode While not explicitly
required by a NameNode, it is highly recommended. The term "Secondary-NameNode " (now called
CheckPointNode) is somewhat misleading. It is not an active failover node and cannot replace the
primary NameNode in case of its failure.
The purpose of the secondary NameNode is to perform periodic checkpoint that evaluate the
status of the NameNode. Recall that the NameNode keeps all system metadata memory for fast
access. It also has two disk files that track changes to the metadata:
An image of the file system state when the NameNode was started. This file begins with
fsimage_* and is used only at startup by the NameNode.
As series of modification done to the file system after starting the NameNode. These files
begin with edit_* and reflect the changes made after the fsimage_* file was read.
The location of these files is set by the dts.namenode.dir property in the hdfs-Site.starting xml file.
1. The previous file system state is reconstructed by loading the fsimage file into memory and
replaying the edit log.
2. The mapping between blocks and data nodes is created by waiting for enough of the DataNodes to
register so that atleast one copy of the data is available. Not all DataNodes are required to register
before HDFS exits from Safe Mode. The registration process may continue for some time.
HDFS may also enter Safe Mode for maintenance using the hdfs dfsadmin-safemode command or
when there is a filesystem issue that must be addressed by the administrator.
Rack Awareness
Rack Awareness deals with data locality. The main design goals of Hadoop MapReduce is to move
the computation to the data.
Three levels of data locality:
1. Data resides on the local machine (best).
2. Data resides on the same rack (better).
3. Data resides in a different rack (good).
When the YARN scheduler is assigning MapReduce containers to work as mappers, it will try to
place the container first on the local machine, then on the same rack, and finally on another rack.
In addition, the NameNode tries to place replicated data Blocks on multiple racks for improved fault
tolerance. In such a case, an entire rack failure will not cause data loss or stop HDFS from working.
Performance may be degraded.
HDFS can be made rack-aware by using a user-derived script that enables the master node to
map the network topology of the cluster.
NameNode High Availability
With early Hadoop installations, the NameNode was a single point of failure that could bring down
the entire Hadoop cluster. NameNode hardware often employed redundant power supplies and
storage to guard against such problems, but it was still susceptible to other failures. The solution was
to implement NameNode High Availability (HA) as a means to provide true failover service.
As shown in Figure 1.3, an HA Hadoop cluster has two (or more) separate NameNode machines.
Each machine is configured with exactly the same software. One of the NameNode machines is in
the Active state and the other is in the Standby state. Like a single NameNode cluster, the Active
NameNode is responsible for all client HDFS operations in the cluster. The Standby NameNode
maintains enough state to provide a fast failover.
To guarantee the file system state is preserved, both the Active and Standby NameNodes
receive block reports from the DataNodes. The Active node also sends all file system edits to a
quorum of Journal nodes. At least three physically separate JournalNode daemons are required,
because edit log modifications must be written to a majority of the JournalNodes. This design will
enable the system to tolerate the failure of a single JournalNode machine. The Standby node
continuously reads the edits from the JournalNodes to ensure its namespace is synchronized with that
of the Active node. In the event of an Active NameNode failure, the Standby node reads all
remaining edits from the JournalNodes before promoting itself to the Active state.
Apache ZooKeeper is used to monitor the NameNode health. Zookeeper is a highly available
service for maintaining small amounts of coordination data, notifying clients of changes in that data,
and monitoring clients for failures. HDFS failover relies on Zookeeper for failure detection and for
Standby to Active NameNode election.
HDFS NameNode Federation
The key benefits of NameNode Federation are as follows.
1. Namespace scalability: HDFS cluster storage scales horizontally without placing a burden on
the NameNode.
2. Better Performance: Addinf more NameNode to the cluster scales the file system read/write
operation throughout by sepserating the total namespace.
3. System isolation: Multiple NameNodes enable different categories of applications to be
distinguished and users can be isolated to different namespaces.
Figure 1.4 illustrates how HDFS NameNode Federation is accomplished. NameNode1 manages the
/research and /marketing namespaces and NameNode 2 manages the /data and /project namespaces.
HDFS Snapshots
HDFS snapshots are similar to backups, but are created by administrators using the
hdfs dfs -snapshot command. HDFS snapshots are read-only point-in-time copies of the file system.
They offer the following features:
Snapshots can be taken of a sub-tree of the file system or the entire file system.
Snapshots can be used for data backup, protection against user errors, and disaster recoverv.
Snapshot creation is instantaneous.
Blocks on the DataNodes are not copied, because the snapshot files record the block list and
the file size. There is no data copying, although it appears to the user that there are duplicate
files.
Snapshots do not adversely affect regular HDFS operations.
local file system. Users can browse the HDFS file system through their local file system that provide
an NFSv3 client compatible operating system. This feature offers users the following capabilities:
Users can easily download/upload files from/to the HDFS file system to/from their local file
system.
Users can stream data directly to HDFS through the mount point. Appending to a file is
supported, but random write capability is not supported.
Command Reference
The preferred way to interact with HDFS in Hadoop version 2 is through the hdfs command. In
version 1 hadoop dfs command was used to manage file in HDFS.
Usage: hdfs [--config confdir] COMMAND
where COMMAND is one of:
dfs run a file system command on the file systems supported in Hadoop.
namenode –format format the DFS file system
secondarynamenode run the DFS secondary namenode
namenode run the DFS namenode
journalnode run the DFS journalnode
zkfc run the ZK Failover Controller daemon
datanode run a DFs datanode
dfsadmin run a DFS admin client
haadmin run a DFS HA admin client
fsck run a DFs file system checking utility
balancer run a cluster balancing utility
jmxget get JMX exported values from NameNode or DataNode.
Mover runs a utility to move block replicas across storage types
Oiv apply the offline fsimage viewer to an fsimage
oiv_legacy apply the offline fsimage viewer to an legacy fsimage
oev apply the offline edits viewer to an edits file
fetchdt fetch a delegation token from the NameNode
getconf get config values from configuration
HDFS provides a series of commands similar to those found in a standard POSIX file system.
A list of those commands can be obtained by issuing the following command. Several of these
commands will be highlighted here under the user account hdfs.
$ hdfs dfs
The HadoopDFSFileReadWrite.java example can be compiled on Linux systems using the following
steps.
First create directory to hold the classes:
$mkdir HDFSClient –classes
Next compile the program using ‘hadoop classpath’ to ensure all the class path are available:
$java –cp ‘hadoop classpath’ –d HDFSClient-classes HDFSClient.java
Finally, create a java archive file:
$jar –cvfe HDFSClient.jar org/myorg.HDFSClient –c HDFSClient-classes/
The program can be run to check for available options as follows:
$hadoop jar ./HDFSClient.jar
Usage: hdfsclient add/read/delete/mkdir [<local_path> <hdfs_path>]
A simple file copy from the local system to HDFS can be accomplished using the following
command:
$hadoop jar ./HDFSClient.jar add ./NOTES.tst /user/hdfs
For Hadoop version 2.6.0, the examples files are located in the following directory
/opt/hadoop-2.6.0/share/hadoop/mapreduce/
In other versions
/usr/lib/hadoop-mapreduce/
or other location
The following is a list of the included jobs in the examples JAR file.
aggregatewordcount: An Aggregate-based map/reduce program that counts
the words in the input files.
aggregatewordhist: An Aggregate-based map/reduceprogram that computes the histogram
of the words in the input files.
bbp: A map/reduce program that usesBailey-Borwein-Plouffe to compute the exact digits of
pi.
dbcount: An example job that counts the pageviewcounts from a database.
distbbp: A map/reduce program that uses a BBP-typeformula to compute the exact bits of
pi.
grep: A map/reduce program that counts the matches toa regex in the input.
join: A job that effects a join over sorted, equally partitioned data sets.
multifilewc: A job that counts words from several files.
pentomino: Amap/reduce tile laying program to find solutions to pentomino
problems.
pi: A map/reduce program that estimates pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10 GB ofrandom textual data per
node.
randomwriter: Amap/reduce program that writes 10 GB of random data per node.
secondarysort: An example defining a secondary sort to thereduce.
sort: A map/reduce program that sorts the data written by therandom writer.
sudoku: A Sudokusolver.
teragen: Generate data for the terasort.
terasort: Run the terasort.
teravalidate: Check the results of the terasort.
wordcount: A map/reduce program that counts the wordsin the input files.
wordmean: A map/reduce program that counts theaverage length of the words in the
input files.
wordmedian: A map/reduce program that counts themedian length of the words in the input
files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of
the length of the words in the input files.
Run Sample MapReduce Examples
To test your installation, run the sample “pi” program that calculates the value of pi using a quasi-
Monte Carlo method and MapReduce. Change to user hdfs and run the following:
# su - hdfs
$ cd /opt/yarn/hadoop-2.2.0/bin $ export
YARN_EXAMPLES=/opt/yarn/hadoop-
2.2.0/share/hadoop/mapreduce
$ ./yarn jar
If the program worked correctly, the following should be displayed at the end of the
program output stream:
• Now, we will look at the YARN web GUI to monitor the examples.
• You can monitor the application submission ID, the user who submitted the
application, the name of the application, the queue in which the application is
submitted, the start time and finish time in the case of finished applications, and the
final status of the application, using the ResourceManager UI.
• The ResourceManager web UI differs from the UI of the Hadoop 1.x versions.
• The following screenshot shows the information we could get from the YARN
web UI (http://localhost:8088).
Clicking on job….
The best benchmarks are always those that reflect real application
performance.
The terasort benchmark sorts a specified amount of randomly generated data. This
benchmark provides combined testing of the HDFS and MapReduce layers of a
Hadoop cluster.
$ yarn jar
$HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar teragen
500000000 /user/hdfs/TeraGen-50GB
• The following command will perform the cleanup for the previous
example:
• $ hdfs dfs -rm -r -skipTrash Tera*
Running the TestDFSIO Benchmark
Example results are as follows (date and time prefix removed). The large standard
deviation is due to the placement of tasksin the cluster on a small four-node cluster.
• fs.TestDFSIO: ----- TestDFSIO ------- read
• fs.TestDFSIO: Date & time: Thu May 14 10:44:09 EDT 2015
fs.TestDFSIO: Number of files: 16
• fs.TestDFSIO: Total MBytes processed: 16000.0 fs.TestDFSIO: Throughput
mb/sec: 32.38643494172466 fs.TestDFSIO: Average IO rate mb/sec:
58.72880554199219 fs.TestDFSIO: IO rate std deviation: 64.60017624360337
fs.TestDFSIO: Test exec time sec: 62.798
3. Clean up the TestDFSIO data.
• Hadoop MapReduce jobs can be managed using the mapred job command.
• The most important options for this command in terms of the examples and
benchmarks are -list, -kill, and -status.
• In particular, if you need to kill one of the examples or benchmarks, you can use the
mapred job -listcommand to find the job-id and then use mapred job -kill <job-id> to
kill the job across the cluster.
• MapReduce jobs can also be controlled at the application level with the yarn
applicationcommand
• The possible options for mapred job are as follows:
$ mapred job Usage: CLI <command> <args>
[-submit <job-file>]
[-status <job-id>]
[-counter <job-id> <group-name> <counter-name>]
[-kill <job-id>]
[-set-priority <job-id> <priority>].
Valid values for priorities are: VERY_HIGH HIGH NORMAL LOW
VERY_LOW
[-events <job-id> <from-event-#> <#-of-events>]
[-history <jobHistoryFile>]
[-list [all]]
[-list-active-trackers]
[-list-blacklisted-trackers]
[-list-attempt-ids <job-id> <task-type> <task-state>].
Valid values for <task-type> are REDUCE MAP. Valid values for <task- state>
are running, completed
-conf <configuration file> specify an application configuration file
-files <comma separated list of files> specify comma separated files to be copied to the
map reduce cluster
-libjars <comma separated list of jars> specify comma separated jar files to include in the
classpath.
If we ignore the grep count (-c) option for the moment, we can reduce the number of
instances to a single number (257) by sending (piping) the results of grep into wc-1
$ grep" Kutuzov "war-and-peace.txt|wc-1
257
Though not strictly a MapReduce process, this idea is quite similar to and much faster
than the manual process of counting the instances of Kutuzov in the printed book. The
analogy can be taken a bit further by using the two simple (and naive) shell scripts shown
below. We can perform the same operation and tokenize both the Kutuzov and Petersburg
strings in the text:
$ cat war-and-peace.txt|./mapper.sh |./reducer.sh
Kutuzov, 315
Petersburg, 128
Notice that more instances of Kutuzov have been found. The mapper inputs a text file
and then outputs data in a (key,value) pair (token-name,count) format. Strictly speaking, the
input to the script is the file and the keys are Kutuzov and Petersburg. The reducer script
takes these key-value pairs and combines the similar tokens and counts the total number of
instances. The result is a new key-value pair (token-name, sum).
Apache Hadoop MapReduce will try to move the mapping tasks to the server that
contains the data slice. Results from each data slice are then combined in the reducer step.
HDFS is not required for Hadoop MapReduce,however.A sufficiently fast parallel file
system can be used in its place. In these designs, each server in the cluster has access to a
high-performance parallel file system that can rapidly provide any data slice. These designs
are typically more expensive than the commodity servers used for many Hadoop clusters.
The first thing MapReduce will do is create the data splits. For simplicity, each
line will be one split. Since each split will require a map task, there are three mapper
processes that count the number of words in the split. On a cluster, the results of each
map task are written to local disk and not to HDFS. Next, similar keys need to be
collected and sent to a reducer process. The shuffle step requires data movement and
can be expensive in terms of processing time. Depending on the nature of the application, the
amount of data that must be shuffled throughout the cluster can vary from small to large.
Once the data have been collected and sorted by key, the reduction step can begin. It
is not necessary and not normally recommended to have a reducer for each key-value pair as
shown inFigure1.5. The number of reducers is a tunable option for many applications. The
final step is to write the output to HDFS. A combiner step enables some pre-reduction of the
map output data. For instance, in the previous example, one map produced the following
counts:
(run,1)
(spot,1)
(run,1)
As shown in Figure 1.6, the count for run can be combined into run, & before the
shuffle. This optimization can help minimize the amount of data transfer needed for
the shuff phase.
The Hadoop YARN resource manager and the MapReduce framework determine the actual
placement of mappers and reducers. The MapReduce framework will try to place the map
task as close to the data as possible. It will request the placement from the YARN scheduler
but may not get the best placement due to the load on the cluster. In general, nodes can run
both mapper and reducer tasks. Indeed, the dynamic nature of YARN enables the work
containers used by completed map tasks to be returned to the pool of available resources.
Figure 1.7 shows a simple three-node MapReduce process. Once the mapping is
complete, the same nodes begin the reduce process. The shuffle stage makes sure the
necessary data are sent to each mapper. Also note that there is no requirement that all the
mappers complete at the same time or that the mapper on a specific node be complete before
a reducer is started. Reducers can be set to start shuffling based on a threshold of percentage
of mappers that have finished.
Finally, although the examples are simple in nature, the parallel MapReduce
algorithm can be scaled up to extremely large data sizes.
Speculative Execution
One of the challenges with many large clusters is the inability to predict or manage
unexpected system bottlenecks or failures. In theory, it is possible to control and monitor
resources so that network traffic and processor load can be evenly balanced; in practice,
however, this problem represents a difficult challenge for large systems. Thus, it is possible
that a congested network, slow disk controller, failing disk, high processor load, or some
other similar problem might lead to slow performance without anyone noticing.
When one part of a MapReduce process runs slowly, it ultimately slows down
everything else because the application cannot complete until all processes are finished. The
nature of the parallel MapReduce model provides an interesting solution to this problem. It is
possible to start a copy of a running map process without disturbing any other running
mapper processes. For example, suppose that as most of the map tasks are coming to a close,
the ApplicationMaster notices that some are still running and schedules redundant copies of
the remaining jobs on less busy or free servers. Should the secondary processes finish first,
the other first processes are then terminated (or vice versa).This process is known as
speculative execution.
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Importorg.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
public class WordCount {
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context ) throws IOException,
InterruptedException {
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
context.write(key, result);
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class); job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class); FileInputFormat.addInputPath(job, new
Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
The framework views the input to the jobs as a set of keyvalue pairs
The MapReduce job proceeds as follows
(input) <k1,v1>--map--<k2,v2>--combine-<k2,v2>--reduce -<k3,k4> (output)
Two inputs
Hello World Bye World
Hello Hadoop Goodbye Hadoop
The WordCount application is quite straight-forward.
public void map(Object key, Text value, Context context) throws IOException,
InterruptedException {
StringTokenizer itr = new tringTokenizer(value.toString());
while (itr.hasMoreTokens())
{ word.set(itr.nextToken());
context.write(word, one);
}}
key-value pair of < <word>, 1>.
For the given sample input the first map emits:
For the given sample input the first map emits:
< Hello, 1>
To compile and run the program from the command line, perform the following steps
1. Make a local wordcount-classes directory
$ mkdir wordcount-classes
2. Compile the wordcount.java program using the hadoop classpath
$ java –cp ‘hadoop classpath’ d wordcount-classes wordcount.java
3.The jar file is created
$ jar –cvf wordcount.jar –c wordcount-classes/
4. To run the program,create an input directory in HDFS and place a text file in the new
directory
$hdfs dfs –mkdir war-and-peace-input
$ hdfs dfs –put war-and-peace.txt war-and-peace-input
5.Run the wordcount application
$ hadoop jar wordcount.jar wordcount war-and-peace-input war- and –peace output
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
The operations of the mapper.py script can be observed by running the command as shown in
the following:
$ echo "foo foo quux labs foo bar quux" | mapper.py
foo 1
foo 1
quux 1
labs 1
foo 1 bar 1
quux 1
Piping the results of the map into the sort command can create a simulated shuffle phase:
echo "foo foo quux labs foo bar quux“ |mapper.py | sort -k1,1
Bar 1
foo 1
foo 1
Foo 1
labs 1
quux 1
quux 1
$ echo "foo foo quux labs foo bar quux“ |mapper.py | sort –k1,1|./reducer.py
Bar 1
Foo 1
Labs 1
quux 1
• The output directory is removed from any previous test runs:
$hdfs dfs –rm –r –skiptrash war-and-peace-output
• The following command line will use the mapper.py and reducer.py to do a word count
on the input file.
}
context.emit(context.getInputKey(), HadoopUtils::toString(sum));
}
};
int main(int argc, char *argv[]) {
return HadoopPipes::runTask(HadoopPipes::TemplateFactory<WordCountMap,
WordCountReduce>());
}
• The program can be compiled with the following line
$ g++ wordcount.cpp –o wordcount –L$HADOOP_HOME/lib/native/
Create the war-and-peace-input directory and move the file into HDFS:
$ hdfs dfs –mkdir war-and-peace-input
$ hdfs dfs –put war-and-peace.txt war-and-peace-input
The executable must be placed into HDFS so YARN can find the program.
$ hdfs dfs –put wordcount bin
$ hdfs dfs –rm –r –skiptrash war-and-peace-output
To run the program, enter the following line
$ mapred pipes \
-D hadoop.pipes.java.recordreader=true \
-D hadoop.pipes.java.recordwriter=true \
Input war-and-peace-input \
Output war-and-peace-output \
-program bin/wordcount
Each mapper of the first job takes a line as input and matches the user-provided regular
expression against the line. It extracts all matching strings and emits (matching string, 1)
pairs.
The second job takes the output of the first job as input. The mapper is an inverse map, while
the reducer is an indentity reducer. The number of reducers is one, so the output is stored in
one file, and it is sorted by the count in a descending order. The output file is text, each line of
which contains count and a matching string.
The following describes how to compile and run the Grep.java example. The steps are:
1. Create a directory for the application classes as follows:
$mkdir Grep_classes
2. Compile the WordCount.java program using following line
$ java –cp ‘hadoop classpath’ –d Grep_classes Grep.java
3. Create a java archive using the following command:
$ jar –cvf Grep.jar –C Grep_classes/
If needed, create the war-and-peace-input directory and move the file into HDFS:
$ hdfs dfs –mkdir war-and-peace-input
$ hdfs dfs –put war-and-peace.txt war-and-peace-input
Make sure the output directory has been removed by issuing the following command:
$ hdfs dfs –rm –r –skiptrash war-and-peace-output
Debugging MapReduce
The applications on a simpler system with smaller data sets.
The errors on these systems are much easier to locate and track.
If applications can run successfully on a single system with a subset of real data, then
running in parallel should be a simple task because the MapReduce algorithm is
transparently scalable.
Listing, Killing, and Job Status
The most import options are –list, -kill, and –status. The yarn application command can be
used to control all applications running on the cluster.
Hadoop Log Management
The MapReduce logs provide a comprehensive listing of both mappers and reducers. The
actual log output consists of three files- stdout, stderr and syslog.
There are two modes for log storage.
The first method is to use log aggregation. In this mode, logs are aggregated in HDFS
and can be displayed in the YARN ResourceManager user interface or examined with
the yarn logs command.
If log aggregation is not enabled, the logs will be placed locally on the cluster nodes
on which the mapper or reducer ran. The location of the unaggregated local logs is
given by the yarn.nodemanager.log-dirs property in the yarn-site.xml file. Without log
aggregation, the cluster nodes used by the job must be noted and then the log files
must be obtained directly from the nodes. Log aggregatio is highly recommended.
For example, after running the pi example program, the logs can be examined as
follows:
$ hadoop jar %HADOOP_EXAMPLES/hadoop-mapreduce-examples.jar pi 16
100000
The applicationId will start with application_ and appear under the Application-id
column.
$yarn application –list –appStates FINISHED
Next, run the following command to produce a dump of all logs for that application.
$yarn logs –application application_1432667013445_0001 >AppOut
Finally, the results for a single container can be found by entering this line:
$yarn logs –application application_1432667013445_0001 –containerId
container_1432667013445_0001_01_000023 –nodeAddress n1:45454|more