Big Data Mapreduce and Streaming
Big Data Mapreduce and Streaming
The Algorithm
Generally MapReduce paradigm is based on sending the computer to where the
data resides!
MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS). The input file is passed to the mapper function
line by line. The mapper processes the data and creates several small
chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.
During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and
hence, need to implement the Writable interface. Additionally, the key classes have to
implement the Writable-Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> →
reduce → <k3, v3>(Output).
Terminology
PayLoad − Applications implement the Map and the Reduce functions, and form
the core of the job.
Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
DataNode − Node where data is presented in advance before any processing
takes place.
MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
SlaveNode − Node where Map and Reduce program runs.
JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
Task Tracker − Tracks the task and reports status to JobTracker.
Job − A program is an execution of a Mapper and Reducer across a dataset.
Task − An execution of a Mapper or a Reducer on a slice of data.
Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the
MapReduce program. Visit the following link mvnrepository.com to download the jar. Let
us assume the downloaded folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and
creating a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input
files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below, the output
will contain the number of input splits, the number of Map tasks, the number of reducer
tasks, etc
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9
The following command is used to see the output in Part-00000 file. This file is
generated by HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the local file
system for analyzing.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_d
Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command.
Running the Hadoop script without any arguments prints the description for all
commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.
1
namenode -format
Formats the DFS filesystem.
2
secondarynamenode
Runs the DFS secondary namenode.
3
namenode
Runs the DFS namenode.
4
datanode
Runs a DFS datanode.
5
dfsadmin
Runs a DFS admin client.
6
mradmin
Runs a Map-Reduce admin client.
7
fsck
Runs a DFS filesystem checking utility.
8
fs
Runs a generic filesystem user client.
9
balancer
Runs a cluster balancing utility.
10
oiv
Applies the offline fsimage viewer to an fsimage.
11
fetchdt
Fetches a delegation token from the NameNode.
12
jobtracker
Runs the MapReduce job Tracker node.
13
pipes
Runs a Pipes job.
14
tasktracker
Runs a MapReduce task Tracker node.
15
historyserver
Runs job history servers as a standalone daemon.
16
job
Manipulates the MapReduce jobs.
17
queue
Gets information regarding JobQueues.
18
version
Prints the version.
19
jar <jar>
Runs a jar file.
20
distcp <srcurl> <desturl>
Copies file or directories recursively.
21
distcp2 <srcurl> <desturl>
DistCp version 2.
22
archive -archiveName NAME -p <parent path> <src>* <dest>
Creates a hadoop archive.
23
classpath
Prints the class path needed to get the Hadoop jar and the required libraries.
24
daemonlog
Get/Set the log level for each daemon
How to Interact with MapReduce Jobs
Usage − hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.
1
-submit <job-file>
Submits the job.
2
-status <job-id>
Prints the map and reduce completion percentage and all job counters.
3
-counter <job-id> <group-name> <countername>
Prints the counter value.
4
-kill <job-id>
Kills the job.
5
-events <job-id> <fromevent-#> <#-of-events>
Prints the events' details received by jobtracker for the given range.
6
-history [all] <jobOutputDir> - history < jobOutputDir>
Prints job details, failed and killed tip details. More details about the job such as
successful tasks and task attempts made for each task can be viewed by specifying
the [all] option.
7
-list[all]
Displays all jobs. -list displays only jobs which are yet to complete.
8
-kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9
-fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.
10
-set-priority <job-id> <priority>
Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH,
NORMAL, LOW, VERY_LOW
Important Commands
For backwards-compatibility:
-inputreader Optional specifies a record reader class (instead
of an input format class).