0% found this document useful (0 votes)

12 views10 pages

Big Data Mapreduce and Streaming

Uploaded by

Smitha GV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views10 pages

Big Data Mapreduce and Streaming

Uploaded by

Smitha GV

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 10

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing

based on java. The MapReduce algorithm contains two important tasks, namely Map
and Reduce. Map takes a set of data and converts it into another set of data, where
individual elements are broken down into tuples (key/value pairs). Secondly, reduce
task, which takes the output from a map as an input and combines those data tuples
into a smaller set of tuples. As the sequence of the name MapReduce implies, the
reduce task is always performed after the map job.
The major advantage of MapReduce is that it is easy to scale data processing over
multiple computing nodes. Under the MapReduce model, the data processing primitives
are called mappers and reducers. Decomposing a data processing application
into mappers and reducers is sometimes nontrivial. But, once we write an application in
the MapReduce form, scaling the application to run over hundreds, thousands, or even
tens of thousands of machines in a cluster is merely a configuration change. This simple
scalability is what has attracted many programmers to use the MapReduce model.

The Algorithm
 Generally MapReduce paradigm is based on sending the computer to where the
data resides!
 MapReduce program executes in three stages, namely map stage, shuffle stage,
and reduce stage.
o Map stage − The map or mapper’s job is to process the input data.
Generally the input data is in the form of file or directory and is stored in the
Hadoop file system (HDFS). The input file is passed to the mapper function
line by line. The mapper processes the data and creates several small
chunks of data.
o Reduce stage − This stage is the combination of the Shuffle stage and
the Reduce stage. The Reducer’s job is to process the data that comes
from the mapper. After processing, it produces a new set of output, which
will be stored in the HDFS.
 During a MapReduce job, Hadoop sends the Map and Reduce tasks to the
appropriate servers in the cluster.
 The framework manages all the details of data-passing such as issuing tasks,
verifying task completion, and copying data around the cluster between the
nodes.
 Most of the computing takes place on nodes with data on local disks that reduces
the network traffic.
 After completion of the given tasks, the cluster collects and reduces the data to
form an appropriate result, and sends it back to the Hadoop server.
Inputs and Outputs (Java Perspective)
The MapReduce framework operates on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key,
value> pairs as the output of the job, conceivably of different types.
The key and the value classes should be in serialized manner by the framework and
hence, need to implement the Writable interface. Additionally, the key classes have to
implement the Writable-Comparable interface to facilitate sorting by the framework.
Input and Output types of a MapReduce job − (Input) <k1, v1> → map → <k2, v2> →
reduce → <k3, v3>(Output).

Terminology
 PayLoad − Applications implement the Map and the Reduce functions, and form
the core of the job.
 Mapper − Mapper maps the input key/value pairs to a set of intermediate
key/value pair.
 NamedNode − Node that manages the Hadoop Distributed File System (HDFS).
 DataNode − Node where data is presented in advance before any processing
takes place.
 MasterNode − Node where JobTracker runs and which accepts job requests
from clients.
 SlaveNode − Node where Map and Reduce program runs.
 JobTracker − Schedules jobs and tracks the assign jobs to Task tracker.
 Task Tracker − Tracks the task and reports status to JobTracker.
 Job − A program is an execution of a Mapper and Reducer across a dataset.
 Task − An execution of a Mapper or a Reducer on a slice of data.
 Task Attempt − A particular instance of an attempt to execute a task on a
SlaveNode.
Compilation and Execution of Process Units Program
Let us assume we are in the home directory of a Hadoop user (e.g. /home/hadoop).
Follow the steps given below to compile and execute the above program.
Step 1
The following command is to create a directory to store the compiled java classes.
$ mkdir units
Step 2
Download Hadoop-core-1.2.1.jar, which is used to compile and execute the
MapReduce program. Visit the following link mvnrepository.com to download the jar. Let
us assume the downloaded folder is /home/hadoop/.
Step 3
The following commands are used for compiling the ProcessUnits.java program and
creating a jar for the program.
$ javac -classpath hadoop-core-1.2.1.jar -d units ProcessUnits.java
$ jar -cvf units.jar -C units/ .
Step 4
The following command is used to create an input directory in HDFS.
$HADOOP_HOME/bin/hadoop fs -mkdir input_dir
Step 5
The following command is used to copy the input file named sample.txtin the input
directory of HDFS.
$HADOOP_HOME/bin/hadoop fs -put /home/hadoop/sample.txt input_dir
Step 6
The following command is used to verify the files in the input directory.
$HADOOP_HOME/bin/hadoop fs -ls input_dir/
Step 7
The following command is used to run the Eleunit_max application by taking the input
files from the input directory.
$HADOOP_HOME/bin/hadoop jar units.jar hadoop.ProcessUnits input_dir output_dir
Wait for a while until the file is executed. After execution, as shown below, the output
will contain the number of input splits, the number of Map tasks, the number of reducer
tasks, etc
Step 8
The following command is used to verify the resultant files in the output folder.
$HADOOP_HOME/bin/hadoop fs -ls output_dir/
Step 9
The following command is used to see the output in Part-00000 file. This file is
generated by HDFS.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000
Below is the output generated by the MapReduce program.
1981 34
1984 40
1985 45
Step 10
The following command is used to copy the output folder from HDFS to the local file
system for analyzing.
$HADOOP_HOME/bin/hadoop fs -cat output_dir/part-00000/bin/hadoop dfs get output_d

Important Commands
All Hadoop commands are invoked by the $HADOOP_HOME/bin/hadoop command.
Running the Hadoop script without any arguments prints the description for all
commands.
Usage − hadoop [--config confdir] COMMAND
The following table lists the options available and their description.

Sr.No. Option & Description

1
namenode -format
Formats the DFS filesystem.

2
secondarynamenode
Runs the DFS secondary namenode.

3
namenode
Runs the DFS namenode.
4
datanode
Runs a DFS datanode.

5
dfsadmin
Runs a DFS admin client.

6
mradmin
Runs a Map-Reduce admin client.

7
fsck
Runs a DFS filesystem checking utility.

8
fs
Runs a generic filesystem user client.

9
balancer
Runs a cluster balancing utility.

10
oiv
Applies the offline fsimage viewer to an fsimage.

11
fetchdt
Fetches a delegation token from the NameNode.

12
jobtracker
Runs the MapReduce job Tracker node.

13
pipes
Runs a Pipes job.

14
tasktracker
Runs a MapReduce task Tracker node.

15
historyserver
Runs job history servers as a standalone daemon.

16
job
Manipulates the MapReduce jobs.

17
queue
Gets information regarding JobQueues.

18
version
Prints the version.

19
jar <jar>
Runs a jar file.

20
distcp <srcurl> <desturl>
Copies file or directories recursively.

21
distcp2 <srcurl> <desturl>
DistCp version 2.

22
archive -archiveName NAME -p <parent path> <src>* <dest>
Creates a hadoop archive.

23
classpath
Prints the class path needed to get the Hadoop jar and the required libraries.

24
daemonlog
Get/Set the log level for each daemon
How to Interact with MapReduce Jobs
Usage − hadoop job [GENERIC_OPTIONS]
The following are the Generic Options available in a Hadoop job.

Sr.No GENERIC_OPTION & Description

1
-submit <job-file>
Submits the job.

2
-status <job-id>
Prints the map and reduce completion percentage and all job counters.

3
-counter <job-id> <group-name> <countername>
Prints the counter value.

4
-kill <job-id>
Kills the job.

5
-events <job-id> <fromevent-#> <#-of-events>
Prints the events' details received by jobtracker for the given range.

6
-history [all] <jobOutputDir> - history < jobOutputDir>
Prints job details, failed and killed tip details. More details about the job such as
successful tasks and task attempts made for each task can be viewed by specifying
the [all] option.

7
-list[all]
Displays all jobs. -list displays only jobs which are yet to complete.

8
-kill-task <task-id>
Kills the task. Killed tasks are NOT counted against failed attempts.
9
-fail-task <task-id>
Fails the task. Failed tasks are counted against failed attempts.

10
-set-priority <job-id> <priority>
Changes the priority of the job. Allowed priority values are VERY_HIGH, HIGH,
NORMAL, LOW, VERY_LOW

To see the status of job

$ $HADOOP_HOME/bin/hadoop job -status <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -status job_201310191043_0004
To see the history of job output-dir
$ $HADOOP_HOME/bin/hadoop job -history <DIR-NAME>
e.g.
$ $HADOOP_HOME/bin/hadoop job -history /user/expert/output
To kill the job
$ $HADOOP_HOME/bin/hadoop job -kill <JOB-ID>
e.g.
$ $HADOOP_HOME/bin/hadoop job -kill job_201310191043_0004

How Streaming Works

In the above example, both the mapper and the reducer are python scripts that read the input
from standard input and emit the output to standard output. The utility will create a Map/Reduce
job, submit the job to an appropriate cluster, and monitor the progress of the job until it
completes.
When a script is specified for mappers, each mapper task will launch the script as a separate
process when the mapper is initialized. As the mapper task runs, it converts its inputs into lines
and feed the lines to the standard input (STDIN) of the process. In the meantime, the mapper
collects the line-oriented outputs from the standard output (STDOUT) of the process and
converts each line into a key/value pair, which is collected as the output of the mapper. By
default, the prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) will be the value. If there is no tab character in the line, then the
entire line is considered as the key and the value is null. However, this can be customized, as per
one need.
When a script is specified for reducers, each reducer task will launch the script as a separate
process, then the reducer is initialized. As the reducer task runs, it converts its input key/values
pairs into lines and feeds the lines to the standard input (STDIN) of the process. In the meantime,
the reducer collects the line-oriented outputs from the standard output (STDOUT) of the process,
converts each line into a key/value pair, which is collected as the output of the reducer. By
default, the prefix of a line up to the first tab character is the key and the rest of the line
(excluding the tab character) is the value. However, this can be customized as per specific
requirements.

Important Commands

Parameters Options Description

-input directory/file-name Required Input location for mapper.

-output directory-name Required Output location for reducer.

-mapper executable or script or JavaClassName Required Mapper executable.

-reducer executable or script or JavaClassName Required Reducer executable.

Makes the mapper, reducer, or

-file file-name Optional combiner executable available locally
on the compute nodes.

Class you supply should return

key/value pairs of Text class. If not
-inputformat JavaClassName Optional
specified, TextInputFormat is used as
the default.

Class you supply should take

key/value pairs of Text class. If not
-outputformat JavaClassName Optional
specified, TextOutputformat is used as
the default.

-partitioner JavaClassName Class that determines which reduce a

Optional
key is sent to.

-combiner streamingCommand or JavaClassName Optional Combiner executable for map output.

Passes the environment variable to

-cmdenv name=value Optional
streaming commands.

For backwards-compatibility:
-inputreader Optional specifies a record reader class (instead
of an input format class).

-verbose Optional Verbose output.

Creates output lazily. For example, if
the output format is based on
-lazyOutput Optional FileOutputFormat, the output file is
created only on the first call to
output.collect (or Context.write).

-numReduceTasks Optional Specifies the number of reducers.

-mapdebug Optional Script to call when map task fails.

-reducedebug Optional Script to call when reduce task fails.

UGC NET Question Papers Computer Science
No ratings yet
UGC NET Question Papers Computer Science
115 pages
String Swift
No ratings yet
String Swift
34 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
C Reference Card
100% (1)
C Reference Card
2 pages
Asymmetric key-RSA
No ratings yet
Asymmetric key-RSA
26 pages
CO3 Session 19
No ratings yet
CO3 Session 19
29 pages
Hadoop Week 3
No ratings yet
Hadoop Week 3
60 pages
Complete Notes
No ratings yet
Complete Notes
142 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
IIB - Internals of IBM Integration Bus PDF
No ratings yet
IIB - Internals of IBM Integration Bus PDF
34 pages
Input Output
No ratings yet
Input Output
40 pages
OWASP FFM 40 Offensive Go Kevin Ott
No ratings yet
OWASP FFM 40 Offensive Go Kevin Ott
38 pages
Learn Python
No ratings yet
Learn Python
101 pages
Class 12 CS File Handling
No ratings yet
Class 12 CS File Handling
120 pages
BDA Lab Manual - Organized
No ratings yet
BDA Lab Manual - Organized
69 pages
Big-Data Unit-3
No ratings yet
Big-Data Unit-3
7 pages
CS 102, Project 5
No ratings yet
CS 102, Project 5
9 pages
Peepcode - Command Line
No ratings yet
Peepcode - Command Line
63 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Asymmetric Key
100% (1)
Asymmetric Key
11 pages
12 13 14 Map Reduce
No ratings yet
12 13 14 Map Reduce
57 pages
Unix Assignment3 by Srishti
No ratings yet
Unix Assignment3 by Srishti
17 pages
Part 1-Introduction To Computers and C++ Programming
No ratings yet
Part 1-Introduction To Computers and C++ Programming
33 pages
Posix SHM Slides
No ratings yet
Posix SHM Slides
21 pages
Unix Cook Book
100% (1)
Unix Cook Book
23 pages
Hadoop Introduction PDF
No ratings yet
Hadoop Introduction PDF
3 pages
Executing Hadoop Map Reduce Jobs
No ratings yet
Executing Hadoop Map Reduce Jobs
2 pages
1.mrplab Intro
No ratings yet
1.mrplab Intro
18 pages
C++ - Is It Possible To Redirect Child Process's Stdout To Another File in Parent Process - Stack Overflow
No ratings yet
C++ - Is It Possible To Redirect Child Process's Stdout To Another File in Parent Process - Stack Overflow
3 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
Big Data Unit-2 PPT Part2
No ratings yet
Big Data Unit-2 PPT Part2
78 pages
1 Unit-1
No ratings yet
1 Unit-1
59 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
BIG DATA UNIT-III Notes
No ratings yet
BIG DATA UNIT-III Notes
16 pages
Big Data Analytics Unit-3
No ratings yet
Big Data Analytics Unit-3
29 pages
Big Data Lab Manual
No ratings yet
Big Data Lab Manual
27 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Bda Megh
No ratings yet
Bda Megh
50 pages
Get Started W RS Enterprise Handout
No ratings yet
Get Started W RS Enterprise Handout
38 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Chapter 4 MapReduce and New Software Stack
No ratings yet
Chapter 4 MapReduce and New Software Stack
48 pages
DSBDSAssingment 11
No ratings yet
DSBDSAssingment 11
20 pages
Big Data Intro
No ratings yet
Big Data Intro
6 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit - II
No ratings yet
Unit - II
64 pages
Unit 3-1
No ratings yet
Unit 3-1
65 pages
Unit 2
No ratings yet
Unit 2
7 pages
EC2 The AWS Compute Serv.9742915.powerpoint
No ratings yet
EC2 The AWS Compute Serv.9742915.powerpoint
5 pages
Perl 4
No ratings yet
Perl 4
112 pages
Bda 2
No ratings yet
Bda 2
35 pages
Linux Day 3 - Bash Shell Scripting
No ratings yet
Linux Day 3 - Bash Shell Scripting
7 pages
Docker Tutorial PDF
No ratings yet
Docker Tutorial PDF
99 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
Operating System - Lab 3
No ratings yet
Operating System - Lab 3
8 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
File Handling
No ratings yet
File Handling
56 pages
Remaining A4 Print
No ratings yet
Remaining A4 Print
35 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit 3 Handouts
No ratings yet
Unit 3 Handouts
11 pages
7.FileHandling - Bytestream
No ratings yet
7.FileHandling - Bytestream
34 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Unit 3
No ratings yet
Unit 3
13 pages
Unit 4
No ratings yet
Unit 4
19 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Core Java
No ratings yet
Core Java
32 pages
Bda 03
No ratings yet
Bda 03
10 pages
Unit 2
No ratings yet
Unit 2
12 pages
21CS1601 Unit 5 Understanding Big Data Technolgies
No ratings yet
21CS1601 Unit 5 Understanding Big Data Technolgies
20 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
05 Movies Data Analysis Using Mapreduce
No ratings yet
05 Movies Data Analysis Using Mapreduce
20 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
NDG Linux2
No ratings yet
NDG Linux2
13 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Cworkbook
No ratings yet
Cworkbook
77 pages
C Lib Clibabi
No ratings yet
C Lib Clibabi
35 pages
18mcs35e U4
No ratings yet
18mcs35e U4
7 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
Python Refcard
100% (1)
Python Refcard
2 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Bda - Unit 3
No ratings yet
Bda - Unit 3
29 pages
cs8251 Programming in C Notes PDF
No ratings yet
cs8251 Programming in C Notes PDF
91 pages
Fs Lab Manual
No ratings yet
Fs Lab Manual
57 pages

Big Data Mapreduce and Streaming

Uploaded by

Big Data Mapreduce and Streaming

Uploaded by

What is MapReduce?

MapReduce is a processing technique and a program model for distributed computing

Sr.No. Option & Description

Sr.No GENERIC_OPTION & Description

To see the status of job

How Streaming Works

Parameters Options Description

-input directory/file-name Required Input location for mapper.

-output directory-name Required Output location for reducer.

-mapper executable or script or JavaClassName Required Mapper executable.

-reducer executable or script or JavaClassName Required Reducer executable.

Makes the mapper, reducer, or

Class you supply should return

Class you supply should take

-partitioner JavaClassName Class that determines which reduce a

-combiner streamingCommand or JavaClassName Optional Combiner executable for map output.

Passes the environment variable to

-verbose Optional Verbose output.

-numReduceTasks Optional Specifies the number of reducers.

-mapdebug Optional Script to call when map task fails.

-reducedebug Optional Script to call when reduce task fails.

You might also like