Assignment 11 DSBDA
Assignment 11 DSBDA
Assignment 11
Aim:
Write a code in JAVA for a simple WordCount application that counts the number of occurrences
of each word in a given input set using the Hadoop MapReduce framework on local-standalone
set-up.
Objective:
By completing this task, students will learn the following
Theory:
Map and Reduce tasks in Hadoop-With in a MapReduce job there are two separate tasks map task and
reduce task.
Map task- A MapReduce job splits the input dataset into independent chunks known as input splits in
Hadoop which are processed by the map tasks in a completely parallel manner. Hadoop framework
creates separate map task for each input split.
Reduce task- The output of the maps is sorted by the Hadoop framework which then becomes input
to the reduce tasks.
Hadoop MapReduce framework operates exclusively on <key, value> pairs. In a MapReduce job, the
input to the Map function is a set of <key, value> pairs and output is also a set of <key, value> pairs.
The output <key, value> pair may have different type from the input <key, value> pair.
The output from the map tasks is sorted by the Hadoop framework. MapReduce guarantees that the
input to every reducer is sorted by key. Input and output of the reduce task can be represented as
follows.
WordCount example reads text files and counts the frequency of the words. Each mapper
takes a line of the input file as input and breaks it into words. It then emits a key/value pair of
the word (In the form of (word, 1)) and each reducer sums the counts for each word and emits
a single key/value with the word and sum.
In the word count MapReduce code there is a Mapper class (MyMapper) with map function
and a Reducer class (MyReducer) with a reduce function.
1. Map function
From the wordcount.txt file Each line will be passed to the map function in the following
format.
<0, Hello wordcount MapReduce Hadoop program.>
<41, This is my first MapReduce program.>
In the map function the line is split on space and each word is written to the context along
with the value as 1.
So the output from the map function for the two lines will be as follows.
You will also need to add at least the following Hadoop jars so that your code can compile.
You will find these jars inside the /share/hadoop directory of your Hadoop installation. With
in /share/hadoop path look in hdfs, mapreduce and common directories for required jars.
Once you are able to compile your code you need to create jar file.
In the eclipse IDE righ click on your Java program and select Export – Java – jar file.
One your word count MapReduce program is succesfully executed you can verify the output
file.
Found 2 items
As you can see Hadoop framework creates output files using part-r-xxxx format. Since only
one reducer is used here so there is only one output file part-r-00000. You can see the content
of the file using the following command.
Hadoo p 1
Hello 1
MapRe duce 2
This 1
first 1
is 1
my 1
progr am . 2
wordc ount 1
Conclusion: In this assignment, we have learned what is HDFS and How Hadoop
MapReduce framework is used to counts the number of occurrences of each word in a given
input set.