wc
wc
Word count is a typical example where Hadoop map reduce developers start Blog Archive
their hands on with. This sample map reduce is intended to count the no of occurrences
► 2015 (1)
of each word in the provided input files.
► 2012 (9)
1 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Now coming to the practical side of implementation we need our input file and map
reduce program jar to do the process job. In a common map reduce process two
methods do the key job namely the map and reduce , the main method would trigger the
map and reduce methods. For convenience and readability it is better to include the map
, reduce and main methods in 3 different class files . We’d look at the 3 files we require
to perform the word count job
import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
2 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Let us dive in details of this source code we can see the usage of a few deprecated
classes and interfaces; this is because the code has been written to be compliant with
Hadoop versions 0.18 and later. From Hadoop version 0.20 some of the methods are
deprecated by still supported.
import java.io.IOException;
import java.util.Iterator;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
3 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
would be the key with the list of associated values with it. For example here we have
multiple values for a single key from our mapper like <apple,1> , <apple,1> , <apple,1> ,
<apple,1> . This key values would be fed into the reducer as < apple, {1,1,1,1} > .
Now let us evaluate our reduce method
reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter)
Here all the input parameters are hold the same functionality as that of a mapper, the
only diference is with the input Key Value. As mentioned earlier the input to a reducer
instance is a key and list of values hence ‘Text key, Iterator<IntWritable>
values’ . The next parameter denotes the output collector of the reducer with the data
type of output Key and Value.
Driver Class
The last class file is the driver class. This driver class is responsible for triggering
the map reduce job in Hadoop, it is in this driver class we provide the name of our job,
output key value data types and the mapper and reducer classes. The source code for
the same is as follows
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
JobClient.runJob(conf);
return 0;
}
Create all the three java files in your project. Now you’d be having compilation
errors just get the latest release of Hadoop and add the jars on to your class path. Once
free from compilation errors we have to package them to a jar. If you are using eclipse
then right click on the project and use the export utility. While packing the jar it is better
not to give the main class, because in future when you have multiple map reduce and
multiple drivers for the same project we should leave an option to choose the main class
file during run time through the command line.
4 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
NOTE: In Hadoop the map reduce process creates the output directory in hdfs
and store the output files on to the same. If the output directory already exists in
Hadoop then the m/r job wont execute, in that case either you need to change
the output directory or delete the provided output directory in HDFS before
running the jar again
6. Once the job shows a success status we can see the output file in the output
directory(part-00000)
Hadoop fs –ls /projects/wordcount/output/
7. For any further investigation of output file we can retrieve the data from hdfs to
LFS and from there to the desired location
hadoop fs –copyToLocal /projects/wordcount/output/ /home/training/usecase
/wordcount/output/
The key point to be noted here is that the no of output files is same as the no of reducers
used as every reducer would produce its own output file. All these output files would be
available in the hdfs output directory we assigned in the run command. It would be a
cumbersome job to combine all these files manually to obtain the result set. For that
Hadoop has provided a get merge command
This command would combine the contents of all the files available directly within the
/projects/wordcount/output/ hdfs directory and write the same to /home/training/usecase
/wordcount/output/WordCount.txt file in LFS
You can find the working copy of the word count implementation with hadoop 0.20 API at
the following location word count example with hadoop 0.20
66 comments:
Nice example with details, Please add the new api example if possible.
Reply
Replies
For the latest api, a working example with complete source code and
explanation can be found at http://hadooptuts.com
Reply
5 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Hi Ratan
Reply
Reply
Can u please explain about how the input file is specified for the mapper and who sends
line by line to mapper function?
Reply
Hi Arockiaraj
In a mapreduce program, the JobTracker assigns input splits to each map task based on
factors like data locality , slot availability etc. A map task actually process a certain hdfs
blocks. If you have a large file that comprises of 10 blocks and if your mapred split
properties complement with the hdfs block size then you'll have 10 map tasks processing
1 block each.
Now once the mapper has its own share of input based on the input format and certain
other properties it is the RecordReader that reads record by record and given them as
input to each execution of the map() method. In default TextInputFormat the record
reader reads till a new line character for a record.
Reply
Reply
Reply
hi ,
what is the type of KEYIN ???what do we call it?? datatype,class,interface etc???
Reply
public class Mapper what does KEYIN mean ? i have searched in source code but
unable to find declaration of KEYIn
Reply
Hi Hemanth
Reply
Hello,
6 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Thanks
Reply
Default reducers will be 1, but you can still change it based on your requirement.
Reply
Hello,
Thanks a lot for a clear overview. I have a question - What happened if I wish to output
the result from a reducer to lets say two different files with some logic related to that.
Something like - mapper is reading, reducer accepts those reads, generate two different
lists and write those lists into two different outputs/files - one for each list.
Thanks a lot
David
Reply
Reply
Nice article. I need to find out how one can extend this example to doing Word Count on
an xml file.
Reply
Reply
Reply
nice
Reply
I tried the code , it works for text file for both inside and outside and inside HDFS . Is
there any difference in term of speed and architecture . Please assist me ? Thanks.
Reply
Really good piece of knowledge, I had come back to understand regarding your website
from my friend Sumit, Hyderabad And it is very useful for who is looking for Hadoop
Online Training. I found a best Hadoop Online Training demo class.
Reply
Reply
7 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Reply
Reply
Hello Dude
I am Fresher in Hadoop. What about Future Vacanies for Hadoop Technology? Reply
Must
Reply
Hey friends here are some good tutorials on hadoop 2.2.0 http://www.javatute.com
/javatute/ViewPostByLabel?label=hadoop
Reply
'This is really very nice tutorial to have the basic understanding of map reduce
function.Thanks a lot.
Reply
Replies
Nice
Reply
Very good document for reference for a newbie in hadoop world. Counts words using
unix scripts are not fun any more :P
Reply
For the latest api, a working example with complete source code and explanation can be
found at http://hadooptuts.com
Reply
thanks a lot.
Hadoop Training in Chennai
Reply
This is a great inspiring tutorials on hadoop.I am pretty much pleased with your good
work.You put really very helpful information. Keep it up
Hadoop Training in hyderabad
Reply
Reply
8 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Reply
Hadoop is an open source tool, so it has multiple benefits for developers and corporate
as well.Anobody intrested in Hadoop Training so please check https://intellipaat.com/
Reply
Thanks for InformationHadoop Course will provide the basic concepts of MapReduce
applications developed using Hadoop, including a close look at framework components,
use of Hadoop for a variety of data analysis tasks, and numerous examples of Hadoop in
action. This course will further examine related technologies such as Hive, Pig, and
Apache Accumulo. HADOOP Online Training
Reply
Nice blog,
you have explained map reduce in very nice way, it helps most of the students who
wants to learn big data hadoop.
We are also providing Hadoop training in Delhi and our trainers are working
professionals having approx 4 to 5 year experience.
Reply
You didn't explain the driver class properly. I'm surprised no one else has said anything
about it. Please add some more information about that.
Reply
Please explain the run method used in Driver class, How is the flow ?
Reply
Reply
http://www.besanttechnologies.com/training-courses/data-warehousing-training/hadoop-
training-institute-in-chennai
Reply
Great article! Map-Reduce has served a great purpose, though: many, many companies,
research labs and individuals are successfully bringing Map-Reduce to bear on problems
to which it is suited: brute-force processing with an optional aggregation. But more
important in the longer term, to my mind, is the way that Map-Reduce provided the
justification for re-evaluating the ways in which large-scale data processing platforms are
built (and purchased!). Learn more at https://intellipaat.com/hadoop-online-training/
Reply
Hi
Really very nice blog.Very informative.Thanks for sharing the valuable
information.Recently I visited www.hadooponlinetutor.com.They are offering hadoop
videos at $20 only.The videos are really awesome.
Reply
9 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
hello
Very nice explanation thanks
I am new to map reduce programming. How to calculate total count of words. I want
output as sum/total_count.
here in this example total sum is 12 so for each key we divide it with respective sum of
that word.
Example:
could any one tell me how should i achieve this i am really struggling a lot.
thanks
Reply
Giving good information. it will help me lot visualpath is one of the best training institute
in hyderabad ameerpet. lombardi bpm
Reply
WebSites:
================
Hadoop Training in Hyderabad
Videos:
===============
Hadoop Training in Hyderabad
Reply
Reply
As CEOs across the globe grapple with issues from talent acquisition and retention to
the need for greater employee productivity, a study by KPMG shows that HR has a
massive opportunity to drive significant business value. To know more about , Visit Big
Data training in chennai
Reply
10 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Reply
Reply
Reply
I was reading your blog this morning and noticed that you have a awesome
resource page. I actually have a similar blog that might be helpful or useful
to your audience.
Regards
sap sd and crm online training
sap online tutorials
sap sd tutorial
sap sd training in ameerpet
Reply
There are lots of information about latest technology and how to get trained in them, like
Hadoop Training Chennai have spread around the web, but this is a unique one
according to me. The strategy you have updated here will make me to get trained in
future technologies(Hadoop Training in Chennai). By the way you are running a great
blog. Thanks for sharing this.
Reply
Great example, tthanks to explain about word count. This post shows you are too
experienced bigdata analyst, please share more tips like this.
Reply
Reply
11 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Reply
The Hadoop tutorial you have explained is most useful for begineers who are taking
Hadoop Administrator Online Training
Thank you for sharing Such a good tutorials on Hadoop
Reply
Reply
I was reading your blog and it gives a lot of information to us.Thanks for sharing the
post..!
QTP &QC Training
Microstrategy Training
MSBI Training
Reply
http://www.tekclasses.com/
Reply
http://www.tekclasses.com/
Reply
This is exactly what I was searching for. Awesome post. Thanks a bunch. Helped me in
taking class for my students. Wish to follow your posts, keep writing! God Bless!
Shashaa
Dot Net training | Dot Net training | Dot Net training
Reply
This is just the information I am finding everywhere. Thanks for your blog, I just
subscribe your blog. This is a nice blog..
online word count
Reply
Hello Admin, thank you for the article. It has helped me during my Java training in
Chennai. Fita academy is a Java training institutes in Chennai that provides training for
interested students. So feel free to contact us to join our Java J2EE training institutes in
Chennai.
Reply
Is there a way of doing this without using imports? Have to use IO inputs.
Reply
12 of 13 11/06/2015 05:30 PM
Kick Start Hadoop: Word Count - Hadoop Map R... http://kickstarthadoop.blogspot.in/2011/04/word-...
Publish Preview
13 of 13 11/06/2015 05:30 PM