Writing An Hadoop MapReduce Program in Python
Writing An Hadoop MapReduce Program in Python
Michael G. Noll
Home About Blog Talks & Publications Tutorials Projects
That said, the ground is now prepared for the purpose of this tutorial: writing a Hadoop
MapReduce program in a more Pythonic way, i.e. in a way you should be familiar with.
What we want to do
We will write a simple MapReduce program (see also the MapReduce article on Wikipedia)
for Hadoop in Python but without using Jython to translate our code to Java jar files.
Our program will mimick the WordCount, i.e. it reads text files and counts how often words
occur. The input is text files and the output is text files, each line of which contains a word
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 1/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
Note: You can also use programming languages other than Python such as Perl or Ruby
with the "technique" described in this tutorial.
Prerequisites
You should have an Hadoop cluster up and running because we will get our hands dirty. If
you don’t have a cluster yet, my following tutorials might help you to build one. The tutorials
are tailored to Ubuntu Linux but the information does also apply to other Linux/Unix
variants.
Make sure the file has execution permission ( chmod +x /home/hduser/mapper.py should
do the trick) or you will run into problems.
#!/usr/bin/env python
"""mapper.py"""
import sys
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 2/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
Make sure the file has execution permission ( chmod +x /home/hduser/reducer.py should
do the trick) or you will run into problems.
#!/usr/bin/env python
"""reducer.py"""
current_word = None
current_count = 0
word = None
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 3/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
Here are some ideas on how to test the functionality of the Map and Reduce scripts.
hduser@ubuntu:~$ echo "foo foo quux labs foo bar quux" | /home/hduser/mapper.p
bar 1
foo 3
labs 1
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 4/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
quux 2
Download each ebook as text files in Plain Text UTF-8 encoding and store the files in a
local temporary directory of choice, for example /tmp/gutenberg .
hduser@ubuntu:~$ ls -l /tmp/gutenberg/
total 3604
-rw-r--r-- 1 hduser hadoop 674566 Feb 3 10:17 pg20417.txt
-rw-r--r-- 1 hduser hadoop 1573112 Feb 3 10:18 pg4300.txt
-rw-r--r-- 1 hduser hadoop 1423801 Feb 3 10:18 pg5000.txt
hduser@ubuntu:~$
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 5/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
If you want to modify some Hadoop settings on the fly like increasing the number of Reduce
tasks, you can use the -D option:
The job will read all the files in the HDFS directory /user/hduser/gutenberg , process it,
and store the results in the HDFS directory /user/hduser/gutenberg-output . In general
Hadoop will create one output file per reducer; in our case however it will only create a single
file because the input files are very small.
As you can see in the output above, Hadoop also provides a basic web interface for statistics
and information. When the Hadoop cluster is running, open http://localhost:50030/ in a
browser and have a look around. Here’s a screenshot of the Hadoop web interface for the job
we just ran.
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 7/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
Figure 1: A screenshot of Hadoop's JobTracker web interface, showing the details of the
MapReduce job we just ran
You can then inspect the contents of the file with the dfs -cat command:
Note that in this specific output above the quote signs ( " ) enclosing the words have not
been inserted by Hadoop. They are the result of how our Python code splits words, and in
this case it matched the beginning of a quote in the ebook texts. Just inspect the part-
00000 file further to see it for yourself.
Generally speaking, iterators and generators (functions that create iterators, for example
with Python’s yield statement) have the advantage that an element of a sequence is not
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 8/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
produced until you actually need it. This can help a lot in terms of computational
expensiveness or memory consumption depending on the task at hand.
Note: The following Map and Reduce scripts will only work "correctly" when being run
in the Hadoop context, i.e. as Mapper and Reducer in a MapReduce job. This means
that running the naive test command "cat DATA | ./mapper.py | sort -k1,1 |
./reducer.py" will not work correctly anymore because some functionality is
intentionally outsourced to Hadoop.
Precisely, we compute the sum of a word’s occurrences, e.g. ("foo", 4) , only if by chance
the same word ( foo ) appears multiple times in succession. In the majority of cases,
however, we let the Hadoop group the (key, value) pairs between the Map and the Reduce
step because Hadoop is more efficient in this regard than our simple Python scripts.
mapper.py
#!/usr/bin/env python
"""A more advanced Mapper, using Python iterators and generators."""
import sys
def read_input(file):
for line in file:
# split the line into words
yield line.split()
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_input(sys.stdin)
for words in data:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
for word in words:
print '%s%s%d' % (word, separator, 1)
if __name__ == "__main__":
main()
reducer.py
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 9/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
#!/usr/bin/env python
"""A more advanced Reducer, using Python iterators and generators."""
def main(separator='\t'):
# input comes from STDIN (standard input)
data = read_mapper_output(sys.stdin, separator=separator)
# groupby groups multiple word-count pairs by word,
# and creates an iterator that returns consecutive keys and their group:
# current_word - string containing a word (the key)
# group - iterator yielding all ["<current_word>", "<count>"
for current_word, group in groupby(data, itemgetter(0)):
try:
total_count = sum(int(count) for current_word, count in group)
print "%s%s%d" % (current_word, separator, total_count)
except ValueError:
# count was not a number, so silently discard this item
pass
if __name__ == "__main__":
main()
Related Links
From yours truly:
LOG IN WITH
OR SIGN UP WITH DISQUS ?
Name
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 10/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
One question, as you mentioned that hadoop does the file sorting and splitting. In the
example it the split of the map output file is done across the same word then the reduce
will have two entries of this word. Does hadoop takes care of such detail when he does
the split of final map?
example
w1
w1
------- (if the file split is done here, them in the final reduced output file will have "w 2"
followed by "w 3")
w1
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 11/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
w1
w1
w1
x1
x1
....
53 △ ▽ • Reply • Share ›
[...snip...]
I was trying to execute this streaming job example. I am getting the following error while i
am running this program.
import sys
def run_map(f):
for line in f:
data = line.rstrip().split()
for word in data:
print(word)
if __name__ == '__main__':
run map(sys.stdin)
see more
13 △ ▽ • Reply • Share ›
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 13/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split('\t')
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
see more
9△ ▽ • Reply • Share ›
or
can u tell me how i can get the input file name "20417-8.txt: in my mapper.py program..i
am tryin to write inverted index program
i searched the Internet and ppl hav suggested to use os.environ["map_input_file"] but it
doesn seem to work..
plz help
8△ ▽ • Reply • Share ›
I setup a small cluster using multiple virtual machines in my computer. When I run the
map-reduce command, the map task is completed, but reduce task gets stuck. I checked
and rechecked the python code. There seem not be any problem. Any suggestion why
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 14/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
py yp y gg y
this might be happening?
5△ ▽ • Reply • Share ›
How to parse and categorize system application log files in hadoop single node cluster.Is
there any mapreduce coding???
3△ ▽ • Reply • Share ›
- I am writing my code in Python. So, can you please suggest how can we introduce
cascading using Hadoop Streaming without actually using "Cascading" package
- Do I need to save intermediate files in this case
I tried searching this on internet but could not come up with a definite answer
jo • 7 years ago
i love much thi
I had a question. This is in the context of a distributed setup involving many nodes, and
several large files stored on them with some replication (say three duplicate blocks per
block). Now, when I run a standard hadoop streaming task like this one, and I don't
specify values for the number of map and reduce tasks through mapred.*.tasks, what is
the default behaviour like? Does it create some parallelism on its own or does it end up
spawning a single task to get the job done?
It seems to me that the absence of the input directory or insufficient permissions might
cause this failure, but the directory does exist in HDFS, the permission is rwx for everyone
on the directory and its contents. Same thing with the output directory.
Could the input file be in the wrong format? Is there a place where more error info would
be displayed?
Thanks,
Jeff
2△ ▽ • Reply • Share ›
I was trying your code for the first easy mapper.py and reducer.py above.
no results come out. No error messages, either. I do not know what's wrong.
Thank you
1△ ▽ • Reply • Share ›
I change some code and run it smoothly! Maybe it will help you.
⛺
△ ▽ • Reply • Share ›
**mapper.py** file
import sys
import cStringIO
import xml.etree.ElementTree as xml
if __name__ == '__main__':
buff = None
intext = False
for line in sys.stdin:
line = line.strip()
if line.find("<row") != -1:
see more
1△ ▽ • Reply • Share ›
FM.
1△ ▽ • Reply • Share ›
FM.
1△ ▽ • Reply • Share ›
1△ ▽ • Reply • Share ›
Cheers
Fabio
1△ ▽ • Reply • Share ›
Left with a single question, How would I sort this output file in descending order of count?
(word with the highest count appears first)
Michael G. Noll
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 20/21
8/21/2019 Writing An Hadoop MapReduce Program In Python
© 2004-2019 Michael G. Noll. All rights reserved. Views expressed here are my own. Privacy.
https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-python/ 21/21