Hadoop Mapreduce Python Script
Hadoop Mapreduce Python Script
____________MAPPER_____________
1> make a file named mapper.py and paste below python code for mapper in it
$ nano mapper.py
#!/usr/bin/env python
import sys
line = line.strip()
words = line.split()
#[ for line in sys.stdin: ] described that input comes from standard input (STDIN).
Standard input(stdin), is the source of input data for python ,
#[ print '%s\t%s' % (word, 1) ] will write the result to (stdout) . This output will
input for reducer
3> make a file named reducer.py and paste below python code for reducer in it
$ nano reducer.py
#!/usr/bin/env python
current_word = None
current_count = 0
word = None
line = line.strip()
try:
count = int(count)
except ValueError:
continue
if current_word == word:
current_count += count
else:
if current_word:
if current_word == word:
print '%s\t%s' % (current_word, current_count)
----understanding above code----
#The code in reducer.py will read results of mapper.py through standard input so , output
of mapper.py and input of reducer.py must match .
#[ try:
count = int(count)
except ValueError: ] will convert count which is in currently string format to int
because count is going to be a number , i.e int.
#The [continue] statement after the code will ignore the line if count was not the number , i.e int
#[ if current_word == word:
current_count += count
else:
if current_word: ] here if works because hadoop sorts map output i.e word before it is passed to the reducer
S.
5> first copy the files that has to be Processed from our local file system to Hadoop’s HDF
6> run hadoop streaming jar file which will allow python code on hadoop followed by mapper reducer input
and output
Here -file takes File/dir to be shipped in the Job jar file -input takes DFS input file for the Map step .
-mapper takes the streaming command to run map steps . -reducer takes the streaming command to run
reduce step