0% found this document useful (0 votes)
61 views6 pages

Hadoop Streaming

This document discusses Hadoop streaming using Python. It provides steps to install Python, write mapper and reducer code for a word count problem, and run the program locally and on HDFS. The mapper code takes input and emits each word and count. The reducer code sums the counts for each word. Common errors like bad interpreters are addressed. References for more information on Hadoop streaming and installing Python are also provided.

Uploaded by

Bigg Boss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views6 pages

Hadoop Streaming

This document discusses Hadoop streaming using Python. It provides steps to install Python, write mapper and reducer code for a word count problem, and run the program locally and on HDFS. The mapper code takes input and emits each word and count. The reducer code sums the counts for each word. Common errors like bad interpreters are addressed. References for more information on Hadoop streaming and installing Python are also provided.

Uploaded by

Bigg Boss
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

HADOOP STREAMING

1. Install Python
# apt update
# apt install python-is-python3
# whereis python3
Hoặc
# apt update && sudo apt upgrade -y
# apt install software-properties-common -y
# add-apt-repository ppa:deadsnakes/ppa -y
# add-apt-repository ppa:deadsnakes/nightly -y
# apt update
# apt install python3.11
# python3.11 --version
2. Example Using Python WordCount
Mapper Phase Code
Tạo file mapper.py và cấp quyền chmod +x mapper.py
#!/usr/bin/python3
"""mapper.py"""

import sys

# input comes from STDIN (standard input)


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print ('%s\t%s' % (word, 1))
Reducer Phase Code
Tạo file reducer.py và cấp quyền chmod +x reducer.py

Biên soạn: Lê Thị Minh Châu


#!/usr/bin/python3
"""reducer.py"""

from operator import itemgetter


import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py


word, count = line.split('\t', 1)

# convert count (currently a string) to int


try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map


output
# by key (here: word) before it is passed to the
reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word,
current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!


if current_word == word:
print ('%s\t%s' % (current_word, current_count))
3. Thực thi chương trình WordCount trên thư mục cục bộ
$ echo "foo foo quux labs foo bar quux" |
/home/hadoopminhchau/mapper.py

Biên soạn: Lê Thị Minh Châu


$ echo "foo foo quux labs foo bar quux" |
/home/hadoopminhchau/mapper.py | sort -k1,1 |
/home/hadoopminhchau/reducer.py

Tạo file data.txt chứa dữ liệu

$ cat ./data.txt | ./mapper.py

$ cat ./data.txt | ./mapper.py | sort -k1,1 | ./reducer.py

Biên soạn: Lê Thị Minh Châu


4. Thực thi chương trình WordCount trên HDFS
Tạo thư mục myinput chứa dữ liệu

Copy thư mục myinput vào HDFS

Chạy MapReduce job


$ hadoop jar hadoop-streaming-3.3.4.jar -file mapper.py -
mapper mapper.py -file reducer.py -reducer reducer.py -
input ./myinput -output ./myoutput

Biên soạn: Lê Thị Minh Châu


Hiển thị kết quả
$ hdfs dfs -cat ./myoutput/part-00000

5. Sửa một số lỗi

Nếu báo lỗi/usr/bin/env: ‘python\r’: No such file or directory


$ sudo apt install dos2unix
Nếu báo lỗi /usr/bin/python^m bad interpreter
$ vim mapper.py then :set ff=unix

Biên soạn: Lê Thị Minh Châu


6. References

[1] https://www.tutorialspoint.com/hadoop/hadoop_streaming.htm
[2] https://www.tutsmake.com/how-to-install-python-3-10-on-ubuntu-22-04/
[3] https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-
python/

Biên soạn: Lê Thị Minh Châu

You might also like