Hadoop Streaming

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

HADOOP STREAMING

1. Install Python
# apt update
# apt install python-is-python3
# whereis python3
Hoặc
# apt update && sudo apt upgrade -y
# apt install software-properties-common -y
# add-apt-repository ppa:deadsnakes/ppa -y
# add-apt-repository ppa:deadsnakes/nightly -y
# apt update
# apt install python3.11
# python3.11 --version
2. Example Using Python WordCount
Mapper Phase Code
Tạo file mapper.py và cấp quyền chmod +x mapper.py
#!/usr/bin/python3
"""mapper.py"""

import sys

# input comes from STDIN (standard input)


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print ('%s\t%s' % (word, 1))
Reducer Phase Code
Tạo file reducer.py và cấp quyền chmod +x reducer.py

Biên soạn: Lê Thị Minh Châu


#!/usr/bin/python3
"""reducer.py"""

from operator import itemgetter


import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN


for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()

# parse the input we got from mapper.py


word, count = line.split('\t', 1)

# convert count (currently a string) to int


try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue

# this IF-switch only works because Hadoop sorts map


output
# by key (here: word) before it is passed to the
reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word,
current_count)
current_count = count
current_word = word

# do not forget to output the last word if needed!


if current_word == word:
print ('%s\t%s' % (current_word, current_count))
3. Thực thi chương trình WordCount trên thư mục cục bộ
$ echo "foo foo quux labs foo bar quux" |
/home/hadoopminhchau/mapper.py

Biên soạn: Lê Thị Minh Châu


$ echo "foo foo quux labs foo bar quux" |
/home/hadoopminhchau/mapper.py | sort -k1,1 |
/home/hadoopminhchau/reducer.py

Tạo file data.txt chứa dữ liệu

$ cat ./data.txt | ./mapper.py

$ cat ./data.txt | ./mapper.py | sort -k1,1 | ./reducer.py

Biên soạn: Lê Thị Minh Châu


4. Thực thi chương trình WordCount trên HDFS
Tạo thư mục myinput chứa dữ liệu

Copy thư mục myinput vào HDFS

Chạy MapReduce job


$ hadoop jar hadoop-streaming-3.3.4.jar -file mapper.py -
mapper mapper.py -file reducer.py -reducer reducer.py -
input ./myinput -output ./myoutput

Biên soạn: Lê Thị Minh Châu


Hiển thị kết quả
$ hdfs dfs -cat ./myoutput/part-00000

5. Sửa một số lỗi

Nếu báo lỗi/usr/bin/env: ‘python\r’: No such file or directory


$ sudo apt install dos2unix
Nếu báo lỗi /usr/bin/python^m bad interpreter
$ vim mapper.py then :set ff=unix

Biên soạn: Lê Thị Minh Châu


6. References

[1] https://www.tutorialspoint.com/hadoop/hadoop_streaming.htm
[2] https://www.tutsmake.com/how-to-install-python-3-10-on-ubuntu-22-04/
[3] https://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduce-program-in-
python/

Biên soạn: Lê Thị Minh Châu

You might also like