Hadoop Streaming Using Python – Word Count Problem
Last Updated :
19 Jan, 2022
Hadoop Streaming is a feature that comes with Hadoop and allows users or developers to use various different languages for writing MapReduce programs like Python, C++, Ruby, etc. It supports all the languages that can read from standard input and write to standard output. We will be implementing Python with Hadoop Streaming and will observe how it works. We will implement the word count problem in python to understand Hadoop Streaming. We will be creating mapper.py and reducer.py to perform map and reduce tasks.
Let’s create one file which contains multiple words that we can count.
Step 1: Create a file with the name word_count_data.txt and add some data to it.
cd Documents/ # to change the directory to /Documents
touch word_count_data.txt # touch is used to create an empty file
nano word_count_data.txt # nano is a command line editor to edit the file
cat word_count_data.txt # cat is used to see the content of the file

Step 2: Create a mapper.py file that implements the mapper logic. It will read the data from STDIN and will split the lines into words, and will generate an output of each word with its individual count.
cd Documents/ # to change the directory to /Documents
touch mapper.py # touch is used to create an empty file
cat mapper.py # cat is used to see the content of the file
Copy the below code to the mapper.py file.
Python3
import sys
for line in sys.stdin:
line = line.strip()
words = line.split()
for word in words:
print '%s\t%s' % (word, 1 )
|
Here in the above program #! is known as shebang and used for interpreting the script. The file will be run using the command we are specifying.

Let’s test our mapper.py locally that it is working fine or not.
Syntax:
cat <text_data_file> | python <mapper_code_python_file>
Command(in my case)
cat word_count_data.txt | python mapper.py
The output of the mapper is shown below.

Step 3: Create a reducer.py file that implements the reducer logic. It will read the output of mapper.py from STDIN(standard input) and will aggregate the occurrence of each word and will write the final output to STDOUT.
cd Documents/ # to change the directory to /Documents
touch reducer.py # touch is used to create an empty file
Python3
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
for line in sys.stdin:
line = line.strip()
word, count = line.split( '\t' , 1 )
try :
count = int (count)
except ValueError:
continue
if current_word = = word:
current_count + = count
else :
if current_word:
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
if current_word = = word:
print '%s\t%s' % (current_word, current_count)
|
Now let’s check our reducer code reducer.py with mapper.py is it working properly or not with the help of the below command.
cat word_count_data.txt | python mapper.py | sort -k1,1 | python reducer.py

We can see that our reducer is also working fine in our local system.
Step 4: Now let’s start all our Hadoop daemons with the below command.
start-dfs.sh
start-yarn.sh

Now make a directory word_count_in_python in our HDFS in the root directory that will store our word_count_data.txt file with the below command.
hdfs dfs -mkdir /word_count_in_python
Copy word_count_data.txt to this folder in our HDFS with help of copyFromLocal command.
Syntax to copy a file from your local file system to the HDFS is given below:
hdfs dfs -copyFromLocal /path 1 /path 2 .... /path n /destination
Actual command(in my case)
hdfs dfs -copyFromLocal /home/dikshant/Documents/word_count_data.txt /word_count_in_python

Now our data file has been sent to HDFS successfully. we can check whether it sends or not by using the below command or by manually visiting our HDFS.
hdfs dfs -ls / # list down content of the root directory
hdfs dfs -ls /word_count_in_python # list down content of /word_count_in_python directory

Let’s give executable permission to our mapper.py and reducer.py with the help of below command.
cd Documents/
chmod 777 mapper.py reducer.py # changing the permission to read, write, execute for user, group and others
In below image,Then we can observe that we have changed the file permission.

Step 5: Now download the latest hadoop-streaming jar file from this Link. Then place, this Hadoop,-streaming jar file to a place from you can easily access it. In my case, I am placing it to /Documents folder where mapper.py and reducer.py file is present.
Now let’s run our python files with the help of the Hadoop streaming utility as shown below.
hadoop jar /home/dikshant/Documents/hadoop-streaming-2.7.3.jar \
> -input /word_count_in_python/word_count_data.txt \
> -output /word_count_in_python/output \
> -mapper /home/dikshant/Documents/mapper.py \
> -reducer /home/dikshant/Documents/reducer.py

In the above command in -output, we will specify the location in HDFS where we want our output to be stored. So let’s check our output in output file at location /word_count_in_python/output/part-00000 in my case. We can check results by manually vising the location in HDFS or with the help of cat command as shown below.
hdfs dfs -cat /word_count_in_python/output/part-00000

Basic options that we can use with Hadoop Streaming
Option
|
Description
|
-mapper |
The command to be run as the mapper |
-reducer |
The command to be run as the reducer |
-input |
The DFS input path for the Map step |
-output |
The DFS output directory for the Reduce step |
Similar Reads
Scraping And Finding Ordered Words In A Dictionary using Python
What are ordered words? An ordered word is a word in which the letters appear in alphabetic order. For example abbey & dirt . The rest of the words are unordered for example geeks The task at hand This task is taken from Rosetta Code and it is not as mundane as it sounds from the above descripti
3 min read
Python | Number to Words using num2words
num2words module in Python, which converts number (like 34) to words (like thirty-four). Also, this library has support for multiple languages. In this article, we will see how to convert number to words using num2words module. Installation One can easily install num2words using pip. pip install num
2 min read
Create Word Counter app using Django
In this article, we are going to make a simple tool that counts a number of words in text using Django. Before diving into this topic you need to have some basic knowledge of Django. Refer to the below article to know about basics of Django. Django BasicsHow to Create a Basic Project using MVT in Dj
3 min read
Possible Words using given characters in Python
Given a dictionary and a character array, print all valid words that are possible using characters from the array. Note: Repetitions of characters is not allowed. Examples: Input : Dict = ["go","bat","me","eat","goal","boy", "run"] arr = ['e','o','b', 'a','m','g', 'l'] Output : go, me, goal. This pr
5 min read
Typing Speed Test Project Using Python Streamlit Library
Typing Speed Test Project involves typing content with the input field displayed on the screen where we need to type the same content also a timer of 30 seconds runs continuously, and when it reaches zero, our typing speed is displayed on the screen in words per minute (WPM). This article will guide
2 min read
Python | Pandas Series.str.count()
Python is a great language for doing data analysis, primarily because of the fantastic ecosystem of data-centric Python packages. Pandas is one of those packages and makes importing and analyzing data much easier. Pandas str.count() method is used to count occurrence of a string or regex pattern in
3 min read
Find the first repeated word in a string in Python using Dictionary
We are given a string that may contain repeated words and the task is to find the first word that appears more than once. For example, in the string "Learn code learn fast", the word "learn" is the first repeated word. Let's understand different approaches to solve this problem using a dictionary. U
3 min read
Build a Song Transcriptor App Using Python
In today's digital landscape, audio files play a significant role in various aspects of our lives, from entertainment to education. However, extracting valuable information or content from audio recordings can be challenging. In this article, we will learn how to build a Song Transcriber application
4 min read
Python Set - Pairs of Complete Strings in Two Sets
The task of finding pairs of complete strings in two sets in Python involves identifying string pairs from two different lists such that, when combined, they contain all the letters of the English alphabet. For example, given two sets a = ['abcdefgh', 'geeksforgeeks', 'lmnopqrst', 'abc'] and b = ['i
3 min read
Python | Words extraction from set of characters using dictionary
Given the words, the task is to extract different words from a set of characters using the defined dictionary. Approach: Python in its language defines an inbuilt module enchant which handles certain operations related to words. In the approach mentioned, following methods are used. check() : It che
3 min read