0% found this document useful (0 votes)
20 views

Word Count using MapReduce on Hadoop

Uploaded by

maramjeghib31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views

Word Count using MapReduce on Hadoop

Uploaded by

maramjeghib31
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Word Count using

MapReduce on Hadoop

Do you have a lot of text data that requires them to count the
occurrence of every single unique word? If yes, you’ve got Hadoop’s
back to process this ‘Big Data’ of yours.

In this article, we’ll try our hands on running MapReduce for a word
count problem on Hadoop. So without wasting any further time, let’s
begin.

Things we need to care about before we start?

1. Oracle VM VirtualBox:
Installing Hadoop locally could be a lot of pain and most of the time,
chances are highly likely that things could go wrong if not carefully
installed. So, we’ll instead use a virtual machine instance of Cloudera
Quickstart that has Hadoop pre-installed on it and sleep a peaceful
night.

If you’ve already installed VirtualBox, you’re good to go for the next


step. If not, download it from here.
Note: Preferably install VirtualBox version 6.1.18 to avoid any
issues. The program was tested on this version and it ran just
perfect.

2. Cloudera Quickstart VM 5.4.2:


Download Cloudera Quickstart VM 5.4.2 from this link. It is a
virtual machine instance with Hadoop pre-installed.

Once the download is finished, extract the .zip file and import
the cloudera-quickstart-vm-5.4.2–0-virtualbox.ovf file as an
‘appliance’ into VirtualBox.

To do so, open VirtualBox; click on File on the menu bar and select
the Import Appliance option. Browse to the location where you have
extracted the Cloudera Quickstart VM.

Configure the settings as per your needs or simply keep the default
settings untouched.

Oracle VM VirtualBox
If everything is configured and setup successfully, we’re good to play
the actual game now.

Step 1: Open Cloudera Quickstart VM on VirtualBox.

Cloudera Quickstart VM

Step 2: Create a .txt data file inside /home/cloudera directory


that will be passed as an input to MapReduce program. For
simplicity purpose, we name it as word_count_data.txt.
Text data file

P.S: Ritson is my friend. :)

Step 3: Create Mapper and Reducer files


inside /home/cloudera directory.

You’ll get them from the following GitHub repository.


www.github.com/NSTiwari/Hadoop-MapReduce-
Programs
a) mapper.py
b) reducer.py
Step 4: Test the MapReduce program locally to check if everything
works properly before running on Hadoop.

Open terminal on Cloudera Quickstart VM instance and run the


following command:
cat word_count_data.txt | python mapper.py | sort -k1,1 |
python reducer.py

Local check of MapReduce

For the above example, the output obtained is exactly the same as
expected.
If you see all the words correctly mapped, sorted and reduced to
their respective counts, then your program is good to be tested on
Hadoop.
Step 5: Configure Hadoop services and settings.

Now, we need to configure certain settings on Hadoop before we run


the MapReduce program for word count.

5a: Login to Cloudera Manager


Open browser on Cloudera Quickstart VM and
open quickstart.cloudera:7180/cmf/login. Login by entering
the credentials as cloudera for both, username and password.

Note: If you see the error “Unable to connect” while logging in


to quickstart.cloudera:7180/cmf/login, try restarting the CDH
services.
Restart CDH services by typing the following command:
sudo /home/cloudera/cloudera-manager --express --force

5b: Start HDFS and YARN services.


Click the dropdown arrow and choose Start option for HDFS and
YARN services.
Start HDFS and YARN services

You’ll see the following if both; HDFS and YARN services are started
successfully.

HDFS service started successfully


YARN service started successfully

Step 6: Create a directory on HDFS

Now, we create a directory named word_count_map_reduce on


HDFS where our input data and its resulting output would be stored.

Use the following command for it.


sudo -u hdfs hadoop fs -mkdir /word_count_map_reduce

Note: If the directory already exists, then either create a directory


with new name or delete the existing directory using the following
command.

export HADOOP_USER_NAME=hdfs
hdfs dfs -rmr /word_count_map_reduce

List HDFS directory items using the following command.


hdfs dfs -ls /
Deleting/Creating a directory on HDFS

Step 7: Copy input data file on HDFS.

Copy the word_count_data.txt file


to word_count_map_reduce directory on HDFS using the
following command.

sudo -u hdfs hadoop fs -put


/home/cloudera/word_count_data.txt
/word_count_map_reduce

Check if file was copied successfully to the desired location.


hdfs dfs -ls /word_count_map_reduce

Input file copied on HDFS successfully

Step 8: Download hadoop-streaming JAR 2.7.3.


Open browser on VM and go to this link and download the hadoop-
streaming JAR 2.7.3 file.

Download hadoop-streaming JAR 2.7.3

Once the file is downloaded, unzip it


inside /home/cloudera directory. Double-check if the JAR file was
unzipped successfully and is present inside
/home/cloudera directory.

ls

hadoop-streaming-2.7.3.jar downloaded successfully

Step 9: Configure permissions to run MapReduce on Hadoop.

We’re almost ready to run our MapReduce job on Hadoop but before
that, we need to give permission to read, write and execute the
Mapper and Reducer programs on Hadoop.
We also need to provide permission for the default user (cloudera) to
write the output file inside HDFS.

Run the following commands to do so:

chmod 777 mapper.py reducer.py


sudo -u hdfs hadoop fs -chown cloudera
/word_count_map_reduce

Permission granted to read, write and execute files on HDFS

Step 10: Run MapReduce on Hadoop.

We’re at the ultimate step of this program. Run the MapReduce job
on Hadoop using the following command.

hadoop jar /home/cloudera/hadoop-streaming-2.7.3.jar \


> -input /word_count_map_reduce/word_count_data.txt
\
> -output /word_count_map_reduce/output \
> -mapper /home/cloudera/mapper.py \
> -reducer /home/cloudera/reducer.py
Execute Hadoop streaming for MapReduce

MapReduce job executed

If you see the output on terminal as shown in above two images, then
the MapReduce job was executed successfully.

Step 11: Read the MapReduce output.

Now, finally run the following command to read the output of


MapReduce for word count of the input data file you had created.
hdfs dfs -cat /word_count_map_reduce/output/part-
00000

MapReduce output on Hadoop

Congratulations, the output for MapReduce on Hadoop is obtained


exactly as expected. All the words in the input data file have been
mapped, sorted and reduced to their respective counts.

If you’d like to talk more on this, feel free to connect with me


on LinkedIn. Till then, adieu.

You might also like