Word Count using MapReduce on Hadoop
Word Count using MapReduce on Hadoop
MapReduce on Hadoop
Do you have a lot of text data that requires them to count the
occurrence of every single unique word? If yes, you’ve got Hadoop’s
back to process this ‘Big Data’ of yours.
In this article, we’ll try our hands on running MapReduce for a word
count problem on Hadoop. So without wasting any further time, let’s
begin.
1. Oracle VM VirtualBox:
Installing Hadoop locally could be a lot of pain and most of the time,
chances are highly likely that things could go wrong if not carefully
installed. So, we’ll instead use a virtual machine instance of Cloudera
Quickstart that has Hadoop pre-installed on it and sleep a peaceful
night.
Once the download is finished, extract the .zip file and import
the cloudera-quickstart-vm-5.4.2–0-virtualbox.ovf file as an
‘appliance’ into VirtualBox.
To do so, open VirtualBox; click on File on the menu bar and select
the Import Appliance option. Browse to the location where you have
extracted the Cloudera Quickstart VM.
Configure the settings as per your needs or simply keep the default
settings untouched.
Oracle VM VirtualBox
If everything is configured and setup successfully, we’re good to play
the actual game now.
Cloudera Quickstart VM
For the above example, the output obtained is exactly the same as
expected.
If you see all the words correctly mapped, sorted and reduced to
their respective counts, then your program is good to be tested on
Hadoop.
Step 5: Configure Hadoop services and settings.
You’ll see the following if both; HDFS and YARN services are started
successfully.
export HADOOP_USER_NAME=hdfs
hdfs dfs -rmr /word_count_map_reduce
ls
We’re almost ready to run our MapReduce job on Hadoop but before
that, we need to give permission to read, write and execute the
Mapper and Reducer programs on Hadoop.
We also need to provide permission for the default user (cloudera) to
write the output file inside HDFS.
We’re at the ultimate step of this program. Run the MapReduce job
on Hadoop using the following command.
If you see the output on terminal as shown in above two images, then
the MapReduce job was executed successfully.