0% found this document useful (0 votes)

20 views

Word Count using MapReduce on Hadoop

Uploaded by

maramjeghib31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views

Word Count using MapReduce on Hadoop

Uploaded by

maramjeghib31

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Word Count using

MapReduce on Hadoop

Do you have a lot of text data that requires them to count the
occurrence of every single unique word? If yes, you’ve got Hadoop’s
back to process this ‘Big Data’ of yours.

In this article, we’ll try our hands on running MapReduce for a word
count problem on Hadoop. So without wasting any further time, let’s
begin.

Things we need to care about before we start?

1. Oracle VM VirtualBox:
Installing Hadoop locally could be a lot of pain and most of the time,
chances are highly likely that things could go wrong if not carefully
installed. So, we’ll instead use a virtual machine instance of Cloudera
Quickstart that has Hadoop pre-installed on it and sleep a peaceful
night.

If you’ve already installed VirtualBox, you’re good to go for the next

step. If not, download it from here.
Note: Preferably install VirtualBox version 6.1.18 to avoid any
issues. The program was tested on this version and it ran just
perfect.

2. Cloudera Quickstart VM 5.4.2:

Download Cloudera Quickstart VM 5.4.2 from this link. It is a
virtual machine instance with Hadoop pre-installed.

Once the download is finished, extract the .zip file and import
the cloudera-quickstart-vm-5.4.2–0-virtualbox.ovf file as an
‘appliance’ into VirtualBox.

To do so, open VirtualBox; click on File on the menu bar and select
the Import Appliance option. Browse to the location where you have
extracted the Cloudera Quickstart VM.

Configure the settings as per your needs or simply keep the default
settings untouched.

Oracle VM VirtualBox
If everything is configured and setup successfully, we’re good to play
the actual game now.

Step 1: Open Cloudera Quickstart VM on VirtualBox.

Cloudera Quickstart VM

Step 2: Create a .txt data file inside /home/cloudera directory

that will be passed as an input to MapReduce program. For
simplicity purpose, we name it as word_count_data.txt.
Text data file

P.S: Ritson is my friend. :)

Step 3: Create Mapper and Reducer files

inside /home/cloudera directory.

You’ll get them from the following GitHub repository.

www.github.com/NSTiwari/Hadoop-MapReduce-
Programs
a) mapper.py
b) reducer.py
Step 4: Test the MapReduce program locally to check if everything
works properly before running on Hadoop.

Open terminal on Cloudera Quickstart VM instance and run the

following command:
cat word_count_data.txt | python mapper.py | sort -k1,1 |
python reducer.py

Local check of MapReduce

For the above example, the output obtained is exactly the same as
expected.
If you see all the words correctly mapped, sorted and reduced to
their respective counts, then your program is good to be tested on
Hadoop.
Step 5: Configure Hadoop services and settings.

Now, we need to configure certain settings on Hadoop before we run

the MapReduce program for word count.

5a: Login to Cloudera Manager

Open browser on Cloudera Quickstart VM and
open quickstart.cloudera:7180/cmf/login. Login by entering
the credentials as cloudera for both, username and password.

Note: If you see the error “Unable to connect” while logging in

to quickstart.cloudera:7180/cmf/login, try restarting the CDH
services.
Restart CDH services by typing the following command:
sudo /home/cloudera/cloudera-manager --express --force

5b: Start HDFS and YARN services.

Click the dropdown arrow and choose Start option for HDFS and
YARN services.
Start HDFS and YARN services

You’ll see the following if both; HDFS and YARN services are started
successfully.

HDFS service started successfully

YARN service started successfully

Step 6: Create a directory on HDFS

Now, we create a directory named word_count_map_reduce on

HDFS where our input data and its resulting output would be stored.

Use the following command for it.

sudo -u hdfs hadoop fs -mkdir /word_count_map_reduce

Note: If the directory already exists, then either create a directory

with new name or delete the existing directory using the following
command.

export HADOOP_USER_NAME=hdfs
hdfs dfs -rmr /word_count_map_reduce

List HDFS directory items using the following command.

hdfs dfs -ls /
Deleting/Creating a directory on HDFS

Step 7: Copy input data file on HDFS.

Copy the word_count_data.txt file

to word_count_map_reduce directory on HDFS using the
following command.

sudo -u hdfs hadoop fs -put

/home/cloudera/word_count_data.txt
/word_count_map_reduce

Check if file was copied successfully to the desired location.

hdfs dfs -ls /word_count_map_reduce

Input file copied on HDFS successfully

Step 8: Download hadoop-streaming JAR 2.7.3.

Open browser on VM and go to this link and download the hadoop-
streaming JAR 2.7.3 file.

Download hadoop-streaming JAR 2.7.3

Once the file is downloaded, unzip it

inside /home/cloudera directory. Double-check if the JAR file was
unzipped successfully and is present inside
/home/cloudera directory.

hadoop-streaming-2.7.3.jar downloaded successfully

Step 9: Configure permissions to run MapReduce on Hadoop.

We’re almost ready to run our MapReduce job on Hadoop but before
that, we need to give permission to read, write and execute the
Mapper and Reducer programs on Hadoop.
We also need to provide permission for the default user (cloudera) to
write the output file inside HDFS.

Run the following commands to do so:

chmod 777 mapper.py reducer.py

sudo -u hdfs hadoop fs -chown cloudera
/word_count_map_reduce

Permission granted to read, write and execute files on HDFS

Step 10: Run MapReduce on Hadoop.

We’re at the ultimate step of this program. Run the MapReduce job
on Hadoop using the following command.

hadoop jar /home/cloudera/hadoop-streaming-2.7.3.jar \

> -input /word_count_map_reduce/word_count_data.txt
\
> -output /word_count_map_reduce/output \
> -mapper /home/cloudera/mapper.py \
> -reducer /home/cloudera/reducer.py
Execute Hadoop streaming for MapReduce

MapReduce job executed

If you see the output on terminal as shown in above two images, then
the MapReduce job was executed successfully.

Step 11: Read the MapReduce output.

Now, finally run the following command to read the output of

MapReduce for word count of the input data file you had created.
hdfs dfs -cat /word_count_map_reduce/output/part-
00000

MapReduce output on Hadoop

Congratulations, the output for MapReduce on Hadoop is obtained

exactly as expected. All the words in the input data file have been
mapped, sorted and reduced to their respective counts.

If you’d like to talk more on this, feel free to connect with me

on LinkedIn. Till then, adieu.

07. Microservices Notes
No ratings yet
07. Microservices Notes
14 pages
Lab11 B
No ratings yet
Lab11 B
9 pages
Word Count
No ratings yet
Word Count
10 pages
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos In Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Big Data Cloudera TP
No ratings yet
Big Data Cloudera TP
33 pages
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
No ratings yet
Tutorial-Counting Words in File (S) Using Mapreduce: Prerequisites
11 pages
TPhadoop
No ratings yet
TPhadoop
27 pages
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
No ratings yet
Go To Cloudera Quickstart VM To Download A Pre-Setup CDH Virtual Machine
20 pages
Big Data Analytics - Lecture 6
No ratings yet
Big Data Analytics - Lecture 6
33 pages
Hadoop Map-Reduce
No ratings yet
Hadoop Map-Reduce
2 pages
Professional Hadoop Solutions
From Everand
Professional Hadoop Solutions
Boris Lublinsky
4/5 (2)
03_Run the WordCount program instructions.docx
No ratings yet
03_Run the WordCount program instructions.docx
4 pages
Cloudera Administration Handbook
From Everand
Cloudera Administration Handbook
Rohit Menon
No ratings yet
BDF Programs
No ratings yet
BDF Programs
32 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hedaya Alasooly
No ratings yet
BDA Lab
No ratings yet
BDA Lab
13 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Practical 2c
No ratings yet
Practical 2c
2 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
Learn Cassandra in 24 Hours
From Everand
Learn Cassandra in 24 Hours
Alex Nordeen
No ratings yet
BDA
No ratings yet
BDA
6 pages
DSBDA 11
No ratings yet
DSBDA 11
15 pages
Writing An Hadoop MapReduce Program in Python
No ratings yet
Writing An Hadoop MapReduce Program in Python
21 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Hadoop MapReduce v2 Cookbook - Second Edition
From Everand
Hadoop MapReduce v2 Cookbook - Second Edition
Thilina Gunarathne
No ratings yet
Cloud PDF
No ratings yet
Cloud PDF
47 pages
Big Data WordCount Practical
No ratings yet
Big Data WordCount Practical
5 pages
Run The WordCount Program Instructions
No ratings yet
Run The WordCount Program Instructions
3 pages
bda megh
No ratings yet
bda megh
50 pages
Homework_Labs_Lecture2
No ratings yet
Homework_Labs_Lecture2
6 pages
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
From Everand
JAVASCRIPT FRONT END PROGRAMMING: Crafting Dynamic and Interactive User Interfaces with JavaScript (2024 Guide for Beginners)
DAISY JOHNSTON
No ratings yet
BDT Lab Manual
No ratings yet
BDT Lab Manual
48 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
No ratings yet
Run Python MapReduce On Local Docker Hadoop Cluster - DEV Community
5 pages
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
No ratings yet
Hadoop Lab Notes: Nicola Tonellotto November 15, 2010
9 pages
Setup Hadoop Gettingstart
No ratings yet
Setup Hadoop Gettingstart
4 pages
Hadoop Blueprints
From Everand
Hadoop Blueprints
Anurag Shrivastava
No ratings yet
Activity 2
No ratings yet
Activity 2
31 pages
Configuration of a Simple Samba File Server, Quota and Schedule Backup
From Everand
Configuration of a Simple Samba File Server, Quota and Schedule Backup
Dr. Hidaia Mahmood Alassouli
No ratings yet
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Big Data File
No ratings yet
Big Data File
16 pages
Learning Cascading
From Everand
Learning Cascading
Michael Covert
No ratings yet
Backend Handbook: for Ruby on Rails Apps
From Everand
Backend Handbook: for Ruby on Rails Apps
Francisco Quintero
1/5 (1)
Learn Kubernetes - Container orchestration using Docker: Learn Collection
From Everand
Learn Kubernetes - Container orchestration using Docker: Learn Collection
Arnaud Weil
4/5 (1)
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
From Everand
Learn Kubernetes & Docker - .NET Core, Java, Node.JS, PHP or Python
Arnaud Weil
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
A concise guide to PHP MySQL and Apache
From Everand
A concise guide to PHP MySQL and Apache
alasdair gilchrist
4/5 (2)
bda lab s
No ratings yet
bda lab s
92 pages
OpenCart Tips and Tricks
From Everand
OpenCart Tips and Tricks
iSenseLabs
No ratings yet
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
From Everand
Docker Tutorial for Beginners: Learn Programming, Containers, Data Structures, Software Engineering, and Coding
Andrew Lee
3/5 (2)
Hadoop For Dummies
From Everand
Hadoop For Dummies
Dirk deRoos
3/5 (2)
Extending Docker
From Everand
Extending Docker
Russ McKendrick
5/5 (1)
ExNo04
No ratings yet
ExNo04
4 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Module 3 - Mapreduce
No ratings yet
Module 3 - Mapreduce
40 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
HDInsight Essentials - Second Edition
From Everand
HDInsight Essentials - Second Edition
Rajesh Nadipalli
No ratings yet
20CSPL701 - Bda - Record 2024-2025
No ratings yet
20CSPL701 - Bda - Record 2024-2025
54 pages
MapReduce(Streaming) TP Report
No ratings yet
MapReduce(Streaming) TP Report
16 pages
BDA record
No ratings yet
BDA record
58 pages
IPC HFW2231M AS I2 B S2 - Datasheet - 20211118
No ratings yet
IPC HFW2231M AS I2 B S2 - Datasheet - 20211118
3 pages
Pooyan Hydraulics: Description Symbol Application
100% (2)
Pooyan Hydraulics: Description Symbol Application
20 pages
Information Sheet 2.1 - Types of computers and parts of the system unit
No ratings yet
Information Sheet 2.1 - Types of computers and parts of the system unit
4 pages
Nikola Tesla Essay
No ratings yet
Nikola Tesla Essay
3 pages
Neles Valvguard™ Vg9200H: Installation, Maintenance and Operating Instructions
No ratings yet
Neles Valvguard™ Vg9200H: Installation, Maintenance and Operating Instructions
75 pages
EN Operating Instructions VEGABAR 19 Two Wire 4 20 Ma
No ratings yet
EN Operating Instructions VEGABAR 19 Two Wire 4 20 Ma
32 pages
How To Recap The Royal 500
No ratings yet
How To Recap The Royal 500
4 pages
MT6186M RF System Datasheet Ee Sze Khoo - The ebook in PDF and DOCX formats is ready for download now
No ratings yet
MT6186M RF System Datasheet Ee Sze Khoo - The ebook in PDF and DOCX formats is ready for download now
75 pages
Logic Controller - Modicon M221 - TM221C16R
No ratings yet
Logic Controller - Modicon M221 - TM221C16R
16 pages
Control Engineering - A Guide For Beginners
100% (2)
Control Engineering - A Guide For Beginners
132 pages
Capability Maturity Model (CMM) : Purpose: To Assess and Help Improve Process in Software Development Organizations
No ratings yet
Capability Maturity Model (CMM) : Purpose: To Assess and Help Improve Process in Software Development Organizations
45 pages
Miracle Mobility 4N1 Ultra Lite Multifunctional Electric Wheelchair - Owners Manual 5-9-22
No ratings yet
Miracle Mobility 4N1 Ultra Lite Multifunctional Electric Wheelchair - Owners Manual 5-9-22
31 pages
CRISP ML (Q) Business Understanding
No ratings yet
CRISP ML (Q) Business Understanding
17 pages
Mathematics F.Y.B.Sc.VSC Syllabus with Practicals 24-25 edited (1)
No ratings yet
Mathematics F.Y.B.Sc.VSC Syllabus with Practicals 24-25 edited (1)
12 pages
PA PDF Construction Quality Plan PDF
No ratings yet
PA PDF Construction Quality Plan PDF
35 pages
Tcam 3
No ratings yet
Tcam 3
5 pages
Wifi
No ratings yet
Wifi
13 pages
MBA K723 Winter 2013: Data Mining and Business Intelligence
No ratings yet
MBA K723 Winter 2013: Data Mining and Business Intelligence
48 pages
Agriculture Management System-3
No ratings yet
Agriculture Management System-3
22 pages
Lesson C: Possession Have, Has: Jamal 'S
No ratings yet
Lesson C: Possession Have, Has: Jamal 'S
2 pages
Case Study
No ratings yet
Case Study
4 pages
Computer Network Assignment
No ratings yet
Computer Network Assignment
2 pages
Session 12 - Interworking Between WCDMA & LTE
No ratings yet
Session 12 - Interworking Between WCDMA & LTE
51 pages
Research Schedule
No ratings yet
Research Schedule
2 pages
MMM Energodiagnostika
No ratings yet
MMM Energodiagnostika
14 pages
Dxxx-690-960/1695-2690/1695-2690-65/65/65-16I/18I/18I-M/M/M-R Easyret 6-Port Antenna With 3 Integrated Rcus - 2.0M Model: Atr4518R6V06
No ratings yet
Dxxx-690-960/1695-2690/1695-2690-65/65/65-16I/18I/18I-M/M/M-R Easyret 6-Port Antenna With 3 Integrated Rcus - 2.0M Model: Atr4518R6V06
2 pages
Catalogo GLP
No ratings yet
Catalogo GLP
12 pages
Logcat
No ratings yet
Logcat
496 pages
Why Is UNIX More Portable Than Other Operating System
0% (1)
Why Is UNIX More Portable Than Other Operating System
12 pages

Word Count using MapReduce on Hadoop

Uploaded by

Word Count using MapReduce on Hadoop

Uploaded by

Word Count using

Things we need to care about before we start?

If you’ve already installed VirtualBox, you’re good to go for the next

2. Cloudera Quickstart VM 5.4.2:

Step 1: Open Cloudera Quickstart VM on VirtualBox.

Step 2: Create a .txt data file inside /home/cloudera directory

P.S: Ritson is my friend. :)

Step 3: Create Mapper and Reducer files

You’ll get them from the following GitHub repository.

Open terminal on Cloudera Quickstart VM instance and run the

Local check of MapReduce

Now, we need to configure certain settings on Hadoop before we run

5a: Login to Cloudera Manager

Note: If you see the error “Unable to connect” while logging in

5b: Start HDFS and YARN services.

HDFS service started successfully

Step 6: Create a directory on HDFS

Now, we create a directory named word_count_map_reduce on

Use the following command for it.

Note: If the directory already exists, then either create a directory

List HDFS directory items using the following command.

Step 7: Copy input data file on HDFS.

Copy the word_count_data.txt file

sudo -u hdfs hadoop fs -put

Check if file was copied successfully to the desired location.

Input file copied on HDFS successfully

Step 8: Download hadoop-streaming JAR 2.7.3.

Download hadoop-streaming JAR 2.7.3

Once the file is downloaded, unzip it

hadoop-streaming-2.7.3.jar downloaded successfully

Step 9: Configure permissions to run MapReduce on Hadoop.

Run the following commands to do so:

chmod 777 mapper.py reducer.py

Permission granted to read, write and execute files on HDFS

Step 10: Run MapReduce on Hadoop.

hadoop jar /home/cloudera/hadoop-streaming-2.7.3.jar \

MapReduce job executed

Step 11: Read the MapReduce output.

Now, finally run the following command to read the output of

MapReduce output on Hadoop

Congratulations, the output for MapReduce on Hadoop is obtained

If you’d like to talk more on this, feel free to connect with me

You might also like