0% found this document useful (0 votes)

140 views

Assignment 11 DSBDA

This document provides instructions for writing a MapReduce word count program in Java using Hadoop to count the number of occurrences of each word in an input file. It explains the MapReduce framework, how mappers break the input into key-value pairs and reducers sum the counts by key. Students are guided through creating the code, compiling a JAR file, and running the program on a sample text file in HDFS to output the word counts.

Uploaded by

DARSHAN JADHAV

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views

Assignment 11 DSBDA

Uploaded by

DARSHAN JADHAV

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

DS&BDL

Assignment 11

Title: Hadoop MapReduce Framework- WordCount application

Aim:
Write a code in JAVA for a simple WordCount application that counts the number of occurrences
of each word in a given input set using the Hadoop MapReduce framework on local-standalone
set-up.

Objective:
By completing this task, students will learn the following

1. Hadoop Distributed File System.

2. MapReduce Framework.

Software/Hardware Requirements: 64-bit Open source OS-Linux, Java, Hadoop.

Theory:
Map and Reduce tasks in Hadoop-With in a MapReduce job there are two separate tasks map task and
reduce task.

Map task- A MapReduce job splits the input dataset into independent chunks known as input splits in
Hadoop which are processed by the map tasks in a completely parallel manner. Hadoop framework
creates separate map task for each input split.

Reduce task- The output of the maps is sorted by the Hadoop framework which then becomes input
to the reduce tasks.

Hadoop MapReduce framework operates exclusively on <key, value> pairs. In a MapReduce job, the
input to the Map function is a set of <key, value> pairs and output is also a set of <key, value> pairs.
The output <key, value> pair may have different type from the input <key, value> pair.

<K1, V1> -> map -> (K2, V2)

The output from the map tasks is sorted by the Hadoop framework. MapReduce guarantees that the
input to every reducer is sorted by key. Input and output of the reduce task can be represented as
follows.

<K2, list(V2)> -> reduce -> <K3, V3>

1. Creating and copying input file to HDFS

If you already have a file in HDFS which you want to use as input then you can skip this step.
First thing is to create a file which will be used as input and copy it to HDFS.
Let’s say you have a file wordcount.txt with the following content.

Hello wordcount MapReduce Hadoop program.

This is my first MapReduce program.
You want to copy this file to /user/process directory with in HDFS. If that path doesn’t exist
then you need to create those directories first.
HDFS Command Refernce List Link- https://www.netjstech.com/2018/02/hdfs-commands-
reference-list.html

hdfs dfs -mkdir -p / user/process

Then copy the file wordcount.txt to this directory.

hdfs dfs -put /netjs /MapReduce/word count .txt

/user /process

2. Write your wordcount mapreduce code.

WordCount example reads text files and counts the frequency of the words. Each mapper
takes a line of the input file as input and breaks it into words. It then emits a key/value pair of
the word (In the form of (word, 1)) and each reducer sums the counts for each word and emits
a single key/value with the word and sum.

In the word count MapReduce code there is a Mapper class (MyMapper) with map function
and a Reducer class (MyReducer) with a reduce function.

1. Map function
From the wordcount.txt file Each line will be passed to the map function in the following
format.
<0, Hello wordcount MapReduce Hadoop program.>
<41, This is my first MapReduce program.>
In the map function the line is split on space and each word is written to the context along
with the value as 1.
So the output from the map function for the two lines will be as follows.

Line 1 <key, value> output

(Hello, 1)
(wordcount, 1)
(MapReduce, 1)
(Hadoop, 1)
(program., 1)
Line 2 <key, value> output
(This, 1)
(is, 1)
(my, 1)
(first, 1)
(MapReduce, 1)
(program., 1)
2. Shuffling and sorting by Hadoop Framework
The output of map function doesn’t become input of the reduce function directly. It
goes through shuffling and sorting by Hadoop framework. In this processing the data
is sorted and grouped by keys.
After the internal processing the data will be in the following format. This is the input
to reduce function.
<Hadoop, (1)>
<Hello, (1)>
<MapReduce, (1, 1)>
<This, (1)>
<first, (1)>
<is, (1)>
<my, (1)>
<program., (1, 1)>
<wordcount, (1)>
3. How reduce task works in Hadoop
As we just saw the input to the reduce task is in the format (key, list<values>). In the
reduce function, for each input <key, value> pair, just iterate the list of values for
each key and add the values that will give the count for each key.
Write that key and sum of values to context, that <key, value> pair is the output of the
reduce function.
Hadoop 1
Hello 1
MapReduce 2
This 1
first 1
is 1
my 1
program. 2
wordcount 1
3. Creating jar of your wordcount MapReduce code

You will also need to add at least the following Hadoop jars so that your code can compile.
You will find these jars inside the /share/hadoop directory of your Hadoop installation. With
in /share/hadoop path look in hdfs, mapreduce and common directories for required jars.

hadoo p -common-2.9.0. jar

hadoo p -hdfs-2.9.0.ja r
hadoo p -hdfs-client-2 .9.0.jar
hadoo p -mapreduce-cli ent -core-2.9.0. jar
hadoo p -mapreduce-cli ent -common-2.9. 0.jar
hadoo p -mapreduce-cli ent -jobclient-2 .9.0.jar
hadoo p -mapreduce-cli ent -hs-2.9.0.ja r
hadoo p -mapreduce-cli ent -app-2.9.0.j ar
commo ns -io -2.4.jar

Once you are able to compile your code you need to create jar file.

In the eclipse IDE righ click on your Java program and select Export – Java – jar file.

4. Running the MapReduce code

You can use the following command to run the program. Assuming you are in your hadoop
installation directory.

bin/h adoop jar /netj s /MapReduce/wor dcount .jar

org.n etjs .WordCount /user/process /user /out
Explanation for the arguments passed is as follows-

/netjs/MapReduce/wordcount.jar is the path to your jar file.

org.netjs.WordCount is the fully qualified path to your Java program class.

/user/process – path to input directory.

/user/out – path to output directory.

One your word count MapReduce program is succesfully executed you can verify the output
file.

hdfs dfs -ls /user/ out

Found 2 items

-rw-r--r-- 1 netjs supergroup 0 2018-02-27 13:37 /user/out/_SUCCESS

-rw-r--r-- 1 netjs supergroup 77 2018-02-27 13:37 /user/out/part-r-00000

As you can see Hadoop framework creates output files using part-r-xxxx format. Since only
one reducer is used here so there is only one output file part-r-00000. You can see the content
of the file using the following command.

hdfs dfs -cat /user /out/part-r-000 00

Hadoo p 1
Hello 1
MapRe duce 2
This 1
first 1
is 1
my 1
progr am . 2
wordc ount 1

Conclusion: In this assignment, we have learned what is HDFS and How Hadoop
MapReduce framework is used to counts the number of occurrences of each word in a given
input set.

Touchpad Plus Ver. 4.0 Class 8
From Everand
Touchpad Plus Ver. 4.0 Class 8
Nidhi Gupta
No ratings yet
TE - Internship Report Tanvi
No ratings yet
TE - Internship Report Tanvi
31 pages
Unit 1 Maintenance Management and Terotechnology: An Overview
No ratings yet
Unit 1 Maintenance Management and Terotechnology: An Overview
9 pages
Touchpad Prime Ver. 1.2 Class 6
From Everand
Touchpad Prime Ver. 1.2 Class 6
Nisha Batra
No ratings yet
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
6 pages
Ravikant_Hadoop_file
No ratings yet
Ravikant_Hadoop_file
22 pages
Example - (Map Function in Word Count)
No ratings yet
Example - (Map Function in Word Count)
6 pages
Unit IV Programming Model
No ratings yet
Unit IV Programming Model
30 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
6 pages
Practical 2c
No ratings yet
Practical 2c
2 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
5 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
CS702_Big_Data_Programs
No ratings yet
CS702_Big_Data_Programs
58 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
02-Wordcount Mapreduce
No ratings yet
02-Wordcount Mapreduce
5 pages
hadoop2
No ratings yet
hadoop2
31 pages
WordCount Program Hadoop Task 2
No ratings yet
WordCount Program Hadoop Task 2
7 pages
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
No ratings yet
Developing A Simple Map-Reduce Program For Hadoop: Big Data Course CS6350 Professor: Dr. Latifur Khan
22 pages
Practical 2-1
No ratings yet
Practical 2-1
4 pages
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
No ratings yet
Setting Up Eclipse:: Codelab 1 Introduction To The Hadoop Environment (Version 0.17.0)
9 pages
MapReduce Word Count Example - Javatpoint
No ratings yet
MapReduce Word Count Example - Javatpoint
12 pages
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
No ratings yet
Word Count Program To Demonstrate The Use of Map and Reduce Tasks
5 pages
Prerequisites: Single Node Setup Cluster Setup
No ratings yet
Prerequisites: Single Node Setup Cluster Setup
5 pages
BDA
No ratings yet
BDA
6 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Bda 03
No ratings yet
Bda 03
10 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
DSBDA GRP B Print
No ratings yet
DSBDA GRP B Print
21 pages
Palak
No ratings yet
Palak
10 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
BDA Experiment 3
No ratings yet
BDA Experiment 3
7 pages
BDA Lab 8 Manual
No ratings yet
BDA Lab 8 Manual
7 pages
Mapreduce Programming Framework
No ratings yet
Mapreduce Programming Framework
23 pages
bda megh
No ratings yet
bda megh
50 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
3 MapReduce program ex code
No ratings yet
3 MapReduce program ex code
14 pages
Advanced Mapreduce
No ratings yet
Advanced Mapreduce
37 pages
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
No ratings yet
Developing A Mapreduce Application: by Dr. K. Venkateswara Rao Professor Department of Cse
83 pages
Big Data Akshat
No ratings yet
Big Data Akshat
57 pages
Word Count Program
No ratings yet
Word Count Program
3 pages
wc
No ratings yet
wc
13 pages
Hands-On Exercises With Big Data: Lab Sheet 1: Getting Started With Mapreduce and Hadoop
No ratings yet
Hands-On Exercises With Big Data: Lab Sheet 1: Getting Started With Mapreduce and Hadoop
14 pages
Week-8 de
No ratings yet
Week-8 de
9 pages
BDF Programs
No ratings yet
BDF Programs
32 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
UNIT 2-tt1
No ratings yet
UNIT 2-tt1
7 pages
Lab2 WC
No ratings yet
Lab2 WC
2 pages
Map Reduce Design and Execution Framework Part 1
No ratings yet
Map Reduce Design and Execution Framework Part 1
19 pages
Running Jar Program
No ratings yet
Running Jar Program
3 pages
Bda Lab Exercises Lab Mannual - 2023
No ratings yet
Bda Lab Exercises Lab Mannual - 2023
72 pages
Wordcount
No ratings yet
Wordcount
3 pages
Word_Count(2021)
No ratings yet
Word_Count(2021)
50 pages
Cloud PDF
No ratings yet
Cloud PDF
47 pages
ExNo04
No ratings yet
ExNo04
4 pages
Run Wordcount
No ratings yet
Run Wordcount
3 pages
WT - Question Bank - Unit Test
No ratings yet
WT - Question Bank - Unit Test
12 pages
"Visual and Acoustic Identification of Bird Species": A Seminar Report ON
No ratings yet
"Visual and Acoustic Identification of Bird Species": A Seminar Report ON
29 pages
WTL Report (Abhi)
No ratings yet
WTL Report (Abhi)
26 pages
Ehv Bladder Parker
No ratings yet
Ehv Bladder Parker
38 pages
Solar Tracker Dissertation
100% (2)
Solar Tracker Dissertation
6 pages
Rubric - Report Writing
No ratings yet
Rubric - Report Writing
5 pages
H - P, L - C T: IGH Erformance OW Urrent Ransceiver
No ratings yet
H - P, L - C T: IGH Erformance OW Urrent Ransceiver
56 pages
ENG3202 Lab Report 4 - Switch
No ratings yet
ENG3202 Lab Report 4 - Switch
16 pages
Issue Tracking Log
No ratings yet
Issue Tracking Log
4 pages
Ifm Innovation Catalogue 2011
100% (1)
Ifm Innovation Catalogue 2011
78 pages
Plemmons Taylor Resume
No ratings yet
Plemmons Taylor Resume
1 page
Workflow Mobile Approvals
No ratings yet
Workflow Mobile Approvals
19 pages
System Test Plan
No ratings yet
System Test Plan
7 pages
PLC - Multilinguage - Tech - Description - CArS v2 - r0 - IT
No ratings yet
PLC - Multilinguage - Tech - Description - CArS v2 - r0 - IT
31 pages
DP Biometric 14121 Drivers
No ratings yet
DP Biometric 14121 Drivers
160 pages
Introduction To CAD: 2D To 3D Modeling
No ratings yet
Introduction To CAD: 2D To 3D Modeling
32 pages
HUAWEI SMU02B Monitoring Unit Data Sheet PDF
100% (1)
HUAWEI SMU02B Monitoring Unit Data Sheet PDF
2 pages
1819sem1 Ee3104c
No ratings yet
1819sem1 Ee3104c
5 pages
How Car Cooling Systems Work
No ratings yet
How Car Cooling Systems Work
4 pages
Ug575 Ultrascale Pkg Pinout
No ratings yet
Ug575 Ultrascale Pkg Pinout
542 pages
Vermona ER9 MIDI Interface: User and Installation Guide
No ratings yet
Vermona ER9 MIDI Interface: User and Installation Guide
18 pages
Preface 2017 Selection of The HPLC Method in Chemical Analysis
No ratings yet
Preface 2017 Selection of The HPLC Method in Chemical Analysis
1 page
How To Configure Network Load Balancing
No ratings yet
How To Configure Network Load Balancing
13 pages
Baboolal Vishwakarma
No ratings yet
Baboolal Vishwakarma
2 pages
Artificial Intelligence (AI) Literacy in Early Childhood Education: The Challenges and Opportunities
No ratings yet
Artificial Intelligence (AI) Literacy in Early Childhood Education: The Challenges and Opportunities
66 pages
Assignment 5 - Plagiarism
No ratings yet
Assignment 5 - Plagiarism
2 pages
TCL+30+5G T776H H1+Service+Manual+V1.0
No ratings yet
TCL+30+5G T776H H1+Service+Manual+V1.0
20 pages
Userspace Networking: Beyond The Kernel Bypass With RDMA!
No ratings yet
Userspace Networking: Beyond The Kernel Bypass With RDMA!
8 pages
Syllabus CSE111
No ratings yet
Syllabus CSE111
2 pages
93025409V1.1.R10 - V1 - 7210 Sas M Os 1.1 - R10
No ratings yet
93025409V1.1.R10 - V1 - 7210 Sas M Os 1.1 - R10
20 pages
MyFerrari - Ferrari F8 Tributo - Fn61uFC
No ratings yet
MyFerrari - Ferrari F8 Tributo - Fn61uFC
11 pages
Opentrack - A Tool For Simulation of Railway Networks
No ratings yet
Opentrack - A Tool For Simulation of Railway Networks
9 pages