0% found this document useful (0 votes)

109 views16 pages

Map Reduce Report

The document describes a MapReduce program report submitted by Trishala Kumari for her Bachelor of Engineering degree. It discusses big data and Hadoop, and provides details on MapReduce architecture including mappers, reducers, and shuffle. It also describes the input and output types in MapReduce jobs. An example word count program is discussed to illustrate how MapReduce works.

Uploaded by

Trishala Kumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

109 views16 pages

Map Reduce Report

Uploaded by

Trishala Kumari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 16

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Santhibastawad Road, Machhe

Belagavi - 590018, Karnataka, India

BDA ACTIVITY-1 REPORT

on
“MAP REDUCE PROGRAM”

Submitted in the partial fulfilment of the requirements for the award of the degree of

BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING

For the Academic Year 2020-21

Submitted by

TRISHALA KUMARI 1JS17IS082

Under the Guidance of

Mr Anil BC
Asst.Professor, Dept. of ISE, JSSATE

2020-2021

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060

1
TABLE OF CONTENT

Chapter 1 Introduction

Chapter 2 Architecture

Chapter 3 Input and output

Chapter 4 word count example

Chapter 5 Application

Chapter 6 Conclusion

References

2
INTRODUCTION

Big Data

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or YouTube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity,
Variety, Volume, and Complexity.

Hadoop

Hadoop is a Big Data framework designed and deployed by Apache Foundation. It is an open-
source software utility that works in the network of computers in parallel to find solutions to Big
Data and process it using the MapReduce algorithm.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional
model is certainly not suitable to process huge volumes of scalable data and cannot be
accommodated by standard database servers. Moreover, the centralized system creates too much
of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.

Map Reduce
Nowadays, with the excessive growth in the information and data, their analysis has become a
burdensome challenge. MapReduce is a fault-tolerant, simple, and scalable framework for data
processing that enables its users to process these massive amounts of data. It is a framework for
efficient large-scale data processing which is presented by Google in 2004 in order to tackle the
issue of processing large amounts of data with reference to the Internet-based applications. These
large input data need to be indexed, stored, retrieved, analyzed and also mined to allow a simple
and continues access to these data and information. MapReduce is one of the forerunners in the
so-called “NoSQL” trend to steer it away from mainstream relational databases. Nowadays, there
are four factors, including processing, storing, visualization, and analyzing large data in modern
organizations and enterprises. The MapReduce can automatically run the applications on a
parallel cluster of hardware and in addition, it can process terabytes and petabytes of data more
rapidly and efficiently. Therefore, its popularity has grown swiftly for diverse brands of
enterprises in many fields. It provides a highly effective and efficient framework for the parallel
execution of the applications, data allocation in distributed database systems, and fault-tolerance
network communications.

The main objective of MapReduce is to facilitate data parallelization, data distribution

and load balancing in a simple library. The easy availability and accessibility of the MapReduce
platforms, such as Hadoop, makes it sufficient for a productive parallelization and execution of
data-intensive tasks. Programmers who use the library of MapReduce must consider two
functions, i.e. a Map and a Reduce function. The Map function receives a key/value pair as input
and creates the intermediate key/value pairs for extra processing. The Reduce function merges all
the intermediate key/value pairs and then creates the last output. Google’s MapReduce and its
open-source implementation, Hadoop, have become highly popular in recent years. The software
environment of Apache Hadoop delivers a distributed implementation of the data storage and
MapReduce computing. Lately, Apache Hadoop has drawn strong attention due to its
applicability for big data processing and it provides execution of the tasks over a cluster of many
machines which are based on commodity hardware. The programs of MapReduce in Hadoop are
usually written in Java, even though it also supports the use of stand-alone Map and Reduces
kernels, which can be written as shell scripts or in other languages. Its popularity has increased
by the use of some significant factors such as simplicity, automatic parallelizability, accepted
scalability, and commodity hardware implementation capability. Also, large-scale data-intensive
cloud computing with the Map Reduce framework is becoming pervasive for many academic
establishments, enterprises, governments, and industrial organizations. Cloud computing as a
progressive Internet-based technology provides a highly available, scalable, and flexible
computing platform for many kinds of applications .However, the Map Reduce programming
model used in the cloud computing has many limitations. Hence, developers have to excel in
programming with the Map Reduce model and spend adequate time to understand the variegated
features of different cloud platforms. With automatic parallelization software, the usage of cloud
computing will be limitless for many applications.
2.MapReduce Architecture

Google created MapReduce to process large amounts of unstructured or semi-structured data,

such as web documents and logs of web page requests, on large shared-nothing clusters of
commodity nodes. It produced various kinds of data such as inverted indices or URL access
frequencies. The MapReduce has three major parts, including Master, Map function and Reduce
function. The Master is responsible for managing the back-end Map and Reduce functions and
offering data and procedures to them. A MapReduce application contains a workflow of jobs
where each job makes two user-specified functions: Map and Reduce. The Map function is
applied to each input record and produces a list of intermediate records. The Reduce function
(also called Reducer) is applied to each group of intermediate records with the same key and
produces a list of output records. MapReduce program is expected to be done on several
computers and nodes when it is performed on Hadoop .

Therefore, a master node runs all the necessary services to organize the communication
between Mappers and Reducers. An input file (or files) is separated into the same parts called
input splits. They pass to the Mappers in which they work parallel together to provide the data
contained within each split. As the data is provided by the Mappers, they separate the output;
then each Reducer gathers the data partition by each Mapper, merges them, processes them, and
produces the output file. An example of this data flow is shown in Fig. below. The main phases
of MapReduce architecture are Mapper, Reducer, and shuffle which are presented below:
Figure:Map reduce architecture

Mapper: Mapper processes input data which are assigned by the master to perform some
computation on this input and produce intermediate results in the form of key/value pairs.
Reducer: The Reduce function receives an intermediate key and a set of values of the key. It
combines these values together to form a lesser set of values.

Shuffle: In MapReduce framework, after the Map task is finished, there are usually large
amounts of middle data to be moved from all Map nodes to all Reduce nodes in the shuffle
phase, the shuffle transfer data from the Mapper disks rather than their main memories and the
intermediate result will be sorted by the keys so that all pairs with the same key will be grouped
together and it needs to transfer the data from the local Map nodes to Reduce nodes through the
network.

Reducer: The Reducer’s job is to process the data that comes from the mapper. After processing,
it produces a new set of output, which will be stored in the HDFS.
3.Inputs and Outputs

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types. The key and value classes have to be
serializable by the framework and hence need to implement the Writable interface. Additionally,
the key classes have to implement the Writable Comparable interface to facilitate sorting by the
framework.

Input and Output types of a MapReduce job:

(input) <k1, v1>-> map-> <k2, v2>-> combine-> <k2, v2>-> reduce-> <k3,v3>(output)
4. Example: WordCount v1.0

Let’s walk through an example MapReduce application to get a flavor for how they work.
WordCount is a simple application that counts the number of occurrences of each word in a
given input set. This works with a local-standalone, pseudo-distributed or fully-distributed
Hadoop installation (Single Node Setup).

The prototypical MapReduce example counts the appearance of each word in a set of documents:
function map(String name, String document):

/ name: document name

/ document: document contents

for each word w in document:

emit (w, 1)

function reduce(String word, Iterator partialCounts):

/ word: a word

/ partialCounts: a list of aggregated partial

counts sum = 0

for each pc in partialCounts:

sum += ParseInt(pc)

emit (word, sum)

Here, each document is split into words, and each word is counted by the map function, using
the word as the result key. The framework puts together all the pairs with the same key and feeds
them to the same call to reduce. Thus, this function just needs to sum all of its input values to
find the total appearances of that word.
Source Code - WordCount.java
Mappper
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}

Wordcount-Reduce

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();

}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf)
}

}
Figure:word count workflow

Walk-through
The WordCount application is quite straight-forward. The Mapper implementation (lines 14-26),
via the map method (lines 18-25), processes one line at a time, as provided by the specified
TextInputFormat(line 49). It then splits the line into tokens separated by whitespaces, via the
StringTokenizer, and emits a key value pair of < <word>, 1>

For the given sample input the first map emits:

< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>

The second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

We'll learn more about the number of maps spawned for a given job, and how to control them in
a fine-grained manner, a bit later in the tutorial. WordCount also specifies a combiner(line 46).
Hence, the output of each map is passed through the local combiner (which is same as the
Reducer as per the job configuration) for local aggregation, after being sorted on the keys.

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the
values, which are the occurrence counts for each key (i.e. words in this example).
Thus the output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

MapReduce Applications

The MapReduce application facilitates the performance of many data parallel applications. The
MapReduce is the main factor in many important applications and it can improve system
parallelism. It gets considerable attention, for data-intensive and computation-intensive
applications on machine clusters . It is used as an efficient distributed computation tool for
variable problems, e.g., search, clustering, log analysis, different types of join operations, matrix
multiplication, pattern matching, and analysis of social networks ; It allows researchers to
investigate in different domains. MapReduce is used in many big data application such as short
message mining , genetic algorithms , k-means The pros and cons of iMapReduce and its
components Pros Cons iMapReduce results in the context of various iMapReduce is not suitable
for more iterative applications, show up to 5 times of faster operation computations. compared
with traditional Hadoop MapReduce. clustering algorithm , DNA fragment, intelligent
transportation system, Healthcare scientific applications, Fuzzy rule based classification
systems , heterogeneous environments , cuckoo search , extreme learning machine , Random
Forest, energy proportionally , Mobile Sensor Data , semantic web and so many. Therefore, in
this section indicates the brief analyzes of these applications and a brief look to various domains
of MapReduce applications
6. Conclusion

MapReduce has efficiency and scalability in most of the studies . It is used for generating and
processing big Data in various different applications. The purpose of this essay is to review,
MapReduce, its architecture, big data and an appropriate use of programming model in
conjunction with the applications of MapReduce in big data have been discussed thoroughly.
Also, we have surveyed and analyzed the implementations of MapReduce. The applications of
MapReduce framework in different contexts like the cloud, multi-core system, and parallel
computation have been investigated precisely. This paper examines and categorized a number of
applications which have been surveyed in MapReduce Framework based on Graph processing,
Join and parallel queries, optimizing frameworks, multi-core systems, and data allocation. The
goal of the MapReduce Framework is to provide an abstraction layer between the faulttolerance,
data distribution and other parallel systems tasks, and the implementation details of the specific
algorithm. Obviously, the requirements of MapReduce applications is growing rapidly. This
survey gives the reader a general review of the MapReduce applications and it will be a good
introductory reference to improve the article that is easier to comprehend.
REFERENCES

 https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
 wang, B., Huang, S., Qiu, J., Liu, Y., Wang, G.: Parallel online sequential extreme
learning machine based on MapReduce.
 Neurocomputing 149, 224–232 (2015) 2. Marozzo, F., Talia, D., Trunfio, P.: P2P-
MapReduce: parallel data processing in dynamic Cloud environments.
 J. Comput. Syst. Sci. 78, 1382–1402 (2012) 3. Mohamed, H., Marchand-Maillet, S.:
MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy.
Parallel Comput. 39, 851–866 (2013)

Hadoop - MapReduce
No ratings yet
Hadoop - MapReduce
5 pages
Big Data Notes
No ratings yet
Big Data Notes
13 pages
Term Paper Java
No ratings yet
Term Paper Java
14 pages
Practical 1: Data Mining and Business Intelligence Practical-1
No ratings yet
Practical 1: Data Mining and Business Intelligence Practical-1
10 pages
132 P16cse5a-P16ite3a 2020052706582977
No ratings yet
132 P16cse5a-P16ite3a 2020052706582977
15 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Analysis of Mapreduce Algorithms: Harini Padmanaban
No ratings yet
Analysis of Mapreduce Algorithms: Harini Padmanaban
6 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
Big Data
No ratings yet
Big Data
120 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
MapReduce Unit3
No ratings yet
MapReduce Unit3
27 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
The Map Reduce Programming
No ratings yet
The Map Reduce Programming
15 pages
Map Reduce On Red Green Blue Architecture
No ratings yet
Map Reduce On Red Green Blue Architecture
11 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
MapReduce BigData 09
No ratings yet
MapReduce BigData 09
9 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Data Science
No ratings yet
Data Science
7 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Hadoop: A Seminar Report On
No ratings yet
Hadoop: A Seminar Report On
28 pages
CC Unit4
No ratings yet
CC Unit4
14 pages
By Christian Mechem and Geoff Crowley
No ratings yet
By Christian Mechem and Geoff Crowley
11 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
A Brief On MapReduce Performance
No ratings yet
A Brief On MapReduce Performance
6 pages
MAPREDUCEFRAMEWORK
No ratings yet
MAPREDUCEFRAMEWORK
12 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Large Scale and MultiStructured Databases
No ratings yet
Large Scale and MultiStructured Databases
223 pages
Map Reduce
No ratings yet
Map Reduce
8 pages
Spark Streaming Research
No ratings yet
Spark Streaming Research
6 pages
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
No ratings yet
Big Data, Map Reduce & Hadoop: By: Surbhi Vyas (7) Varsha
40 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Medha 8059
No ratings yet
Medha 8059
4 pages
Unit 3 BDT
No ratings yet
Unit 3 BDT
42 pages
Mapreduce: Simplified Data Analysis of Big Data: Sciencedirect
No ratings yet
Mapreduce: Simplified Data Analysis of Big Data: Sciencedirect
9 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Spark 4-2 Documentation
No ratings yet
Spark 4-2 Documentation
60 pages
Big Data Analysis PDF 2
No ratings yet
Big Data Analysis PDF 2
18 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Unit 3 Notes
No ratings yet
Unit 3 Notes
21 pages
Act4 May2 6E BDA SEC
No ratings yet
Act4 May2 6E BDA SEC
4 pages
Module 3
No ratings yet
Module 3
43 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Assignment Questions BDA Lec 6
No ratings yet
Assignment Questions BDA Lec 6
51 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Big Data Computing
No ratings yet
Big Data Computing
36 pages
Administering ArcGIS for Server
From Everand
Administering ArcGIS for Server
Hussein Nasser
No ratings yet
OTRS 4 - Admin Manual PDF
No ratings yet
OTRS 4 - Admin Manual PDF
10 pages
The Ultimate C - C - SECAUTH - 20 - SAP Certified Technology Associate - SAP System Security and Authorizations
0% (1)
The Ultimate C - C - SECAUTH - 20 - SAP Certified Technology Associate - SAP System Security and Authorizations
2 pages
Chapter 8 E-CRM Presentation
No ratings yet
Chapter 8 E-CRM Presentation
26 pages
E-Granthalaya Project
No ratings yet
E-Granthalaya Project
11 pages
Telecom Ecommerce Case Study
No ratings yet
Telecom Ecommerce Case Study
4 pages
UNIT 5 Data Binding and Deployment
No ratings yet
UNIT 5 Data Binding and Deployment
31 pages
Sample Project Report
No ratings yet
Sample Project Report
43 pages
Week 1 686 F2022
No ratings yet
Week 1 686 F2022
37 pages
Salesforce Development Syllabus
No ratings yet
Salesforce Development Syllabus
3 pages
Recoveries
No ratings yet
Recoveries
6 pages
Apple Security Features For Mac
No ratings yet
Apple Security Features For Mac
9 pages
Luciano Listorti CV - English PDF
No ratings yet
Luciano Listorti CV - English PDF
3 pages
VMmark Rules 4.x 2024-04-18
No ratings yet
VMmark Rules 4.x 2024-04-18
23 pages
COC3-Server-2016 Ver5 Full
No ratings yet
COC3-Server-2016 Ver5 Full
7 pages
The Journey To Dev Ops
No ratings yet
The Journey To Dev Ops
199 pages
DP-900 Questions
No ratings yet
DP-900 Questions
19 pages
Devops Interview Question
No ratings yet
Devops Interview Question
19 pages
Service-Now: Types of Support Tools
No ratings yet
Service-Now: Types of Support Tools
4 pages
Seminar Report
No ratings yet
Seminar Report
6 pages
Internet of Things (Iot) : A Survey On Empowering Technologies, Research Opportunities and Applications
No ratings yet
Internet of Things (Iot) : A Survey On Empowering Technologies, Research Opportunities and Applications
26 pages
CMDB7.6.04 NormalizationReconciliationGuide
No ratings yet
CMDB7.6.04 NormalizationReconciliationGuide
166 pages
ADBMS
No ratings yet
ADBMS
74 pages
Week 8 AIA IoT
No ratings yet
Week 8 AIA IoT
1 page
Unit 1 Unit1
No ratings yet
Unit 1 Unit1
38 pages
Introduction To E-Commerce
No ratings yet
Introduction To E-Commerce
24 pages
Module 07 Understanding Hard Disks and File Systems
100% (1)
Module 07 Understanding Hard Disks and File Systems
144 pages
Project Documentation Preparation
No ratings yet
Project Documentation Preparation
46 pages
Amit Kulkarni: Seeking Assignments in Technical Product Management, Software Development and Go To Market Strategy
No ratings yet
Amit Kulkarni: Seeking Assignments in Technical Product Management, Software Development and Go To Market Strategy
9 pages
Group-1-Introduction-April 21 1232
No ratings yet
Group-1-Introduction-April 21 1232
62 pages
Project Management: Managing and Using Information Systems: A Strategic Approach by Keri Pearlson & Carol Saunders
No ratings yet
Project Management: Managing and Using Information Systems: A Strategic Approach by Keri Pearlson & Carol Saunders
54 pages

Map Reduce Report

Uploaded by

Map Reduce Report

Uploaded by

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Santhibastawad Road, Machhe

BDA ACTIVITY-1 REPORT

For the Academic Year 2020-21

TRISHALA KUMARI 1JS17IS082

Under the Guidance of

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING

Chapter 3 Input and output

Chapter 4 word count example

The main objective of MapReduce is to facilitate data parallelization, data distribution

Google created MapReduce to process large amounts of unstructured or semi-structured data,

Input and Output types of a MapReduce job:

/ name: document name

/ document: document contents

function reduce(String word, Iterator partialCounts):

/ partialCounts: a list of aggregated partial

for each pc in partialCounts:

emit (word, sum)

For the given sample input the first map emits:

< World, 1>

< Bye, 1>

< World, 1>

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>

You might also like