0% found this document useful (0 votes)
107 views16 pages

Map Reduce Report

The document describes a MapReduce program report submitted by Trishala Kumari for her Bachelor of Engineering degree. It discusses big data and Hadoop, and provides details on MapReduce architecture including mappers, reducers, and shuffle. It also describes the input and output types in MapReduce jobs. An example word count program is discussed to illustrate how MapReduce works.

Uploaded by

Trishala Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
107 views16 pages

Map Reduce Report

The document describes a MapReduce program report submitted by Trishala Kumari for her Bachelor of Engineering degree. It discusses big data and Hadoop, and provides details on MapReduce architecture including mappers, reducers, and shuffle. It also describes the input and output types in MapReduce jobs. An example word count program is discussed to illustrate how MapReduce works.

Uploaded by

Trishala Kumari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

VISVESVARAYA TECHNOLOGICAL UNIVERSITY

Jnana Sangama, Santhibastawad Road, Machhe


Belagavi - 590018, Karnataka, India

BDA ACTIVITY-1 REPORT


on
“MAP REDUCE PROGRAM”

Submitted in the partial fulfilment of the requirements for the award of the degree of

BACHELOR OF ENGINEERING
IN
INFORMATION SCIENCE AND ENGINEERING

For the Academic Year 2020-21


Submitted by

TRISHALA KUMARI 1JS17IS082

Under the Guidance of


Mr Anil BC
Asst.Professor, Dept. of ISE, JSSATE

2020-2021

DEPARTMENT OF INFORMATION SCIENCE AND ENGINEERING


JSS ACADEMY OF TECHNICAL EDUCATION
JSS Campus, Dr.Vishnuvardhan Road, Bengaluru-560060

1
TABLE OF CONTENT

Chapter 1 Introduction

Chapter 2 Architecture

Chapter 3 Input and output

Chapter 4 word count example

Chapter 5 Application

Chapter 6 Conclusion

References

2
INTRODUCTION

Big Data

Big Data is a collection of large datasets that cannot be processed using traditional computing
techniques. For example, the volume of data Facebook or YouTube need require it to collect and
manage on a daily basis, can fall under the category of Big Data. However, Big Data is not only
about scale and volume, it also involves one or more of the following aspects − Velocity,
Variety, Volume, and Complexity.

Hadoop

Hadoop is a Big Data framework designed and deployed by Apache Foundation. It is an open-
source software utility that works in the network of computers in parallel to find solutions to Big
Data and process it using the MapReduce algorithm.

Why MapReduce?
Traditional Enterprise Systems normally have a centralized server to store and process data. The
following illustration depicts a schematic view of a traditional enterprise system. Traditional
model is certainly not suitable to process huge volumes of scalable data and cannot be
accommodated by standard database servers. Moreover, the centralized system creates too much
of a bottleneck while processing multiple files simultaneously.
Google solved this bottleneck issue using an algorithm called MapReduce. MapReduce divides a
task into small parts and assigns them to many computers. Later, the results are collected at one
place and integrated to form the result dataset.

Map Reduce
Nowadays, with the excessive growth in the information and data, their analysis has become a
burdensome challenge. MapReduce is a fault-tolerant, simple, and scalable framework for data
processing that enables its users to process these massive amounts of data. It is a framework for
efficient large-scale data processing which is presented by Google in 2004 in order to tackle the
issue of processing large amounts of data with reference to the Internet-based applications. These
large input data need to be indexed, stored, retrieved, analyzed and also mined to allow a simple
and continues access to these data and information. MapReduce is one of the forerunners in the
so-called “NoSQL” trend to steer it away from mainstream relational databases. Nowadays, there
are four factors, including processing, storing, visualization, and analyzing large data in modern
organizations and enterprises. The MapReduce can automatically run the applications on a
parallel cluster of hardware and in addition, it can process terabytes and petabytes of data more
rapidly and efficiently. Therefore, its popularity has grown swiftly for diverse brands of
enterprises in many fields. It provides a highly effective and efficient framework for the parallel
execution of the applications, data allocation in distributed database systems, and fault-tolerance
network communications.

The main objective of MapReduce is to facilitate data parallelization, data distribution


and load balancing in a simple library. The easy availability and accessibility of the MapReduce
platforms, such as Hadoop, makes it sufficient for a productive parallelization and execution of
data-intensive tasks. Programmers who use the library of MapReduce must consider two
functions, i.e. a Map and a Reduce function. The Map function receives a key/value pair as input
and creates the intermediate key/value pairs for extra processing. The Reduce function merges all
the intermediate key/value pairs and then creates the last output. Google’s MapReduce and its
open-source implementation, Hadoop, have become highly popular in recent years. The software
environment of Apache Hadoop delivers a distributed implementation of the data storage and
MapReduce computing. Lately, Apache Hadoop has drawn strong attention due to its
applicability for big data processing and it provides execution of the tasks over a cluster of many
machines which are based on commodity hardware. The programs of MapReduce in Hadoop are
usually written in Java, even though it also supports the use of stand-alone Map and Reduces
kernels, which can be written as shell scripts or in other languages. Its popularity has increased
by the use of some significant factors such as simplicity, automatic parallelizability, accepted
scalability, and commodity hardware implementation capability. Also, large-scale data-intensive
cloud computing with the Map Reduce framework is becoming pervasive for many academic
establishments, enterprises, governments, and industrial organizations. Cloud computing as a
progressive Internet-based technology provides a highly available, scalable, and flexible
computing platform for many kinds of applications .However, the Map Reduce programming
model used in the cloud computing has many limitations. Hence, developers have to excel in
programming with the Map Reduce model and spend adequate time to understand the variegated
features of different cloud platforms. With automatic parallelization software, the usage of cloud
computing will be limitless for many applications.
2.MapReduce Architecture

Google created MapReduce to process large amounts of unstructured or semi-structured data,


such as web documents and logs of web page requests, on large shared-nothing clusters of
commodity nodes. It produced various kinds of data such as inverted indices or URL access
frequencies. The MapReduce has three major parts, including Master, Map function and Reduce
function. The Master is responsible for managing the back-end Map and Reduce functions and
offering data and procedures to them. A MapReduce application contains a workflow of jobs
where each job makes two user-specified functions: Map and Reduce. The Map function is
applied to each input record and produces a list of intermediate records. The Reduce function
(also called Reducer) is applied to each group of intermediate records with the same key and
produces a list of output records. MapReduce program is expected to be done on several
computers and nodes when it is performed on Hadoop .

Therefore, a master node runs all the necessary services to organize the communication
between Mappers and Reducers. An input file (or files) is separated into the same parts called
input splits. They pass to the Mappers in which they work parallel together to provide the data
contained within each split. As the data is provided by the Mappers, they separate the output;
then each Reducer gathers the data partition by each Mapper, merges them, processes them, and
produces the output file. An example of this data flow is shown in Fig. below. The main phases
of MapReduce architecture are Mapper, Reducer, and shuffle which are presented below:
Figure:Map reduce architecture

Mapper: Mapper processes input data which are assigned by the master to perform some
computation on this input and produce intermediate results in the form of key/value pairs.
Reducer: The Reduce function receives an intermediate key and a set of values of the key. It
combines these values together to form a lesser set of values.

Shuffle: In MapReduce framework, after the Map task is finished, there are usually large
amounts of middle data to be moved from all Map nodes to all Reduce nodes in the shuffle
phase, the shuffle transfer data from the Mapper disks rather than their main memories and the
intermediate result will be sorted by the keys so that all pairs with the same key will be grouped
together and it needs to transfer the data from the local Map nodes to Reduce nodes through the
network.

Reducer: The Reducer’s job is to process the data that comes from the mapper. After processing,
it produces a new set of output, which will be stored in the HDFS.
3.Inputs and Outputs

The MapReduce framework operates exclusively on <key, value> pairs, that is, the framework
views the input to the job as a set of <key, value> pairs and produces a set of <key, value> pairs
as the output of the job, conceivably of different types. The key and value classes have to be
serializable by the framework and hence need to implement the Writable interface. Additionally,
the key classes have to implement the Writable Comparable interface to facilitate sorting by the
framework.

Input and Output types of a MapReduce job:

(input) <k1, v1>-> map-> <k2, v2>-> combine-> <k2, v2>-> reduce-> <k3,v3>(output)
4. Example: WordCount v1.0

Let’s walk through an example MapReduce application to get a flavor for how they work.
WordCount is a simple application that counts the number of occurrences of each word in a
given input set. This works with a local-standalone, pseudo-distributed or fully-distributed
Hadoop installation (Single Node Setup).

The prototypical MapReduce example counts the appearance of each word in a set of documents:
function map(String name, String document):

/ name: document name

/ document: document contents


for each word w in document:

emit (w, 1)

function reduce(String word, Iterator partialCounts):

/ word: a word

/ partialCounts: a list of aggregated partial


counts sum = 0

for each pc in partialCounts:


sum += ParseInt(pc)

emit (word, sum)

Here, each document is split into words, and each word is counted by the map function, using
the word as the result key. The framework puts together all the pairs with the same key and feeds
them to the same call to reduce. Thus, this function just needs to sum all of its input values to
find the total appearances of that word.
Source Code - WordCount.java
Mappper
package org.myorg;
import java.io.IOException;
import java.util.*;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapred.*;
import org.apache.hadoop.util.*;
public class WordCount {
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text,
IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter
reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
output.collect(word, one);
}

Wordcount-Reduce

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text,
IntWritable> {
public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output,
Reporter reporter) throws IOException {
int sum = 0;
while (values.hasNext()) {
sum += values.next().get();

}
output.collect(key, new IntWritable(sum));
}
}
public static void main(String[] args) throws Exception {
JobConf conf = new JobConf(WordCount.class);
conf.setJobName("wordcount");
conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(IntWritable.class);
conf.setMapperClass(Map.class);
conf.setCombinerClass(Reduce.class);
conf.setReducerClass(Reduce.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(conf, new Path(args[0]));
FileOutputFormat.setOutputPath(conf, new Path(args[1]));
JobClient.runJob(conf)
}

}
Figure:word count workflow

Walk-through
The WordCount application is quite straight-forward. The Mapper implementation (lines 14-26),
via the map method (lines 18-25), processes one line at a time, as provided by the specified
TextInputFormat(line 49). It then splits the line into tokens separated by whitespaces, via the
StringTokenizer, and emits a key value pair of < <word>, 1>

For the given sample input the first map emits:


< Hello, 1>

< World, 1>

< Bye, 1>

< World, 1>


The second map emits:

< Hello, 1>

< Hadoop, 1>

< Goodbye, 1>

< Hadoop, 1>

We'll learn more about the number of maps spawned for a given job, and how to control them in
a fine-grained manner, a bit later in the tutorial. WordCount also specifies a combiner(line 46).
Hence, the output of each map is passed through the local combiner (which is same as the
Reducer as per the job configuration) for local aggregation, after being sorted on the keys.

The output of the first map:

< Bye, 1>

< Hello, 1>

< World, 2>

The output of the second map:

< Goodbye, 1>

< Hadoop, 2>

< Hello, 1>

The Reducer implementation (lines 28-36), via the reduce method (lines 29-35) just sums up the
values, which are the occurrence counts for each key (i.e. words in this example).
Thus the output of the job is:

< Bye, 1>

< Goodbye, 1>

< Hadoop, 2>

< Hello, 2>

< World, 2>


MapReduce Applications

The MapReduce application facilitates the performance of many data parallel applications. The
MapReduce is the main factor in many important applications and it can improve system
parallelism. It gets considerable attention, for data-intensive and computation-intensive
applications on machine clusters . It is used as an efficient distributed computation tool for
variable problems, e.g., search, clustering, log analysis, different types of join operations, matrix
multiplication, pattern matching, and analysis of social networks ; It allows researchers to
investigate in different domains. MapReduce is used in many big data application such as short
message mining , genetic algorithms , k-means The pros and cons of iMapReduce and its
components Pros Cons iMapReduce results in the context of various iMapReduce is not suitable
for more iterative applications, show up to 5 times of faster operation computations. compared
with traditional Hadoop MapReduce. clustering algorithm , DNA fragment, intelligent
transportation system, Healthcare scientific applications, Fuzzy rule based classification
systems , heterogeneous environments , cuckoo search , extreme learning machine , Random
Forest, energy proportionally , Mobile Sensor Data , semantic web and so many. Therefore, in
this section indicates the brief analyzes of these applications and a brief look to various domains
of MapReduce applications
6. Conclusion

MapReduce has efficiency and scalability in most of the studies . It is used for generating and
processing big Data in various different applications. The purpose of this essay is to review,
MapReduce, its architecture, big data and an appropriate use of programming model in
conjunction with the applications of MapReduce in big data have been discussed thoroughly.
Also, we have surveyed and analyzed the implementations of MapReduce. The applications of
MapReduce framework in different contexts like the cloud, multi-core system, and parallel
computation have been investigated precisely. This paper examines and categorized a number of
applications which have been surveyed in MapReduce Framework based on Graph processing,
Join and parallel queries, optimizing frameworks, multi-core systems, and data allocation. The
goal of the MapReduce Framework is to provide an abstraction layer between the faulttolerance,
data distribution and other parallel systems tasks, and the implementation details of the specific
algorithm. Obviously, the requirements of MapReduce applications is growing rapidly. This
survey gives the reader a general review of the MapReduce applications and it will be a good
introductory reference to improve the article that is easier to comprehend.
REFERENCES

 https://www.tutorialspoint.com/hadoop/hadoop_mapreduce.htm
 wang, B., Huang, S., Qiu, J., Liu, Y., Wang, G.: Parallel online sequential extreme
learning machine based on MapReduce.
 Neurocomputing 149, 224–232 (2015) 2. Marozzo, F., Talia, D., Trunfio, P.: P2P-
MapReduce: parallel data processing in dynamic Cloud environments.
 J. Comput. Syst. Sci. 78, 1382–1402 (2012) 3. Mohamed, H., Marchand-Maillet, S.:
MRO-MPI: MapReduce overlapping using MPI and an optimized data exchange policy.
Parallel Comput. 39, 851–866 (2013)

You might also like