0% found this document useful (0 votes)

5 views

Map Red

MapReduce is an algorithm used by the Hadoop MapReduce engine to distribute work across a cluster, with the core concepts involving a map operation that transforms input data into key-value pairs and a reduce operation that combines all values for a specific key. The MapReduce engine allows maps and reduces to run in parallel on different machines and data locations for scalability.

Uploaded by

20261A6757 VIJAYAGIRI ANIL KUMAR

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

Map Red

Uploaded by

20261A6757 VIJAYAGIRI ANIL KUMAR

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MapReduce

MapReduce
MapReduce is the key algorithm that the Hadoop MapReduce engine uses to distribute work around a cluster.

The core concepts are described in Dean and Ghemawat.

The Map
A map transform is provided to transform an input data row of key and value to an output key/value:
map(key1,value) -> list<key2,value2>

That is, for an input it returns a list containing zero or more (key,value) pairs:

The output can be a different key from the input

The output can have multiple entries with the same key

The Reduce
A reduce transform is provided to take all values for a specific key, and generate a new list of the reduced output.
reduce(key2, list<value2>) -> list<value3>

The MapReduce Engine

The key aspect of the MapReduce algorithm is that if every Map and Reduce is independent of all other ongoing Maps and Reduces, then the operation
can be run in parallel on different keys and lists of data. On a large cluster of machines, you can go one step further, and run the Map operations on
servers where the data lives. Rather than copy the data over the network to the program, you push out the program to the machines. The output list can
then be saved to the distributed filesystem, and the reducers run to merge the results. Again, it may be possible to run these in parallel, each reducing
different keys.

A distributed filesystem spreads multiple copies of the data across different machines. This not only offers reliability without the need for RAID-
controlled disks, it offers multiple locations to run the mapping. If a machine with one copy of the data is busy or offline, another machine can be
used.
A job scheduler (in Hadoop, the JobTracker), keeps track of which MR jobs are executing, schedules individual Maps, Reduces or intermediate
merging operations to specific machines, monitors the success and failures of these individual Tasks, and works to complete the entire batch job.
The filesystem and Job scheduler can somehow be accessed by the people and programs that wish to read and write data, and to submit and
monitor MR jobs.

Apache Hadoop is such a MapReduce engine. It provides its own distributed filesystem and runs [HadoopMapReduce] jobs on servers near the data
stored on the filesystem -or any other supported filesystem, of which there is more than one.

Limitations
For maximum parallelism, you need the Maps and Reduces to be stateless, to not depend on any data generated in the same MapReduce job.
You cannot control the order in which the maps run, or the reductions.
It is very inefficient if you are repeating similar searches again and again. A database with an index will always be faster than running an MR job
over unindexed data. However, if that index needs to be regenerated whenever data is added, and data is being added continually, MR jobs may
have an edge. That inefficiency can be measured in both CPU time and power consumed.
In the Hadoop implementation Reduce operations do not take place until all the Maps are complete (or have failed and been skipped). As a result,
you do not get any data back until the entire mapping has finished.
There is a general assumption that the output of the reduce is smaller than the input to the Map. That is, you are taking a large datasource and
generating smaller final values.

Will MapReduce/Hadoop solve my problems?

If you can rewrite your algorithms as Maps and Reduces, then yes. If not, then no.

It is not a silver bullet to all the problems of scale, just a good technique to work on large sets of data when you can work on small pieces of that dataset in
parallel.
HadoopMapReduce
How Map and Reduce operations are actually carried out
Introduction
This document describes how MapReduce operations are carried out in Hadoop. If you are not familiar with the Google MapReduce programming model
you should get acquainted with it first.

Map
As the Map operation is parallelized the input file set is first split to several pieces called FileSplits. If an individual file is so large that it will affect seek time
it will be split to several Splits. The splitting does not know anything about the input file's internal logical structure, for example line-oriented text files are
split on arbitrary byte boundaries. Then a new map task is created per FileSplit.

When an individual map task starts it will open a new output writer per configured reduce task. It will then proceed to read its FileSplit using the RecordRea
der it gets from the specified InputFormat. InputFormat parses the input and generates key-value pairs. InputFormat must also handle records that may be
split on the FileSplit boundary. For example TextInputFormat will read the last line of the FileSplit past the split boundary and, when reading other than the
first FileSplit, TextInputFormat ignores the content up to the first newline.

It is not necessary for the InputFormat to generate both meaningful keys and values. For example the default output from TextInputFormat consists of input
lines as values and somewhat meaninglessly line start file offsets as keys - most applications only use the lines and ignore the offsets.

As key-value pairs are read from the RecordReader they are passed to the configured Mapper. The user supplied Mapper does whatever it wants with the
input pair and calls OutputCollector.collect with key-value pairs of its own choosing. The output it generates must use one key class and one value class.
This is because the Map output will be written into a SequenceFile which has per-file type information and all the records must have the same type (use
subclassing if you want to output different data structures). The Map input and output key-value pairs are not necessarily related typewise or in cardinality.

When Mapper output is collected it is partitioned, which means that it will be written to the output specified by the Partitioner. The default HashPartitioner
uses the hashcode function on the key's class (which means that this hashcode function must be good in order to achieve an even workload across the
reduce tasks). See MapTask for details.

N input files will generate M map tasks to be run and each map task will generate as many output files as there are reduce tasks configured in the system.
Each output file will be targeted at a specific reduce task and the map output pairs from all the map tasks will be routed so that all pairs for a given key end
up in files targeted at a specific reduce task.

Combine
When the map operation outputs its pairs they are already available in memory. For efficiency reasons, sometimes it makes sense to take advantage of
this fact by supplying a combiner class to perform a reduce-type function. If a combiner is used then the map key-value pairs are not immediately written to
the output. Instead they will be collected in lists, one list per each key value. When a certain number of key-value pairs have been written, this buffer is
flushed by passing all the values of each key to the combiner's reduce method and outputting the key-value pairs of the combine operation as if they were
created by the original map operation.

For example, a word count MapReduce application whose map operation outputs (word, 1) pairs as words are encountered in the input can use a
combiner to speed up processing. A combine operation will start gathering the output in in-memory lists (instead of on disk), one list per word. Once a
certain number of pairs is output, the combine operation will be called once per unique word with the list available as an iterator. The combiner then emits (
word, count-in-this-part-of-the-input) pairs. From the viewpoint of the Reduce operation this contains the same information as the original Map output, but
there should be far fewer pairs output to disk and read from disk.

Reduce
When a reduce task starts, its input is scattered in many files across all the nodes where map tasks ran. If run in distributed mode these need to be first
copied to the local filesystem in a copy phase (see ReduceTaskRunner).

Once all the data is available locally it is appended to one file in an append phase. The file is then merge sorted so that the key-value pairs for a given key
are contiguous (sort phase). This makes the actual reduce operation simple: the file is read sequentially and the values are passed to the reduce method
with an iterator reading the input file until the next key value is encountered. See ReduceTask for details.

At the end, the output will consist of one output file per executed reduce task. The format of the files can be specified with JobConf.setOutputFormat. If
SequentialOutputFormat is used then the output key and value classes must also be specified.
HowManyMapsAndReduces
Partitioning your job into maps and reduces
Picking the appropriate size for the tasks for your job can radically change the performance of Hadoop. Increasing the number of tasks increases the
framework overhead, but increases load balancing and lowers the cost of failures. At one extreme is the 1 map/1 reduce case where nothing is distributed.
The other extreme is to have 1,000,000 maps/ 1,000,000 reduces where the framework runs out of resources for the overhead.

Number of Maps
The number of maps is usually driven by the number of DFS blocks in the input files. Although that causes people to adjust their DFS block size to adjust
the number of maps. The right level of parallelism for maps seems to be around 10-100 maps/node, although we have taken it up to 300 or so for very cpu-
light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default
InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input
files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input
data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the
number of maps.

The number of map tasks can also be increased manually using the JobConf's conf.setNumMapTasks(int num). This can be used to increase the number
of map tasks, but will not set the number below that which Hadoop determines via splitting the input data.

Number of Reduces
The ideal reducers should be the optimal value that gets them closest to:

A multiple of the block size

A task time between 5 and 15 minutes
Creates the fewest files possible

Anything other than that means there is a good chance your reducers are less than great. There is a tremendous tendency for users to use a REALLY high
value ("More parallelism means faster!") or a REALLY low value ("I don't want to blow my namespace quota!"). Both are equally dangerous, resulting in
one or more of:

Terrible performance on the next phase of the workflow

Terrible performance due to the shuffle
Terrible overall performance because you've overloaded the namenode with objects that are ultimately useless
Destroying disk IO for no really sane reason
Lots of network transfers due to dealing with crazy amounts of CFIF/MFIF work

Now, there are always exceptions and special cases. One particular special case is that if following that advice makes the next step in the workflow do
ridiculous things, then we need to likely 'be an exception' in the above general rules of thumb.

Currently the number of reduces is limited to roughly 1000 by the buffer size for the output files (io.buffer.size * 2 * numReduces << heapSize). This will be
fixed at some point, but until it is it provides a pretty firm upper bound.

The number of reduce tasks can also be increased in the same way as the map tasks, via JobConf's conf.setNumReduceTasks(int num).
WordCount
WordCount Example
WordCount example reads text files and counts how often words occur. The input is text files and the output is text files, each line of which contains a
word and the count of how often it occured, separated by a tab.

Each mapper takes a line as input and breaks it into words. It then emits a key/value pair of the word and 1. Each reducer sums the counts for each word
and emits a single key/value with the word and sum.

As an optimization, the reducer is also used as a combiner on the map outputs. This reduces the amount of data sent across the network by combining
each word into a single record.

To run the example, the command syntax is

bin/hadoop jar hadoop-*-examples.jar wordcount [-m <#maps>] [-r <#reducers>] <in-dir> <out-dir>

All of the files in the input directory (called in-dir in the command line above) are read and the counts of words in the input are written to the output
directory (called out-dir above). It is assumed that both inputs and outputs are stored in HDFS (see ImportantConcepts). If your input is not already in
HDFS, but is rather in a local file system somewhere, you need to copy the data into HDFS using a command like this:

bin/hadoop dfs -mkdir <hdfs-dir>

bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

As of version 0.17.2.1, you only need to run a command like this:

bin/hadoop dfs -copyFromLocal <local-dir> <hdfs-dir>

Word count supports generic options : see DevelopmentCommandLineOptions

Below is the standard wordcount example implemented in Java:

package org.myorg;

import java.io.IOException;
import java.util.*;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.conf.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

public class WordCount {

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

Job job = new Job(conf, "wordcount");

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

job.setMapperClass(Map.class);
job.setReducerClass(Reduce.class);

job.setInputFormatClass(TextInputFormat.class);
job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.waitForCompletion(true);
}

}
Sort
Sort Example
The Sort example simply uses the map/reduce framework to sort the input directory into the output directory. The inputs and outputs must be Sequence
files where the keys and values are BytesWritable.

The mapper is the predefined IdentityMapper and the reducer is the predefined IdentityReducer, both of which just pass their inputs directly to the output.

To run the program:

bin/hadoop jar hadoop-*-examples.jar sort [-m <#maps>] [-r <#reduces>] <in-dir> <out-dir>

Running Sort Benchmark

To use the sort example as a benchmark, generate 10GB/node of random data using RandomWriter. Then sort the data using the sort example. This
provides a sort benchmark that scales depending on the size of the cluster. By default, the sort example uses 1.0 * capacity for the number of reduces and
depending on your cluster you may see better results at 1.75 * capacity.

The commands are:

% bin/hadoop jar hadoop-*-examples.jar randomwriter rand

% bin/hadoop jar hadoop-*-examples.jar sort rand rand-sort The first command will generate the unsorted data in the rand directory. The second command
will read that data, sort it, and write into the rand-sort directory.

Sort supports generic options : see DevelopmentCommandLineOptions

ALU Passive Optical Networking 2
No ratings yet
ALU Passive Optical Networking 2
24 pages
Bill Validator JCM UBA
No ratings yet
Bill Validator JCM UBA
14 pages
Hadoop Karunesh
No ratings yet
Hadoop Karunesh
14 pages
Map Reduce
No ratings yet
Map Reduce
18 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Data Science
No ratings yet
Data Science
7 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Bda CHP2
No ratings yet
Bda CHP2
105 pages
Unit 4 CS 3RD Yr
No ratings yet
Unit 4 CS 3RD Yr
13 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Research Paper - Map Reduce - CSC3323
No ratings yet
Research Paper - Map Reduce - CSC3323
16 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
MapReduce
No ratings yet
MapReduce
14 pages
Q1. What Is The Purpose of Recordreader in Hadoop?
No ratings yet
Q1. What Is The Purpose of Recordreader in Hadoop?
5 pages
Ditp - ch2 4
No ratings yet
Ditp - ch2 4
2 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
3.Map-Reduce Framework - 1
No ratings yet
3.Map-Reduce Framework - 1
47 pages
MapReduce - Documentation
No ratings yet
MapReduce - Documentation
2 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
P.Prabu (23x61c) CCS334-BDA - Unit-3
No ratings yet
P.Prabu (23x61c) CCS334-BDA - Unit-3
23 pages
BDA notes
No ratings yet
BDA notes
39 pages
Map Reduce
No ratings yet
Map Reduce
3 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Hadoop Streaming: Mapreduce
No ratings yet
Hadoop Streaming: Mapreduce
8 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
BIG DATA UNIT -3
No ratings yet
BIG DATA UNIT -3
7 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
Hadoop MapReduce Explained Simply
No ratings yet
Hadoop MapReduce Explained Simply
3 pages
Introduction to batch processing
No ratings yet
Introduction to batch processing
23 pages
P.prabu (28x61c) CCS334 BDA - Unit 4
No ratings yet
P.prabu (28x61c) CCS334 BDA - Unit 4
28 pages
UNIT 3 NOTES (1)
No ratings yet
UNIT 3 NOTES (1)
21 pages
3.1.How Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.How Map Reduce Works & 3.2 Anatomy
11 pages
Own Answer 2
No ratings yet
Own Answer 2
22 pages
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
No ratings yet
21CS1601 UNIT 5 UNDERSTANDING BIG DATA TECHNOLGIES
20 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Unit 3
No ratings yet
Unit 3
10 pages
BDA FW-4
No ratings yet
BDA FW-4
7 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
MapReduce Tutorial
No ratings yet
MapReduce Tutorial
32 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
BDA Lab 5
No ratings yet
BDA Lab 5
6 pages
BDA Notes Unit-4
No ratings yet
BDA Notes Unit-4
86 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Rohit
No ratings yet
Rohit
14 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
Unit 3 Bba
No ratings yet
Unit 3 Bba
11 pages
Bda CH3
No ratings yet
Bda CH3
10 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Essential Algorithms: A Practical Approach to Computer Algorithms
From Everand
Essential Algorithms: A Practical Approach to Computer Algorithms
Rod Stephens
4.5/5 (2)
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Dart for Flutter
From Everand
Dart for Flutter
Zeuz IT
No ratings yet
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
From Everand
Parallel Python with Dask: Perform distributed computing, concurrent programming and manage large dataset
Tim Peters
No ratings yet
KPMG IIOT Future State Development
No ratings yet
KPMG IIOT Future State Development
4 pages
Django: Python Web Framework Rayland Jeans CSCI 5448
No ratings yet
Django: Python Web Framework Rayland Jeans CSCI 5448
40 pages
Code generation in Compiler Design (1)
No ratings yet
Code generation in Compiler Design (1)
5 pages
Early Life: Kannada Mysore
No ratings yet
Early Life: Kannada Mysore
4 pages
Curriculum Vitae
No ratings yet
Curriculum Vitae
5 pages
Visual Studio 2008 Readme
No ratings yet
Visual Studio 2008 Readme
37 pages
Ashley Flick: Midnight Productions Newtown, PA
No ratings yet
Ashley Flick: Midnight Productions Newtown, PA
2 pages
Angel Band DMC
No ratings yet
Angel Band DMC
5 pages
P90
No ratings yet
P90
8 pages
F-789sga (Exp) - en
100% (2)
F-789sga (Exp) - en
35 pages
TungDT BH02089 AssignmentII Programming st2
No ratings yet
TungDT BH02089 AssignmentII Programming st2
62 pages
Physics Project (2019-20)
No ratings yet
Physics Project (2019-20)
10 pages
SMA1 K
No ratings yet
SMA1 K
12 pages
Special Report: Biometrics in Healthcare
No ratings yet
Special Report: Biometrics in Healthcare
19 pages
RDFing
100% (1)
RDFing
18 pages
BÀI TẬP SO SÁNH
No ratings yet
BÀI TẬP SO SÁNH
3 pages
BASIC SCIENCE AND TECHNOLOGY JSS 3
No ratings yet
BASIC SCIENCE AND TECHNOLOGY JSS 3
4 pages
ENR
No ratings yet
ENR
3 pages
Python Guide PDF
No ratings yet
Python Guide PDF
125 pages
Access Control Lists: CCNA Routing and Switching Connecting Networks v6.0
No ratings yet
Access Control Lists: CCNA Routing and Switching Connecting Networks v6.0
45 pages
HR Vodafone
No ratings yet
HR Vodafone
20 pages
Dell 24 Monitor E2424hs Datasheet
No ratings yet
Dell 24 Monitor E2424hs Datasheet
4 pages
Ecu 04 Manual en
No ratings yet
Ecu 04 Manual en
4 pages
Write Protecting The Diskonchip 2000: Ap-Doc-011 Application Note
No ratings yet
Write Protecting The Diskonchip 2000: Ap-Doc-011 Application Note
4 pages
Math Q2W2 G-4
No ratings yet
Math Q2W2 G-4
86 pages
MetaCAM Enterprise
No ratings yet
MetaCAM Enterprise
8 pages
MECH550P: Foundations in Control Engineering
No ratings yet
MECH550P: Foundations in Control Engineering
17 pages
Sap S/4 Hana
80% (20)
Sap S/4 Hana
23 pages