0% found this document useful (0 votes)

39 views37 pages

Advanced Mapreduce

The document provides a comprehensive guide on developing a MapReduce job, specifically a WordCount job, detailing the configuration of jobs, mappers, reducers, and combiners. It explains the implementation of the Mapper and Reducer classes, the input and output formats, and the execution sequence of applications in the YARN framework. Additionally, it outlines the components of MRv2, including ResourceManager, ApplicationMaster, and NodeManager, emphasizing the separation of resource management and job scheduling.

Uploaded by

ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views37 pages

Advanced Mapreduce

Uploaded by

ram

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Advanced MapReduce

Developing your MapReduce Job

MapReduce
• Job – execution of map and reduce
– functions to accomplish a task
• Equal to Java’s main

• Task – single Mapper or Reducer

– Performs work on a fragment of data
WordCount Job
1. Configure the Job
– Specify Input, Output, Mapper, Reducer and Combiner

2. Implement Mapper
– Input is text – a line from [Link]
– Tokenize the text and emit first character with a count of 1 - <token,
1>

3. Implement Reducer
– Sum up counts for each letter
– Write out the result to HDFS

4. Run the job

1. Configure Job
• Job class
– Encapsulates information about a job
– Controls execution of the job
Job job = new Job();

• A job is packaged within a jar file

– Hadoop Framework distributes the jar on your behalf
– Needs to know which jar file to distribute
– The easiest way to specify the jar that your job resides in is by calling
[Link]
[Link]([Link]);
– Hadoop will locate the jar file that contains the provided class
1. Configure Job – Specify Input
[Link](job, new
Path(otherArgs[0]));
[Link]([Link]);

• Can be a file, directory or a file pattern

– Directory is converted to a list of files as an input
• Input is specified by implementation of InputFormat - in
this case TextInputFormat
– Responsible for creating splits and a record reader
– Controls input types of key-value pairs, in this case LongWritable
and Text
• File is broken into lines, mapper will receive 1 line at a time
1. Configure Job – Specify Output
[Link](job, new
Path(otherArgs[1]));
[Link]([Link]);

• OutputFormat defines specification for outputting data from

Map/Reduce job

• WodCount job utilizes an implementation of OutputFormat -

TextOutputFormat
– Define output path where reducer should place its output
• If path already exists then the job will fail
– Each reducer task writes to its own file
• By default a job is configured to run with a single reducer
– Writes key-value pair as plain text
1. Configure Job – Specify Output
[Link]([Link]);
[Link]([Link]);

• Specify the output key and value types for both

mapper and reducer functions
– Many times the same type
– If types differ then use
• setMapOutputKeyClass()
• setMapOutputValueClass()
1. Configure Job
• Specify Mapper, Reducer and Combiner
– At a minimum will need to implement these classes
– Mappers and Reducer usually have same output key

[Link]([Link]);
[Link]([Link]);
[Link]([Link]);
1. Configure Job
• [Link](true)
– Submits and waits for completion
– The boolean parameter flag specifies whether output
should be written to console
– If the job completes successfully ‘true’ is returned,
otherwise ‘false’ is returned

[Link]([Link](true) ? 0 : 1);
Our Count Job is configured to
• Chop up text files into lines
• Send records to mappers as key-value pairs
– Line number and the actual value
• Mapper class is TokenizeMapper
– Receives key-value of <IntWritable,Text>
– Outputs key-value of <Text, IntWritable>
• Reducer class is IntSumReducer
– Receives key-value of <Text, IntWritable>
– Outputs key-values of <Text, IntWritable> as text
• Combiner class is IntSumReducer
1. Configure Count Job
public class WordCount {
public static void main(String[] args) throws Exception {

try{
if ([Link] != 2) {
[Link]("Usage: wordcount <input dir> <output dir>\n");
[Link](-1);
}

Job job = new Job();

[Link]([Link]);

[Link]([Link]);
[Link]([Link]);

[Link](job, new Path(args[0]));

[Link](job, new Path(args[1]));

[Link]([Link](true) ? 0 : 1);
}catch (Exception e){
[Link]();
}
}
2. Implement Mapper Class
• Class has 4 Java Generics parameters
– (1) input key (2) input value (3) output key (4) output value
– Input and output utilizes hadoop’s IO framework
• [Link]

• Your job is to implement map() method

– Input key and value
– Output key and value
– Logic is up to you

• map() method injects Context object, use to:

– Write output
– Create your own counters
2. Implement Maper
public class TokenizerMapper extends Mapper<Object, Text, Text,
IntWritable> {

@Override
public void map(Object key, Text value, Context context)
throws IOException, InterruptedException {

StringTokenizer itr = new StringTokenizer([Link]());

while ([Link]()) {
[Link](new Text([Link]()), new IntWritable(1));
}
}
}
3. Implement Reducer
• Analogous to Mapper – generic class with four types
– (1) input key (2) input value (3) output key (4) output value
– The output types of map functions must match the input types of
reduce function
• In this case Text and IntWritable
– Map/Reduce framework groups key-value pairs produced by mapper
by key
• For each key there is a set of one or more values
• Input into a reducer is sorted by key
• Known as Shuffle and Sort
– Reduce function accepts key->setOfValues and outputs key-value pairs
• Also utilizes Context object (similar to Mapper)
3. Implement Reducer
public class IntSumReducer
extends Reducer<Text, IntWritable, Text, IntWritable>{

@Override
public void reduce (Text key, Iterable<IntWritable> values, Context
context)
throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values)
{
sum += [Link]();
}

[Link](key, new IntWritable(sum));

}
}
3. Reducer as a Combiner
• Combine data per Mapper task to reduce amount of data
transferred to reduce phase
• Reducer can very often serve as a combiner
– Only works if reducer’s output key-value pair types are the same as
mapper’s output types
• Combiners are not guaranteed to run
– Optimization only
– Not for critical logic
4. Run Count Job
• DEMO -- Specify how to run

• Create a JAR class

• Run the jar
hadoop jar [Link] <in> <out>
Output From Your Job
• Provides job id
– Used to identify, monitor and manage the job
• Shows number of generated splits
• Reports the Progress
• Displays Counters – statistics for the job
– Sanity check that the numbers match what you expected
Input and Output
Hadoop IO Classes
• Hadoop uses it’s own serialization mechanism for writing data in
and out of network, database or files
– Optimized for network serialization
– A set of basic types is provided
– Easy to implement your own

• Extends Writable Interface

– Framework’s serialization mechanism
– Defines how to read and write fields

• [Link] package
– LongWritable for Long
– IntWritable for Integer
– Text for String
– Etc...
Key and Value Types
• Keys must implement WritableComparable interface
– Extends Writable and [Link]<T>
– Required because keys are sorted prior reduce phase

• Hadoop is shipped with many default implementations of

WritableComparable<T>
– Wrappers for primitives (String, Integer, etc...)
– Or you can implement your own
WritableComparable<T>
Implementations
Hadoop’s Input Format
• Hadoop eco-system is packaged with many InputFormats
– TextInputFormat
– NLineInputFormat
– DBInputFormat
– TableInputFormat (HBASE)
– StreamInputFormat
– SequenceFileInputFormat
– Etc...

• Configure on a Job object

– [Link]([Link]);
TextInput Format
• Plaint Text Input
• Default format
TableInput Format
• Converts data in HTable to format consumable to MapReduce
• Mapper must accept proper key/values
HashPartitioner
• Calculate Index of Partition:
– Convert key’s hash into non-negative number
• Logical AND with maximum integer value
– Modulo by number of reduce tasks
• In case of more than 1 reducer
– Records distributed evenly across available reduce tasks
• Assuming a good hashCode() function
– Records with same key will make it into the same reduce
task
– Code is independent from the # of partitions/reducers
specified
OutputFormat
• Specification for writing data
– The other side of InputFormat
• Implementation of OutputFormat<K,V>
• TextOutputFormat is the default implementation
– Output records as lines of text
– Key and values are tab separated “Key /t value”
• Can be configured via
“[Link]” property
– Key and Value may of any type - call .toString()
Hadoop’s Output Format
• Hadoop eco-system is packaged with many OutputFormats
– TextOutputFormat
– DBOutputFormat
– TableOutputFormat (HBASE)
– MapFileOutputFormat
– SequenceFileOutputFormat
– NullOutputFormat
– Etc...
• Configure on Job object
– [Link]([Link]);
– [Link]([Link]);
– [Link]([Link]);
MRv2 / YARN

The future of next-gen computing

YARN
Yet Another Resource Negotiator
• YARN: a generic resource-management and distributed
application framework
– In Aug 2012, YARN was promoted to be a sub-project of Hadoop in
Apache. Before this, YARN was part of the Hadoop MapReduce
project.
– MR is not sufficient for all use cases, like PageRank or many ML
algorithms
– MR become just one of the applications that can be run in YARN
– Future YARN algorithms: MPI/Iterative processing, graph-processing,
simple services, real time (stream processing, CEPFresil)
– Since all the data in the enterprise is already available in Hadoop HDFS
having multiple paths for processing is critical
– All client-facing MapReduce interfaces are unchanged, which means
that there is no need to make any source code changes to run on top
of Hadoop 0.23
MRv1 Quick Recap

JT responsibilities:
• Resource management (TTs),
• Tracking resource consumption/availability,
• Job life-cycle management
MRv2 Overview

Fundamental idea:
Re-architect JT’s Resource Management and Job scheduling & monitoring into two
separate components: Resource Manager & AppMaster
Building Blocks
• ResourceManager: manages the global assignment of compute resources to
applications, has a pluggable scheduler for allocating resources to the running
applications subject constraints of capacities, queues. It optimizes for cluster utilization
(keep all resources in use all the time) against various constraints such as capacity
guarantees, fairness, and SLAs, does NOT do fault tolerance for resources (AM does)

• ApplicationMaster: manages the application’s scheduling and coordination,

negotiates appropriate resource containers from the Scheduler and tracking their
progress

• NodeManager: per machine slave, responsible for launching applications’ containers,

monitoring their resources (cpu, memory, disk) and reporting to the ResourceManager

• The AM can request very specific requirements from the RM for the containers, like:
– Resource name, hostname, rack name,
– Memory (in MB)
– CPU (in cores), added after March 2012
– Future: disk, network, GPUs, etc
Building Blocks
• A resource request from the AM to the scheduler in the RM
has the following:
– Resource name: hostname, rackname, (Future: VMs on a host,
networks)
– Priority: priority within the application, not across cluster
– Resource requirement: memory, CPU (F: GPUs)
– # of Containers: just a #

• A container is basically a resource allocation that grants rights

to an application to use a specific amount of resources
(memory, CPU) on a specific host
Application Execution Sequence
Application Execution Sequence
1) A client program submits the application, including the necessary specifications to launch the
application-specific ApplicationMaster itself.

2) The ResourceManager assumes the responsibility to negotiate a specified container in which to start
the ApplicationMaster and then launches the ApplicationMaster.

3) The ApplicationMaster, on boot-up, registers with the ResourceManager – the registration allows the
client program to query the ResourceManager for details, which allow it to directly communicate with its
own ApplicationMaster.

4) During normal operation the ApplicationMaster negotiates appropriate resource containers via the
resource-request protocol.

5) On successful container allocations, the ApplicationMaster launches the container by providing the
container launch specification to the NodeManager. The launch specification, typically, includes the
necessary information to allow the container to communicate with the ApplicationMaster itself.

6) The application code executing within the container then provides necessary information (progress,
status etc.) to its ApplicationMaster via an application-specific protocol.

7) During the application execution, the client that submitted the program communicates directly with
the ApplicationMaster to get status, progress updates etc. via an application-specific protocol.

8) Once the application is complete, and all necessary work has been finished, the ApplicationMaster
deregisters with the ResourceManager and shuts down, allowing its own container to be repurposed.
Thank You

Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
MapReduce Application Development Guide
No ratings yet
MapReduce Application Development Guide
83 pages
Hadoop MapReduce WordCount Guide
No ratings yet
Hadoop MapReduce WordCount Guide
5 pages
Experiment-4 BDA LAB
No ratings yet
Experiment-4 BDA LAB
7 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
BDA University Questions
No ratings yet
BDA University Questions
10 pages
Bda Unit III r20csm
No ratings yet
Bda Unit III r20csm
54 pages
BDC Output 3
No ratings yet
BDC Output 3
4 pages
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
No ratings yet
CS246 TA Session: Hadoop Tutorial: Peyman Kazemian 1/11/2011
13 pages
Big Data Analytics with Hadoop Guide
No ratings yet
Big Data Analytics with Hadoop Guide
10 pages
MapReduce Word Count Program Guide
No ratings yet
MapReduce Word Count Program Guide
14 pages
MapReduce for Data Engineers
No ratings yet
MapReduce for Data Engineers
30 pages
Hadoop and Map Reduce
No ratings yet
Hadoop and Map Reduce
27 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Big Data Practical 2
No ratings yet
Big Data Practical 2
11 pages
M4 06 MapReduce
No ratings yet
M4 06 MapReduce
28 pages
MapReduce Basics: Components & Code
No ratings yet
MapReduce Basics: Components & Code
25 pages
Hadoop for Developers
No ratings yet
Hadoop for Developers
49 pages
Map Reduce
No ratings yet
Map Reduce
57 pages
Hadoop MapReduce Tutorial Guide
No ratings yet
Hadoop MapReduce Tutorial Guide
31 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
MRUnit for MapReduce Testing Guide
No ratings yet
MRUnit for MapReduce Testing Guide
31 pages
Word Count Example
No ratings yet
Word Count Example
4 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Unit-2 (MapReduce-I)
No ratings yet
Unit-2 (MapReduce-I)
28 pages
Unit 3 Bigdata
No ratings yet
Unit 3 Bigdata
31 pages
Hadoop Map-Reduce Guide
No ratings yet
Hadoop Map-Reduce Guide
28 pages
MapReduce - Notes
No ratings yet
MapReduce - Notes
17 pages
Dllction To MAPREDUCE Afflrlling: L Tro
No ratings yet
Dllction To MAPREDUCE Afflrlling: L Tro
12 pages
MapReduce Word and Character Count Guide
No ratings yet
MapReduce Word and Character Count Guide
22 pages
Bda Experiment No2
No ratings yet
Bda Experiment No2
12 pages
Overview of MapReduce Framework
No ratings yet
Overview of MapReduce Framework
23 pages
Big Data 4 Vivek
No ratings yet
Big Data 4 Vivek
3 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
B1 Instructions
No ratings yet
B1 Instructions
9 pages
Import Import Import Import Import Import Import Import Public Class Extends Implements
No ratings yet
Import Import Import Import Import Import Import Import Public Class Extends Implements
7 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Understanding Hadoop MapReduce Workflow
No ratings yet
Understanding Hadoop MapReduce Workflow
11 pages
MapReduce Workflow in Hadoop
No ratings yet
MapReduce Workflow in Hadoop
28 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
81 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Big Data Unit 3 - PPT2
No ratings yet
Big Data Unit 3 - PPT2
11 pages
Execute Java Map Reduce Sample Using Eclipse
No ratings yet
Execute Java Map Reduce Sample Using Eclipse
9 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Hadoop
No ratings yet
Hadoop
38 pages
Hadoop Developingapps PDF
No ratings yet
Hadoop Developingapps PDF
17 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
First Map-Reduce Program in Hadoop
No ratings yet
First Map-Reduce Program in Hadoop
22 pages
Bda Unit 2
No ratings yet
Bda Unit 2
54 pages
Hadoop Mapred
100% (1)
Hadoop Mapred
11 pages
Classcreation
No ratings yet
Classcreation
2 pages
Kick Start Hadoop: Word Count - Hadoop Map Reduce Example
No ratings yet
Kick Start Hadoop: Word Count - Hadoop Map Reduce Example
13 pages
Unit 3
No ratings yet
Unit 3
13 pages
MapReduce Word Count Example in Java
No ratings yet
MapReduce Word Count Example in Java
6 pages
MapReduce Word Count on Multi Node Cluster
No ratings yet
MapReduce Word Count on Multi Node Cluster
10 pages
Ccs334 Bda Nov Dec2024
No ratings yet
Ccs334 Bda Nov Dec2024
6 pages
04 Exercise 4 - v04
No ratings yet
04 Exercise 4 - v04
2 pages
AWS Certified Data Engineer
No ratings yet
AWS Certified Data Engineer
186 pages
Hive
100% (1)
Hive
47 pages
Pyspark
No ratings yet
Pyspark
4 pages
Talend Etl
No ratings yet
Talend Etl
78 pages
PySpark Data Frame Questions PDF
100% (2)
PySpark Data Frame Questions PDF
57 pages
Key Features of Distributed File Systems
No ratings yet
Key Features of Distributed File Systems
3 pages
Installation and Configuration System Tool For Hadoop
No ratings yet
Installation and Configuration System Tool For Hadoop
30 pages
Padamati Hemanth Resume
No ratings yet
Padamati Hemanth Resume
2 pages
Open Source Tools for Data Engineering
No ratings yet
Open Source Tools for Data Engineering
5 pages
Adaptive Data Replication Scheme Based On Access Count Prediction in Hadoop
No ratings yet
Adaptive Data Replication Scheme Based On Access Count Prediction in Hadoop
7 pages
Chapter N1 Introduction To Big Data
No ratings yet
Chapter N1 Introduction To Big Data
40 pages
Big Data Benchmarking 2014
0% (1)
Big Data Benchmarking 2014
164 pages
Cloud Computing Question Bank Unit IV and Unit V Updated
No ratings yet
Cloud Computing Question Bank Unit IV and Unit V Updated
25 pages
Cloud Lab Manual
No ratings yet
Cloud Lab Manual
61 pages
Mastering Apache Mesos for Big Data
100% (1)
Mastering Apache Mesos for Big Data
36 pages
Murugan and Kala 2023 (Financial Risk Management & Analysis Using ML)
No ratings yet
Murugan and Kala 2023 (Financial Risk Management & Analysis Using ML)
9 pages
BDA Techmax (Searchable)
No ratings yet
BDA Techmax (Searchable)
150 pages
Iterative Map-Reduce for Graph Search
No ratings yet
Iterative Map-Reduce for Graph Search
25 pages
Amazon Emr Management Guide
No ratings yet
Amazon Emr Management Guide
314 pages
Evolution of Cloud Storage Systems
No ratings yet
Evolution of Cloud Storage Systems
46 pages
Mesosphere
No ratings yet
Mesosphere
23 pages
Experiment 1 - BDA - 66
No ratings yet
Experiment 1 - BDA - 66
8 pages
CloudSecurity Unit 1
No ratings yet
CloudSecurity Unit 1
24 pages
05-MapReduce and Yarn
No ratings yet
05-MapReduce and Yarn
82 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
BDA Unit-6
No ratings yet
BDA Unit-6
11 pages

Advanced Mapreduce

Uploaded by

Advanced Mapreduce

Uploaded by

Advanced MapReduce

Developing your MapReduce Job

• Task – single Mapper or Reducer

4. Run the job

• A job is packaged within a jar file

• Can be a file, directory or a file pattern

• OutputFormat defines specification for outputting data from

• WodCount job utilizes an implementation of OutputFormat -

• Specify the output key and value types for both

Job job = new Job();

[Link](job, new Path(args[0]));

• Your job is to implement map() method

• map() method injects Context object, use to:

StringTokenizer itr = new StringTokenizer([Link]());

[Link](key, new IntWritable(sum));

• Create a JAR class

• Extends Writable Interface

• Hadoop is shipped with many default implementations of

• Configure on a Job object

The future of next-gen computing

• ApplicationMaster: manages the application’s scheduling and coordination,

• NodeManager: per machine slave, responsible for launching applications’ containers,

• A container is basically a resource allocation that grants rights

You might also like