0% found this document useful (0 votes)

34 views39 pages

MapReduce and Yarn

The document describes MapReduce, a programming model for processing large datasets in a distributed manner. It breaks the processing into two phases - map and reduce. The map phase processes the input data and generates intermediate key-value pairs. The reduce phase merges all intermediate values associated with the same key. The document also provides an example of finding the highest recorded temperature for each year using MapReduce on a weather dataset.

Uploaded by

Alekhya Abbaraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views39 pages

MapReduce and Yarn

Uploaded by

Alekhya Abbaraju

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 39

MapReduce

 MapReduce is a programming model for data processing. The model is

simple, yet not too simple to express useful programs.

 MapReduce simultaneously processes and analyzes huge data sets

logically into separate clusters.
 While Map sorts the data, Reduce segregates it into logical clusters,
thus removing the bad data and retaining the necessary information.

 Hadoop can run MapReduce programs written in various languages: e.g.

Java, Ruby, Python, C++.

 Most importantly, MapReduce programs are inherently parallel, thus

putting very large-scale data analysis into the hands of anyone with
enough machines at their disposal.
A Weather Dataset

 Weather sensors collect data every hour at many locations across the globe and gather a large
volume of log data, which is a good candidate for analysis with MapReduce because it
involves processing all the data, and the data is semi-structured and record-oriented.

Data Format:

 The data used in the example is from the National Climatic Data Center, or NCDC.

 Data files are organized by date and weather station.

 There is a directory for each year from 1901 to 2001, each containing a gzipped file for each
weather station with its readings for that year.

 There are tens of thousands of weather stations, so the whole dataset is made up of a large
number of relatively small files.
Format of a National Climatic Data Center record
0057
332130 # USAF weather station identifier
99999 # WBAN weather station identifier
19500101 # observation date
0300 # observation time
4
+51317 # latitude (degrees x 1000)
+028783 # longitude (degrees x 1000)
FM-12
+0171 # elevation (meters)
99999
V020
320 # wind direction (degrees)
1 # quality code
N
0072
1
00450 # sky ceiling height (meters)
1 # quality code
C
N
010000 # visibility distance (meters)
1 # quality code
N
9
-0128 # air temperature (degrees Celsius x 10)
1 # quality code
-0139 # dew point temperature (degrees Celsius x 10)
1 # quality code
10268 # atmospheric pressure (hectopascals x 10)
1 # quality code

 It’s generally easier and more efficient to process a smaller number of relatively large files, so
the data is preprocessed so that each year’s readings are concatenated into a single file.
What’s the highest recorded global temperature for each year in the dataset?

Analyzing the Data with Hadoop

 To take advantage of the parallel processing that Hadoop provides, we need to express the
query as a MapReduce job.

 MapReduce works by breaking the processing into two phases:

 The map phase and
 The reduce phase.

 Each phase has key-value pairs as input and output, the types of which may be chosen by the
programmer.

 The programmer also specifies two functions:

 the map function and
 the reduce function.

 The input to our map phase is the raw NCDC data.

Map Function
 A simple function.

 Pull out the year and the air temperature, because these are the only fields useful for the query.

 In this case, the map function is just a data preparation phase, setting up the data in such a
way that the reduce function can do its work on it: finding the maximum temperature for each
year.

 The map function is also a good place to drop bad records: filter out temperatures that are
missing, suspect, or erroneous.

 To visualize the way the map works, consider the following sample lines of input data
0067011990999991950051507004...9999999N9+00001+99999999999...
0043011990999991950051512004...9999999N9+00221+99999999999...
0043011990999991950051518004...9999999N9-00111+99999999999...
0043012650999991949032412004...0500001N9+01111+99999999999...
0043012650999991949032418004...0500001N9+00781+99999999999...
 These lines are presented to the map function as the key-value pairs:
(0, 0067011990999991950051507004...9999999N9+00001+99999999999...)
(106, 0043011990999991950051512004...9999999N9+00221+99999999999...)
(212, 0043011990999991950051518004...9999999N9-00111+99999999999...)
(318, 0043012650999991949032412004...0500001N9+01111+99999999999...)
(424, 0043012650999991949032418004...0500001N9+00781+99999999999...)

 The keys are the line offsets within the file, which is ignored in map function.

 The map function merely extracts the year and the air temperature and emits them as its output
(1950, 0)
(1950, 22)
(1950, −11)
(1949, 111)
(1949, 78)
• The output from the map function is processed by the MapReduce framework before being sent to
the reduce function.

• This processing sorts and groups the key-value pairs by key.

• The reduce function sees the following input:
(1949, [111, 78])
(1950, [0, 22, −11])

• The reduce function iterates through the list and pick up the maximum reading: ( The final Output)
(1949, 111)
(1950, 22)

Map Reduce Logical Data Flow

import java.io.IOException;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NewMaxTemperature {

static class NewMaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable>

{
private static final int MISSING = 9999;
public void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException
{
String line = value.toString();
String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
else
airTemperature = Integer.parseInt(line.substring(87, 92));
String quality = line.substring(92, 93);

if (airTemperature != MISSING && quality.matches("[01459]"))

context.write(new Text(year), new IntWritable(airTemperature));
}
}

static class NewMaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable>

{
public void reduce(Text key, Iterable<IntWritable> values,Context context)
throws IOException, InterruptedException
{
int maxValue = Integer.MIN_VALUE;
for (IntWritable value : values)
maxValue = Math.max(maxValue, value.get());
context.write(key, new IntWritable(maxValue));
}

}
public static void main(String[] args) throws Exception
{
/*[*/@SuppressWarnings("deprecation")
Job job = new Job();
job.setJarByClass(NewMaxTemperature.class);/*]*/

FileInputFormat.addInputPath(job, new Path("input/ncdc/sample.txt"));

FileOutputFormat.setOutputPath(job, new Path("output"));

job.setMapperClass(NewMaxTemperatureMapper.class);
job.setReducerClass(NewMaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);

System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
WORD COUNT
public class WordCountMR
{
public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>
{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();

public void map(Object key, Text value, Context context) throws IOException,
InterruptedException
{
StringTokenizer itr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens())
{
word.set(itr.nextToken());
context.write(word, one);
}
}
}
public static class IntSumReducer extends Reducer<Text,IntWritable,Text,IntWritable>
{
private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,

InterruptedException
{
int sum = 0;
for (IntWritable val : values)
{
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}
public static void main(String[] args) throws Exception
{

// BasicConfigurator.configure();

Configuration conf = new Configuration();

Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCountMR.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path("input/N"));
FileOutputFormat.setOutputPath(job, new Path("output2"));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
YARN
 Apache YARN (Yet Another Resource Negotiator) is Hadoop’s cluster resource management system.

 YARN was introduced in Hadoop 2 to improve the MapReduce implementation.

 It supports other distributed computing paradigms as well.

 YARN provides APIs for requesting and working with cluster resources, but these APIs are not typically
used directly by user code.

 Users write to higher-level APIs provided by distributed computing frameworks (MapReduce, Spark, and so
on), which themselves are built on YARN and hide the resource management details from the user.

 Pig, Hive, and Crunch are all examples of processing frameworks that run on MapReduce, Spark, or Tez (or
on all three), and don’t interact with YARN directly.

YARN Applications
Anatomy of a YARN Application Run
 YARN provides its core services via two types of long-running daemon:
 A resource manager (one per cluster) to manage the use of resources across the cluster, and
 Node managers running on all the nodes in the cluster to launch and monitor containers.
 A container executes an application-specific process with a constrained set of resources (memory, CPU, and
so on).
Resource Requests

 YARN has a flexible model for making resource requests.

 A request for a set of containers can express

 the amount of computer resources required for each container (memory and CPU).
 locality constraints for the containers in that request.

 Resource requests can be made at any time by a YARN application.

 An application can make all of its requests up front, or it can take a more dynamic approach
whereby it requests more resources dynamically to meet the changing needs of the application.

 Spark takes the first approach, starting a fixed number of executors on the cluster.

 MapReduce has two phases: the map task containers are requested up front, but the reduce task
containers are not started until later.

 Also, if any tasks fail, additional containers will be requested so the failed tasks can be rerun.
Application Lifespan

 The lifespan of a YARN application can be either

 A short-lived application of a few seconds or

 A long-running application that runs for days or even months.

 Categorizing applications in terms of how they map to the jobs that users run., we have –

 The simplest case as one application per user job, which is the approach that MapReduce
takes.

 The second model is to run one application per workflow or user session of (possibly
unrelated) jobs.

 The third model is a long-running application that is shared by different users.

YARN Compared to MapReduce 1

 “MapReduce 1” - the distributed implementation of MapReduce in the original version of

Hadoop (version 1 and earlier)
 “MapReduce 2” - the implementation that uses YARN (in Hadoop 2 and later).

 A comparison of MapReduce 1 and YARN components

MapReduce 1 YARN

Jobtracker Resource manager, application master,

timeline server

Tasktracker Node manager

Slot Container
MapReduce 1 VS (YARN)

There are two types of daemon that control the job execution process:
 A jobtracker
 Coordinates all the jobs run on the system by scheduling tasks to run on tasktrackers.
( Done by Resource manager in YARN)

 Keeps a record of the overall progress of each job.

 Reschedules a failed task on a different tasktracker.
( Done by Application master in YARN one for each MapReduce job [similar to
Google MapReduce] )

 Stores job history for completed jobs.

( Done by Timeline server in YARN)

 One or more tasktrackers

 Tasktrackers run tasks and send progress reports to the jobtracker.
( Done by Node manager in YARN)
Benefits of YARN

 Scalability
 YARN can run on larger clusters than MapReduce 1.
 MapReduce 1 hits scalability bottlenecks in the region of 4,000 nodes and 40,000 tasks.
 YARN is designed to scale up to 10,000 nodes and 100,000 tasks.

 Availability
 With the jobtracker’s responsibilities split between the resource manager and application master in
YARN, making the service highly available is much simpler.
 Provide High Availabilty for the resource manager, then for YARN applications (on a per-application
basis).

 Utilization
 In YARN, a node manager manages a pool of resources, rather than a fixed number of designated slots.
 Resources in YARN are fine grained, so an application can make a request for what it needs, rather than
for an indivisible slot, which may be too big (which is wasteful of resources) or too small (which may
cause a failure) for the particular task.

 Multitenancy
 In some ways, the biggest benefit of YARN is that it opens up Hadoop to other types of distributed
application beyond MapReduce. MapReduce is just one YARN application among many.
MAP REDUCE
Anatomy of a MapReduce Job Run
 Run a MapReduce job with a single method call: submit() on a Job object.
 Give a call to waitForCompletion(), which submits the job if it hasn’t been submitted already, then waits
for it to finish.
There are five independent entities:

• The client, which submits the MapReduce job.

• The YARN resource manager, which coordinates the allocation of compute resources on the
cluster.

• The YARN node managers, which launch and monitor the compute containers on machines in
the cluster.

• The MapReduce application master, which coordinates the tasks running the Map‐Reduce job.

The application master and the MapReduce tasks run in containers that are scheduled
by the resource manager and managed by the node managers.

• The distributed filesystem , HDFS, which is used for sharing job files between the other entities.
Job Submission
 The submit() method on Job creates an internal JobSubmitter instance and calls
submitJobInternal() on it.

 Having submitted the job, waitForCompletion() polls the job’s progress once per second
 When the job completes successfully, the job counters are displayed.
 Otherwise, the error that caused the job to fail is logged to the console.

 The job submission process implemented by JobSubmitter does the following:

• Asks the resource manager for a new application ID, used for the MapReduce job ID.

• Checks the output specification of the job.

• Computes the input splits for the job.

• Copies the resources needed to run the job, including the job JAR file, the configuration
file, and the computed input splits, to the shared filesystem in a directory named after the job
ID.

• Submits the job by calling submitApplication() on the resource manager.

Job Initialization

 When the resource manager receives a call to its submitApplication() method, it hands off
the request to the YARN scheduler.

 The scheduler allocates a container, and the resource manager then launches the application
master’s process there, under the node manager’s management.

 The application master for MapReduce jobs is a Java application whose main class is
MRAppMaster.

 It initializes the job by creating a number of bookkeeping objects to keep track of the job’s
progress.

 Next, it retrieves the input splits computed in the client from the shared filesystem.
 It then creates a map task object for each split, as well as a number of reduce task objects
determined by the mapreduce.job.reduces property.
Task Assignment
 The application master requests containers for all the map and reduce tasks in the job from the
resource manager.

 Requests for map tasks are made first and with a higher priority than those for reduce tasks.

 Requests for reduce tasks are not made until 5% of map tasks have completed.

Task Execution
 The application master starts the container by contacting the node manager.

 The task is executed by a Java application whose main class is YarnChild.

 Before it can run the task, it localizes the resources that the task needs, including the job
configuration and JAR file, and any files from the distributed cache.

 Finally, it runs the map or reduce task.

 The YarnChild runs in a dedicated JVM, so that any bugs in the user-defined map and reduce
functions (or even in YarnChild) don’t affect the node manager.
Progress and Status Updates
 A job and each of its tasks have a status, which includes state of the job or task (e.g., running,
successfully completed, failed), the progress of maps and reduces, the values of the job’s
counters, and a status message or description.
Job Completion
 When the application master receives a notification that the last task for a job is complete it
changes the status for the job to “successful.”

 When the Job polls for status, it prints a message about job completion and returns from the
waitForCompletion() method.

 Job statistics and counters are printed to the console at this point.

 On job completion, the application master and the task containers clean up their working state
(so intermediate output is deleted).

 OutputCommitter’s commit Job() method is called.

 Job information is archived by the job history server.

Failures
 Failures can be due to –
 user code having bugs
 Crashing of Processes
 Failure of Machines.

 One of the major benefits of using Hadoop is its ability to handle such failures and allow your
job to complete successfully.

 It considers the failure of any of the following entities:

 The task
 The application master
 The node manager and
 The resource manager.
Task Failure
Can be due to various reasons –

 user code in the map or reduce task throws a runtime exception.

 The task JVM reports the error back to its parent application master before it exits.
 The application master marks the task attempt as failed, and frees up the container so its
resources are available for another task.

 Sudden exit of the task JVM—perhaps there is a JVM bug that causes the JVM to exit.
 In this case, the node manager notices that the process has exited and informs the
application master so it can mark the attempt as failed.

 Hanging tasks - The application master notices that it hasn’t received a progress update for a
while and proceeds to mark the task as failed.
 The task JVM process will be killed automatically after this period
Application Master Failure

 Applications in YARN are retried in the event of failure.

 The maximum number of attempts to run a MapReduce application master
 default value is 2
 mapreduce.am.max-attempts property.

 YARN imposes a limit for the maximum number of attempts for any YARN application master
running on the cluster
 default value is 2
 The limit is set by yarn.resourcemanager.am.max-attempts

 Recovery works is as follows.

 An application master sends periodic heartbeats to the resource manager,
 If the application master fails, the resource manager detects the failure and start a new
instance of the master running in a new container (managed by a node manager)
Node Manager Failure

 Can fail by crashing or running very slowly.

 Will stop sending heartbeats to the resource manager (or send them very infrequently).

 The resource manager will notice a failed node manager if it has stopped sending heartbeats
for 10 minutes.

 The failed node is then removed from its pool of nodes to schedule containers on.

 Any task or application master running on the failed node manager will be recovered.

 Node managers may be blacklisted if the number of failures for the application is high, even if
the node manager itself has not failed.

 Blacklisting is done by the application master.

 The user may set the threshold with the mapreduce.job.maxtaskfailures.per.tracker job
property.
Resource Manager Failure

 Failure of the resource manager is serious, because without it, neither jobs nor task containers
can be launched.

 To achieve high availability (HA), a pair of resource managers is run in an active-standby

configuration.

 Information about all the running applications is stored in a highly available state store (backed
by ZooKeeper or HDFS).

 When the new resource manager starts, it reads the application information from the state
store, then restarts the application masters for all the applications.

 The transition of a resource manager from standby to active is handled by a failover controller(
which by default uses Zookeeper leader election).
Shuffle and Sort

 MapReduce makes the guarantee that the input to every reducer is sorted by key.
 The process by which the system performs the sort—and transfers the map outputs to the
reducers as inputs—is known as the shuffle.

Shuffle and Sort in Map Reduce

The Map Side
 Each map task has a circular memory buffer that it writes the output to.
 The buffer is 100 MB by default (the size can be tuned by changing the property -
mapreduce.task.io.sort.mb).
 When the contents of the buffer reach a certain threshold size (default value is 0.80 or 80% -
mapreduce.map.sort.spill.percent ), a background thread will start to spill the contents to
disk.

The Reduce Side

 The reduce task needs the map output for its particular partition from several map tasks across
the cluster.
 The map tasks may finish at different times, so the reduce task starts copying their outputs as
soon as each completes. This is known as the copy phase of the reduce task.
 When in-memory buffer reaches a threshold size (mapreduce.reduce.shuffle.merge.percent)
or reaches a threshold number of map outputs (mapreduce.reduce.merge.inmem.threshold),
it is merged and spilled to disk.
 When all the map outputs have been copied, the reduce task moves into the sort phase, which
merges the map outputs, maintaining their sort ordering.
Matrix Multiplication

 Each cell of the matrix is labelled as Aij and Bij.

 One step matrix multiplication has 1 mapper and 1 reducer.

 The Formula is:

Mapper for Matrix A (k, v)=((i, k), (A, j, Aij)) for all k
Mapper for Matrix B (k, v)=((i, k), (B, j, Bjk)) for all i
Computing the mapper for Matrix A:

 k, i, j computes the number of times it occurs.

 Here all are 2, therefore when k=1, i can have 2 values 1 & 2,
 each case can have 2 further values of j=1 and j=2.
 Substituting all values in formula

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

j=2 ((1, 1), (A, 2, 2))
i=2 j=1 ((2, 1), (A, 1, 3))
j=2 ((2, 1), (A, 2, 4))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

j=2 ((1, 2), (A, 2, 2))
i=2 j=1 ((2, 2), (A, 1, 3))
j=2 ((2, 2), (A, 2, 4))
Computing the mapper for Matrix B

i=1 j=1 k=1 ((1, 1), (B, 1, 5))

k=2 ((1, 2), (B, 1, 6))
j=2 k=1 ((1, 1), (B, 2, 7))
j=2 ((1, 2), (B, 2, 8))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

k=2 ((2, 2), (B, 1, 6))
j=2 k=1 ((2, 1), (B, 2, 7))
k=2 ((2, 2), (B, 2, 8))

The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

(i, k) => Summation (Aij * Bjk)) for j
Output =>((i, k), sum)
Computing the reducer:

We can observe from Mapper computation that 4 pairs are common (1, 1), (1, 2), (2, 1) and (2, 2)
Make a list separate for Matrix A & B with adjoining values taken from Mapper step above:

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(1*5) + (2*7)] =19 -------(i)
(1, 2) =>Alist ={(A, 1, 1), (A, 2, 2)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(1*6) + (2*8)] =22 -------(ii)

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

Blist ={(B, 1, 5), (B, 2, 7)}
Now Aij x Bjk: [(3*5) + (4*7)] =43 -------(iii)
(2, 2) =>Alist ={(A, 1, 3), (A, 2, 4)}
Blist ={(B, 1, 6), (B, 2, 8)}
Now Aij x Bjk: [(3*6) + (4*8)] =50 -------(iv)
From (i), (ii), (iii) and (iv) we conclude that
((1, 1), 19)
((1, 2), 22)
((2, 1), 43)
((2, 2), 50)

Unit III EBDP 2022
No ratings yet
Unit III EBDP 2022
77 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
ADBMS-Module 3
No ratings yet
ADBMS-Module 3
115 pages
05 - MapReduce in Hadoop - An Introduction
No ratings yet
05 - MapReduce in Hadoop - An Introduction
31 pages
Map Reduce
No ratings yet
Map Reduce
14 pages
Map Reduce
No ratings yet
Map Reduce
46 pages
Bda Lab Output
No ratings yet
Bda Lab Output
22 pages
Map Reduce 1
No ratings yet
Map Reduce 1
8 pages
Document 6
No ratings yet
Document 6
15 pages
An Initial Investigation of ChatGPT Unit Test Generation
No ratings yet
An Initial Investigation of ChatGPT Unit Test Generation
10 pages
Sets Bda
No ratings yet
Sets Bda
19 pages
Map Reduce-LO2
No ratings yet
Map Reduce-LO2
62 pages
Group B PR 3 DSBDA
No ratings yet
Group B PR 3 DSBDA
6 pages
Hadoop Big Data Unit 3
No ratings yet
Hadoop Big Data Unit 3
22 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
CS702 Big Data Programs
No ratings yet
CS702 Big Data Programs
58 pages
MR Progs For Self Excercise
No ratings yet
MR Progs For Self Excercise
14 pages
University Institute of Computing: Big Bata Analytics 22CAH-782
No ratings yet
University Institute of Computing: Big Bata Analytics 22CAH-782
51 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Short Programs
No ratings yet
Short Programs
41 pages
DA Lab Program-3
No ratings yet
DA Lab Program-3
9 pages
Practical 2-2
No ratings yet
Practical 2-2
9 pages
Hadoop Weather
No ratings yet
Hadoop Weather
4 pages
Bda Lab S
No ratings yet
Bda Lab S
92 pages
3 MapReduce Framework
No ratings yet
3 MapReduce Framework
28 pages
BDA4
No ratings yet
BDA4
7 pages
Lez.d-01-Hadoop (A) Intro
No ratings yet
Lez.d-01-Hadoop (A) Intro
58 pages
Map Reduce Design and EXECUTION FRAMEWORK
No ratings yet
Map Reduce Design and EXECUTION FRAMEWORK
21 pages
Unit Iii LM
No ratings yet
Unit Iii LM
14 pages
Ch02a Mapreduce
No ratings yet
Ch02a Mapreduce
53 pages
Big Data 1 PDF
No ratings yet
Big Data 1 PDF
17 pages
Unit IV BDA
No ratings yet
Unit IV BDA
32 pages
Analyzing The Data With Hadoop
No ratings yet
Analyzing The Data With Hadoop
13 pages
09b - MapReduce
No ratings yet
09b - MapReduce
44 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
BDT Unit - Iii
No ratings yet
BDT Unit - Iii
12 pages
Hadoop Map Reduce Concepts - Teaching - 1
No ratings yet
Hadoop Map Reduce Concepts - Teaching - 1
53 pages
Unit-Iv CC&BD CS62
No ratings yet
Unit-Iv CC&BD CS62
76 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Lecture - 3
No ratings yet
Lecture - 3
25 pages
CS-702 (D) BigData
No ratings yet
CS-702 (D) BigData
61 pages
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp: Cloud Computing With Mapreduce and Hadoop
55 pages
C Abapd 2309
100% (1)
C Abapd 2309
27 pages
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
No ratings yet
Hadoop and MR Programming: DR G Sudha Sadasivam Professor Cse, PSGCT
71 pages
HADOOP One Day Crash Course
No ratings yet
HADOOP One Day Crash Course
19 pages
BDA Unit 4 Notes
No ratings yet
BDA Unit 4 Notes
20 pages
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
No ratings yet
MapReduce Is A Framework Using Which We Can Write Applications To Process Huge Amounts of Data
12 pages
Mapreduce Introduction
No ratings yet
Mapreduce Introduction
14 pages
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
No ratings yet
3a - MapReduce Data Flow Scheduling Combiner Partitioner PDF
22 pages
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
No ratings yet
Lecture 4: Mapreduce and Hadoop: Indranil Gupta (Indy)
37 pages
Big Data Notes (All Lectures)
No ratings yet
Big Data Notes (All Lectures)
44 pages
Unit V Programming Model
No ratings yet
Unit V Programming Model
53 pages
Bda Unit 1
No ratings yet
Bda Unit 1
13 pages
Kcs 061 PPT Unit 2
No ratings yet
Kcs 061 PPT Unit 2
56 pages
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
No ratings yet
Parlab Parallel Boot Camp Cloud Computing With Mapreduce and Hadoop
49 pages
Bda Material Unit 3
No ratings yet
Bda Material Unit 3
14 pages
1.4 Map Reduce
No ratings yet
1.4 Map Reduce
30 pages
Web Content Accessibility Guidelines
No ratings yet
Web Content Accessibility Guidelines
26 pages
Python Notes
No ratings yet
Python Notes
188 pages
U4 - Shell Pattern Matching
No ratings yet
U4 - Shell Pattern Matching
5 pages
Unit III
No ratings yet
Unit III
118 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
New RPG Free-Form Operations
No ratings yet
New RPG Free-Form Operations
6 pages
1st Year
No ratings yet
1st Year
75 pages
Compiler Design: Lexical Analysis
No ratings yet
Compiler Design: Lexical Analysis
68 pages
GC 2024 10 29
No ratings yet
GC 2024 10 29
17 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
CSS Resume
No ratings yet
CSS Resume
2 pages
M - Sequence and P - Sequencer in UVM
No ratings yet
M - Sequence and P - Sequencer in UVM
4 pages
OOp With Java-Module1 Notes
No ratings yet
OOp With Java-Module1 Notes
52 pages
1 TheorieLangage Compilation
No ratings yet
1 TheorieLangage Compilation
19 pages
BDA Unit 1
No ratings yet
BDA Unit 1
50 pages
(English (Auto-Generated) ) Kubernetes YAML File Explained - Deployment and Service - Kubernetes Tutorial 19 (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Kubernetes YAML File Explained - Deployment and Service - Kubernetes Tutorial 19 (DownSub - Com)
10 pages
1.1intro Java
No ratings yet
1.1intro Java
39 pages
Unit-1-Part1-Big Data Analytics and Tools
No ratings yet
Unit-1-Part1-Big Data Analytics and Tools
12 pages
ISTQB ISEB BH0 010 AjoySingha - Unlocked
No ratings yet
ISTQB ISEB BH0 010 AjoySingha - Unlocked
46 pages
Introduction To Angular
No ratings yet
Introduction To Angular
42 pages
MIPS Assembly Language Programming: Computer Organization and Assembly
No ratings yet
MIPS Assembly Language Programming: Computer Organization and Assembly
33 pages
Big Data Intro
No ratings yet
Big Data Intro
12 pages
CSC241Object Oriented Programming: Lect-1,2 DR Shama Noreen
No ratings yet
CSC241Object Oriented Programming: Lect-1,2 DR Shama Noreen
25 pages
Csharp File Io
No ratings yet
Csharp File Io
3 pages
Responsive Web Design Projects - Build A Product Landing Page - Learn
No ratings yet
Responsive Web Design Projects - Build A Product Landing Page - Learn
1 page
IT 304 OOPM Unit V - 1693892221
No ratings yet
IT 304 OOPM Unit V - 1693892221
10 pages
DEF CON Safe Mode - Slava Makkaveev - Pwn2Own Qualcomm Compute DSP For Fun and Profit
No ratings yet
DEF CON Safe Mode - Slava Makkaveev - Pwn2Own Qualcomm Compute DSP For Fun and Profit
31 pages
Unit Testing of Spark Applications
No ratings yet
Unit Testing of Spark Applications
18 pages
Test Plan: Scope
No ratings yet
Test Plan: Scope
19 pages
Pin: Building Customized Program Analysis Tools With Dynamic Instrumentation
No ratings yet
Pin: Building Customized Program Analysis Tools With Dynamic Instrumentation
11 pages
React Native Text Input
No ratings yet
React Native Text Input
5 pages
Empowerment Technology
No ratings yet
Empowerment Technology
2 pages
Josh Finnie: Software Engineer
No ratings yet
Josh Finnie: Software Engineer
2 pages
TeamGroup CARDEA ZERO Z340 & MP34 SSD Re-Initial Tool Operating Instruction - EN
No ratings yet
TeamGroup CARDEA ZERO Z340 & MP34 SSD Re-Initial Tool Operating Instruction - EN
7 pages
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

MapReduce and Yarn

Uploaded by

MapReduce and Yarn

Uploaded by

MapReduce

 MapReduce is a programming model for data processing. The model is

 MapReduce simultaneously processes and analyzes huge data sets

 Hadoop can run MapReduce programs written in various languages: e.g.

 Most importantly, MapReduce programs are inherently parallel, thus

 Data files are organized by date and weather station.

Analyzing the Data with Hadoop

 MapReduce works by breaking the processing into two phases:

 The programmer also specifies two functions:

 The input to our map phase is the raw NCDC data.

• This processing sorts and groups the key-value pairs by key.

Map Reduce Logical Data Flow

public class NewMaxTemperature {

static class NewMaxTemperatureMapper extends Mapper<LongWritable, Text, Text, IntWritable>

if (airTemperature != MISSING && quality.matches("[01459]"))

static class NewMaxTemperatureReducer extends Reducer<Text, IntWritable, Text, IntWritable>

FileInputFormat.addInputPath(job, new Path("input/ncdc/sample.txt"));

public void reduce(Text key, Iterable<IntWritable> values,Context context) throws IOException,

Configuration conf = new Configuration();

 YARN was introduced in Hadoop 2 to improve the MapReduce implementation.

 YARN has a flexible model for making resource requests.

 A request for a set of containers can express

 Resource requests can be made at any time by a YARN application.

 The lifespan of a YARN application can be either

 A short-lived application of a few seconds or

 The third model is a long-running application that is shared by different users.

 “MapReduce 1” - the distributed implementation of MapReduce in the original version of

 A comparison of MapReduce 1 and YARN components

Jobtracker Resource manager, application master,

Tasktracker Node manager

 Keeps a record of the overall progress of each job.

 Stores job history for completed jobs.

 One or more tasktrackers

• The client, which submits the MapReduce job.

 The job submission process implemented by JobSubmitter does the following:

• Checks the output specification of the job.

• Computes the input splits for the job.

• Submits the job by calling submitApplication() on the resource manager.

 The task is executed by a Java application whose main class is YarnChild.

 Finally, it runs the map or reduce task.

 OutputCommitter’s commit Job() method is called.

 Job information is archived by the job history server.

 It considers the failure of any of the following entities:

 user code in the map or reduce task throws a runtime exception.

 Applications in YARN are retried in the event of failure.

 Recovery works is as follows.

 Can fail by crashing or running very slowly.

 Blacklisting is done by the application master.

 To achieve high availability (HA), a pair of resource managers is run in an active-standby

Shuffle and Sort in Map Reduce

The Reduce Side

 Each cell of the matrix is labelled as Aij and Bij.

 One step matrix multiplication has 1 mapper and 1 reducer.

 The Formula is:

 k, i, j computes the number of times it occurs.

k=1 i=1 j=1 ((1, 1), (A, 1, 1))

k=2 i=1 j=1 ((1, 2), (A, 1, 1))

i=1 j=1 k=1 ((1, 1), (B, 1, 5))

i=2 j=1 k=1 ((2, 1), (B, 1, 5))

The formula for Reducer is:

Reducer(k, v)=(i, k)=>Make sorted Alist and Blist

(1, 1) =>Alist ={(A, 1, 1), (A, 2, 2)}

(2, 1) =>Alist ={(A, 1, 3), (A, 2, 4)}

You might also like