0% found this document useful (0 votes)

67 views77 pages

Unit III EBDP 2022

The document discusses MapReduce and analyzing weather data using MapReduce. It describes how MapReduce works by breaking the processing into a map phase and reduce phase. It provides an example of using MapReduce to find the highest recorded global temperature for each year by having the map function extract the year and temperature from weather data and the reduce function output the maximum temperature for each year. It also shows implementing this example in Java code and using Hadoop Streaming to write the map and reduce functions in other languages like Python.

Uploaded by

Raju Jacob Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views77 pages

Unit III EBDP 2022

Uploaded by

Raju Jacob Raj

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 77

Map Reduce

Analyzing Data
 MapReduce is a programming model for data
processing.
 Hadoop can run MapReduce programs written in
various languages. – (Hadoop Streaming)
 We shall look at the same program expressed in
Java, Ruby, Python, and C++.

2
A Weather Dataset
 Program that mines
weather data
 Weather sensors collect data
every hour at many locations
across the globe
 They gather a large volume of
log data, which is good candi-
date for analysis with MapRe-
duce
 Data Format
 Data from the National Climate
Data Center(NCDC)
 Stored using a line-oriented
ASCII format, in which each line
is a record

3
A Weather Dataset
 Data Format
 Data files are organized by date and weather station.
 There is a directory for each year from 1901 to 2001, each containing a gzipped file for
each weather station with its readings for that year.

 The whole dataset is made up of a large number of relatively small files since there are
tens of thousands of weather station.
 The data was preprocessed so that each year’s readings were concatenated into a sin-
gle file.

4
Analyzing the Data with Unix Tools
 What’s the highest recorded global temperature for each year in the
dataset?
 Unix Shell script program with awk, the classic tool for processing line-ori-
ented data
The scripts loops through the compressed year files
 printing the year  processing each file using awk

Awk extracts the air temperature and the quality code from the data.

Temperature value 9999 signifies a missing value in the NCDC dataset.

 Beginning of a run
Maximum temperature is 31.7℃ for 1901.

 The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large Instance. 5
Analyzing the Data with Unix Tools
 To speed up the processing, run parts of the program in parallel
 Problems for parallel processing
 Dividing the work into equal-size pieces isn’t always easy or obvious.
• The file size for different years varies
• The whole run is dominated by the longest file
• A better approach is to split the input into fixed-size chunks and assign each chunk to a
process
 Combining the results from independent processes may need further pro-
cessing.
 Still limited by the processing capacity of a single machine, handling coor-
dination and reliability for multiple machines

 It’s feasible to parallelize the processing, though, it’s messy in

practice.

6
Analyzing the Data with Hadoop – Map and Reduce
 Map and Reduce
 MapReduce works by breaking the processing into 2 phases: the
map and the reduce.
 Both map and reduce phases have key-value pairs as input and
output.
 Programmers have to specify two functions: map and reduce
function.
 The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the
value is each line of the data set.
 The map function pulls out the year and the air temperature
from each input value.
 The reduce function takes <year, temperature> pairs as input and
produces the maximum temperature for each year as the result.
7
Analyzing the Data with Hadoop – Map and Reduce
 Original NCDC Format

 Input file for the map function, stored in HDFS

 Output of the map function, running in parallel for each block

 Input for the reduce function & Output of the reduce function

8
Analyzing the Data with Hadoop – Map and Reduce
 The whole data flow

Map() Shuffling Reduce ()

<1950, 0>
<1950, 22> <1949, [111, 78]>
<1950, [0, 22, -11]> <1949, 111>
<1949,111>
<1951, 10> <1951, [10, 76,34], 19> <1950, 22>
<1952 ,[22, 34]> <1951, 76>
<1952, 22>
<1953, [45]> <1952, 34>
<1954, 0>
<1954, 22> <1955, [23]> <1953, 45>
<1950, -11> <1955,25>
<1949, 78>
<1951, 25>

Input File

9
Analyzing the Data with Hadoop – Java MapReduce
 Having run through how the MapReduce program works, express it in code
 A map function, a reduce function, and some code to run the job are needed.
 Map function
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {

String line = value.toString();

String year = line.substring(15, 19);
int airTemperature;
if (line.charAt(87) == '+') { // parseInt doesn't like leading plus signs
airTemperature = Integer.parseInt(line.substring(88, 92));
} else {
airTemperature = Integer.parseInt(line.substring(87, 92));
}
String quality = line.substring(92, 93);
if (airTemperature != MISSING && quality.matches("[01459]")) {
context.write(new Text(year), new IntWritable(airTemperature)); } } }
10
Analyzing the Data with Hadoop – Java MapReduce

• Reduce function
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {

@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {

int maxValue = Integer.MIN_VALUE;

for (IntWritable value : values) {
maxValue = Math.max(maxValue, value.get()); }
context.write(key, new IntWritable(maxValue)); } }

11
Analyzing the Data with Hadoop – Java MapReduce
 Main function for running the MapReduce job
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");

FileInputFormat.addInputPath(job, new Path(args[0]));

FileOutputFormat.setOutputPath(job, new Path(args[1]));

job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);

job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
12
System.exit(job.waitForCompletion(true) ? 0 : 1);} }
Analyzing the Data with Hadoop – Java MapReduce
 A test run

 The output is written to the output directory, which contains one output
file per reducer

13
Hadoop Streaming
 Hadoop provides an API to MapReduce
 write the map and reduce functions in languages other
than Java.
 We can use any language in MapReduce program.
 Hadoop Streaming
 Map input data is passed over standard input to your map
function.
 The map function processes the data line by line and
writes lines to standard output.
 A map output key-value pair is written as a single tab-de-
limited line.
 Reduce function reads lines from standard input (sorted by
key), and writes its results to standard output.
14
Hadoop Streaming
 Python
 Streaming supports any programming language that can read from standard input
and write to standard output.
 The map and reduce script in Python

 Test the programs and run the job in the same way we did in Ruby.
15
Hadoop Map Reduce
 Map Reduce is a Programming Model, that processes data
distributed across many computers.
 Scalable to thousands of nodes and petabytes of data.

Phases in MR Programming Model

Map Phase
Reduce Phase
Sort Phase
Shuffle Phase
Partition Phase
Combine Phase
16
MR Phases
 Map Phase
• Divides the input into small chunks called splits. (hdfs blocks)
• Each split is divided into records, and the map processes each record
• Each input split is processed by a single Map task. Multiple Maptasks
process data in parallel.
• Works on principle of Data Locality
• Takes the input as a series of key/value pairs and Generates output as key /
value pairs.
• Map tasks output (intermediate o/p) is written to local disk.

 Reduce Phase
• Takes the key & list of values to generates key value pairs as output
• Reduce tasks don’t have the advantage of data locality – the input to a sin-
gle reduce task is normally the output from all mappers.
• All map outputs are merged across the network and passed to the user-de-
fined reduce function.
• The output of the reduce is normally stored in HDFS.
17
MR Phases (Cont..)
 Sort & Shuffle Phase
• The input to every reducer is sorted by key.
• The process of grouping the Map output based on sorted key and trans-
ferring to the reducers as inputs—is known as the shuffle.
 Partition Phase
• The Partition phase determines the reducers to which the sorted & shuf -
fled intermediate data to be transferred.
 Combine Phase
• Each Maps intermediate result is processed into key value pairs so that
there is less data that is to be transferred to reduce phase.
• It reduces the n/w traffic by optimizing the map intermediate result.

18
19
Data flow
 Hadoop moves the MapReduce computation to each machine
hosting a part of the data.
 Data Flow
 A MapReduce job consists of the input data, the MapReduce pro-
gram, and configuration information.
 Hadoop runs the job by dividing it into 2 types of tasks, map and re-
duce tasks.
 Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on tasktrackers.
• Tasktrackers : run tasks and send progress report to the jobtracker.

20
Data flow
 Data Flow – single reduce task
 Reduce tasks don’t have the advantage of data locality – the input to a single
reduce task is normally the output from all mappers.
 All map outputs are merged across the network and passed to the user-defined
reduce function.
 The output of the reduce is normally stored in HDFS.

21
Data Flow
 Data Flow – multiple reduce tasks
 The number of reduce tasks is specified independently not governed
by the input size.
 The map tasks partition their output by keys, each creating one parti-
tion for each reduce task.
 There can be many keys and their associated values in each partition,
but the records for any key are all in a single partition.

22
Data Flow
 Data Flow – zero reduce task

23
MapReduce Types
 Map & Reduce function types are as follows:

 The map input key and value types (K1 and V1) are different from the
map output types (K2 and V2).
 The reduce input must have the same types as the map output, al-
though the reduce output types may be different again (K3 and V3).
 If a combine function is used then it is the same form as the reduce
function (and is an implementation of Reducer), except its output
types are the intermediate key and value types (K2 and V2), so they
can feed the reduce function.
 The partition function operates on the intermediate key and value
types (K2 and V2),and returns the partition index. In practice, the par-
tition is determined solely by the key (the value is ignored)
24
Default Map, Reduce,Partitioner
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void map(KEYIN key, VALUEIN value,

Context context) throws IOException, InterruptedException {
context.write((KEYOUT) key, (VALUEOUT) value); } }

public class HashPartitioner<K, V> extends Partitioner<K, V> {

public int getPartition(K key, V value,
int numReduceTasks) {
return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}}

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

protected void reduce(KEYIN key, Iterable<VALUEIN> values,
Context context) throws IOException, InterruptedException {
for (VALUEIN value: values) {
context.write((KEYOUT) key, (VALUEOUT) value); } } }
25
Java Map Reduce Program – Map Class

26
Reduce Class

27
Driver Class

28
Combiner Functions
 Many MapReduce jobs are limited by the bandwidth available on the cluster.
 It pays to minimize the data transferred between map and reduce tasks.
 Hadoop allows the user to specify a combiner function to be run on the map
output – the combiner function’s output forms the input to the reduce func-
tion.
 The contract for the combiner function constrains the type of function that
may be used.
 Example without a combiner function

Map () shuffling Reduce ()

<1950, 0>
<1950, 20>
<1950, 10>
<1950, [0, 20, 10, 25, 15]> <1950, 25>
<1950, 25>
<1950, 15>

Map ()
 Example combiner shuffling Reduce ()
with a <1950,
combiner
0>
function, finding maximum temperature for a map
<1950, 20> <1950, 20>
<1950, 10>
<1950, [20, 25]> <1950, 25>
<1950, 25>
<1950, 15> <1950, 25> 29
Scaling Out
 Combiner Functions
 The function calls on the temperature values can be expressed as fol-
lows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) =
max(20, 25) = 25
 Calculating ‘mean’ temperatures couldn’t use the mean as the com-
biner function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.
 The combiner function doesn’t replace the reduce function.
 It can help cut down the amount of data shuffled between the maps
and the reduces

30
Scaling Out
 Combiner Functions
 Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in MaxTemperatur-
eReducer.
• The only change is to set the combiner class on the JobConf.

31
Anatomy of a MapReduce Job Run
 You can run a MapReduce job with a single line of code:
 JobClient.runJob(conf).
• But it conceals a great deal of processing behind the
scenes.
 At the highest level, there are four independent entities:
 The client
• submits the MapReduce job.
 The jobtracker, which coordinates the job run
 The tasktrackers, which run the tasks that the job has been
split into.
 The distributed filesystem (normally HDFS) used for sharing
job files between the other entities.

32
Anatomy of a MapReduce Job Run

 Step 1~4: Job submission

 Step 5,6: Job initialization
 Step 7: Task assignment
 Step 8~10: Task execution

33
Map Reduce Job Run
 Jobs get submitted by wait for completion() in Driver Class
 Client Submits the Map Reduce Job to Job Tracker
 Job Tracker verifies certain Properties of Job Submission such
as input / output files specifications and so on and Initializes
the Job, Assigns to a Task Tracker
 Task Tracker run the Map and Reduce Tasks
 Task Tracker sends Progress status to Job Tracker periodically.
 Job Tracker periodically sends the progress status by combining
the status of map and reduce tasks received from task trackers
 Finally Job Completion status is return to Client indicating the
completion of Job.

34
Job Submission
 JobClient.runJob()
 Creates a new JobClient instances1
 Calls submitJob()

 JobClient.submitJob()
 Asks the jobtracker for a new job ID (by calling JobTracker.getNewJobId()) 2
 Checks the output specification of the job
 Computes the input splits for the job
 Copies the resources needed to run the job to the jobtracker’s filesystem 3
• The job JAR file, the configuration file and the computed input splits
 Tells the jobtracker that the job is ready for execution
(by calling JobTracker.submitJob()) 4

35
Job Initialization
 JobTracker.submitJob()
 Creates a new JobInProgress instances5
• Represents the job being run
• Encapsulates its tasks and status information
 Puts it into an internal queue
• The job scheduler will pick it up and initialize it from the queue

 Job scheduler
 Retrieves the input splits from the shared filesystem6
 Creates one map task for each split
 Creates reduce tasks to be run
• The # of reduce tasks is determined by the mapred.reduce.tasks property
 Gives IDs to the tasks
36
Task Assignment
 Tasktrackers
 Periodically send heartbeats to the Jobtracker7
• Also send whether they are ready to run a new task
 Have a fixed number of slots for map/reduce tasks

 Jobtracker
 Chooses a job to select the task from
 Assigns map/reduce tasks to tasktrackers
using the hearbeat values
• For map tasks, it takes account of the data locality

37
Task Execution
 Tasktracker
 Copies the job JAR from the shared filesystem8
 Creates a local working directory for the task,
and un-jars the contents of the JAR into this directory
 Creates an instance of TaskRunner to run the task

 TaskRunner
 Launches a new Java Virtual Machine(JVM)9
• So that any bugs in the user-defined map and reduce functions
don’t affect the tasktracker
 Runs each task in the JVM
• Child process informs the parent of the task’s progress every few
seconds

38
Anatomy of a MapReduce Job Run
 Streaming and Pipes
• Both Streaming and Pipes run special map and reduce tasks for the purpose of
launching the user-supplied executable and communicating with it (Figure 6-2).

39
Anatomy of a MapReduce Job Run
 Progress and Status Updates
 When a task is running, it keeps track of its progress, that is, the pro-
portion of the task completed.
 A job and each of its tasks have a status, which includes the state of the
job or task (e.g., running, successfully completed, failed), the progress
of maps and reduces, the values of the job’s counters, and a status
message or description
 What Constitutes Progress in MapReduce?
• Reading an input record (in a mapper or reducer)
• Writing an output record (in a mapper or reducer)
• Incrementing a counter
• The task status changes, report to the Task Tracker.

40
Anatomy of a MapReduce Job Run
 Job Completion
 When the jobtracker receives a notification that the last task for a job is complete,
it changes the status for the job to “successful.” Then, when the JobClient polls for
status, it learns that the job has completed successfully, so it prints a message to
tell the user, and then returns from the runJob() method.

41
Failures
 Task Failure
 Child task failing.
• When user code in the map or reduce task throws a runtime excep-
tion.
• Streaming tasks exits with a nonzero exit code, it is marked as failed.
governed by the stream.non.zero.exit.is.failure property
• Hanging tasks failure occurs when Task Tracker(TT) has not received
progress update for a while, and proceeds to mark the task as failed.
The child JVM process will be automatically killed after this period.
• mapred.map.max.attempts=4(default)
• mapred.reduce.max.attempts

 The maximum percentage of tasks that are allowed to fail without

triggering job failure
• mapred.max.map.failures.percent
• mapred.max.reduce.failures.percent

42
Failures
 Tasktracker Failure
 Failure of a tasktracker is another failure mode.
 If a tasktracker(TT) fails by crashing, or running very slowly, it
will stop sending heartbeats to the jobtracker(JT).
 The JT removes it from its pool of TTs and arranges the map
tasks on another TT.
 A JT can blacklist a TT if more than 4 tasks of the same job
failes on a particular TT.
 Jobtracker Failure
 Failure of the jobtracker is the most serious failure mode.
Currently, Hadoop has no mechanism for dealing with failure
of the jobtracker.

43
Job Scheduling
 Goals
1. Good throughput or response time for tasks (or jobs)
2. High utilization of resources
 Hadoop
 A Hadoop job consists of Map tasks and Reduce tasks
 If Only one job in entire cluster => it occupies cluster
 Multiple Users with multiple jobs
• Users/jobs = “tenants”
• Multi-tenant system
• Need a way to schedule all these jobs (and their constituent
tasks)
• Need to be fair across the different tenants
 Hadoop schedulers
 FIFO Scheduler (MR V1)
 Hadoop Capacity Scheduler
 Hadoop Fair Scheduler
44
FIFO Scheduler
• Queue-based : Maintain tasks in a queue in order of arrival
• When processor free, dequeue head and schedule it
• Each job uses the whole cluster, so jobs wait their turn.
• Can set priorities for the jobs in the queue (5 queues with priorities)
• Job priority mapred.job.priority property
(VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW)
• Preemption is not supported

45
Capacity Schedulers
 Contains multiple queues
 Each queue contains multiple jobs
 Each queue guaranteed some portion of the cluster capacity
E.g.,
 Queue 1 is given 80% of cluster
 Queue 2 is given 20% of cluster
 Higher-priority jobs go to Queue 1
 For jobs within same queue, FIFO typically used
 Administrators can configure queues
 Elasticity
 A queue allowed to occupy more of cluster if resources are free
 But if other queues below their capacity limit, need to give these other queues re-
sources
 Pre-emption not allowed!
 Cannot stop a task part-way through
 When reducing % cluster to a queue, wait until some tasks of that queue have finished46
Fair Scheduler
 Goal: All jobs get equal share of resources
 When only one job present, occupies entire cluster
 As other jobs arrive, each job given equal % of cluster
 E.g., Each job might be given equal number of cluster-wide slots
 Each slot == 1 task of job

 Pool Based : Divides cluster into pools

 Jobs are assigned to pools (1 pool per user by default)
 A user job pool is a number of slots assigned for tasks for that user
 Each pool gets the same number of task slots by default

 Resources divided equally among pools

 Gives each user fair share of cluster

 Within each pool, can use either

 Fair share scheduling, or
 FIFO/FCFS
 (Configurable)
 mapred.jobtracker.taskScheduler = org.apache.hadoop.mapred.FairScheduler 47
Fair Scheduler

Example:
• Mary, John, Peter submit jobs that demand 80, 80, and 120 tasks respectively
• Say the cluster has a limit to allocate 60 tasks at most
• Default behavior: Distribute task fairly among 3 users (each get 20)

48
Fair Scheduler
 Minimum share can be set for a pool
 In the previous example, say Mary has a minimum share of 40
 Mary would be allocated 40, then the rest is distributed evenly
to other pools
 When minimum share not met in a pool, for a while
 Take resources away from other pools
• By pre-empting jobs in those other pools
• By killing tasks in pools running over capacity in order to give more slots
to the pool running under capacity.

49
Shuffle and sort in MapReduce
MapReduce makes the guarantee that the input to every reducer is
sorted by key.
The process by which the system performs the sort—and transfers
the map outputs to the reducers as inputs—is known as the shuffle

50
Shuffle and Sort
 MapReduce makes the guarantee that the input to every reducer is
sorted by key. The process by which the system performs the sort—
and transfers the map outputs to the reducers as inputs—is known
as the shuffle.
 The Map Side
 Map function process is more involved, and takes advantage of buffering
writes in memory and doing some presorting for efficiency reasons. Figure
6-4 shows what happens.
 Map output will be written to the buffer, the thread partitioning and parti-
tion-specific sorting the data in the buffer.
 Each Map Task has a ring structure of the memory buffer and write data to
the memory.
 Aligned the final output records that are recorded immediately after the
spill file is written to disk, and merge them into a single output file, and
when it reaches the buffer limit (default 80%) spill.
 Map output compression to reduce the amount of transmitted data can be
performed.

51
Shuffle and Sort
 The Reduce Side
 The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes.
 How do reducers know which TTs to fetch map output from?
• map tasks complete successfully -> they notify their parent TT of the status
update
-> in turn notifies the jobtracker. -> A thread in the reducer periodically asks
the jobtracker for map output locations -> TTs do not delete map outputs from
disk as soon as the first reducer has retrieved them, as the reducer may fail.
 When all the map outputs have been copied, the reduce task moves into
the sort phase, which merges the map outputs, maintaining their sort or-
dering.
 During the reduce phase the reduce function is invoked for each key in the
sorted output. The output of this phase is written directly to the output
filesystem, typically HDFS. In the case of HDFS, since the TT node is also
running a datanode, the first block replica will be written to the local disk.
52
Task Execution
 Speculative Execution
 This makes job execution time sensitive to slow-running tasks, as it
takes only one slow task to make the whole job take significantly
longer than it would have done otherwise.
 Hadoop tries to detect when a task is running slower than expected
and launches another, equivalent, task as a backup. This is termed
speculative execution of tasks.
 It’s important to understand that speculative execution does not
work by launching two duplicate tasks at about the same time so
they can race each other.
• if the original task completes before the speculative task then the speculative task
is killed; on the other hand, if the speculative task finishes first, then the original
is killed.
 Speculative execution is an optimization, not a feature to make jobs
run more reliably.

53
Task Execution
 Task JVM Reuse
 Hadoop runs tasks in their own Java Virtual Machine to isolate them
from other running tasks.
 The overhead of starting a new JVM for each task can take around a
second, which for jobs that run for a minute or so is insignificant.
 However, jobs that have a large number of very short-lived tasks (these
are usually map tasks) or that have lengthy initialization, can see per-
formance gains when the JVM is reused for subsequent tasks.
 The property for controlling task JVM reuse is mapred.job.re-
use.jvm.num.tasks : it specifies the maximum number of tasks to run for
a given job for each JVM launched, the default is 1.

54
Task Execution
 Skipping Bad Records
 The best way to handle corrupt records is in your mapper or reducer code.
You can detect the bad record and ignore it, or you can abort the job by
throwing an exception. You can also count the total number of bad records in
the job using counters to see how widespread the problem is.
 Skipping mode is off by default; you enable it independently for map and re-
duce tasks using the SkipBadRecords class.
 The Task Execution Environment
 Hadoop provides information to a map or reduce task about the environment in which
it is running. (The properties in Table 6-5 can be accessed from the job’s configuration)

55
Distributed and Parallel Processing Technology

MapReduce Types and Formats

56
Mapper and Reducer Files loaded from local HDFS store

The Mapper performs the user-de-

InputFormat
fined task MapReduce program
file
A new instance of Mapper is cre- Split Split Split

ated for each split file

The Reducer performs the user-de- RR RR RR

fined task of the MapReduce pro-

Map Map Map
gram
A new instance of Reducer is cre- Partitioner
ated for each partition
Sort
For each key in the partition as-
signed to a Reducer, the Reducer is Reduce
called once

57
Partitioner Files loaded from local HDFS store

Each mapper may emit (K, V) pairs to InputFormat

any partition file

Split Split Split
Therefore, the map nodes must all file

agree on where to send different pieces RR RR RR

of
intermediate data Map Map Map

The partitioner class determines which

partition a given (K,V) pair will go to Partitioner

The default partitioner computes a hash Sort

value for a given key and assigns it to a
partition based on this result Reduce

58
Sort
Files loaded from local HDFS store
Each Reducer is responsible for reduc-
ing InputFormat
the values associated with (several)
file
intermediate keys Split Split Split
file

RR RR RR
The set of intermediate keys on a single
node is automatically sorted by
Map Map Map
MapReduce before they are presented
to the Reducer
Partitioner

Sort

Reduce

59
Input Files

 Input files are where the data for a MapReduce task is

initially stored

 The input files typically reside in a distributed file system

(e.g. HDFS)
file

 The format of input files is arbitrary file

 Line-based log files

 Binary files
 Multi-line input records
 Or something else entirely 60
Files loaded from local HDFS store
InputFormat
An InputFormat is responsible for creating the input InputFormat
splits, and dividing them into records.
InputFormat is a class that does the following: file

 Selects the files that should be used for input

file
 Defines the InputSplits that break a file
 An input split is a chunk of the input that is processed by a single map.
 Provides a factory for RecordReader objects that read the file
 Loads data from its source and converts it into (K,V) pairs suitable for
reading by Mappers
 Each split is divided into records, and the map processes each record—a key-value
pair—in turn.

• Input Splits
• An InputSplit has a length in bytes and a set of storage locations, which are
just hostname strings.
• The storage locations are used by the MapReduce system to place map tasks
as close to the split’s data as possible.
• The size is used to order the splits so that the largest get processed first, in an
attempt to minimize the job runtime 61
Input Formats
 FileInputFormat
• FileInputFormat is the base class for all implementations of In-
putFormat that use files as their data source (see Figure 7-2).
• It provides two things
1.a place to define which files are included as the input to a job.
2. an implementation for generating splits for the input files.

62
63
Input Formats
 FileInputFormat input paths
• FileInputFormat offers four static convenience meth-
ods for setting a JobConf’s input paths:

1. The addInputPath() and addInputPaths() methods

add a path or paths to the list of inputs.
2. The setInputPaths() methods set the entire list of
paths in one go.

64
Input Formats
 FileInputFormat input splits
1. FileInputFormat splits only large files(Here “large” means larger
than an HDFS block).
2. The split size is normally the size of an HDFS block, which is appro-
priate for most applications.

65
Input Formats
 Small files and CombineFileInputFormat
• Hadoop works better with a small number of large files than a large number
of small files.
• If the file is very small (“small” means significantly smaller than an HDFS
block) and there are a lot of them, then each map task will process very lit -
tle input, and there will be a lot of them (one per file), each of which im-
poses extra bookkeeping overhead.
• CombineFileInputFormat, was designed to work well with small files.
•
1. CombineFileInputFormat packs many files into each split so that each
mapper has more to process.
2. It Considers node and rack locality into account when deciding which
blocks to place in the same split
3. CombineFileInputFormat does not compromise the speed at which it can
process the input in a typical MapReduce job.

66
Input Formats
 Text Input Format (Default InputFormat)
• The key, a LongWritable, is the byte offset within the file, it is normally the
beginning of the line.
• The value is the contents of the line, excluding any line terminators (new-
line, carriage return), and is packaged as a Text object.
• So a file containing the following text:

is divided into one split of four records. The records are interpreted as the
following key-value pairs:

67
Input Formats
 KeyValue InputFormat
 Useful incase the key is already present in the input file.
 K,V are separated by a delimeter by default tab.
 Mapreduce.input.keyvaluelinerecordreader.key.value.seperater”.
 NLineInputFormat
 With TextInputFormat and KeyValueTextInputFormat, each mapper re-
ceives a variable number of lines of input.
 The number depends on the size of the split and the length of the lines.
 If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
 Database Input (and Output)
 DBInputFormat is an input format for reading data from a relational database, us-
ing JDBC.
 Because it doesn’t have any sharding capabilities, you need to be careful not to overwhelm
the database you are reading from by running too many mappers. For this reason, it is best
used for loading relatively small datasets, perhaps for joining with larger datasets from
HDFS, using MultipleInputs.
68
Input Formats
 SequenceInputFormat

 SequenceFileInputFormat
• Hadoop’s sequence file format stores sequences of binary key-value
pairs.

 SequenceFileAsTextInputFormat
• SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat
that converts the sequence file’s keys and values to Text objects.

 SequenceFileAsBinaryInputFormat
• SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFor-
mat that retrieves the sequence file’s keys and values as opaque binary
objects.
• Map task should know how to deal with these objects.
69
Input Formats
 Multiple Inputs
 one might be tab-separated plain text, the other a binary sequence file. Even
if they are in the same format, they may have different representations, and
therefore need to be parsed differently.
 These cases are handled elegantly by using the MultipleInputs class.
• MultipleInputs class, which allows you to specify the InputFormat and
Mapper to use on a per-path basis.
• Example)
if we had weather data from the U.K. Met Office# that we wanted to com-
bine with the NCDC data for our maximum temperature analysis, then we
might set up the input as follows:

70
OutputFormat
Files loaded from local HDFS store

 The OutputFormat class defines the way InputFormat

(K,V) pairs produced by Reducers are file
written to output files Split Split Split
file
 The instances of OutputFormat provided by
RR RR RR
Hadoop write to files on the local disk or in
HDFS
Map Map Map
 Several OutputFormats are provided by
Hadoop Partitioner

OutputFormat Description Sort

TextOutputFormat Default; writes lines in "key \t
value" format Reduce

SequenceFileOutputFormat Writes binary files suitable for

reading into subsequent MapRe- OutputFormat
duce jobs
NullOutputFormat Generates no output files

71
OutputFormat

72
Output Formats
 Text Output
 The default output format, TextOutputFormat, writes records as
lines of text.
 TextOutputFormat keys and values may be of any type.
 Each key-value pair is separated by a tab character, although that
may be changed using the mapred.textoutputformat.separator
property.
 You can suppress the key or the value (or both, making this out-
put format equivalent to NullOutputFormat, which emits noth-
ing) from the output using a NullWritable type.

73
Output Formats
 Binary Output
 SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes se-
quence files for its output.
• This is a good choice of output if it forms the input to a further
MapReduce job, since it is compact, and is readily compressed.
 SequenceFileAsBinaryOutputFormat
• SequenceFileAsBinaryOutputFormat is the counterpart to Se-
quenceFileAsBinaryInput Format.
• SequenceFileAsBinaryOutputFormat writes keys and values in
raw binary format into a SequenceFile container.
 MapFileOutputFormat
• MapFileOutputFormat writes MapFiles as output.

74
Output Formats
 Multiple Outputs
 FileOutputFormat and its subclasses generate a set of files in the
output directory.
• There is one file per reducer
• Files are named by the partition number: part-00000, part-00001, etc.
 There is sometimes a need to have more control over the naming
of the files, or to produce multiple files per reducer.
• MapReduce comes with two libraries to help you do this: Multiple-
OutputFormat and MultipleOutputs.

 MultipleOutputFormat
• MultipleOutputFormat allows you to write data to multiple files
whose names are derived from the output keys and values.

75
Output Formats
 MultipleOutputs
• There’s a second library in Hadoop for generating multiple outputs, provided
by the MultipleOutputs class.
• Unlike MultipleOutputFormat, MultipleOutputs can emit different types for
each output. On the other hand, there is less control over the naming of out-
puts.
• What’s the Difference Between MultipleOutputFormat and MultipleOutputs?

• So in summary, MultipleOutputs is more fully featured, but MultipleOut-

76
putFormat has more control over the output directory structure and file
Output Formats
 Lazy Output
 FileOutputFormat subclasses will create output (part-nnnnn)
files, even if they are empty.
 Some applications prefer that empty files not be created,
which is where LazyOutput Format helps.
 Streaming and Pipes support a -lazyOutput option to enable
LazyOutputFormat.

Ei-150180 Technical Training Sistema de Inspeccion Por RX
100% (1)
Ei-150180 Technical Training Sistema de Inspeccion Por RX
160 pages
SM 695SR Tier3 1genel EN
No ratings yet
SM 695SR Tier3 1genel EN
18 pages
Cloudera CCD 410
100% (1)
Cloudera CCD 410
21 pages
Supplier Details
No ratings yet
Supplier Details
25 pages
The Problem and Its Scope: 1.1 Rationale of The Study
No ratings yet
The Problem and Its Scope: 1.1 Rationale of The Study
22 pages
N Gene Manual 120
No ratings yet
N Gene Manual 120
241 pages
Iso 679 2009
100% (1)
Iso 679 2009
12 pages
Bda Unit 3
No ratings yet
Bda Unit 3
22 pages
Isp1181bdgg Datasheet
No ratings yet
Isp1181bdgg Datasheet
70 pages
SB+ Presentation
No ratings yet
SB+ Presentation
36 pages
Unit I EBDP 2022
No ratings yet
Unit I EBDP 2022
80 pages
LATS Energy v.2.0 Manual (SI)
No ratings yet
LATS Energy v.2.0 Manual (SI)
36 pages
A1: Resit Coursework: Big Data (6CS030)
100% (1)
A1: Resit Coursework: Big Data (6CS030)
40 pages
Casing
100% (1)
Casing
14 pages
Unit1imp PDF
No ratings yet
Unit1imp PDF
3 pages
Safety Management System
100% (1)
Safety Management System
65 pages
Unit2pptx PDF
No ratings yet
Unit2pptx PDF
64 pages
Mos Integrated Circuit: Data Sheet
No ratings yet
Mos Integrated Circuit: Data Sheet
12 pages
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
No ratings yet
Introduction To Hadoop - Part Two: 1 Hadoop and Comma Separated Values (CSV) Files 1
38 pages
OOSD IIT Kharagpur Mid Sem-12 Question Paper
No ratings yet
OOSD IIT Kharagpur Mid Sem-12 Question Paper
3 pages
HP Elitebook Folio 1040 G2 Notebook PC
No ratings yet
HP Elitebook Folio 1040 G2 Notebook PC
4 pages
Networking Midterms Reviewer
No ratings yet
Networking Midterms Reviewer
35 pages
Log HAA6696 WAUZZZ8E57A104177 321290km 199640mi 1
No ratings yet
Log HAA6696 WAUZZZ8E57A104177 321290km 199640mi 1
5 pages
Unit-Iii: A Weather Dataset
No ratings yet
Unit-Iii: A Weather Dataset
12 pages
Unit IV EBDP 22
No ratings yet
Unit IV EBDP 22
97 pages
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
No ratings yet
Module-1: Hdfs Basics Running Example Programs and Benchmarks Hadoop Mapreduce Framework Mapreduce Programming
33 pages
Digital Forensic
No ratings yet
Digital Forensic
238 pages
Unit 3 Docs
No ratings yet
Unit 3 Docs
16 pages
Phishing Websites by ML
No ratings yet
Phishing Websites by ML
4 pages
Food Storage Monitoring and Data Acquisition Using Bluetooth
No ratings yet
Food Storage Monitoring and Data Acquisition Using Bluetooth
10 pages
2332x Ref
No ratings yet
2332x Ref
3 pages
How To Learn Python in 30 v2
No ratings yet
How To Learn Python in 30 v2
36 pages
Thomasyl CV
No ratings yet
Thomasyl CV
7 pages
Kurt Bell Resume 2-8-2017 A
No ratings yet
Kurt Bell Resume 2-8-2017 A
3 pages
Srushti Kanade: Objective
No ratings yet
Srushti Kanade: Objective
2 pages
MCQs On UNIT 4
No ratings yet
MCQs On UNIT 4
5 pages
4 2 PDF
No ratings yet
4 2 PDF
2 pages
Final Exam in STS
No ratings yet
Final Exam in STS
2 pages
RFA6 Substation Fence Layout and Detailed Drawing R11
No ratings yet
RFA6 Substation Fence Layout and Detailed Drawing R11
9 pages
Phase 2 Final
100% (1)
Phase 2 Final
65 pages
Unit 4 Da
No ratings yet
Unit 4 Da
57 pages
Lesson 5 Managing Text Flow 742780943 (Yeah Good Day)
No ratings yet
Lesson 5 Managing Text Flow 742780943 (Yeah Good Day)
37 pages
EMF Metrics: Specification and Calculation of Model Metrics Within The Eclipse Modeling Framework
No ratings yet
EMF Metrics: Specification and Calculation of Model Metrics Within The Eclipse Modeling Framework
5 pages
B.tech Viii Bda Chapter 3
No ratings yet
B.tech Viii Bda Chapter 3
21 pages
Sahithi Devi
No ratings yet
Sahithi Devi
10 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
Module 4 - Geographic Information System (GIS)
No ratings yet
Module 4 - Geographic Information System (GIS)
34 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
43 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Apache Hadoop YARN
No ratings yet
Apache Hadoop YARN
24 pages
FSD Lab Manual
No ratings yet
FSD Lab Manual
84 pages
Unit 3
No ratings yet
Unit 3
14 pages
MapReduce and Yarn
No ratings yet
MapReduce and Yarn
39 pages
Oops 4
No ratings yet
Oops 4
76 pages
Unit 5-Cloud PDF
No ratings yet
Unit 5-Cloud PDF
33 pages
21csc205p Dbms Unit I
No ratings yet
21csc205p Dbms Unit I
154 pages
Ce2017 Data Visualization
No ratings yet
Ce2017 Data Visualization
5 pages
Pythonic Data Cleaning With Numpy and Pandas
No ratings yet
Pythonic Data Cleaning With Numpy and Pandas
11 pages
HBase
No ratings yet
HBase
31 pages
Sample Paper Q0503
No ratings yet
Sample Paper Q0503
20 pages
File Hanling - New - C++
No ratings yet
File Hanling - New - C++
26 pages
Unit 4 - 4.4
No ratings yet
Unit 4 - 4.4
23 pages
Word Count Program With MapReduce and Java
No ratings yet
Word Count Program With MapReduce and Java
5 pages
Class Interface: Diffrence New Api Old Api
No ratings yet
Class Interface: Diffrence New Api Old Api
5 pages
Evolution of Programming Methodologies and Consepts of Oop
100% (1)
Evolution of Programming Methodologies and Consepts of Oop
48 pages
BDA Unit-3
No ratings yet
BDA Unit-3
24 pages
Hadoop Tutorial - YDN
No ratings yet
Hadoop Tutorial - YDN
14 pages
Unit - Ii Control Structures
No ratings yet
Unit - Ii Control Structures
18 pages
Bda Unit-Iii-R20
No ratings yet
Bda Unit-Iii-R20
44 pages
HBase Interview Questions
No ratings yet
HBase Interview Questions
12 pages
13 - m1 - Linux Basic Commands - Edureka VM PDF
No ratings yet
13 - m1 - Linux Basic Commands - Edureka VM PDF
3 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
mst-2 Record4docx
No ratings yet
mst-2 Record4docx
115 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
BCA-244 C++ Programming Laboratory
No ratings yet
BCA-244 C++ Programming Laboratory
51 pages
Apache Sqoop
No ratings yet
Apache Sqoop
21 pages
NoSQL Databases UNIT-3
No ratings yet
NoSQL Databases UNIT-3
20 pages
Unit-4-Unit-4-Bda EDIT
No ratings yet
Unit-4-Unit-4-Bda EDIT
16 pages
Analyzing Data With Hadoop
No ratings yet
Analyzing Data With Hadoop
54 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
3.3.2 23 Sih-2019
No ratings yet
3.3.2 23 Sih-2019
16 pages
Unit 4 Handouts
No ratings yet
Unit 4 Handouts
13 pages
Hadoop Interview Questions and Answers
No ratings yet
Hadoop Interview Questions and Answers
3 pages
20ES3102 Java Programming Unit IV Chapter 2 Collections Framework
No ratings yet
20ES3102 Java Programming Unit IV Chapter 2 Collections Framework
17 pages
4.7.1 - Data Warehousing Mining & Business Intelligence
No ratings yet
4.7.1 - Data Warehousing Mining & Business Intelligence
3 pages
Himanshu - Assignment Solved ETL 1
No ratings yet
Himanshu - Assignment Solved ETL 1
6 pages
Hive Installation On Windows 10
No ratings yet
Hive Installation On Windows 10
13 pages
BgiData QB
100% (1)
BgiData QB
3 pages
Unit 1 Bda Complete Notes
No ratings yet
Unit 1 Bda Complete Notes
15 pages
Brochure For ATAL Workshop
No ratings yet
Brochure For ATAL Workshop
3 pages
UNIT 3 (Chapter 2) Pandas
No ratings yet
UNIT 3 (Chapter 2) Pandas
43 pages
05 - MapReduce in Hadoop - An Introduction
No ratings yet
05 - MapReduce in Hadoop - An Introduction
31 pages
Cloud Computing Full Notes
100% (1)
Cloud Computing Full Notes
90 pages
SImplified Solutions of BAD601 Model Question Paper
No ratings yet
SImplified Solutions of BAD601 Model Question Paper
32 pages
Powerprotect DD Virtual Tape Library Implementation and Administration
No ratings yet
Powerprotect DD Virtual Tape Library Implementation and Administration
70 pages
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet

Unit III EBDP 2022

Uploaded by

Unit III EBDP 2022

Uploaded by

Map Reduce

Temperature value 9999 signifies a missing value in the NCDC dataset.

 It’s feasible to parallelize the processing, though, it’s messy in

 Input file for the map function, stored in HDFS

 Output of the map function, running in parallel for each block

Map() Shuffling Reduce ()

String line = value.toString();

int maxValue = Integer.MIN_VALUE;

FileInputFormat.addInputPath(job, new Path(args[0]));

Phases in MR Programming Model

protected void map(KEYIN key, VALUEIN value,

public class HashPartitioner<K, V> extends Partitioner<K, V> {

public class Reducer<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {

Map () shuffling Reduce ()

 Step 1~4: Job submission

 The maximum percentage of tasks that are allowed to fail without

 Pool Based : Divides cluster into pools

 Resources divided equally among pools

 Within each pool, can use either

MapReduce Types and Formats

The Mapper performs the user-de-

ated for each split file

The Reducer performs the user-de- RR RR RR

fined task of the MapReduce pro-

Each mapper may emit (K, V) pairs to InputFormat

any partition file

agree on where to send different pieces RR RR RR

The partitioner class determines which

The default partitioner computes a hash Sort

 Input files are where the data for a MapReduce task is

 The input files typically reside in a distributed file system

 The format of input files is arbitrary file

 Line-based log files

 Selects the files that should be used for input

1. The addInputPath() and addInputPaths() methods

 The OutputFormat class defines the way InputFormat

OutputFormat Description Sort

SequenceFileOutputFormat Writes binary files suitable for

• So in summary, MultipleOutputs is more fully featured, but MultipleOut-

You might also like