Unit III EBDP 2022
Unit III EBDP 2022
Analyzing Data
MapReduce is a programming model for data
processing.
Hadoop can run MapReduce programs written in
various languages. – (Hadoop Streaming)
We shall look at the same program expressed in
Java, Ruby, Python, and C++.
2
A Weather Dataset
Program that mines
weather data
Weather sensors collect data
every hour at many locations
across the globe
They gather a large volume of
log data, which is good candi-
date for analysis with MapRe-
duce
Data Format
Data from the National Climate
Data Center(NCDC)
Stored using a line-oriented
ASCII format, in which each line
is a record
3
A Weather Dataset
Data Format
Data files are organized by date and weather station.
There is a directory for each year from 1901 to 2001, each containing a gzipped file for
each weather station with its readings for that year.
The whole dataset is made up of a large number of relatively small files since there are
tens of thousands of weather station.
The data was preprocessed so that each year’s readings were concatenated into a sin-
gle file.
4
Analyzing the Data with Unix Tools
What’s the highest recorded global temperature for each year in the
dataset?
Unix Shell script program with awk, the classic tool for processing line-ori-
ented data
The scripts loops through the compressed year files
printing the year processing each file using awk
Awk extracts the air temperature and the quality code from the data.
Beginning of a run
Maximum temperature is 31.7℃ for 1901.
The complete run for the century took 42 minutes in one run on a single EC2 High-CPU
Extra Large Instance. 5
Analyzing the Data with Unix Tools
To speed up the processing, run parts of the program in parallel
Problems for parallel processing
Dividing the work into equal-size pieces isn’t always easy or obvious.
• The file size for different years varies
• The whole run is dominated by the longest file
• A better approach is to split the input into fixed-size chunks and assign each chunk to a
process
Combining the results from independent processes may need further pro-
cessing.
Still limited by the processing capacity of a single machine, handling coor-
dination and reliability for multiple machines
6
Analyzing the Data with Hadoop – Map and Reduce
Map and Reduce
MapReduce works by breaking the processing into 2 phases: the
map and the reduce.
Both map and reduce phases have key-value pairs as input and
output.
Programmers have to specify two functions: map and reduce
function.
The input to the map phase is the raw NCDC data.
• Here, the key is the offset of the beginning of the line and the
value is each line of the data set.
The map function pulls out the year and the air temperature
from each input value.
The reduce function takes <year, temperature> pairs as input and
produces the maximum temperature for each year as the result.
7
Analyzing the Data with Hadoop – Map and Reduce
Original NCDC Format
Input for the reduce function & Output of the reduce function
8
Analyzing the Data with Hadoop – Map and Reduce
The whole data flow
Input File
9
Analyzing the Data with Hadoop – Java MapReduce
Having run through how the MapReduce program works, express it in code
A map function, a reduce function, and some code to run the job are needed.
Map function
public class MaxTemperatureMapper
extends Mapper<LongWritable, Text, Text, IntWritable> {
private static final int MISSING = 9999;
@Override public void map(LongWritable key, Text value,
Context context) throws IOException, InterruptedException {
• Reduce function
public class MaxTemperatureReducer
extends Reducer<Text, IntWritable, Text, IntWritable> {
@Override
public void reduce(Text key, Iterable<IntWritable> values,
Context context)
throws IOException, InterruptedException {
11
Analyzing the Data with Hadoop – Java MapReduce
Main function for running the MapReduce job
public class MaxTemperature {
public static void main(String[] args) throws Exception {
if (args.length != 2) {
System.err.println("Usage: MaxTemperature <input path> <output path>");
System.exit(-1);
}
Job job = new Job();
job.setJarByClass(MaxTemperature.class);
job.setJobName("Max temperature");
job.setMapperClass(MaxTemperatureMapper.class);
job.setReducerClass(MaxTemperatureReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
12
System.exit(job.waitForCompletion(true) ? 0 : 1);} }
Analyzing the Data with Hadoop – Java MapReduce
A test run
The output is written to the output directory, which contains one output
file per reducer
13
Hadoop Streaming
Hadoop provides an API to MapReduce
write the map and reduce functions in languages other
than Java.
We can use any language in MapReduce program.
Hadoop Streaming
Map input data is passed over standard input to your map
function.
The map function processes the data line by line and
writes lines to standard output.
A map output key-value pair is written as a single tab-de-
limited line.
Reduce function reads lines from standard input (sorted by
key), and writes its results to standard output.
14
Hadoop Streaming
Python
Streaming supports any programming language that can read from standard input
and write to standard output.
The map and reduce script in Python
Test the programs and run the job in the same way we did in Ruby.
15
Hadoop Map Reduce
Map Reduce is a Programming Model, that processes data
distributed across many computers.
Scalable to thousands of nodes and petabytes of data.
Reduce Phase
• Takes the key & list of values to generates key value pairs as output
• Reduce tasks don’t have the advantage of data locality – the input to a sin-
gle reduce task is normally the output from all mappers.
• All map outputs are merged across the network and passed to the user-de-
fined reduce function.
• The output of the reduce is normally stored in HDFS.
17
MR Phases (Cont..)
Sort & Shuffle Phase
• The input to every reducer is sorted by key.
• The process of grouping the Map output based on sorted key and trans-
ferring to the reducers as inputs—is known as the shuffle.
Partition Phase
• The Partition phase determines the reducers to which the sorted & shuf -
fled intermediate data to be transferred.
Combine Phase
• Each Maps intermediate result is processed into key value pairs so that
there is less data that is to be transferred to reduce phase.
• It reduces the n/w traffic by optimizing the map intermediate result.
18
19
Data flow
Hadoop moves the MapReduce computation to each machine
hosting a part of the data.
Data Flow
A MapReduce job consists of the input data, the MapReduce pro-
gram, and configuration information.
Hadoop runs the job by dividing it into 2 types of tasks, map and re-
duce tasks.
Two types of nodes, 1 jobtracker and several tasktrackers
• Jobtracker : coordinates and schedules tasks to run on tasktrackers.
• Tasktrackers : run tasks and send progress report to the jobtracker.
20
Data flow
Data Flow – single reduce task
Reduce tasks don’t have the advantage of data locality – the input to a single
reduce task is normally the output from all mappers.
All map outputs are merged across the network and passed to the user-defined
reduce function.
The output of the reduce is normally stored in HDFS.
21
Data Flow
Data Flow – multiple reduce tasks
The number of reduce tasks is specified independently not governed
by the input size.
The map tasks partition their output by keys, each creating one parti-
tion for each reduce task.
There can be many keys and their associated values in each partition,
but the records for any key are all in a single partition.
22
Data Flow
Data Flow – zero reduce task
23
MapReduce Types
Map & Reduce function types are as follows:
The map input key and value types (K1 and V1) are different from the
map output types (K2 and V2).
The reduce input must have the same types as the map output, al-
though the reduce output types may be different again (K3 and V3).
If a combine function is used then it is the same form as the reduce
function (and is an implementation of Reducer), except its output
types are the intermediate key and value types (K2 and V2), so they
can feed the reduce function.
The partition function operates on the intermediate key and value
types (K2 and V2),and returns the partition index. In practice, the par-
tition is determined solely by the key (the value is ignored)
24
Default Map, Reduce,Partitioner
public class Mapper<KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
26
Reduce Class
27
Driver Class
28
Combiner Functions
Many MapReduce jobs are limited by the bandwidth available on the cluster.
It pays to minimize the data transferred between map and reduce tasks.
Hadoop allows the user to specify a combiner function to be run on the map
output – the combiner function’s output forms the input to the reduce func-
tion.
The contract for the combiner function constrains the type of function that
may be used.
Example without a combiner function
Map ()
Example combiner shuffling Reduce ()
with a <1950,
combiner
0>
function, finding maximum temperature for a map
<1950, 20> <1950, 20>
<1950, 10>
<1950, [20, 25]> <1950, 25>
<1950, 25>
<1950, 15> <1950, 25> 29
Scaling Out
Combiner Functions
The function calls on the temperature values can be expressed as fol-
lows:
• Max(0, 20, 10, 25, 15) = max( max(0, 20, 10), max(25, 15) ) =
max(20, 25) = 25
Calculating ‘mean’ temperatures couldn’t use the mean as the com-
biner function
• mean(0, 20, 10, 25, 15) = 14
• mean( mean(0, 20, 10), mean(25, 15) ) = mean(10, 20) = 15.
The combiner function doesn’t replace the reduce function.
It can help cut down the amount of data shuffled between the maps
and the reduces
30
Scaling Out
Combiner Functions
Specifying a combiner function
• The combiner function is defined using the Reducer interface
• It is the same implementation as the reducer function in MaxTemperatur-
eReducer.
• The only change is to set the combiner class on the JobConf.
31
Anatomy of a MapReduce Job Run
You can run a MapReduce job with a single line of code:
JobClient.runJob(conf).
• But it conceals a great deal of processing behind the
scenes.
At the highest level, there are four independent entities:
The client
• submits the MapReduce job.
The jobtracker, which coordinates the job run
The tasktrackers, which run the tasks that the job has been
split into.
The distributed filesystem (normally HDFS) used for sharing
job files between the other entities.
32
Anatomy of a MapReduce Job Run
33
Map Reduce Job Run
Jobs get submitted by wait for completion() in Driver Class
Client Submits the Map Reduce Job to Job Tracker
Job Tracker verifies certain Properties of Job Submission such
as input / output files specifications and so on and Initializes
the Job, Assigns to a Task Tracker
Task Tracker run the Map and Reduce Tasks
Task Tracker sends Progress status to Job Tracker periodically.
Job Tracker periodically sends the progress status by combining
the status of map and reduce tasks received from task trackers
Finally Job Completion status is return to Client indicating the
completion of Job.
34
Job Submission
JobClient.runJob()
Creates a new JobClient instances1
Calls submitJob()
JobClient.submitJob()
Asks the jobtracker for a new job ID (by calling JobTracker.getNewJobId()) 2
Checks the output specification of the job
Computes the input splits for the job
Copies the resources needed to run the job to the jobtracker’s filesystem 3
• The job JAR file, the configuration file and the computed input splits
Tells the jobtracker that the job is ready for execution
(by calling JobTracker.submitJob()) 4
35
Job Initialization
JobTracker.submitJob()
Creates a new JobInProgress instances5
• Represents the job being run
• Encapsulates its tasks and status information
Puts it into an internal queue
• The job scheduler will pick it up and initialize it from the queue
Job scheduler
Retrieves the input splits from the shared filesystem6
Creates one map task for each split
Creates reduce tasks to be run
• The # of reduce tasks is determined by the mapred.reduce.tasks property
Gives IDs to the tasks
36
Task Assignment
Tasktrackers
Periodically send heartbeats to the Jobtracker7
• Also send whether they are ready to run a new task
Have a fixed number of slots for map/reduce tasks
Jobtracker
Chooses a job to select the task from
Assigns map/reduce tasks to tasktrackers
using the hearbeat values
• For map tasks, it takes account of the data locality
37
Task Execution
Tasktracker
Copies the job JAR from the shared filesystem8
Creates a local working directory for the task,
and un-jars the contents of the JAR into this directory
Creates an instance of TaskRunner to run the task
TaskRunner
Launches a new Java Virtual Machine(JVM)9
• So that any bugs in the user-defined map and reduce functions
don’t affect the tasktracker
Runs each task in the JVM
• Child process informs the parent of the task’s progress every few
seconds
38
Anatomy of a MapReduce Job Run
Streaming and Pipes
• Both Streaming and Pipes run special map and reduce tasks for the purpose of
launching the user-supplied executable and communicating with it (Figure 6-2).
39
Anatomy of a MapReduce Job Run
Progress and Status Updates
When a task is running, it keeps track of its progress, that is, the pro-
portion of the task completed.
A job and each of its tasks have a status, which includes the state of the
job or task (e.g., running, successfully completed, failed), the progress
of maps and reduces, the values of the job’s counters, and a status
message or description
What Constitutes Progress in MapReduce?
• Reading an input record (in a mapper or reducer)
• Writing an output record (in a mapper or reducer)
• Incrementing a counter
• The task status changes, report to the Task Tracker.
40
Anatomy of a MapReduce Job Run
Job Completion
When the jobtracker receives a notification that the last task for a job is complete,
it changes the status for the job to “successful.” Then, when the JobClient polls for
status, it learns that the job has completed successfully, so it prints a message to
tell the user, and then returns from the runJob() method.
41
Failures
Task Failure
Child task failing.
• When user code in the map or reduce task throws a runtime excep-
tion.
• Streaming tasks exits with a nonzero exit code, it is marked as failed.
governed by the stream.non.zero.exit.is.failure property
• Hanging tasks failure occurs when Task Tracker(TT) has not received
progress update for a while, and proceeds to mark the task as failed.
The child JVM process will be automatically killed after this period.
• mapred.map.max.attempts=4(default)
• mapred.reduce.max.attempts
42
Failures
Tasktracker Failure
Failure of a tasktracker is another failure mode.
If a tasktracker(TT) fails by crashing, or running very slowly, it
will stop sending heartbeats to the jobtracker(JT).
The JT removes it from its pool of TTs and arranges the map
tasks on another TT.
A JT can blacklist a TT if more than 4 tasks of the same job
failes on a particular TT.
Jobtracker Failure
Failure of the jobtracker is the most serious failure mode.
Currently, Hadoop has no mechanism for dealing with failure
of the jobtracker.
43
Job Scheduling
Goals
1. Good throughput or response time for tasks (or jobs)
2. High utilization of resources
Hadoop
A Hadoop job consists of Map tasks and Reduce tasks
If Only one job in entire cluster => it occupies cluster
Multiple Users with multiple jobs
• Users/jobs = “tenants”
• Multi-tenant system
• Need a way to schedule all these jobs (and their constituent
tasks)
• Need to be fair across the different tenants
Hadoop schedulers
FIFO Scheduler (MR V1)
Hadoop Capacity Scheduler
Hadoop Fair Scheduler
44
FIFO Scheduler
• Queue-based : Maintain tasks in a queue in order of arrival
• When processor free, dequeue head and schedule it
• Each job uses the whole cluster, so jobs wait their turn.
• Can set priorities for the jobs in the queue (5 queues with priorities)
• Job priority mapred.job.priority property
(VERY_HIGH, HIGH, NORMAL, LOW, VERY_LOW)
• Preemption is not supported
45
Capacity Schedulers
Contains multiple queues
Each queue contains multiple jobs
Each queue guaranteed some portion of the cluster capacity
E.g.,
Queue 1 is given 80% of cluster
Queue 2 is given 20% of cluster
Higher-priority jobs go to Queue 1
For jobs within same queue, FIFO typically used
Administrators can configure queues
Elasticity
A queue allowed to occupy more of cluster if resources are free
But if other queues below their capacity limit, need to give these other queues re-
sources
Pre-emption not allowed!
Cannot stop a task part-way through
When reducing % cluster to a queue, wait until some tasks of that queue have finished46
Fair Scheduler
Goal: All jobs get equal share of resources
When only one job present, occupies entire cluster
As other jobs arrive, each job given equal % of cluster
E.g., Each job might be given equal number of cluster-wide slots
Each slot == 1 task of job
Example:
• Mary, John, Peter submit jobs that demand 80, 80, and 120 tasks respectively
• Say the cluster has a limit to allocate 60 tasks at most
• Default behavior: Distribute task fairly among 3 users (each get 20)
48
Fair Scheduler
Minimum share can be set for a pool
In the previous example, say Mary has a minimum share of 40
Mary would be allocated 40, then the rest is distributed evenly
to other pools
When minimum share not met in a pool, for a while
Take resources away from other pools
• By pre-empting jobs in those other pools
• By killing tasks in pools running over capacity in order to give more slots
to the pool running under capacity.
49
Shuffle and sort in MapReduce
MapReduce makes the guarantee that the input to every reducer is
sorted by key.
The process by which the system performs the sort—and transfers
the map outputs to the reducers as inputs—is known as the shuffle
50
Shuffle and Sort
MapReduce makes the guarantee that the input to every reducer is
sorted by key. The process by which the system performs the sort—
and transfers the map outputs to the reducers as inputs—is known
as the shuffle.
The Map Side
Map function process is more involved, and takes advantage of buffering
writes in memory and doing some presorting for efficiency reasons. Figure
6-4 shows what happens.
Map output will be written to the buffer, the thread partitioning and parti-
tion-specific sorting the data in the buffer.
Each Map Task has a ring structure of the memory buffer and write data to
the memory.
Aligned the final output records that are recorded immediately after the
spill file is written to disk, and merge them into a single output file, and
when it reaches the buffer limit (default 80%) spill.
Map output compression to reduce the amount of transmitted data can be
performed.
51
Shuffle and Sort
The Reduce Side
The map tasks may finish at different times, so the reduce task starts
copying their outputs as soon as each completes.
How do reducers know which TTs to fetch map output from?
• map tasks complete successfully -> they notify their parent TT of the status
update
-> in turn notifies the jobtracker. -> A thread in the reducer periodically asks
the jobtracker for map output locations -> TTs do not delete map outputs from
disk as soon as the first reducer has retrieved them, as the reducer may fail.
When all the map outputs have been copied, the reduce task moves into
the sort phase, which merges the map outputs, maintaining their sort or-
dering.
During the reduce phase the reduce function is invoked for each key in the
sorted output. The output of this phase is written directly to the output
filesystem, typically HDFS. In the case of HDFS, since the TT node is also
running a datanode, the first block replica will be written to the local disk.
52
Task Execution
Speculative Execution
This makes job execution time sensitive to slow-running tasks, as it
takes only one slow task to make the whole job take significantly
longer than it would have done otherwise.
Hadoop tries to detect when a task is running slower than expected
and launches another, equivalent, task as a backup. This is termed
speculative execution of tasks.
It’s important to understand that speculative execution does not
work by launching two duplicate tasks at about the same time so
they can race each other.
• if the original task completes before the speculative task then the speculative task
is killed; on the other hand, if the speculative task finishes first, then the original
is killed.
Speculative execution is an optimization, not a feature to make jobs
run more reliably.
53
Task Execution
Task JVM Reuse
Hadoop runs tasks in their own Java Virtual Machine to isolate them
from other running tasks.
The overhead of starting a new JVM for each task can take around a
second, which for jobs that run for a minute or so is insignificant.
However, jobs that have a large number of very short-lived tasks (these
are usually map tasks) or that have lengthy initialization, can see per-
formance gains when the JVM is reused for subsequent tasks.
The property for controlling task JVM reuse is mapred.job.re-
use.jvm.num.tasks : it specifies the maximum number of tasks to run for
a given job for each JVM launched, the default is 1.
54
Task Execution
Skipping Bad Records
The best way to handle corrupt records is in your mapper or reducer code.
You can detect the bad record and ignore it, or you can abort the job by
throwing an exception. You can also count the total number of bad records in
the job using counters to see how widespread the problem is.
Skipping mode is off by default; you enable it independently for map and re-
duce tasks using the SkipBadRecords class.
The Task Execution Environment
Hadoop provides information to a map or reduce task about the environment in which
it is running. (The properties in Table 6-5 can be accessed from the job’s configuration)
55
Distributed and Parallel Processing Technology
56
Mapper and Reducer Files loaded from local HDFS store
57
Partitioner Files loaded from local HDFS store
58
Sort
Files loaded from local HDFS store
Each Reducer is responsible for reduc-
ing InputFormat
the values associated with (several)
file
intermediate keys Split Split Split
file
RR RR RR
The set of intermediate keys on a single
node is automatically sorted by
Map Map Map
MapReduce before they are presented
to the Reducer
Partitioner
Sort
Reduce
59
Input Files
• Input Splits
• An InputSplit has a length in bytes and a set of storage locations, which are
just hostname strings.
• The storage locations are used by the MapReduce system to place map tasks
as close to the split’s data as possible.
• The size is used to order the splits so that the largest get processed first, in an
attempt to minimize the job runtime 61
Input Formats
FileInputFormat
• FileInputFormat is the base class for all implementations of In-
putFormat that use files as their data source (see Figure 7-2).
• It provides two things
1.a place to define which files are included as the input to a job.
2. an implementation for generating splits for the input files.
62
63
Input Formats
FileInputFormat input paths
• FileInputFormat offers four static convenience meth-
ods for setting a JobConf’s input paths:
64
Input Formats
FileInputFormat input splits
1. FileInputFormat splits only large files(Here “large” means larger
than an HDFS block).
2. The split size is normally the size of an HDFS block, which is appro-
priate for most applications.
65
Input Formats
Small files and CombineFileInputFormat
• Hadoop works better with a small number of large files than a large number
of small files.
• If the file is very small (“small” means significantly smaller than an HDFS
block) and there are a lot of them, then each map task will process very lit -
tle input, and there will be a lot of them (one per file), each of which im-
poses extra bookkeeping overhead.
• CombineFileInputFormat, was designed to work well with small files.
•
1. CombineFileInputFormat packs many files into each split so that each
mapper has more to process.
2. It Considers node and rack locality into account when deciding which
blocks to place in the same split
3. CombineFileInputFormat does not compromise the speed at which it can
process the input in a typical MapReduce job.
66
Input Formats
Text Input Format (Default InputFormat)
• The key, a LongWritable, is the byte offset within the file, it is normally the
beginning of the line.
• The value is the contents of the line, excluding any line terminators (new-
line, carriage return), and is packaged as a Text object.
• So a file containing the following text:
is divided into one split of four records. The records are interpreted as the
following key-value pairs:
67
Input Formats
KeyValue InputFormat
Useful incase the key is already present in the input file.
K,V are separated by a delimeter by default tab.
Mapreduce.input.keyvaluelinerecordreader.key.value.seperater”.
NLineInputFormat
With TextInputFormat and KeyValueTextInputFormat, each mapper re-
ceives a variable number of lines of input.
The number depends on the size of the split and the length of the lines.
If you want your mappers to receive a fixed number of lines of input, then
NLineInputFormat is the InputFormat to use.
Database Input (and Output)
DBInputFormat is an input format for reading data from a relational database, us-
ing JDBC.
Because it doesn’t have any sharding capabilities, you need to be careful not to overwhelm
the database you are reading from by running too many mappers. For this reason, it is best
used for loading relatively small datasets, perhaps for joining with larger datasets from
HDFS, using MultipleInputs.
68
Input Formats
SequenceInputFormat
SequenceFileInputFormat
• Hadoop’s sequence file format stores sequences of binary key-value
pairs.
SequenceFileAsTextInputFormat
• SequenceFileAsTextInputFormat is a variant of SequenceFileInputFormat
that converts the sequence file’s keys and values to Text objects.
SequenceFileAsBinaryInputFormat
• SequenceFileAsBinaryInputFormat is a variant of SequenceFileInputFor-
mat that retrieves the sequence file’s keys and values as opaque binary
objects.
• Map task should know how to deal with these objects.
69
Input Formats
Multiple Inputs
one might be tab-separated plain text, the other a binary sequence file. Even
if they are in the same format, they may have different representations, and
therefore need to be parsed differently.
These cases are handled elegantly by using the MultipleInputs class.
• MultipleInputs class, which allows you to specify the InputFormat and
Mapper to use on a per-path basis.
• Example)
if we had weather data from the U.K. Met Office# that we wanted to com-
bine with the NCDC data for our maximum temperature analysis, then we
might set up the input as follows:
70
OutputFormat
Files loaded from local HDFS store
71
OutputFormat
72
Output Formats
Text Output
The default output format, TextOutputFormat, writes records as
lines of text.
TextOutputFormat keys and values may be of any type.
Each key-value pair is separated by a tab character, although that
may be changed using the mapred.textoutputformat.separator
property.
You can suppress the key or the value (or both, making this out-
put format equivalent to NullOutputFormat, which emits noth-
ing) from the output using a NullWritable type.
73
Output Formats
Binary Output
SequenceFileOutputFormat
• As the name indicates, SequenceFileOutputFormat writes se-
quence files for its output.
• This is a good choice of output if it forms the input to a further
MapReduce job, since it is compact, and is readily compressed.
SequenceFileAsBinaryOutputFormat
• SequenceFileAsBinaryOutputFormat is the counterpart to Se-
quenceFileAsBinaryInput Format.
• SequenceFileAsBinaryOutputFormat writes keys and values in
raw binary format into a SequenceFile container.
MapFileOutputFormat
• MapFileOutputFormat writes MapFiles as output.
74
Output Formats
Multiple Outputs
FileOutputFormat and its subclasses generate a set of files in the
output directory.
• There is one file per reducer
• Files are named by the partition number: part-00000, part-00001, etc.
There is sometimes a need to have more control over the naming
of the files, or to produce multiple files per reducer.
• MapReduce comes with two libraries to help you do this: Multiple-
OutputFormat and MultipleOutputs.
MultipleOutputFormat
• MultipleOutputFormat allows you to write data to multiple files
whose names are derived from the output keys and values.
75
Output Formats
MultipleOutputs
• There’s a second library in Hadoop for generating multiple outputs, provided
by the MultipleOutputs class.
• Unlike MultipleOutputFormat, MultipleOutputs can emit different types for
each output. On the other hand, there is less control over the naming of out-
puts.
• What’s the Difference Between MultipleOutputFormat and MultipleOutputs?
77