Mapreduce Types and Formats
Mapreduce Types and Formats
Formats
By
Dr. K. Venkateswara Rao
Professor CSE
Prepared using O’Reilly – Hadoop: The Definitive Guide and some slides
are taken from Taikyoung Kim Presentation
MapReduce Types
• MapReduce has a simple model of data processing: inputs and outputs for the
map and reduce functions are key-value pairs.
• The map and reduce functions in Hadoop MapReduce have the following
general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
• In general, the map input key and value types (K1 and V1) are different from
the map output types (K2 and V2).
• The reduce input must have the same types as the map output, although the
reduce output types may be different again (K3 and V3).
MapReduce Types
• Combiner Function
map: (K1, V1) → list(K2, V2)
combiner: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
It has the same form as the reduce function except its output types
Often the combiner and reduce functions are the same
• Partition Function
partition: (K2, V2) → integer
Operates on the intermediate key and value types (K2 and V2)
Returns the partition index
In practice, the partition is determined solely by the key (the value is
ignored)
MapReduce Types
• Input types are set by the input format.
• Ex:- setInputFormatClass(TextInputFormat.class)
Generates keys of type LongWritable and values of type Text
• The other types are set explicitly by calling the methods on the Job
Ex:- Job conf; conf.setMapOutputKeyClass(Text.class)
• intermediate types are also set as the final output types by default.
The default MapReduce Job
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1; }
MapReduce Types
• Number of Reducers:
• Choosing the number of reducers for a job is more of an art than a science.
Increasing the number of reducers makes the reduce phase shorter, since you
get more parallelism. However, if you take this too far, you can have lots of
small files, which is suboptimal. One rule of thumb is to aim for reducers that
each run for five minutes or so, and which produce at least one HDFS block’s
worth of output.
• The number of map tasks is equal to the number of splits that the input is
turned into. The number of reducers will be equal to the number of nodes
multiplied by the slots per node mapred.tasktracker.reduce.tasks.maximum.
• It is good to have slightly fewer reducers than total slots.
• By default, there is a single reducer
MapReduce Types
• Number of Reducers:
• Choosing the number of reducers for a job is more of an art than a science.
Increasing the number of reducers makes the reduce phase shorter, since you
get more parallelism. However, if you take this too far, you can have lots of
small files, which is suboptimal. One rule of thumb is to aim for reducers that
each run for five minutes or so, and which produce at least one HDFS block’s
worth of output.
• The number of map tasks is equal to the number of splits that the input is
turned into. The number of reducers will be equal to the number of nodes
multiplied by the slots per node mapred.tasktracker.reduce.tasks.maximum.
• It is good to have slightly fewer reducers than total slots.
• By default, there is a single reducer
Hadoop Streaming
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop
and user program.
• Difference between Streaming and the Java MapReduce API.
• The Java API is geared toward processing your map function one record at a
time. The framework calls the map() method on your Mapper for each record in
the input, whereas
• With Streaming the map program can decide how to process the input
• for example, it could easily read and process multiple lines at a time since
it’s in control of the reading.
The relationship of the Streaming executable to the node
manager and the task container
.• Streaming runs special map and reduce tasks for the
purpose of launching the user supplied executable
and communicating with it
• The Streaming task communicates with the
streaming process (which may be written in any
language) using standard input and output streams.
• During execution of the task, the Java process
passes input key-value pairs to the external process,
which runs it through the user-defined map or
reduce function and passes the output key-value
pairs back to the Java process.
MapReduce Types: The default Streaming Job
• $ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop*-streaming.jar \
-input input/sample.txt -output output -mapper /bin/cat
• There is no default identity mapper, so it must explicitly be set. Hadoop Streaming
output keys and values are always Text. Usually the key (the line offset) is not passed
to the mapper. The default command set explicitly is:
• $ hadoop jar $HADOOP_INSTALL/contrib/hadoop-*streaming.jar \
-input input/sample.txt -output output \
- inputformat org.apache.hadoop.mapred.TextInputFormat -mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner -numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
MapReduce Types: Keys and values in Streaming
• A Streaming application can control the separator that is used when
a key-value pair is turned into a series of bytes and sent to the map or reduce
process over standard input.
• The separator can be configured independently for maps and reducers
• Furthermore, the key from the output can be composed of more than the first field: it
can be made up of the first n fields (defined by
stream.num.map.output.key.fields
or
stream.num.reduce.output.key.fields),
• The value is the remaining fields, after n fields
For example, if the output from a Streaming process was a,b,c (with a comma as
the separator), and n was 2, the key would be parsed as a,b and the value as c.
MapReduce Types: Keys and values in Streaming
• The separator can be configured or set independently for maps and reducers.
• These settings do not have any bearing on the input and output formats.
Use of separators in a Streaming MapReduce job
Input Formats
Input Formats: Input Splits and Records
• Splits and records are logical
Input Formats
• Hadoop can process many different types of data formats, from flat text files to
databases.
Input Formats
Input Formats: Input Splits and Records
• Input splits are represented by the Java class InputSplit (which is in the
org.apache.hadoop.mapreduce package)
• An InputSplit has a length in bytes and a set of storage locations which are just hostname
strings.
• A split doesn’t contain the input data; it is just a reference to the data.
• The storage locations are used by the MapReduce system to place map tasks as close to the
split’s data as possible
• The size is used to order the splits so that the largest get processed first, in an attempt to
minimize the job runtime.
Input Formats: Input Splits and Records
• MapReduce application developer need not deal with InputSplits directly, as they are created
by an InputFormat.
• The client running the job calculates the splits for the job by calling getSplits(), then sends
them to the application master, which uses their storage locations to schedule map tasks that
will process them on the cluster.
• The map task passes the split to the createRecordReader() method on InputFormat to obtain a
RecordReader for that split. RecordReader iterate over records.
Input Formats: Input Splits and Records
• The map task uses RecordReader to generate record key-value pairs, which it passes to the
map function. Following is the Mapper’s run() method:
• After running setup(), the nextKeyValue() is called repeatedly on the Context to populate the
key and value objects for the mapper.
• The key and value are retrieved from the RecordReader by way of the Context and are
passed to the map() method for it to do its work.
• When the reader gets to the end of the stream, the nextKeyValue() method returns false, and
the map task runs its cleanup() method and then completes.
Input Formats: FileInputFormat
• FileInputFormat is the base class for all implementations of InputFormat that
use files as their data source
• It provides two things:
1. A place to define which files are included as the input to a job,
2. An implementation for generating splits for the input files.
• The job of dividing splits into records is performed by subclasses.
• FileInputFormat offers four static convenience methods for setting a Job’s input
paths:
Input Formats: FileInputFormat input splits
• FileInputFormat splits only large files—here, “large” means larger than an
HDFS block.
Input Formats: FileInputFormat input splits
• The split size is calculated by the following formula (see the computeSplitSize() method in
FileInputFormat):
• max(minimumSize, min(maximumSize, blockSize))
• and by default: minimumSize < blockSize < maximumSize
Small Files and CombineFile Input Format
Input Formats: TextInputFormat
Input Formats: NLineInputFormat
Input Formats: Binary Input
Input Formats: Multiple Inputs
Output Formats
Output Formats
Output Formats: Text Output
Output Formats: Binary Output
Output Formats: Multiple Outputs
Output Formats: Multiple Outputs
Output Formats: Lazy Output
• Type of samplers
1. Random sampler
2. Split sampler
3. Interval sampler
• RandomSampler method takes following parameters to chooses keys
1. uniform probability,
2. the maximum number of samples to take
3. the maximum number of splits to sample as parameters
Sorting
• SplitSampler samples only the first n records in a split. It is not so good for
sorted data because it doesn’t select keys from throughout the split.
• IntervalSampler chooses keys at regular intervals through the split and makes
a better choice for sorted data.
• RandomSampler is a good general-purpose sampler.
• If none of these suits user application, he can write own implementation of the
Sampler interface.
• The objective of sampling is to produce partitions that are approximately equal
in size.
Secondary Sort
• The MapReduce framework sorts the records by key before they reach the reducers.
For any particular key, however, the values are not sorted.
• The order in which the values appear is not even stable from one run to the next,
because they come from different map tasks, which may finish at different times
from run to run.
• It is possible to impose an order on the values by sorting and grouping the keys in a
particular way.
• Do following to get the effect of sorting by value:
Make the key a composite of the natural key and the natural value.
The sort comparator should order by the composite key (i.e., the natural key and
natural value).
The partitioner and grouping comparator for the composite key should consider
only the natural key for partitioning and grouping.
Joins
• MapReduce can perform joins between large datasets
• How the join can be implemented depends on how large the
datasets are and how they are partitioned. If one dataset is large (the
weather records) but the other one is small enough to be distributed
to each node in the cluster (as the station metadata is), the join can
be effected by a MapReduce job that brings the records for each
station together.
• The mapper or reducer uses the smaller dataset to look up the
station metadata for a station ID.
• If the join is performed by the mapper it is called a map-side join.
• if the join is performed by the reducer it is called a reduce-side join.
Inner join of two datasets
Map-Side Joins
• If both datasets are too large for either to be copied to each node in the cluster, they
can still be joined using MapReduce with a map-side or reduce-side join, depending
on how the data is structured.
• A map-side join works by performing the join before the data reaches the map
function
• Requirements
• Each input dataset must be divided into the same number of partitions
• It must be sorted by the same key (the join key) in each source
• All the records for a particular key must reside in the same partition
• Above requirements actually fit the description of the output of a MapReduce job
• A map-side join can be used to join the outputs of several jobs that had the same
number of reducers, the same keys, and output files that are not splittable
Map-Side Joins
• Use a CompositeInputFormat from the org.apache.hadoop.mapred.join
package to run a map-side join
Reduce-Side Joins
• More general than a map-side join
• Input datasets don’t have to be structured in any particular way
• Less efficient as both datasets have to go through the MapReduce shuffle
• Idea
• The mapper tags each record with its source
• Uses the join key as the map output key so that the records with the same key are
brought together in the reducer
• Multiple inputs
• The input sources for the datasets have different formats
• Use the MultipleInputs class to separate the logic for parsing and tagging each source.
• Secondary sort
• To perform the join, it is important to have the data from one source before another
Joins - Map-Side vs Reduce-Side
Side Data Distribution
• Side data can be defined as extra read-only data needed by a job to process the main dataset.
• The challenge is to make side data available to all the map or reduce tasks (which are spread across
the cluster) in a convenient and efficient fashion.
• Using the Job Configuration
• Set arbitrary key-value pairs in the job configuration using the various setter methods on
JobConf
• Useful if one needs to pass a small piece of metadata to tasks
• Don’t use this mechanism for transferring more than a few kilobytes of data
• The job configuration is read by the jobtracker, the tasktracker, and the child JVM, and
each time the configuration is read, all of its entries are read into memory, even if they
are not used
• Distributed Cache
• Instead of serializing side data in the job config, it is preferred to distribute the datasets using
Hadoop’s distributed cache
• Provides a service for copying files and archives to the task nodes in time for the tasks to
use them when they run
MapReduce Library Classes
• Hadoop comes with a library of mappers and reducers for commonly used functions.