0% found this document useful (0 votes)
166 views65 pages

Mapreduce Types and Formats

The document discusses MapReduce types and formats. It covers: 1. The general forms of map and reduce functions in Hadoop MapReduce, with different key and value types for inputs and outputs. 2. Combiner functions have the same form as reduce but output the same types, often performing the same function. 3. Partition functions determine which reducer a map output goes to based on key. 4. Input formats define input types while other types are set explicitly in the job. FileInputFormat is the base class for file inputs and defines input splits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
166 views65 pages

Mapreduce Types and Formats

The document discusses MapReduce types and formats. It covers: 1. The general forms of map and reduce functions in Hadoop MapReduce, with different key and value types for inputs and outputs. 2. Combiner functions have the same form as reduce but output the same types, often performing the same function. 3. Partition functions determine which reducer a map output goes to based on key. 4. Input formats define input types while other types are set explicitly in the job. FileInputFormat is the base class for file inputs and defines input splits.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 65

MapReduce Types and

Formats
By
Dr. K. Venkateswara Rao
Professor CSE

Prepared using O’Reilly – Hadoop: The Definitive Guide and some slides
are taken from Taikyoung Kim Presentation
MapReduce Types
• MapReduce has a simple model of data processing: inputs and outputs for the
map and reduce functions are key-value pairs.
• The map and reduce functions in Hadoop MapReduce have the following
general form:
map: (K1, V1) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
• In general, the map input key and value types (K1 and V1) are different from
the map output types (K2 and V2).
• The reduce input must have the same types as the map output, although the
reduce output types may be different again (K3 and V3).
MapReduce Types
• Combiner Function
map: (K1, V1) → list(K2, V2)
combiner: (K2, list(V2)) → list(K2, V2)
reduce: (K2, list(V2)) → list(K3, V3)
It has the same form as the reduce function except its output types
Often the combiner and reduce functions are the same
• Partition Function
partition: (K2, V2) → integer
Operates on the intermediate key and value types (K2 and V2)
Returns the partition index
In practice, the partition is determined solely by the key (the value is
ignored)
MapReduce Types
• Input types are set by the input format.

• Ex:- setInputFormatClass(TextInputFormat.class)
Generates keys of type LongWritable and values of type Text
• The other types are set explicitly by calling the methods on the Job
Ex:- Job conf; conf.setMapOutputKeyClass(Text.class)
• intermediate types are also set as the final output types by default.
The default MapReduce Job
@Override
public int run(String[] args) throws Exception {
Job job = JobBuilder.parseInputAndOutput(this, getConf(), args);
job.setInputFormatClass(TextInputFormat.class);
job.setMapperClass(Mapper.class);
job.setMapOutputKeyClass(LongWritable.class);
job.setMapOutputValueClass(Text.class);
job.setPartitionerClass(HashPartitioner.class);
job.setNumReduceTasks(1);
job.setReducerClass(Reducer.class);
job.setOutputKeyClass(LongWritable.class);
job.setOutputValueClass(Text.class);
job.setOutputFormatClass(TextOutputFormat.class);
return job.waitForCompletion(true) ? 0 : 1; }
MapReduce Types
• Number of Reducers:
• Choosing the number of reducers for a job is more of an art than a science.
Increasing the number of reducers makes the reduce phase shorter, since you
get more parallelism. However, if you take this too far, you can have lots of
small files, which is suboptimal. One rule of thumb is to aim for reducers that
each run for five minutes or so, and which produce at least one HDFS block’s
worth of output.
• The number of map tasks is equal to the number of splits that the input is
turned into. The number of reducers will be equal to the number of nodes
multiplied by the slots per node mapred.tasktracker.reduce.tasks.maximum.
• It is good to have slightly fewer reducers than total slots.
• By default, there is a single reducer
MapReduce Types
• Number of Reducers:
• Choosing the number of reducers for a job is more of an art than a science.
Increasing the number of reducers makes the reduce phase shorter, since you
get more parallelism. However, if you take this too far, you can have lots of
small files, which is suboptimal. One rule of thumb is to aim for reducers that
each run for five minutes or so, and which produce at least one HDFS block’s
worth of output.
• The number of map tasks is equal to the number of splits that the input is
turned into. The number of reducers will be equal to the number of nodes
multiplied by the slots per node mapred.tasktracker.reduce.tasks.maximum.
• It is good to have slightly fewer reducers than total slots.
• By default, there is a single reducer
Hadoop Streaming
• Hadoop Streaming uses Unix standard streams as the interface between Hadoop
and user program.
• Difference between Streaming and the Java MapReduce API.
• The Java API is geared toward processing your map function one record at a
time. The framework calls the map() method on your Mapper for each record in
the input, whereas
• With Streaming the map program can decide how to process the input
• for example, it could easily read and process multiple lines at a time since
it’s in control of the reading.
The relationship of the Streaming executable to the node
manager and the task container
.• Streaming runs special map and reduce tasks for the
purpose of launching the user supplied executable
and communicating with it
• The Streaming task communicates with the
streaming process (which may be written in any
language) using standard input and output streams.
• During execution of the task, the Java process
passes input key-value pairs to the external process,
which runs it through the user-defined map or
reduce function and passes the output key-value
pairs back to the Java process.
MapReduce Types: The default Streaming Job
• $ hadoop jar $HADOOP_INSTALL/contrib/streaming/hadoop*-streaming.jar \
-input input/sample.txt -output output -mapper /bin/cat
• There is no default identity mapper, so it must explicitly be set. Hadoop Streaming
output keys and values are always Text. Usually the key (the line offset) is not passed
to the mapper. The default command set explicitly is:
• $ hadoop jar $HADOOP_INSTALL/contrib/hadoop-*streaming.jar \
-input input/sample.txt -output output \
- inputformat org.apache.hadoop.mapred.TextInputFormat -mapper /bin/cat \
-partitioner org.apache.hadoop.mapred.lib.HashPartitioner -numReduceTasks 1 \
-reducer org.apache.hadoop.mapred.lib.IdentityReducer \
-outputformat org.apache.hadoop.mapred.TextOutputFormat
MapReduce Types: Keys and values in Streaming
• A Streaming application can control the separator that is used when
a key-value pair is turned into a series of bytes and sent to the map or reduce
process over standard input.
• The separator can be configured independently for maps and reducers
• Furthermore, the key from the output can be composed of more than the first field: it
can be made up of the first n fields (defined by
stream.num.map.output.key.fields
or
stream.num.reduce.output.key.fields),
• The value is the remaining fields, after n fields
For example, if the output from a Streaming process was a,b,c (with a comma as
the separator), and n was 2, the key would be parsed as a,b and the value as c.
MapReduce Types: Keys and values in Streaming
• The separator can be configured or set independently for maps and reducers.
• These settings do not have any bearing on the input and output formats.
Use of separators in a Streaming MapReduce job
Input Formats
Input Formats: Input Splits and Records
• Splits and records are logical
Input Formats
• Hadoop can process many different types of data formats, from flat text files to
databases.
Input Formats
Input Formats: Input Splits and Records
• Input splits are represented by the Java class InputSplit (which is in the
org.apache.hadoop.mapreduce package)

• An InputSplit has a length in bytes and a set of storage locations which are just hostname
strings.
• A split doesn’t contain the input data; it is just a reference to the data.
• The storage locations are used by the MapReduce system to place map tasks as close to the
split’s data as possible
• The size is used to order the splits so that the largest get processed first, in an attempt to
minimize the job runtime.
Input Formats: Input Splits and Records
• MapReduce application developer need not deal with InputSplits directly, as they are created
by an InputFormat.

• The client running the job calculates the splits for the job by calling getSplits(), then sends
them to the application master, which uses their storage locations to schedule map tasks that
will process them on the cluster.
• The map task passes the split to the createRecordReader() method on InputFormat to obtain a
RecordReader for that split. RecordReader iterate over records.
Input Formats: Input Splits and Records
• The map task uses RecordReader to generate record key-value pairs, which it passes to the
map function. Following is the Mapper’s run() method:

• After running setup(), the nextKeyValue() is called repeatedly on the Context to populate the
key and value objects for the mapper.
• The key and value are retrieved from the RecordReader by way of the Context and are
passed to the map() method for it to do its work.
• When the reader gets to the end of the stream, the nextKeyValue() method returns false, and
the map task runs its cleanup() method and then completes.
Input Formats: FileInputFormat
• FileInputFormat is the base class for all implementations of InputFormat that
use files as their data source
• It provides two things:
1. A place to define which files are included as the input to a job,
2. An implementation for generating splits for the input files.
• The job of dividing splits into records is performed by subclasses.
• FileInputFormat offers four static convenience methods for setting a Job’s input
paths:
Input Formats: FileInputFormat input splits
• FileInputFormat splits only large files—here, “large” means larger than an
HDFS block.
Input Formats: FileInputFormat input splits
• The split size is calculated by the following formula (see the computeSplitSize() method in
FileInputFormat):
• max(minimumSize, min(maximumSize, blockSize))
• and by default: minimumSize < blockSize < maximumSize
Small Files and CombineFile Input Format
Input Formats: TextInputFormat
Input Formats: NLineInputFormat
Input Formats: Binary Input
Input Formats: Multiple Inputs
Output Formats
Output Formats
Output Formats: Text Output
Output Formats: Binary Output
Output Formats: Multiple Outputs
Output Formats: Multiple Outputs
Output Formats: Lazy Output

• FileOutputFormat subclasses will create output (part-r-nnnnn)


files, even if they are empty.
• Lazy OutputFormat helps some applications that does not want to
create empty files
• It is a wrapper output format that ensures that the output file is
created only when the first record is emitted for a given partition.
• To use it, call its setOutputFormatClass() method with the
JobConf.
MapReduce Features
By
Dr. K. Venkateswara Rao
Professor, CSE
(Reference:- Hadoop: The Definitive Guide,
4th Edition, Tom White)
MapReduce Features
• Following are the advanced features of MapReduce that help to know about the data being
analyzed.
1. Counters
2. Sorting datasets
3. Joining datasets
4. Side Data Distribution
• Benefits of the Advanced features of MapReduce
1. Finding any bug while counting invalid records
2. Examining different ways of sorting datasets and controlling the sort order in
MapReduce
3. Understanding concept of joining of Datasets in Mapeduce
4. Making side data available to all the map or reduce tasks (which are spread across the
cluster) in a convenient and efficient fashion. Side data can be defined as extra read-
only data needed by a job to process the main dataset.
Counters
• Counters are a useful channel for gathering statistics about the job
1. for quality control or for application-level statistics
2. for problem diagnosis
Ex) # of invalid records
• Counter values are much easier to retrieve than log output from logfiles for large
distributed jobs
• Types of Counters
1. Built-in Counters such as task counters, Job Counters etc.
2. User-Defined Java Counters
• User can define a set of counters to be incremented in a mapper/reducer
function.
Ex:- Dynamic counters (not defined by Java enum) can be created by the user
Built-in Counters
• Hadoop maintains some built-in counters for every job, and these report
various metrics.
For example, there are counters for the number of bytes and records
processed, which allow you to confirm that the expected amount of input
was consumed and the expected amount of output was produced.
• Counters are divided into groups
• Each group either contains task counters (which are updated as a task
progresses) or job counters (which are updated as a job progresses).
1. Task counters gather information about tasks over the course of their
execution, and the results are aggregated over all the tasks in a job
2. Job counters are maintained by the application master to measure job-
level statistics.
Built-in Counter Groups
Built-in MapReduce task counters
Built-in MapReduce task counters Contd…
Built-in MapReduce task counters Contd…
Built-in MapReduce task counters Contd…
Built-in filesystem task Counters
Built-in File I/O Format Task Counters
Built-in job Counters
Built-in job Counters
Built-in job Counters
User-Defined Java Counters
• MapReduce allows user code to define a set of counters, which are then
incremented as desired in the mapper or reducer. \
• Counters are defined by a Java enum, which serves to group related counters.
• A job may define an arbitrary number of enums, each with an arbitrary
number of fields.
• The name of the enum is the group name, and the enum’s fields are the
counter names.
• Counters are global: the MapReduce framework aggregates them across all
maps and reduces to produce a grand total at the end of the job.
• It is also possible retrieve counter values using the Java APIwhile the job is
Running, although it is more usual to get counters at the end of a job run,
when they are stable.
User-Defined Java Counters
Sorting
• by default, MapReduce will sort input records by their keys.
Sorting
• Suppose we run this program using 30 reducers using following command

• This command produces 30 output files, each of which is sorted.


• However, there is no easy way to combine the files to produce a globally sorted file
(Partial Sort).
• Total Sort
• produces a globally-sorted output file
• Produce a set of sorted files that, if concatenated, would form a globally sorted
file
• Use a partitioner that respects the total order of the output
• Ex) Range partitioner
Total Sort
• Although Total Sort approach of producing a set of files and concatenating them
works, It requires to choose the partition sizes carefully to ensure that they are fairly
even so that job times aren’t dominated by a single reducer
• Example: bad partitioning

• To construct more even partitions, we need to have a better understanding of the


distribution for the whole dataset
• It’s possible to get a fairly even set of partitions, by sampling the key space. The idea
behind sampling is that you look at a small subset of the keys to approximate the key
distribution, which is then used to construct partitions. Luckily, we don’t have to
write the code to do this ourselves, as Hadoop comes with a selection of samplers.
Sorting
• The InputSampler class defines a nested Sampler interface whose implementations return a
sample of keys given an InputFormat and Job:

• Type of samplers
1. Random sampler
2. Split sampler
3. Interval sampler
• RandomSampler method takes following parameters to chooses keys
1. uniform probability,
2. the maximum number of samples to take
3. the maximum number of splits to sample as parameters
Sorting
• SplitSampler samples only the first n records in a split. It is not so good for
sorted data because it doesn’t select keys from throughout the split.
• IntervalSampler chooses keys at regular intervals through the split and makes
a better choice for sorted data.
• RandomSampler is a good general-purpose sampler.
• If none of these suits user application, he can write own implementation of the
Sampler interface.
• The objective of sampling is to produce partitions that are approximately equal
in size.
Secondary Sort
• The MapReduce framework sorts the records by key before they reach the reducers.
For any particular key, however, the values are not sorted.
• The order in which the values appear is not even stable from one run to the next,
because they come from different map tasks, which may finish at different times
from run to run.
• It is possible to impose an order on the values by sorting and grouping the keys in a
particular way.
• Do following to get the effect of sorting by value:
Make the key a composite of the natural key and the natural value.
The sort comparator should order by the composite key (i.e., the natural key and
natural value).
The partitioner and grouping comparator for the composite key should consider
only the natural key for partitioning and grouping.
Joins
• MapReduce can perform joins between large datasets
• How the join can be implemented depends on how large the
datasets are and how they are partitioned. If one dataset is large (the
weather records) but the other one is small enough to be distributed
to each node in the cluster (as the station metadata is), the join can
be effected by a MapReduce job that brings the records for each
station together.
• The mapper or reducer uses the smaller dataset to look up the
station metadata for a station ID.
• If the join is performed by the mapper it is called a map-side join.
• if the join is performed by the reducer it is called a reduce-side join.
Inner join of two datasets
Map-Side Joins
• If both datasets are too large for either to be copied to each node in the cluster, they
can still be joined using MapReduce with a map-side or reduce-side join, depending
on how the data is structured.
• A map-side join works by performing the join before the data reaches the map
function
• Requirements
• Each input dataset must be divided into the same number of partitions
• It must be sorted by the same key (the join key) in each source
• All the records for a particular key must reside in the same partition
• Above requirements actually fit the description of the output of a MapReduce job
• A map-side join can be used to join the outputs of several jobs that had the same
number of reducers, the same keys, and output files that are not splittable
Map-Side Joins
• Use a CompositeInputFormat from the org.apache.hadoop.mapred.join
package to run a map-side join
Reduce-Side Joins
• More general than a map-side join
• Input datasets don’t have to be structured in any particular way
• Less efficient as both datasets have to go through the MapReduce shuffle
• Idea
• The mapper tags each record with its source
• Uses the join key as the map output key so that the records with the same key are
brought together in the reducer
• Multiple inputs
• The input sources for the datasets have different formats
• Use the MultipleInputs class to separate the logic for parsing and tagging each source.
• Secondary sort
• To perform the join, it is important to have the data from one source before another
Joins - Map-Side vs Reduce-Side
Side Data Distribution
• Side data can be defined as extra read-only data needed by a job to process the main dataset.
• The challenge is to make side data available to all the map or reduce tasks (which are spread across
the cluster) in a convenient and efficient fashion.
• Using the Job Configuration
• Set arbitrary key-value pairs in the job configuration using the various setter methods on
JobConf
• Useful if one needs to pass a small piece of metadata to tasks
• Don’t use this mechanism for transferring more than a few kilobytes of data
• The job configuration is read by the jobtracker, the tasktracker, and the child JVM, and
each time the configuration is read, all of its entries are read into memory, even if they
are not used
• Distributed Cache
• Instead of serializing side data in the job config, it is preferred to distribute the datasets using
Hadoop’s distributed cache
• Provides a service for copying files and archives to the task nodes in time for the tasks to
use them when they run
MapReduce Library Classes
• Hadoop comes with a library of mappers and reducers for commonly used functions.

You might also like