S MapReduce Types Formats Features 03
S MapReduce Types Formats Features 03
S MapReduce Types Formats Features 03
The Map input key K1 and value V1 are different from Map output type K2 and V2
The Reduce input is same as Map output
The output of Reduce is different as K3 and V3
Data Flow:
JAVA API REPRESENATTION
COMBINER FUNCTION
The form is same as that generic form except its output types are the intermediate key and
value types (K2 and V2))
Often the combiner and reduce functions are the same, in which case K3 is the same as K2,
and V3 is the same as V2.
MAP REDUCE INPUT FORMATS
Input split is a chunk of the input that is processed by a single map.
Each map processes a single split.
Each split is divided into records, and the map processes each record—a key-value pair—in
turn.
Input splits are represented by the Java class InputSplit
FILE INPUT FORMATS
Java InputFormat is responsible for creating the input splits and dividing them into records. Its
types are:
•FileInputFormat: The base class for all implementations of InputFormat that use files as their
data source. It provides two things: a place to define which files are included as the input to a
job, and an implementation for generating splits for the input files.
•CombineFileInputFormat: This packs many files into each split so that each mapper has more to
process. Hadoop works better with a small number of large files than a large number of small
files.
•WholeFileInputFormat: This a format where the keys are not used and the values are the file
contents.Takes a FileSplit and converts it into a single record.
TEXT INPUT FORMATS
•TextInputFormat: The default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is the
contents of the line, excluding any line terminators and is packaged as a Text object.
•KeyValueTextInputFormat: TextInputFormat’s keys, being simply the offsets within the file, are
not normally very useful. It is common for each line in a file to be a key-value pair, separated by
a delimiter such as a tab character.
•NLineInputFormat: With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input. The number depends on the size of the split and the
lengthof the lines. If you want your mappers to receive a fixed number of lines of input,
thenNLineInputFormat is the InputFormat to use.
•XML: Hadoop comes with a class for this purpose called StreamXmlRecordReader (which is in
the org.apache.hadoop.streaming.mapreduce package)
OTHER INPUT TYPES
•Binary Input: Hadoop MapReduce is not restricted to processing textual data. It has support for
binary formats.
•Database Input: DBInputFormat is an input format for reading data from a relational database,
using JDBC.
OUTPUT TYPE FORMATS
•TextOutputFormat: Default output format. It writes records as lines of text (keys and values are
turned into strings).
•Binary Output:
SequenceFileOutputFormat: Writes sequence files as output
•Multiple Output: FileOutputFormat and its subclasses generate a set of files in the output
directory. There is one file per reducer, and files are named by the partition number: part-r-
00000, part-r-00001, and so on
•LazyOutput: FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are
empty. Some applications prefer that empty files not be created, which is where Lazy
OutputFormat helps.
•Database Output: The output formats for writing to relational databases and to HBase
FEATURES
Counters:
• Used for gathering statistical information about any job and to diagnose a problem, if any.
• Task counters gather information about task and the results are aggregated for all tasks in a job.
• Job counters are maintained by the master and it is used for measuring the job level statistics.
• Dynamic counters are app. writer which is created by counters using Java enum or by interface.
• Retrieving counters give the job level statistics even when the job is still running instead of waiting till the end for the job to finish. Built-
in Java APIs.
3. User-defined streaming functions: They can increment counters by sending a specially formatted line to the standard error stream