S MapReduce Types Formats Features 03

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 16

MapReduce Types , Formats , and Features

MAP REDUCE GENERAL FORM


The general form of map and reduce functions in Hadoop MapReduce is:

The Map input key K1 and value V1 are different from Map output type K2 and V2
The Reduce input is same as Map output
The output of Reduce is different as K3 and V3
Data Flow:
JAVA API REPRESENATTION
COMBINER FUNCTION
The form is same as that generic form except its output types are the intermediate key and
value types (K2 and V2))

Often the combiner and reduce functions are the same, in which case K3 is the same as K2,
and V3 is the same as V2.
MAP REDUCE INPUT FORMATS
Input split is a chunk of the input that is processed by a single map.
Each map processes a single split.
Each split is divided into records, and the map processes each record—a key-value pair—in
turn.
Input splits are represented by the Java class InputSplit
FILE INPUT FORMATS
Java InputFormat is responsible for creating the input splits and dividing them into records. Its
types are:
•FileInputFormat: The base class for all implementations of InputFormat that use files as their
data source. It provides two things: a place to define which files are included as the input to a
job, and an implementation for generating splits for the input files.
•CombineFileInputFormat: This packs many files into each split so that each mapper has more to
process. Hadoop works better with a small number of large files than a large number of small
files.
•WholeFileInputFormat: This a format where the keys are not used and the values are the file
contents.Takes a FileSplit and converts it into a single record.
TEXT INPUT FORMATS
•TextInputFormat: The default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is the
contents of the line, excluding any line terminators and is packaged as a Text object.
•KeyValueTextInputFormat: TextInputFormat’s keys, being simply the offsets within the file, are
not normally very useful. It is common for each line in a file to be a key-value pair, separated by
a delimiter such as a tab character.
•NLineInputFormat: With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input. The number depends on the size of the split and the
lengthof the lines. If you want your mappers to receive a fixed number of lines of input,
thenNLineInputFormat is the InputFormat to use.
•XML: Hadoop comes with a class for this purpose called StreamXmlRecordReader (which is in
the org.apache.hadoop.streaming.mapreduce package)
OTHER INPUT TYPES

•Binary Input: Hadoop MapReduce is not restricted to processing textual data. It has support for
binary formats.

•Database Input: DBInputFormat is an input format for reading data from a relational database,
using JDBC.
OUTPUT TYPE FORMATS
•TextOutputFormat: Default output format. It writes records as lines of text (keys and values are
turned into strings).
•Binary Output:
SequenceFileOutputFormat: Writes sequence files as output

•Multiple Output: FileOutputFormat and its subclasses generate a set of files in the output
directory. There is one file per reducer, and files are named by the partition number: part-r-
00000, part-r-00001, and so on
•LazyOutput: FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are
empty. Some applications prefer that empty files not be created, which is where Lazy
OutputFormat helps.
•Database Output: The output formats for writing to relational databases and to HBase
FEATURES
Counters:
• Used for gathering statistical information about any job and to diagnose a problem, if any.

Two types of counters:

1. Built-in counter: Task counters and job counters

• Task counters gather information about task and the results are aggregated for all tasks in a job.

• Job counters are maintained by the master and it is used for measuring the job level statistics.

2. User Defined counters: Dynamic counters and retrieving counters.

• Dynamic counters are app. writer which is created by counters using Java enum or by interface.

• Retrieving counters give the job level statistics even when the job is still running instead of waiting till the end for the job to finish. Built-
in Java APIs.

3. User-defined streaming functions: They can increment counters by sending a specially formatted line to the standard error stream

The line must have the following format: reporter:counter:group,counter,amount


SORTING
• One of the important and heart of MapReduce.
• Preparation before sorting is needed for certain datasets (temperatures cannot be sorted as
text). Store data as sequence files instead.
• Different ways of sorting the datasets:
1. Partial Sort: Default sort where the sorting is done based on the keys. Individual output files
are sorted but no globally sorted file is combined and produced.
2. Total Sort: Can be used for producing globally output files. Total sort uses a partitioner and
the partition sizes must be fairly even for this to work.
3. Secondary Sort: Can be used to sort the records by key in the mapper phase instead of using
it in the reducer phase.
JOINS
•Joins can be used in MapReduce to combine large datasets together.
•Implementation depends upon how large the dataset is and also how they’re partitioned.
•When processing large data sets the need for joining data by a common key can be very useful, if
not essential.
•By joining data you can further gain insights.
MAP-SIDE JOIN
•Map-side join is one of the features of MapReduce where the join is performed by the mapper.
•Works by performing the join before the dataset even reaches the map function.
• For this to work, the inputs to each map must be partitioned and sorted in a particular way.
•To take advantage of map-side joins our data must meet one of following criteria:
1. The datasets to be joined are already sorted by the same key and have the same number of
partitions.
2. Of the two datasets to be joined, one is small enough to fit into memory.
•Each input dataset must be divided into the same number of partitions, and it must be sorted by
the same key (the join key) in each source. All the records for a particular key must reside in the
same partition.
REDUCE-SIDE JOIN
•Reduce-side join is one of the features of MapReduce where the join is performed by the
reducer.
•The basic idea is that the mapper tags each record with its source and uses the join key as the
map output key, so that the records with the same key are brought together in the reducer.
•A reduce-side join is more general than a map-side join, in that the input datasets don’t have to
be structured in any particular way, but it is less efficient because both datasets have to go
through the MapReduce shuffle.
What do we need for this to work?
1. Multiple Inputs: The input sources for the datasets generally have different formats, so it is
very convenient to use the MultipleInputs class to separate logic for parsing and tagging.
2. Secondary Sort: Reducer checks the records from both sources that have the same key, but
they are not guaranteed to be in any particular order.
SIDE DATA DISTRIBUTION
•Can be defined as extra read-only data needed by a job to process the main dataset.
•Provides a service for copying files and archives to the task nodes in time for the tasks to use
them when they run.
•One way to make data available is to use Job Configuration setter method to set key-value pairs .
•Hadoop’s distributed cache is another way to distribute the datasets.
•Files and Archives are the two kinds of files we need to place the files in the cache.
•To make side data available to all the map or reduce tasks is a huge challenge.
CONCLUSION
• MapReduce provides a simple way to scale your application.
• Some of the important applications : Social networking and search engines, PageRank, Statistics
and genomics.
•Scales out to more machines, rather than scaling up
•Effortlessly scale from a single machine to thousands
•Fault tolerant & High performance
•If you can fit your use case to its paradigm, scaling is handled by the framework

You might also like