Map Reduce
Map Reduce
with MapReduce
Contents
•The introduction of MapReduce,
•MapReduce Architecture,
• Data flow in MapReduce Splits,
•Mapper,
•Portioning,
•Sort and shuffle,
•Combiner,
•Reducer,
•Basic Configuration of MapReduce,
•MapReduce life cycle,
• Driver Code,
• Mapper and Reducer,
•How MapReduce Works.
The introduction of MapReduce
MapReduce is a processing technique and a program model for
distributed computing based on java.
contains two important tasks, namely Map and Reduce.
Map takes a set of data and converts it into another set of data,
where individual elements are broken down into tuples (key/value
pairs).
The processing primitive is called mapper.
Reduce task takes the output from a map as an input and
combines those data tuples into a smaller set of tuples.
Reduce task is always performed after the map job.
The processing primitive is called reducer.
The major advantage of MapReduce is that it is easy to scale
data processing over multiple computing nodes.
The introduction of MapReduce
The main advantages is that we write an application in the
MapReduce form, scaling the application to run over hundreds,
thousands, or even tens of thousands of machines in a cluster with
a configuration change.
• MapReduce program executes in three stages:
map stage,
shuffle stage,
reduce stage.
Map stage : The map or mapper’s job is to process the input
data.
Generally the input data is in the form of file or directory and is
stored in the Hadoop file system (HDFS).
The input file is passed to the mapper function line by line.
The mapper processes the data and creates several small chunks
of data.
The introduction of MapReduce
Reduce stage : This stage is the combination of the Shuffle stage
and the Reduce stage.
The Reducer’s job is to process the data that comes from the
mapper.
After processing, it produces a new set of output, which will be
stored in the HDFS.
Fundamental Principle
•FileInputFormat in Hadoop
• TextInputFormat
•KeyValueTextInputFormat
•SequenceFileInputFormat
•SequenceFileAsTextInputFormat
•SequenceFileAsBinaryInputFormat
•NLineInputFormat
• DBInputFormat.
How we get the data to mapper?
2 methods used to get the data to mapper in MapReduce:
•getsplits()
•createRecordReader()
public abstract class InputFormat<K, V>
{
public abstract List<InputSplit> getSplits(JobContext context)
throws IOException, InterruptedException;
public abstract RecordReader<K, V>
createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException;
}
Hadoop Input Format
FileInputFormat in Hadoop: FileInputFormat in Hadoop is the base class for all file-based
InputFormats.
•FileInputFormat specifies input directory where data files are located.
•When we start a Hadoop job, FileInputFormat is provided with a path containing files to
read.
•FileInputFormat will read all files and divides these files into one or more InputSplits.
•TextInputFormat: TextInputFormat ion Hadoop is the default InputFormat of
MapReduce.
•TextInputFormat treats each line of each input file as a separate record and performs no
parsing.
KeyValueTextInputFormat:KeyValueTextInputFormat in Hadoop is similar to
TextInputFormat as it also treats each line of input as a separate record.
•While TextInputFormat treats entire line as the value, but the KeyValueTextInputFormat
breaks the line itself into key and value by a tab character (‘/t’).
SequenceFileInputFormat: SequenceFileInputFormat in Hadoop is an InputFormat which
reads sequence files.
•Sequence files are binary files that stores sequences of binary key-value pairs.
•Sequence files are block-compressed and provide direct serialization and deserialization
of several arbitrary data types (not just text).
•Key & Value both are user-defined.
Hadoop Input Format
SequenceFileAsTextInputFormat: SequenceFileAsTextInputFormat in Hadoop is another
form of SequenceFileInputFormat which converts the sequence file key values to Text
objects.
•By calling ‘tostring()’ conversion is performed on the keys and values.
•This InputFormat makes sequence files suitable input for streaming.
SequenceFileAsBinaryInputFormat: SequenceFileAsBinaryInputFormat in Hadoop is a
SequenceFileInputFormat using which we can extract the sequence file’s keys and values
as a binary object.
NLineInputFormat: It is another form of TextInputFormat where the keys are byte offset
of the line and values are contents of the line.
•Each mapper receives a variable number of lines of input with TextInputFormat and
KeyValueTextInputFormat.
•The number depends on the size of the split and the length of the lines.
•And if we want our mapper to receive a fixed number of lines of input, then we use
NLineInputFormat.
Example: N is the number of lines of input that each mapper receives. By default (N=1),
each mapper receives exactly one line of input.
• If N=2, then each split contains two lines. One mapper will receive the first two Key-
Value pairs and another mapper will receive the second two key-value pairs.
Hadoop Input Format
DBInputFormat: DBInputFormat in Hadoop is an InputFormat that reads data from a
relational database, using JDBC.
•it doesn’t have portioning capabilities.
• so we need to careful to read to many mappers.
•joining is used to make datasets from HDFS using MultipleInputs.
•Here Key is LongWritables while Value is DBWritables.
Hadoop InputOutput Format
Output Format
Hadoop Output Format
OutputFormat is used to write to files on the local disk or in HDFS.
•OutputFormat describes the output-specification for a Map-Reduce job.
•Based on output specification following things happened:
•MapReduce job checks that the output directory.
•OutputFormat provides the RecordWriter implementation to be used to write the output
files of the job.
•Output files are stored in a FileSystem.
•FileOutputFormat.setOutputPath() method is used to set the output directory.
•Every Reducer writes a separate file in a common output directory.
Types of Hadoop Output Format
•textOutputFormat,
• sequenceFileOutputFormat,
• mapFileOutputFormat,
•sequenceFileAsBinaryOutputFormat,
•DBOutputFormat,
• LazyOutputForma, and
• MultipleOutputs
TextOutputFormat:
•MapReduce default OutputFormat is TextOutputFormat, it writes (key, value) pairs on
individual lines of text files
•The keys and values can be of any type.
•TextOutputFormat turns it to string by calling toString().
•Each key-value pair is separated by a tab character.
• It can be changed using MapReduce.output.textoutputformat.separator property.
• KeyValueTextOutputFormat is used for reading output text files since it breaks lines into
key-value pairs based on a configurable separator.
MapReduce Join
•It is used to achieve larger dataset.
•Joining of two datasets begin by comparing size of each dataset.
•If one dataset is smaller as compared to the other dataset then smaller dataset is
distributed to every datanode in the cluster.
•Once it is distributed, either Mapper or Reducer uses smaller dataset to perform lookup
for matching records from large dataset and then combine those records to form output
records.
Map-side join - When the join is performed by the mapper, it is called as map-side join.
• In this type, the join is performed before data is actually consumed by the map function.
• It is mandatory that the input to each map is in the form of a partition and is in sorted
order.
•Also, there must be an equal number of partitions and it must be sorted by the join key.
Reduce-side join - When the join is performed by the reducer, it is called as reduce-side
join.
•There is no necessity in this join to have dataset in a structured form (or partitioned).
• Here, map side processing emits join key and corresponding tuples of both the tables.
• As an effect of this processing, all the tuples with same join key fall into the same reducer
which then joins the records with same join key.
MapReduce Join Example
MapReduce Join Example-1
MapReduce Join Example-2
pseudo code
map (K table, V rec) reduce (K dept_id, list<tagged_rec> tagged_recs)
{ {
}
emit (tagged_rec.rec.Dept_Id, joined_rec)
}
}
1: How Hadoop is different from Data-mining and Data-ware Housing?