0% found this document useful (0 votes)
98 views26 pages

S MapReduce Types Formats Features 06

Uploaded by

tvam904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
98 views26 pages

S MapReduce Types Formats Features 06

Uploaded by

tvam904
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 26

Map Reduce, Types,

Formats and Features


What is Map Reduce ?

● MapReduce is a computational model and an implementation


for processing and generating big data sets with a parallel,
distributed kind of algorithm on a cluster of data. A
MapReduce consists of the following procedures
● Map procedure: Performs a filtering and sorting operation
● Reduce procedure: Performs a summary operation
Overview of Map Reduce

● Data processing is the key in Map Reduce, it consists of


inputs and outputs for the map and reduce functions used as
key-value pairs.
● This presentation consists of MapReduce model in detail,
and in particular at how data in various formats, Types and
features available in the model.
Functions of Map Reduce

● Map Reduce serves two essential functions:


● It filters and distributes work to various nodes within the
cluster or map, a function sometimes referred to as the
mapper.
● It collects, organizes and reduces the results from each node
into a collective answer, referred to as the reducer.
Map Reduce Features

● Counters - They are a useful channel for gathering statistics


about the job like quality control.
● Hadoop maintains some built in counters for every job that
report various metrics for your job.
● Types of counters - Task counters, Job counters, User-Defined
Java Counters.
● Sorting - Ability to sort data is at the heart of MapReduce.
● Types of sorts - Partial sort, Total sort, Secondary sort
● For any particular key, values are not sorted.
Map Reduce Features

● Joins - MapReduce can perform joins between large


datasets, but writing code to do joins from scratch is fairly
involved. Ex: Map-side joins, Reduced-side join.
● Basic idea is that the mapper tags each record with its
source and uses the join key as map output key, so that the
records with same key are brought together in the reducer.
● Side Data Distribution - It can be defined as extra read
only data needed by job to process the main dataset.
● Challenge is to make side data available to all the map or
reduce tasks in convenient and efficient fashion.
Map Reduce Features

● Using the Job Configuration - We can set arbitrary key-


value pairs in the job configuration using various setter
methods on Configuration.
● Distributed Cache - Rather than serializing side data in job
configuration, it is preferable to distribute datasets using
Hadoop's distributed cache mechanism.
● Distributed Cache API - Most applications don’t need to
use distributed cache API as they can use the cache via
GenericOptionsParser.
● MapReduce Library Classes - Hadoop comes with library
of mappers and reducers for commonly used functions.
Input Format

● Input Format takes care about how input file is split and read
by Hadoop.
● It uses input format interface and TextInputFormat is the
default.
● Each Input file is broken into splits and each map processes a
single split.Each Split is further divided into records of
key/value pairs which are processed by map tasks one record
at a time.
● Record reader creates key/value pairs from input splits and
writes on context, which will be shared with Mapper class.
Input Format
Types of Input File Format

● FileInputFormat: It is the base class for all file-based Input


Formats. It specifies input directory where data files are
located. It will read all files and divides these files into one or
more Input Splits.
● TextInputFormat: Each line in the text file is a
record.Key:Byte offset of line Value: Content of the line.
● KeyValueTextInputFormat: Everything before the
separator is the key, and everything after is value.
● SequenceFileInputFormat: To read any sequence files.
Key and values are user defined.
Types of Input File Format

● SequenceFileasTextInputFormat: Similar to
SequenceFileInputFormat. It converts sequence file key
values to text objects.
● SequenceFileasBinaryInputFormat: To read any
sequence files. It is used to extract sequence files keys and
values as opaque binary object.
● NLineInputFormat: Similar to TextInputFormat, But each
split is guaranteed to have exactly N lines.
● DBInputFormat: To read data from RDS. Key is
LongWrittables and values are DB Writable.
Output Formats

● The OutputFormat checks the Output-Specification for


execution of the Map- Reduce job. For e.g check that the
output directory doesn’t already exist.
● It determines how RecordWriter Implementation is used to
write output to output files. Output Files are stored in a File
System.
● The OutputFormat decides the way the output key-value
pairs are written in the output files by RecordWriter.
Output Format
Types of Output Formats

● TextOutputFormat : It is the MapReduce default Hadoop


reducer Output Format which writes key,value pairs on
individual lines of text files.
● SequenceFileOutputFormat : It writes sequences files for
its output and it is the intermediate format use between
MapReduce jobs.
● MapFileOutputFormat: It writes output as map files.The
key in the MapFile must be added in order to ensure that the
reducer emits keys in sorted order.
Types of Output Formats

● MultipleOutputs : It allows writing data to files whose


names are derived from the output keys and values.
● LazyOutputFormat : It is a wrapper OutputFormat which
ensures that the output file will be created only when the
record is emitted for a given partition
● DBOutputFormat : It writes to the relational database and
HBase and sends the reduce output to a SQL Table.
Sorting

● MapReduce Framework automatically sort the keys generated


by the mapper.
● Reducer in MapReduce starts a new reduce task when the
next key in the sorted input data is different than the
previous. Each reduce task takes key value pairs as input and
generates key-value pair as output.
● Secondary Sorting in MapReduce:
○ If we want to sort reducer values, then we use a secondary sorting
technique. This technique enables us to sort the values (in ascending or
descending order) passed to each reducer.
Sorting
Counters

● Counters provides a way to measure the progress or the number of operations that occur within map
reduce.
● There are basically 2 types of MapReduce Counters:

○ Built-In Counters in MapReduce:

■ Hadoop maintains some built-in Hadoop counters for every job and these report various
metrics, like, there are counters for the number of bytes and records, which allow us to
confirm that the expected amount of input is consumed and the expected amount of
output is produced.

○ User-Defined Counters/Custom Counters in Hadoop MapReduce.

■ In addition to MapReduce built-in counters, MapReduce allows user code to define a set of
counters

■ For example, in Java, ‘enum’ is used to define counters.


Joins

● Large datasets can be combined by using joints in


MapReduce.
● When processing large data sets, the need for joining data by
a common key can be very useful.
● There are two types of joins

○ Map side join

○ Reduce side join


Joins
Reduce-side join

● If the join is performed by the reducer it is called a reduce-side join.


● Reduce-side join is easier to implement than the map-side join as
there is no need for us to structure the datasets in a particular way.
But it is less efficient as the datasets have to be shuffled.
● To perform the join, we simply need to cache a key and compare it
to incoming keys. As long as the keys match, we can join the values
from the corresponding keys.
● In reduce-side joins, there are two different scenarios that we can
consider:

○ one-to-one
Map-side join

e● Graph Querying
When the join function is performed by the mapper it is called map
side join.
● Expects a strong prerequisite before joining data at map side. The
prerequisites are:

○ Data should be partitioned and sorted in particular way.

○ Each input data should be divided in same number of partition.

○ Must be sorted with same key.

○ All the records for a particular key must reside in the same
partition.
Conclusion

● The MapReduce programming model is being used at many


places for many functions.
● The model is easy to use by everyone since it hides the
underlying details (parallelization, fault-tolerance,
neighborhood optimization, and load equalization)
● MapReduce is employed for the generation of data like
sorting, data processing, machine learning and so on.
● The implementation makes economical use of the machine
resources. Hence it is apt to be used on large machine
issues.
Conclusion (conti)

● Network information measure is a scarce resource. Various


optimizations are aimed at reducing the quantity of data sent
across the network.
● This optimization allows us to browse information from native
disks.
● Writing one copy of the information to native disk saves
network bandwidth.
Summary

● Topics covered:

○ Map Reduce

○ Functions of Map Reduce

○ Features of Map Reduce

○ Input Formats

○ Output Formats

○ Sorting

○ Counters

○ Joins and Types

○ Conclusion
Thank you

You might also like