0% found this document useful (0 votes)

98 views26 pages

S MapReduce Types Formats Features 06

Uploaded by

tvam904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

98 views26 pages

S MapReduce Types Formats Features 06

Uploaded by

tvam904

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 26

Map Reduce, Types,

Formats and Features

What is Map Reduce ?

● MapReduce is a computational model and an implementation

for processing and generating big data sets with a parallel,
distributed kind of algorithm on a cluster of data. A
MapReduce consists of the following procedures
● Map procedure: Performs a filtering and sorting operation
● Reduce procedure: Performs a summary operation
Overview of Map Reduce

● Data processing is the key in Map Reduce, it consists of

inputs and outputs for the map and reduce functions used as
key-value pairs.
● This presentation consists of MapReduce model in detail,
and in particular at how data in various formats, Types and
features available in the model.
Functions of Map Reduce

● Map Reduce serves two essential functions:

● It filters and distributes work to various nodes within the
cluster or map, a function sometimes referred to as the
mapper.
● It collects, organizes and reduces the results from each node
into a collective answer, referred to as the reducer.
Map Reduce Features

● Counters - They are a useful channel for gathering statistics

about the job like quality control.
● Hadoop maintains some built in counters for every job that
report various metrics for your job.
● Types of counters - Task counters, Job counters, User-Defined
Java Counters.
● Sorting - Ability to sort data is at the heart of MapReduce.
● Types of sorts - Partial sort, Total sort, Secondary sort
● For any particular key, values are not sorted.
Map Reduce Features

● Joins - MapReduce can perform joins between large

datasets, but writing code to do joins from scratch is fairly
involved. Ex: Map-side joins, Reduced-side join.
● Basic idea is that the mapper tags each record with its
source and uses the join key as map output key, so that the
records with same key are brought together in the reducer.
● Side Data Distribution - It can be defined as extra read
only data needed by job to process the main dataset.
● Challenge is to make side data available to all the map or
reduce tasks in convenient and efficient fashion.
Map Reduce Features

● Using the Job Configuration - We can set arbitrary key-

value pairs in the job configuration using various setter
methods on Configuration.
● Distributed Cache - Rather than serializing side data in job
configuration, it is preferable to distribute datasets using
Hadoop's distributed cache mechanism.
● Distributed Cache API - Most applications don’t need to
use distributed cache API as they can use the cache via
GenericOptionsParser.
● MapReduce Library Classes - Hadoop comes with library
of mappers and reducers for commonly used functions.
Input Format

● Input Format takes care about how input file is split and read
by Hadoop.
● It uses input format interface and TextInputFormat is the
default.
● Each Input file is broken into splits and each map processes a
single split.Each Split is further divided into records of
key/value pairs which are processed by map tasks one record
at a time.
● Record reader creates key/value pairs from input splits and
writes on context, which will be shared with Mapper class.
Input Format
Types of Input File Format

● FileInputFormat: It is the base class for all file-based Input

Formats. It specifies input directory where data files are
located. It will read all files and divides these files into one or
more Input Splits.
● TextInputFormat: Each line in the text file is a
record.Key:Byte offset of line Value: Content of the line.
● KeyValueTextInputFormat: Everything before the
separator is the key, and everything after is value.
● SequenceFileInputFormat: To read any sequence files.
Key and values are user defined.
Types of Input File Format

● SequenceFileasTextInputFormat: Similar to
SequenceFileInputFormat. It converts sequence file key
values to text objects.
● SequenceFileasBinaryInputFormat: To read any
sequence files. It is used to extract sequence files keys and
values as opaque binary object.
● NLineInputFormat: Similar to TextInputFormat, But each
split is guaranteed to have exactly N lines.
● DBInputFormat: To read data from RDS. Key is
LongWrittables and values are DB Writable.
Output Formats

● The OutputFormat checks the Output-Specification for

execution of the Map- Reduce job. For e.g check that the
output directory doesn’t already exist.
● It determines how RecordWriter Implementation is used to
write output to output files. Output Files are stored in a File
System.
● The OutputFormat decides the way the output key-value
pairs are written in the output files by RecordWriter.
Output Format
Types of Output Formats

● TextOutputFormat : It is the MapReduce default Hadoop

reducer Output Format which writes key,value pairs on
individual lines of text files.
● SequenceFileOutputFormat : It writes sequences files for
its output and it is the intermediate format use between
MapReduce jobs.
● MapFileOutputFormat: It writes output as map files.The
key in the MapFile must be added in order to ensure that the
reducer emits keys in sorted order.
Types of Output Formats

● MultipleOutputs : It allows writing data to files whose

names are derived from the output keys and values.
● LazyOutputFormat : It is a wrapper OutputFormat which
ensures that the output file will be created only when the
record is emitted for a given partition
● DBOutputFormat : It writes to the relational database and
HBase and sends the reduce output to a SQL Table.
Sorting

● MapReduce Framework automatically sort the keys generated

by the mapper.
● Reducer in MapReduce starts a new reduce task when the
next key in the sorted input data is different than the
previous. Each reduce task takes key value pairs as input and
generates key-value pair as output.
● Secondary Sorting in MapReduce:
○ If we want to sort reducer values, then we use a secondary sorting
technique. This technique enables us to sort the values (in ascending or
descending order) passed to each reducer.
Sorting
Counters

● Counters provides a way to measure the progress or the number of operations that occur within map
reduce.
● There are basically 2 types of MapReduce Counters:

○ Built-In Counters in MapReduce:

■ Hadoop maintains some built-in Hadoop counters for every job and these report various
metrics, like, there are counters for the number of bytes and records, which allow us to
confirm that the expected amount of input is consumed and the expected amount of
output is produced.

○ User-Defined Counters/Custom Counters in Hadoop MapReduce.

■ In addition to MapReduce built-in counters, MapReduce allows user code to define a set of
counters

■ For example, in Java, ‘enum’ is used to define counters.

Joins

● Large datasets can be combined by using joints in

MapReduce.
● When processing large data sets, the need for joining data by
a common key can be very useful.
● There are two types of joins

○ Map side join

○ Reduce side join

Joins
Reduce-side join

● If the join is performed by the reducer it is called a reduce-side join.

● Reduce-side join is easier to implement than the map-side join as
there is no need for us to structure the datasets in a particular way.
But it is less efficient as the datasets have to be shuffled.
● To perform the join, we simply need to cache a key and compare it
to incoming keys. As long as the keys match, we can join the values
from the corresponding keys.
● In reduce-side joins, there are two different scenarios that we can
consider:

○ one-to-one
Map-side join

e● Graph Querying
When the join function is performed by the mapper it is called map
side join.
● Expects a strong prerequisite before joining data at map side. The
prerequisites are:

○ Data should be partitioned and sorted in particular way.

○ Each input data should be divided in same number of partition.

○ Must be sorted with same key.

○ All the records for a particular key must reside in the same
partition.
Conclusion

● The MapReduce programming model is being used at many

places for many functions.
● The model is easy to use by everyone since it hides the
underlying details (parallelization, fault-tolerance,
neighborhood optimization, and load equalization)
● MapReduce is employed for the generation of data like
sorting, data processing, machine learning and so on.
● The implementation makes economical use of the machine
resources. Hence it is apt to be used on large machine
issues.
Conclusion (conti)

● Network information measure is a scarce resource. Various

optimizations are aimed at reducing the quantity of data sent
across the network.
● This optimization allows us to browse information from native
disks.
● Writing one copy of the information to native disk saves
network bandwidth.
Summary

● Topics covered:

○ Map Reduce

○ Functions of Map Reduce

○ Features of Map Reduce

○ Input Formats

○ Output Formats

○ Sorting

○ Counters

○ Joins and Types

○ Conclusion
Thank you

Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
BDA - Unit - III-1
No ratings yet
BDA - Unit - III-1
57 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Hadoop MapReduce Tutorial
No ratings yet
Hadoop MapReduce Tutorial
25 pages
MAP Reduce - 1
No ratings yet
MAP Reduce - 1
34 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
Bda Winter 2021 Solution
No ratings yet
Bda Winter 2021 Solution
27 pages
Big Data
No ratings yet
Big Data
120 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Mitsubishi Product ID Key For All Software - PL
No ratings yet
Mitsubishi Product ID Key For All Software - PL
2 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Paper Map Reduce
No ratings yet
Paper Map Reduce
16 pages
Unit - III
No ratings yet
Unit - III
37 pages
Lecture 03
No ratings yet
Lecture 03
26 pages
Unit 3 - Big Data Technologies
No ratings yet
Unit 3 - Big Data Technologies
42 pages
Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Bda Module 4
No ratings yet
Bda Module 4
34 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Map Reduce 2
No ratings yet
Map Reduce 2
14 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
Lecture 04
No ratings yet
Lecture 04
25 pages
Map Reduce
No ratings yet
Map Reduce
39 pages
Map Reduce
No ratings yet
Map Reduce
35 pages
Unit4 Fos
No ratings yet
Unit4 Fos
7 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Unit 4
No ratings yet
Unit 4
11 pages
Module2 C MapReduceParadigm
No ratings yet
Module2 C MapReduceParadigm
74 pages
Big Data Analytics UNIT 3 Notets
No ratings yet
Big Data Analytics UNIT 3 Notets
12 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
Operating Systems Notes
No ratings yet
Operating Systems Notes
137 pages
Introducing FortiSiem
No ratings yet
Introducing FortiSiem
69 pages
Unit 3 Bda
No ratings yet
Unit 3 Bda
59 pages
Unit-2 (MapReduce-II)
No ratings yet
Unit-2 (MapReduce-II)
11 pages
AlphaSET User Manual GBR
No ratings yet
AlphaSET User Manual GBR
137 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Ecs765p W2
No ratings yet
Ecs765p W2
55 pages
Computer Architecture and Organization Learning Module 1
No ratings yet
Computer Architecture and Organization Learning Module 1
31 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
Map Reduce Report
No ratings yet
Map Reduce Report
16 pages
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
No ratings yet
Understand: The First Phase of Mapreduce Paradigm, What Is A Map/Mapper, What Is The Input To The
5 pages
Data Science
No ratings yet
Data Science
7 pages
Chapter4 - MapReduce
No ratings yet
Chapter4 - MapReduce
29 pages
Map Reduce
No ratings yet
Map Reduce
11 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Map Red
No ratings yet
Map Red
6 pages
Map-Reduce For Parallel Computing: Amit Jain
No ratings yet
Map-Reduce For Parallel Computing: Amit Jain
72 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Bda Unit 4
No ratings yet
Bda Unit 4
20 pages
Chapter 2 - Networking & Telecommunication
No ratings yet
Chapter 2 - Networking & Telecommunication
32 pages
Chapter2 - Machine Instructions and Programs
No ratings yet
Chapter2 - Machine Instructions and Programs
85 pages
Part II: Waits Events and The Geeks Who Love Them: Kyle Hailey
No ratings yet
Part II: Waits Events and The Geeks Who Love Them: Kyle Hailey
41 pages
Mapreduce: Simpli - Ed Data Processing On Large Clusters
No ratings yet
Mapreduce: Simpli - Ed Data Processing On Large Clusters
4 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Segment Routing - IDNOG V1.4
No ratings yet
Segment Routing - IDNOG V1.4
36 pages
JHS-770 Software Upgrade Procedure
100% (2)
JHS-770 Software Upgrade Procedure
19 pages
GCCEventlog 2024 11 12
No ratings yet
GCCEventlog 2024 11 12
17 pages
Service Manual Acer TravelMate 420 PDF
No ratings yet
Service Manual Acer TravelMate 420 PDF
140 pages
Wuolah Free 4
No ratings yet
Wuolah Free 4
3 pages
Computer Organization: Prepared by Asst. Prof. Sherin Thomas ECE Dept. MBITS, Nellimattam
No ratings yet
Computer Organization: Prepared by Asst. Prof. Sherin Thomas ECE Dept. MBITS, Nellimattam
67 pages
Anatomy of A MapReduce Job
No ratings yet
Anatomy of A MapReduce Job
5 pages
Experiment No.4
No ratings yet
Experiment No.4
8 pages
Correction de TP 3 PHP-MySQL Gestion Des Utilisateurs
No ratings yet
Correction de TP 3 PHP-MySQL Gestion Des Utilisateurs
20 pages
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
No ratings yet
Mapreduce Programming Model and Design Patterns: Andrea Lottarini January 17, 2012
23 pages
Assembly Language For x86 Processors
No ratings yet
Assembly Language For x86 Processors
50 pages
Biostar A68i 350 Deluxe Owners Manual
No ratings yet
Biostar A68i 350 Deluxe Owners Manual
45 pages
Lab Assessment: - 1: 1.create A Virtual Machine (VM)
No ratings yet
Lab Assessment: - 1: 1.create A Virtual Machine (VM)
17 pages
Safous Explainer Simple and Secure Zero Trust Access
No ratings yet
Safous Explainer Simple and Secure Zero Trust Access
2 pages
Ring Network PDF
100% (1)
Ring Network PDF
3 pages
Ip Easy Subneting
No ratings yet
Ip Easy Subneting
2 pages
Assignment 1
No ratings yet
Assignment 1
17 pages
Assignment Ankush
No ratings yet
Assignment Ankush
11 pages
Chapter05 Exercises
No ratings yet
Chapter05 Exercises
11 pages
Dynamic VLAN Assignment Using RADIUS
No ratings yet
Dynamic VLAN Assignment Using RADIUS
6 pages
CCR2004 1G 12S+2XS
No ratings yet
CCR2004 1G 12S+2XS
3 pages
Linux Test Command Information and Examples
No ratings yet
Linux Test Command Information and Examples
4 pages
7segdisplay 7447
No ratings yet
7segdisplay 7447
1 page
3D Hardware design:: Software applications for GPU
From Everand
3D Hardware design:: Software applications for GPU
S Mathioudakis
No ratings yet

S MapReduce Types Formats Features 06

Uploaded by

S MapReduce Types Formats Features 06

Uploaded by

Map Reduce, Types,

Formats and Features

● MapReduce is a computational model and an implementation

● Data processing is the key in Map Reduce, it consists of

● Map Reduce serves two essential functions:

● Counters - They are a useful channel for gathering statistics

● Joins - MapReduce can perform joins between large

● Using the Job Configuration - We can set arbitrary key-

● FileInputFormat: It is the base class for all file-based Input

● The OutputFormat checks the Output-Specification for

● TextOutputFormat : It is the MapReduce default Hadoop

● MultipleOutputs : It allows writing data to files whose

● MapReduce Framework automatically sort the keys generated

○ Built-In Counters in MapReduce:

○ User-Defined Counters/Custom Counters in Hadoop MapReduce.

■ For example, in Java, ‘enum’ is used to define counters.

● Large datasets can be combined by using joints in

○ Map side join

○ Reduce side join

● If the join is performed by the reducer it is called a reduce-side join.

○ Data should be partitioned and sorted in particular way.

○ Each input data should be divided in same number of partition.

○ Must be sorted with same key.

● The MapReduce programming model is being used at many

● Network information measure is a scarce resource. Various

○ Functions of Map Reduce

○ Features of Map Reduce

○ Joins and Types

You might also like