S MapReduce Types Formats Features 03

The document provides an overview of MapReduce types, formats, and features. It describes the general form of Map and Reduce functions, input and output formats including file, text, and database formats. It also discusses features like counters, sorting, joins, side data distribution, and common MapReduce applications.

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

77 views16 pages

S MapReduce Types Formats Features 03

Uploaded by

Ashwin Ajmera

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 16

MapReduce Types , Formats , and Features

MAP REDUCE GENERAL FORM

The general form of map and reduce functions in Hadoop MapReduce is:

The Map input key K1 and value V1 are different from Map output type K2 and V2
The Reduce input is same as Map output
The output of Reduce is different as K3 and V3
Data Flow:
JAVA API REPRESENATTION
COMBINER FUNCTION
The form is same as that generic form except its output types are the intermediate key and
value types (K2 and V2))

Often the combiner and reduce functions are the same, in which case K3 is the same as K2,
and V3 is the same as V2.
MAP REDUCE INPUT FORMATS
Input split is a chunk of the input that is processed by a single map.
Each map processes a single split.
Each split is divided into records, and the map processes each record—a key-value pair—in
turn.
Input splits are represented by the Java class InputSplit
FILE INPUT FORMATS
Java InputFormat is responsible for creating the input splits and dividing them into records. Its
types are:
•FileInputFormat: The base class for all implementations of InputFormat that use files as their
data source. It provides two things: a place to define which files are included as the input to a
job, and an implementation for generating splits for the input files.
•CombineFileInputFormat: This packs many files into each split so that each mapper has more to
process. Hadoop works better with a small number of large files than a large number of small
files.
•WholeFileInputFormat: This a format where the keys are not used and the values are the file
contents.Takes a FileSplit and converts it into a single record.
TEXT INPUT FORMATS
•TextInputFormat: The default InputFormat. Each record is a line of input. The key, a
LongWritable, is the byte offset within the file of the beginning of the line. The value is the
contents of the line, excluding any line terminators and is packaged as a Text object.
•KeyValueTextInputFormat: TextInputFormat’s keys, being simply the offsets within the file, are
not normally very useful. It is common for each line in a file to be a key-value pair, separated by
a delimiter such as a tab character.
•NLineInputFormat: With TextInputFormat and KeyValueTextInputFormat, each mapper receives
a variable number of lines of input. The number depends on the size of the split and the
lengthof the lines. If you want your mappers to receive a fixed number of lines of input,
thenNLineInputFormat is the InputFormat to use.
•XML: Hadoop comes with a class for this purpose called StreamXmlRecordReader (which is in
the org.apache.hadoop.streaming.mapreduce package)
OTHER INPUT TYPES

•Binary Input: Hadoop MapReduce is not restricted to processing textual data. It has support for
binary formats.

•Database Input: DBInputFormat is an input format for reading data from a relational database,
using JDBC.
OUTPUT TYPE FORMATS
•TextOutputFormat: Default output format. It writes records as lines of text (keys and values are
turned into strings).
•Binary Output:
SequenceFileOutputFormat: Writes sequence files as output

•Multiple Output: FileOutputFormat and its subclasses generate a set of files in the output
directory. There is one file per reducer, and files are named by the partition number: part-r-
00000, part-r-00001, and so on
•LazyOutput: FileOutputFormat subclasses will create output (part-r-nnnnn) files, even if they are
empty. Some applications prefer that empty files not be created, which is where Lazy
OutputFormat helps.
•Database Output: The output formats for writing to relational databases and to HBase
FEATURES
Counters:
• Used for gathering statistical information about any job and to diagnose a problem, if any.

Two types of counters:

1. Built-in counter: Task counters and job counters

• Task counters gather information about task and the results are aggregated for all tasks in a job.

• Job counters are maintained by the master and it is used for measuring the job level statistics.

2. User Defined counters: Dynamic counters and retrieving counters.

• Dynamic counters are app. writer which is created by counters using Java enum or by interface.

• Retrieving counters give the job level statistics even when the job is still running instead of waiting till the end for the job to finish. Built-
in Java APIs.

3. User-defined streaming functions: They can increment counters by sending a specially formatted line to the standard error stream

The line must have the following format: reporter:counter:group,counter,amount

SORTING
• One of the important and heart of MapReduce.
• Preparation before sorting is needed for certain datasets (temperatures cannot be sorted as
text). Store data as sequence files instead.
• Different ways of sorting the datasets:
1. Partial Sort: Default sort where the sorting is done based on the keys. Individual output files
are sorted but no globally sorted file is combined and produced.
2. Total Sort: Can be used for producing globally output files. Total sort uses a partitioner and
the partition sizes must be fairly even for this to work.
3. Secondary Sort: Can be used to sort the records by key in the mapper phase instead of using
it in the reducer phase.
JOINS
•Joins can be used in MapReduce to combine large datasets together.
•Implementation depends upon how large the dataset is and also how they’re partitioned.
•When processing large data sets the need for joining data by a common key can be very useful, if
not essential.
•By joining data you can further gain insights.
MAP-SIDE JOIN
•Map-side join is one of the features of MapReduce where the join is performed by the mapper.
•Works by performing the join before the dataset even reaches the map function.
• For this to work, the inputs to each map must be partitioned and sorted in a particular way.
•To take advantage of map-side joins our data must meet one of following criteria:
1. The datasets to be joined are already sorted by the same key and have the same number of
partitions.
2. Of the two datasets to be joined, one is small enough to fit into memory.
•Each input dataset must be divided into the same number of partitions, and it must be sorted by
the same key (the join key) in each source. All the records for a particular key must reside in the
same partition.
REDUCE-SIDE JOIN
•Reduce-side join is one of the features of MapReduce where the join is performed by the
reducer.
•The basic idea is that the mapper tags each record with its source and uses the join key as the
map output key, so that the records with the same key are brought together in the reducer.
•A reduce-side join is more general than a map-side join, in that the input datasets don’t have to
be structured in any particular way, but it is less efficient because both datasets have to go
through the MapReduce shuffle.
What do we need for this to work?
1. Multiple Inputs: The input sources for the datasets generally have different formats, so it is
very convenient to use the MultipleInputs class to separate logic for parsing and tagging.
2. Secondary Sort: Reducer checks the records from both sources that have the same key, but
they are not guaranteed to be in any particular order.
SIDE DATA DISTRIBUTION
•Can be defined as extra read-only data needed by a job to process the main dataset.
•Provides a service for copying files and archives to the task nodes in time for the tasks to use
them when they run.
•One way to make data available is to use Job Configuration setter method to set key-value pairs .
•Hadoop’s distributed cache is another way to distribute the datasets.
•Files and Archives are the two kinds of files we need to place the files in the cache.
•To make side data available to all the map or reduce tasks is a huge challenge.
CONCLUSION
• MapReduce provides a simple way to scale your application.
• Some of the important applications : Social networking and search engines, PageRank, Statistics
and genomics.
•Scales out to more machines, rather than scaling up
•Effortlessly scale from a single machine to thousands
•Fault tolerant & High performance
•If you can fit your use case to its paradigm, scaling is handled by the framework

PP Written Report
No ratings yet
PP Written Report
13 pages
M3-Advanced Java...
No ratings yet
M3-Advanced Java...
61 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
Chapter 3
No ratings yet
Chapter 3
45 pages
Java SE 8 Question Bank
100% (1)
Java SE 8 Question Bank
107 pages
8th Sem Notes
No ratings yet
8th Sem Notes
8 pages
String and StringBuilder
50% (2)
String and StringBuilder
42 pages
CHAPTER 02: Big Data Analytics
No ratings yet
CHAPTER 02: Big Data Analytics
62 pages
Exercise 6 PDF
No ratings yet
Exercise 6 PDF
2 pages
CCS334 BIG DATA ANALYTICS Session 1 Intr
No ratings yet
CCS334 BIG DATA ANALYTICS Session 1 Intr
18 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Cloud Computing Chapter3 2
0% (1)
Cloud Computing Chapter3 2
36 pages
20cs51i Makeup Exam September 2023 QP - Deemech
No ratings yet
20cs51i Makeup Exam September 2023 QP - Deemech
2 pages
STUTI - GUPTA Hadoop Resume PDF
No ratings yet
STUTI - GUPTA Hadoop Resume PDF
2 pages
Cloud Computing 18CS72: Microsoft Windows Azure
No ratings yet
Cloud Computing 18CS72: Microsoft Windows Azure
10 pages
Gebru Netsanet Kassaye 150519190409
No ratings yet
Gebru Netsanet Kassaye 150519190409
65 pages
Dbms Lab Manual
No ratings yet
Dbms Lab Manual
44 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
Bank Management System
100% (2)
Bank Management System
19 pages
Durga Soft PDF Part II
100% (7)
Durga Soft PDF Part II
201 pages
13 - m1 - Linux Basic Commands - Edureka VM PDF
No ratings yet
13 - m1 - Linux Basic Commands - Edureka VM PDF
3 pages
HOL Hive PDF
No ratings yet
HOL Hive PDF
23 pages
Building Java Programs: Chapter 10, 11 Lecture 22: 143 Preview
100% (1)
Building Java Programs: Chapter 10, 11 Lecture 22: 143 Preview
19 pages
Dbms Complete Lab Manual
No ratings yet
Dbms Complete Lab Manual
86 pages
BDA Unit - II
No ratings yet
BDA Unit - II
66 pages
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
No ratings yet
AnalytixLabs - Advanced Big Data Science Using Python-R-Hadoop-Spark
13 pages
Hadoop Interview Guide
100% (1)
Hadoop Interview Guide
34 pages
21csc205p Dbms Unit IV
No ratings yet
21csc205p Dbms Unit IV
66 pages
Rural Electrification
No ratings yet
Rural Electrification
40 pages
BDA Unit 5 HIVE HBASE
No ratings yet
BDA Unit 5 HIVE HBASE
33 pages
Introduction To Big Data and Hadoop
100% (1)
Introduction To Big Data and Hadoop
29 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
MySQL Connectivity - Nandini Das
No ratings yet
MySQL Connectivity - Nandini Das
27 pages
Bcs Lab Manual
100% (3)
Bcs Lab Manual
51 pages
Information Technology Management Practical Files
No ratings yet
Information Technology Management Practical Files
57 pages
Map Reduce
No ratings yet
Map Reduce
10 pages
Convocation 2024 Letter Registration LIST 25112024
No ratings yet
Convocation 2024 Letter Registration LIST 25112024
28 pages
Delphi - Visual Component Library PDF
No ratings yet
Delphi - Visual Component Library PDF
1,072 pages
Ai and ML qp1 Solved
No ratings yet
Ai and ML qp1 Solved
20 pages
Hadoop Overview
100% (1)
Hadoop Overview
16 pages
BDC Previous Papers 2 Marks
100% (1)
BDC Previous Papers 2 Marks
7 pages
UNIT5
No ratings yet
UNIT5
31 pages
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
100% (1)
Big Data Analytics: By: Syed Nawaz Pasha at SR Univeristy Professional Elective-5 B.Tech Iv-Ii Sem
31 pages
Aci 311.1
No ratings yet
Aci 311.1
1 page
Linux Lab Manual by Zoom PDF
No ratings yet
Linux Lab Manual by Zoom PDF
184 pages
Mrcet R20 Iv 1 QB
No ratings yet
Mrcet R20 Iv 1 QB
79 pages
Unit V Big Data Analytics
No ratings yet
Unit V Big Data Analytics
47 pages
Data Structure and Algorithm
No ratings yet
Data Structure and Algorithm
30 pages
Master The Wards Surgery Flashcards Dec 3 2015 1st Edition Sonpal Niket Fischer Conrad Ebook All Chapters PDF
100% (4)
Master The Wards Surgery Flashcards Dec 3 2015 1st Edition Sonpal Niket Fischer Conrad Ebook All Chapters PDF
52 pages
Big Data
No ratings yet
Big Data
25 pages
Hadoop Interview Questions Faq
No ratings yet
Hadoop Interview Questions Faq
14 pages
Mining Data Streams (Part 2)
No ratings yet
Mining Data Streams (Part 2)
56 pages
Deepak Professional Summary
No ratings yet
Deepak Professional Summary
3 pages
Sinumerik Sinumerik 840D SL NCU: 6FC5397-0AP10-3BA0
No ratings yet
Sinumerik Sinumerik 840D SL NCU: 6FC5397-0AP10-3BA0
114 pages
OOPC 1 Research Assignment
No ratings yet
OOPC 1 Research Assignment
4 pages
Spacer Fabric
100% (1)
Spacer Fabric
6 pages
Block Chain
No ratings yet
Block Chain
9 pages
Intellipaat Hands On Exercises PDF
No ratings yet
Intellipaat Hands On Exercises PDF
49 pages
IES - Electronics Conventional Papers - I & II 1980 - 2007
No ratings yet
IES - Electronics Conventional Papers - I & II 1980 - 2007
222 pages
Lecture 4 - Pair RDD and DataFrame
No ratings yet
Lecture 4 - Pair RDD and DataFrame
38 pages
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
No ratings yet
Experiment 3: Hive: Aim: To Understand Data Processing Tool - Hive and HQL (Hive Query Language)
11 pages
Null 001.2015.issue 273 en
No ratings yet
Null 001.2015.issue 273 en
26 pages
LMHC
No ratings yet
LMHC
1 page
An Introduction To Meditation
No ratings yet
An Introduction To Meditation
20 pages
Research Paper Presentation Pandas Moshiul Arefin
No ratings yet
Research Paper Presentation Pandas Moshiul Arefin
30 pages
Activity Clock PDF
No ratings yet
Activity Clock PDF
2 pages
Mining Data Streams
No ratings yet
Mining Data Streams
67 pages
Analysis of Segment Reporting With Reference To Selected Software Companies
No ratings yet
Analysis of Segment Reporting With Reference To Selected Software Companies
18 pages
Java 4th Unit
No ratings yet
Java 4th Unit
46 pages
Sacred Places To Visit in Madinah Ul Nabi SAW
No ratings yet
Sacred Places To Visit in Madinah Ul Nabi SAW
41 pages
Memorandums
No ratings yet
Memorandums
2 pages
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
No ratings yet
Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
46 pages
Math II IMP-1
No ratings yet
Math II IMP-1
10 pages
Accudemia For Tutors FA24
No ratings yet
Accudemia For Tutors FA24
7 pages
Counseling Intake Assessment Information Form
No ratings yet
Counseling Intake Assessment Information Form
7 pages
Joint Dislocations
No ratings yet
Joint Dislocations
35 pages
Database Management Systems Nov
No ratings yet
Database Management Systems Nov
6 pages
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
No ratings yet
Pig: Building High-Level Dataflows Over Map-Reduce: Utkarsh Srivastava
46 pages
Chapter 05 Slides
No ratings yet
Chapter 05 Slides
35 pages
Data Mining With Hadoop and Hive Introduction To Architecture
No ratings yet
Data Mining With Hadoop and Hive Introduction To Architecture
39 pages
Prepositions of Place and Movement in Spanish Verbs Table
No ratings yet
Prepositions of Place and Movement in Spanish Verbs Table
6 pages
Hive - A Warehousing Solution Over A Map-Reduce Framework
No ratings yet
Hive - A Warehousing Solution Over A Map-Reduce Framework
24 pages
RIL Index 12-JUN-2020
No ratings yet
RIL Index 12-JUN-2020
36 pages
Booklet Chapter 3
No ratings yet
Booklet Chapter 3
22 pages
MapReduce Example
No ratings yet
MapReduce Example
3 pages
12 Sympathizers (Oegema, Klandermans) PDF
No ratings yet
12 Sympathizers (Oegema, Klandermans) PDF
21 pages
Farm Worker Movement (Jenkins, Perrow)
No ratings yet
Farm Worker Movement (Jenkins, Perrow)
21 pages
Black Insurgency (McAdam)
No ratings yet
Black Insurgency (McAdam)
21 pages
Hands-On Hadoop Tutorial
100% (1)
Hands-On Hadoop Tutorial
13 pages
Social Networks (Snow)
No ratings yet
Social Networks (Snow)
16 pages
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
No ratings yet
Re Producing Feminine Bodies Emergent Spaces Through Contestation in The Women S March On Washington PDF
12 pages
Today in Physics 217: Electric Dipoles and Their Interactions
No ratings yet
Today in Physics 217: Electric Dipoles and Their Interactions
15 pages
Rate Analysis of Ms Maqbool Ahme03122021095727
No ratings yet
Rate Analysis of Ms Maqbool Ahme03122021095727
6 pages
Framing The Women's March On Washington
No ratings yet
Framing The Women's March On Washington
10 pages
A Living Archive of Modern Protest Memory Making in The Women S March
No ratings yet
A Living Archive of Modern Protest Memory Making in The Women S March
10 pages
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
No ratings yet
Emergent and Divergent Spaces in The Women S March The Challenges of Intersectionality and Inclusion
9 pages
Chakras Book PDF
100% (17)
Chakras Book PDF
89 pages
Robb 2009 - Metalsucks
No ratings yet
Robb 2009 - Metalsucks
7 pages
Letter of Acknowledgment - International Host Incubator
No ratings yet
Letter of Acknowledgment - International Host Incubator
2 pages
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
No ratings yet
1 Introduction Bash Shell Linux Mac Os m1 Overview Slides PDF
6 pages
Name of The Student Student ID Session 2. Present Address
No ratings yet
Name of The Student Student ID Session 2. Present Address
9 pages
Notes
No ratings yet
Notes
3 pages
Business Ethics Case Study PDF
No ratings yet
Business Ethics Case Study PDF
5 pages
Lease Essays
No ratings yet
Lease Essays
5 pages
Wepik Geometric Blue Tom Resume 20230928140657REcc
No ratings yet
Wepik Geometric Blue Tom Resume 20230928140657REcc
1 page
Map Reduce
No ratings yet
Map Reduce
1 page
Optimizing Hadoop for MapReduce
From Everand
Optimizing Hadoop for MapReduce
Khaled Tannir
No ratings yet
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
From Everand
(Excerpts From) Investigating Performance: Design and Outcomes With Xapi
Janet Laane Effron
No ratings yet
Trackpad Pro Ver. 5.0 Class 6
From Everand
Trackpad Pro Ver. 5.0 Class 6
Nidhi Arora
No ratings yet

S MapReduce Types Formats Features 03

Uploaded by

S MapReduce Types Formats Features 03

Uploaded by

MapReduce Types , Formats , and Features

MAP REDUCE GENERAL FORM

Two types of counters:

1. Built-in counter: Task counters and job counters

2. User Defined counters: Dynamic counters and retrieving counters.

The line must have the following format: reporter:counter:group,counter,amount

You might also like