0% found this document useful (0 votes)

95 views59 pages

Unit 3 Bda

MapReduce is a software framework for processing large datasets in a distributed computing environment. It allows for parallel processing of large datasets across clusters of computers using a simple programming model. The MapReduce framework divides the job into map and reduce phases where the map phase processes the input data and generates intermediate output which is sorted and grouped by the reduce phase to produce the final output. MapReduce provides fault tolerance, parallelization, and scalability making it suitable for large scale data processing on Hadoop clusters.

Uploaded by

teja.ksp1801

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

95 views59 pages

Unit 3 Bda

Uploaded by

teja.ksp1801

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 59

UNIT-3

BigData Analytics
(20CSE361)

MapReduce

1
What is MapReduce?

• MapReduce is the core Hadoop ecosystem

component which provides data processing.
• MapReduce is a software framework for easily
writing applications that process the vast amount
of structured and unstructured data stored in the
HDFS.
• MapReduce programs are parallel, very useful
for performing large-scale data analysis using
multiple machines in the cluster.
2
Cont...
• The whole job is taken from the user and
divided into smaller tasks, and assign them to
the worker nodes.
• MapReduce programs take input as a list and
convert to the output as a list.
• it improves the speed and reliability of cluster
this parallel processing.

3
4
Application of MapReduce
Entertainment: To discover the most popular movies,
based on what you like and what you watched in this
case Hadoop MapReduce help you out.
• It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like
Amazon, Walmart, and eBay, utilize the MapReduce
programming model to distinguish most loved items
dependent on clients’ inclinations or purchasing
behavior.
• It incorporates making item proposal Mechanisms for
E-commerce inventories, examining website records,
buy history, user interaction logs, etc…
5
Cont….
Data Warehouse: We can utilize MapReduce to
analyze large data volumes in data warehouses
while implementing specific business logic for data
insights.

Fraud Detection: Hadoop and MapReduce are utilized

in monetary enterprises including organizations like
banks, insurance providers, installment areas for
misrepresentation recognition, pattern
differentiating proof, or business metrics through
transaction analysis.

6
How does MapReduce works?
Hadoop Ecosystem component ‘MapReduce’ works by
breaking the processing into two phases:
1. Map phase
2. Reduce phase
• Each phase has key-value pairs as input and output.

7
8
Cont…
• Map Phase: In this phase, data in each split is
passed to a mapping function to produce output
values.
• The map takes key/value pair as input.
• The data may be structured or unstructured
form.
• Keys are the reference of the input files and value
is the data set.
• The user can create a custom business logic
based on their need for data processing, this task
is applied on every input value.
9
Cont…
2.Reduce Phase: In this phase, MapReduce
Reduce takes intermediate Key / Value pairs
as input and process the output of the mapper.
• Key/value pairs provided to reduce are sorted
by key.
• In reducer, do aggregation or summation sort
of computation.
 Reduce produces a final output as a list of
key/value pairs.
 This final output is stored in HDFS and replication is
done as usual.
10
Cont….

Shuffling and Sort: This Reducer task starts with

Shuffle and sort.
• Its task is to consolidate the relevant records
from Mapping phase output.

11
Example:

12
Anatomy of MapReduce Job Run

13
14
15
16
17
Advantage of Map Reduce
Fault tolerance: It can handle failures without
downtime.
Speed: It splits, shuffles, and reduces the
unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-
out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable
framework. MapReduce allows users to run
applications from many nodes. 18
Advantage of Map Reduce
Parallel Processing:
•Here multiple job-parts of the same dataset can be
processed in a parallel manner.
•This can reduce the task that can be taken to
complete a task.
Data Locality:
•Moving huge data to processing is costly.
•Processing takes time as the data is processed by a
single unit which becomes the bottleneck.

19
Limitations of MapReduce

• MapReduce cannot cache the intermediate data in

memory for a further requirement which diminishes
the performance of Hadoop.

• It is only suitable for Batch Processing of a Huge

amounts of Data.

20
Shuffling and Sorting in MapReduce

21
22
Objective,

• Shuffling process is an intermediate, output

from mappers is transferred to the reducer.
• Reducer gets one or more keys and associated values
on the basis of reducers.
• Intermediated Key-Value generated by Mapper is
sorted automatically by key is called shuffling and
Sorting in Hadoop MapReduce.

23
Shuffling in MapReduce

• The process of transferring data from the

Mappers to Reducers is known as shuffling.
• Shuffling procedure used to Sort the data
using Key-Values.
• The shuffling task begins when some of the
Mapping tasks are done. So this is the faster
process. It will not wait for completion of the
Mapper Task.

24
Sorting in MapReduce
• The MapReduce Framework automatically
Sorts the data based on key-values on output
of Mapper. So, before sending to the reducer,
all key-value pairs are sorted.
• The Reducer can easily understand when a
new reducing task will be started by the
sorted Key-Value pairs.
• If the user not set Reducer task, the Shuffling
and Sorting phase will not take place. The task
will over after the Mapper task.
25
Secondary Sorting in MapReduce
• If we want to sort reducer’s values,
then, the secondary sorting technique is used
as it enables us to sort the values (in
ascending or descending order) passed to
each reducer.

26
Conclusion
• Shuffling-Sorting occurs simultaneously to
summarize the Mapper Intermediate output.

• Shuffling and Sorting are not performed at all

if you specify zero reducers
(setNumReduceTasks(0)).

27
INPUT FORMATS
● InputFormat describes the input-specification
for execution of the Map-Reduce job.

● InputFormat describes how to split and read

input files.

● InputFormat is responsible for splitting the

input data file into records which is used for
map-reduce operation.

28
cont…

• InputFormat defines the RecordReader.

• It is also responsible for reading actual
records from the input files.

29
TYPES OF INPUT FORMAT
1. FileInputFormat:
• It is the base class for all file-based
InputFormats.
• When we start a MapReduce job execution,
FileInputFormat provides a path containing
files to read.
• This InputFormat will read all files and
divides these files into one or more
InputSplits.
30
TYPES OF INPUT FORMAT
2. TextInputFormat:

• It is the default InputFormat.

• This InputFormat treats each line of each
input file as a separate record.
• It performs no parsing. TextInputFormat is
useful for unformatted data or line-based
records like log files.

31
TYPES OF INPUT FORMAT
3. KeyValueTextInputFormat:
• It is similar to TextInputFormat.

• This InputFormat also treats each line of

input as a separate record.

• While the difference is that

TextInputFormat treats entire line as the
value, but the KeyValueTextInputFormat
breaks the line itself into key and value
32
TYPES OF INPUT FORMAT
4. SequenceFileInputFormat:

• It is an InputFormat which reads sequence

files. Sequence files are binary files.

• These files store sequences of binary key-

value pairs.

• These are block-compressed and provide

direct serialization and deserialization of
several arbitrary data.

33
TYPES OF INPUT FORMAT
5. N-lineInputFormat:

• It is TextInputFormat where the keys are byte

offset of the line. And values are contents of
the line.
• So, each mapper receives a variable number
of lines of input with TextInputFormat and
KeyValueTextInputFormat.
• So, if want our mapper to receive a fixed
number of lines of input, then we use
NLineInputFormat.
34
TYPES OF INPUT FORMAT

6. DBInputFormat:
• This InputFormat reads data from a relational
database, using JDBC.

• Its loads small datasets, perhaps for joining with

large datasets from HDFS using MultipleInputs.

35
OUTPUT FORMATS
• Its similar with InputFormat functions.
• OutputFormat instances are used to write to files
on the local disk or in HDFS.
• In MapReduce job execution on the basis of
output specification;

● Its provides the RecordWriter implementation to

be used to write the output files of the job. then
the output files are stored in a FileSystem.

36
TYPES OF OUTPUT FORMAT
1. TextOutputFormat:
• The default OutputFormat is TextOutputFormat.
• It writes (key, value) pairs on individual lines of text files.
• Its keys and values can be of any type.
• The reason behind is that TextOutputFormat turns them
to string by calling toString() on them.
• It separates key-value pair by a tab character.
• By using MapReduce.output.textoutputformat.separator
property we can also change it.

37
TYPES OF OUTPUT FORMAT
2. SequenceFileOutputFormat:
• This OutputFormat writes sequences files for its output.
SequenceFileInputFormat is also intermediate format
use between MapReduce jobs.
• It serializes arbitrary data types to the file. And the
corresponding SequenceFileInputFormat will deserialize
the file into the same types.
• It presents the data to the next mapper in the same
manner as it was emitted by the previous reducer. Static
methods also control the compression.
38
TYPES OF OUTPUT FORMAT
3. SequenceFileAsBinaryOutputFormat:
• It is another variant of SequenceFileInputFormat.
• It also writes keys and values to sequence file in binary
format.
4. MapFileOutputFormat:
• It is another form of FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted
order.

39
TYPES OF OUTPUT FORMAT
5. MultipleOutputs:
• This format allows writing data to files whose names are
derived from the output keys and values.

6. LazyOutputFormat:

In MapReduce job execution, FileOutputFormat sometimes

create output files, even if they are empty.

LazyOutputFormat is also a wrapper OutputFormat.

40
TYPES OF OUTPUT FORMAT
7. DBOutputFormat:
• It is the OutputFormat for writing to relational databases
and HBase.
• This format also sends the reduce output to a SQL table.

• It also accepts key-value pairs. In this, the key has a

type extending DBwritable.

41
Failure in MapReduce

• Failures are norm in commodity hardware

• Worker failure
– Detect failure via periodic heartbeats
– Re-execute in-progress map/reduce tasks
• Master failure
– Single point of failure; Resume from Execution Log
• Robust
– Google’s experience: lost 1600 of 1800 machines once!,
but finished fine
42
43
44
a) FIFO Scheduler
• First In First Out is default scheduling policy used in
Hadoop.
• FIFO Scheduler gives more preferences to the
application first come and first serve.
• It places the applications in queue and executes
them in the order of their submission (FIFO).
• Here, irrespective of the size and priority, the request
for the first application in the queue are allocated
first.
• Once the first application request is satisfied, then
only the next application in the queue is served.

45
Cont…
FIFO Advantage:
• It is simple to understand
• Doesn’t need any configuration.
• Jobs are executed in the order of their submission.
FIFO Disadvantage:
• It is not suitable for shared clusters.
• If the large application comes before the shorter
one, then the large application will use all the
resources in the cluster, and the shorter
application has to wait for its turn. This leads to
starvation.
46
b) Capacity Scheduler
• The Capacity Scheduler allows multiple-tenants to
securely share a large Hadoop cluster.
• It’s designed to run multi-tenant cluster while
maximizing the throughput and the utilization of the
cluster.
• It supports hierarchical queues that utilizes the cluster
resources.
• A queue hierarchy contains three types of queues that
are root, parent, and leaf.
• The root queue represents the cluster itself, parent
queue represents organization/group or sub-
organization/sub-group, and leaf accepts application
submission.
47
Advantages:
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides Elasticity for groups or organizations
in a cost-effective manner.
• It also gives capacity guarantees and
safeguards to the organization utilizing cluster.
Disadvantage:
• It is complex than other scheduler.

48
c) Fair Scheduler

49
FairScheduler

• FairScheduler allows YARN applications to fairly

share resources in large Hadoop clusters.
• It will dynamically balance resources between all
running applications.

• It assigns resources to applications that all

applications get, on average, and an equal amount
of resources.

50
• The FairScheduler, by default, takes scheduling
fairness decisions only on the basis of
memory. (both memory and CPU.)
• Its also support hierarchical queue.
• When an application is present in the queue,
then the app gets its minimum share, but
when the queue doesn’t need its full
guaranteed share, then the excess share is
split between other running applications.

51
Advantages:
• It provides a reasonable way to share between
the number of users in clusters.
• It can work with application priorities where
the priorities are used as weights.
Disadvantage:
• It requires configuration.

52
Features of MapReduce

53
1. Scalability:
• Apache Hadoop is a highly scalable framework.
• This is because of its ability to store and distribute huge
data across plenty of servers.
• All these servers were inexpensive and can operate in
parallel.
• We can easily scale the storage and computation power by
adding servers to the cluster.
• This can use thousands of terabytes of data.

54
2. Flexibility
• MapReduce programming enables companies to access
new sources of data.
• It enables companies to operate on different types of
data.
• It allows enterprises to access structured as well as
unstructured data, and derive significant value by gaining
insights from the multiple sources of data.
• Its provides support for the multiple languages and
data from email, social media, to click stream resources.
• MapReduce is flexible to deal with data rather than
traditional DBMS.

55
3.Security and Authentication
• The MapReduce programming model uses HBase and
HDFS security platform that allows access only to the
authenticated users to operate on the data.
• it protects from unauthorized access to system data
and enhances system security.

4. Cost-effective solution
• Its allows the storage and processing of large data sets
in very affordable manner.

56
5. Fast
if we are dealing with large volumes of unstructured data,
Hadoop MapReduce just takes minutes to process
terabytes of data.
It can process petabytes of data in just an hour.
that allow for the faster processing of data.
6. Simple model of programming
• Its allows programmers to develop the MapReduce
programs which can handle tasks easily and efficiently.
• Anyone can easily learn and write MapReduce
programs and meet their data processing Bz, its written by
JAVA.

57
7. Parallel Programming
• Major aspects of the working of MapReduce programming
is parallel processing.
• It divides the tasks that allows execution in parallel.
• Its allows multiple processors to execute these divided
tasks. So the entire program is run in less time.

58
8. Availability and resilient nature
• Whenever the data is sent to an individual node, the same
set of data is forwarded to some other nodes in a cluster.
• So, if any particular node suffers from a failure, then there
are always other copies present on other nodes that can still be
accessed whenever needed.
• This assures high availability of data.
• One of the major features offered by Apache Hadoop is its
fault tolerance. The Hadoop MapReduce framework has the
ability to quickly recognizing faults that occur.
• It then applies a quick and automatic recovery solution.
This feature makes it a game-changer in the world of big data
processing.

Mapreduce Types and Formats
No ratings yet
Mapreduce Types and Formats
65 pages
Map Reduce
No ratings yet
Map Reduce
30 pages
S MapReduce Types Formats
100% (2)
S MapReduce Types Formats
22 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
17 pages
Chapter 9 - Processing Big Data With Mapreduce
No ratings yet
Chapter 9 - Processing Big Data With Mapreduce
157 pages
Introduction To MapReduce
No ratings yet
Introduction To MapReduce
9 pages
Lecture 10 MapReduce Hadoop
No ratings yet
Lecture 10 MapReduce Hadoop
37 pages
Cloud Computing CS 15-319: Programming Models-Part III Lecture 6, Feb 1, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part III Lecture 6, Feb 1, 2012
40 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Introduction To Map Reduce
No ratings yet
Introduction To Map Reduce
50 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
AAAI2011 Tutorial Slides
No ratings yet
AAAI2011 Tutorial Slides
213 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
BDA-MapReduce (1) 5rfgy656yhgvcft6
No ratings yet
BDA-MapReduce (1) 5rfgy656yhgvcft6
60 pages
Cloud Unit 5
No ratings yet
Cloud Unit 5
52 pages
S MapReduce Types Formats Features 06
No ratings yet
S MapReduce Types Formats Features 06
26 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
S MapReduce Types Formats Features
No ratings yet
S MapReduce Types Formats Features
15 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
64 pages
BDA Unit-3
No ratings yet
BDA Unit-3
63 pages
Unit 4
No ratings yet
Unit 4
11 pages
Map Reduce Programming
No ratings yet
Map Reduce Programming
74 pages
2 Bda Chapter2 Answer
No ratings yet
2 Bda Chapter2 Answer
9 pages
Bda Ia1 Scheme
No ratings yet
Bda Ia1 Scheme
7 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
HDFS Unit 4
No ratings yet
HDFS Unit 4
12 pages
Unit-2 Map Reduce Notes
No ratings yet
Unit-2 Map Reduce Notes
28 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Chapter 4 - Understanding Map Reduce Fundamentals
No ratings yet
Chapter 4 - Understanding Map Reduce Fundamentals
45 pages
Big Data Management Continued
No ratings yet
Big Data Management Continued
48 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Unit 5 Lecture 5
No ratings yet
Unit 5 Lecture 5
21 pages
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
No ratings yet
Lecture 10 Chapter 6 Part 1 Big Data Processing Concepts
26 pages
Map Reduce
No ratings yet
Map Reduce
25 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Hadoop Unit III DR David
No ratings yet
Hadoop Unit III DR David
12 pages
Unit 2
No ratings yet
Unit 2
12 pages
Hadoop - Mapreduce
No ratings yet
Hadoop - Mapreduce
5 pages
Hadoop and MapReduce Notes
No ratings yet
Hadoop and MapReduce Notes
4 pages
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
No ratings yet
BDA Module 3 - Part 1 (Mapreduce and HBase) 2023
15 pages
3 Fuel Consumption Example - MR
No ratings yet
3 Fuel Consumption Example - MR
7 pages
Data Science
No ratings yet
Data Science
7 pages
BDA
No ratings yet
BDA
20 pages
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
No ratings yet
Mapreduce: Simplified Data Processing On Large Clusters by Jeffrey Dean and Sanjay Ghemawa Presented by Jon Logan
30 pages
Map Reduce
No ratings yet
Map Reduce
69 pages
Hadoop
No ratings yet
Hadoop
34 pages
Large-Scale Data Management: Cs525: Special Topics in Dbs
No ratings yet
Large-Scale Data Management: Cs525: Special Topics in Dbs
22 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
S MapReduce Types Formats Features 03
No ratings yet
S MapReduce Types Formats Features 03
16 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
Apache Spark Interview Questions and Answers PDF
No ratings yet
Apache Spark Interview Questions and Answers PDF
31 pages
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
No ratings yet
Unit-5 Unit-5: Case Studies of Big Data Analytics Using Map-Reduce Programming
11 pages
DAV Quantum
No ratings yet
DAV Quantum
143 pages
Introduction To Data Engineering
No ratings yet
Introduction To Data Engineering
28 pages
Big Data Analytics Comp Syllabus Sem7
No ratings yet
Big Data Analytics Comp Syllabus Sem7
4 pages
Cloud Bigdata Amand AWS
No ratings yet
Cloud Bigdata Amand AWS
6 pages
Unit Iv Programming Model
No ratings yet
Unit Iv Programming Model
58 pages
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
No ratings yet
Hadoop BIG DATA Interview Questions You'll Most Likely Be Asked
20 pages
Admin 1
No ratings yet
Admin 1
856 pages
Master Databrciks
No ratings yet
Master Databrciks
79 pages
Vii Cse Cs8711 CC Labmanual
No ratings yet
Vii Cse Cs8711 CC Labmanual
36 pages
NetBackup10 AdminGuide Hadoop
No ratings yet
NetBackup10 AdminGuide Hadoop
67 pages
536C3A
No ratings yet
536C3A
2 pages
Anvesh - Sr. Data Engineer
No ratings yet
Anvesh - Sr. Data Engineer
6 pages
Ashika Resume DS
No ratings yet
Ashika Resume DS
7 pages
h16294 Dell Emc Ecs With f5 WP
100% (1)
h16294 Dell Emc Ecs With f5 WP
71 pages
Impala and BigQuery
No ratings yet
Impala and BigQuery
47 pages
Dell Emc Isilon Onefs Operating
No ratings yet
Dell Emc Isilon Onefs Operating
5 pages
Big Data Solution Using Hadoop - Project For Big Data Management
No ratings yet
Big Data Solution Using Hadoop - Project For Big Data Management
3 pages
A Big Data Approach For Smart Transportation Management On Bus Network
No ratings yet
A Big Data Approach For Smart Transportation Management On Bus Network
6 pages
Michael Vamper
No ratings yet
Michael Vamper
5 pages
CC Viva Questions and Answers
No ratings yet
CC Viva Questions and Answers
9 pages
Farhan Data Engineer
No ratings yet
Farhan Data Engineer
9 pages
Wa0154.
No ratings yet
Wa0154.
12 pages
Steps To Install Hadoop 2.x Release (Yarn or Next-Gen) On Single Node Cluster Setup
No ratings yet
Steps To Install Hadoop 2.x Release (Yarn or Next-Gen) On Single Node Cluster Setup
7 pages
Log Parsing
No ratings yet
Log Parsing
24 pages
MCKV Institute of Engineering: CO2 Analyze Analyze
No ratings yet
MCKV Institute of Engineering: CO2 Analyze Analyze
2 pages
La Trobe University Department of Computer Science and Computer Engineering CSE5BDC Assignment 2020
No ratings yet
La Trobe University Department of Computer Science and Computer Engineering CSE5BDC Assignment 2020
18 pages
Big Data Documentation - Big Data Documentation
No ratings yet
Big Data Documentation - Big Data Documentation
2 pages
Summary:: Project Details: Customer Knowledge Platform Application in Wal-Mart
No ratings yet
Summary:: Project Details: Customer Knowledge Platform Application in Wal-Mart
1 page
MATLAB Data Science
From Everand
MATLAB Data Science
Henry Codwell
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
From Everand
Principles of MapReduce Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering DuckDB: High-Performance Analytics Made Easy
From Everand
Mastering DuckDB: High-Performance Analytics Made Easy
Robert Johnson
No ratings yet
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
From Everand
SAS Programming Guidelines Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet

Unit 3 Bda

Uploaded by

Unit 3 Bda

Uploaded by

UNIT-3

• MapReduce is the core Hadoop ecosystem

Fraud Detection: Hadoop and MapReduce are utilized

Shuffling and Sort: This Reducer task starts with

• MapReduce cannot cache the intermediate data in

• It is only suitable for Batch Processing of a Huge

• Shuffling process is an intermediate, output

• The process of transferring data from the

• Shuffling and Sorting are not performed at all

● InputFormat describes how to split and read

● InputFormat is responsible for splitting the

• InputFormat defines the RecordReader.

• It is the default InputFormat.

• This InputFormat also treats each line of

• While the difference is that

• It is an InputFormat which reads sequence

• These files store sequences of binary key-

• These are block-compressed and provide

• It is TextInputFormat where the keys are byte

• Its loads small datasets, perhaps for joining with

● Its provides the RecordWriter implementation to

In MapReduce job execution, FileOutputFormat sometimes

LazyOutputFormat is also a wrapper OutputFormat.

• It also accepts key-value pairs. In this, the key has a

• Failures are norm in commodity hardware

• FairScheduler allows YARN applications to fairly

• It assigns resources to applications that all

You might also like