0% found this document useful (0 votes)
95 views59 pages

Unit 3 Bda

MapReduce is a software framework for processing large datasets in a distributed computing environment. It allows for parallel processing of large datasets across clusters of computers using a simple programming model. The MapReduce framework divides the job into map and reduce phases where the map phase processes the input data and generates intermediate output which is sorted and grouped by the reduce phase to produce the final output. MapReduce provides fault tolerance, parallelization, and scalability making it suitable for large scale data processing on Hadoop clusters.

Uploaded by

teja.ksp1801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
95 views59 pages

Unit 3 Bda

MapReduce is a software framework for processing large datasets in a distributed computing environment. It allows for parallel processing of large datasets across clusters of computers using a simple programming model. The MapReduce framework divides the job into map and reduce phases where the map phase processes the input data and generates intermediate output which is sorted and grouped by the reduce phase to produce the final output. MapReduce provides fault tolerance, parallelization, and scalability making it suitable for large scale data processing on Hadoop clusters.

Uploaded by

teja.ksp1801
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

UNIT-3

BigData Analytics
(20CSE361)

MapReduce

1
What is MapReduce?

• MapReduce is the core Hadoop ecosystem


component which provides data processing.
• MapReduce is a software framework for easily
writing applications that process the vast amount
of structured and unstructured data stored in the
HDFS.
• MapReduce programs are parallel, very useful
for performing large-scale data analysis using
multiple machines in the cluster.
2
Cont...
• The whole job is taken from the user and
divided into smaller tasks, and assign them to
the worker nodes.
• MapReduce programs take input as a list and
convert to the output as a list.
• it improves the speed and reliability of cluster
this parallel processing.

3
4
Application of MapReduce
Entertainment: To discover the most popular movies,
based on what you like and what you watched in this
case Hadoop MapReduce help you out.
• It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like
Amazon, Walmart, and eBay, utilize the MapReduce
programming model to distinguish most loved items
dependent on clients’ inclinations or purchasing
behavior.
• It incorporates making item proposal Mechanisms for
E-commerce inventories, examining website records,
buy history, user interaction logs, etc…
5
Cont….
Data Warehouse: We can utilize MapReduce to
analyze large data volumes in data warehouses
while implementing specific business logic for data
insights.

Fraud Detection: Hadoop and MapReduce are utilized


in monetary enterprises including organizations like
banks, insurance providers, installment areas for
misrepresentation recognition, pattern
differentiating proof, or business metrics through
transaction analysis.

6
How does MapReduce works?
Hadoop Ecosystem component ‘MapReduce’ works by
breaking the processing into two phases:
1. Map phase
2. Reduce phase
• Each phase has key-value pairs as input and output.

7
8
Cont…
• Map Phase: In this phase, data in each split is
passed to a mapping function to produce output
values.
• The map takes key/value pair as input.
• The data may be structured or unstructured
form.
• Keys are the reference of the input files and value
is the data set.
• The user can create a custom business logic
based on their need for data processing, this task
is applied on every input value.
9
Cont…
2.Reduce Phase: In this phase, MapReduce
Reduce takes intermediate Key / Value pairs
as input and process the output of the mapper.
• Key/value pairs provided to reduce are sorted
by key.
• In reducer, do aggregation or summation sort
of computation.
 Reduce produces a final output as a list of
key/value pairs.
 This final output is stored in HDFS and replication is
done as usual.
10
Cont….

Shuffling and Sort: This Reducer task starts with


Shuffle and sort.
• Its task is to consolidate the relevant records
from Mapping phase output.

11
Example:

12
Anatomy of MapReduce Job Run

13
14
15
16
17
Advantage of Map Reduce
Fault tolerance: It can handle failures without
downtime.
Speed: It splits, shuffles, and reduces the
unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-
out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable
framework. MapReduce allows users to run
applications from many nodes. 18
Advantage of Map Reduce
Parallel Processing:
•Here multiple job-parts of the same dataset can be
processed in a parallel manner.
•This can reduce the task that can be taken to
complete a task.
Data Locality:
•Moving huge data to processing is costly.
•Processing takes time as the data is processed by a
single unit which becomes the bottleneck.

19
Limitations of MapReduce

• MapReduce cannot cache the intermediate data in


memory for a further requirement which diminishes
the performance of Hadoop.

• It is only suitable for Batch Processing of a Huge


amounts of Data.

20
Shuffling and Sorting in MapReduce

21
22
Objective,

• Shuffling process is an intermediate, output


from mappers is transferred to the reducer.
• Reducer gets one or more keys and associated values
on the basis of reducers.
• Intermediated Key-Value generated by Mapper is
sorted automatically by key is called shuffling and
Sorting in Hadoop MapReduce.

23
Shuffling in MapReduce

• The process of transferring data from the


Mappers to Reducers is known as shuffling.
• Shuffling procedure used to Sort the data
using Key-Values.
• The shuffling task begins when some of the
Mapping tasks are done. So this is the faster
process. It will not wait for completion of the
Mapper Task.

24
Sorting in MapReduce
• The MapReduce Framework automatically
Sorts the data based on key-values on output
of Mapper. So, before sending to the reducer,
all key-value pairs are sorted.
• The Reducer can easily understand when a
new reducing task will be started by the
sorted Key-Value pairs.
• If the user not set Reducer task, the Shuffling
and Sorting phase will not take place. The task
will over after the Mapper task.
25
Secondary Sorting in MapReduce
• If we want to sort reducer’s values,
then, the secondary sorting technique is used
as it enables us to sort the values (in
ascending or descending order) passed to
each reducer.

26
Conclusion
• Shuffling-Sorting occurs simultaneously to
summarize the Mapper Intermediate output.

• Shuffling and Sorting are not performed at all


if you specify zero reducers
(setNumReduceTasks(0)).

27
INPUT FORMATS
● InputFormat describes the input-specification
for execution of the Map-Reduce job.

● InputFormat describes how to split and read


input files.

● InputFormat is responsible for splitting the


input data file into records which is used for
map-reduce operation.

28
cont…

• InputFormat defines the RecordReader.


• It is also responsible for reading actual
records from the input files.

29
TYPES OF INPUT FORMAT
1. FileInputFormat:
• It is the base class for all file-based
InputFormats.
• When we start a MapReduce job execution,
FileInputFormat provides a path containing
files to read.
• This InputFormat will read all files and
divides these files into one or more
InputSplits.
30
TYPES OF INPUT FORMAT
2. TextInputFormat:

• It is the default InputFormat.


• This InputFormat treats each line of each
input file as a separate record.
• It performs no parsing. TextInputFormat is
useful for unformatted data or line-based
records like log files.

31
TYPES OF INPUT FORMAT
3. KeyValueTextInputFormat:
• It is similar to TextInputFormat.

• This InputFormat also treats each line of


input as a separate record.

• While the difference is that


TextInputFormat treats entire line as the
value, but the KeyValueTextInputFormat
breaks the line itself into key and value
32
TYPES OF INPUT FORMAT
4. SequenceFileInputFormat:

• It is an InputFormat which reads sequence


files. Sequence files are binary files.

• These files store sequences of binary key-


value pairs.

• These are block-compressed and provide


direct serialization and deserialization of
several arbitrary data.

33
TYPES OF INPUT FORMAT
5. N-lineInputFormat:

• It is TextInputFormat where the keys are byte


offset of the line. And values are contents of
the line.
• So, each mapper receives a variable number
of lines of input with TextInputFormat and
KeyValueTextInputFormat.
• So, if want our mapper to receive a fixed
number of lines of input, then we use
NLineInputFormat.
34
TYPES OF INPUT FORMAT

6. DBInputFormat:
• This InputFormat reads data from a relational
database, using JDBC.

• Its loads small datasets, perhaps for joining with


large datasets from HDFS using MultipleInputs.

35
OUTPUT FORMATS
• Its similar with InputFormat functions.
• OutputFormat instances are used to write to files
on the local disk or in HDFS.
• In MapReduce job execution on the basis of
output specification;

● Its provides the RecordWriter implementation to


be used to write the output files of the job. then
the output files are stored in a FileSystem.

36
TYPES OF OUTPUT FORMAT
1. TextOutputFormat:
• The default OutputFormat is TextOutputFormat.
• It writes (key, value) pairs on individual lines of text files.
• Its keys and values can be of any type.
• The reason behind is that TextOutputFormat turns them
to string by calling toString() on them.
• It separates key-value pair by a tab character.
• By using MapReduce.output.textoutputformat.separator
property we can also change it.

37
TYPES OF OUTPUT FORMAT
2. SequenceFileOutputFormat:
• This OutputFormat writes sequences files for its output.
SequenceFileInputFormat is also intermediate format
use between MapReduce jobs.
• It serializes arbitrary data types to the file. And the
corresponding SequenceFileInputFormat will deserialize
the file into the same types.
• It presents the data to the next mapper in the same
manner as it was emitted by the previous reducer. Static
methods also control the compression.
38
TYPES OF OUTPUT FORMAT
3. SequenceFileAsBinaryOutputFormat:
• It is another variant of SequenceFileInputFormat.
• It also writes keys and values to sequence file in binary
format.
4. MapFileOutputFormat:
• It is another form of FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted
order.

39
TYPES OF OUTPUT FORMAT
5. MultipleOutputs:
• This format allows writing data to files whose names are
derived from the output keys and values.

6. LazyOutputFormat:

In MapReduce job execution, FileOutputFormat sometimes


create output files, even if they are empty.

LazyOutputFormat is also a wrapper OutputFormat.

40
TYPES OF OUTPUT FORMAT
7. DBOutputFormat:
• It is the OutputFormat for writing to relational databases
and HBase.
• This format also sends the reduce output to a SQL table.

• It also accepts key-value pairs. In this, the key has a


type extending DBwritable.

41
Failure in MapReduce

• Failures are norm in commodity hardware


• Worker failure
– Detect failure via periodic heartbeats
– Re-execute in-progress map/reduce tasks
• Master failure
– Single point of failure; Resume from Execution Log
• Robust
– Google’s experience: lost 1600 of 1800 machines once!,
but finished fine
42
43
44
a) FIFO Scheduler
• First In First Out is default scheduling policy used in
Hadoop.
• FIFO Scheduler gives more preferences to the
application first come and first serve.
• It places the applications in queue and executes
them in the order of their submission (FIFO).
• Here, irrespective of the size and priority, the request
for the first application in the queue are allocated
first.
• Once the first application request is satisfied, then
only the next application in the queue is served.

45
Cont…
FIFO Advantage:
• It is simple to understand
• Doesn’t need any configuration.
• Jobs are executed in the order of their submission.
FIFO Disadvantage:
• It is not suitable for shared clusters.
• If the large application comes before the shorter
one, then the large application will use all the
resources in the cluster, and the shorter
application has to wait for its turn. This leads to
starvation.
46
b) Capacity Scheduler
• The Capacity Scheduler allows multiple-tenants to
securely share a large Hadoop cluster.
• It’s designed to run multi-tenant cluster while
maximizing the throughput and the utilization of the
cluster.
• It supports hierarchical queues that utilizes the cluster
resources.
• A queue hierarchy contains three types of queues that
are root, parent, and leaf.
• The root queue represents the cluster itself, parent
queue represents organization/group or sub-
organization/sub-group, and leaf accepts application
submission.
47
Advantages:
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides Elasticity for groups or organizations
in a cost-effective manner.
• It also gives capacity guarantees and
safeguards to the organization utilizing cluster.
Disadvantage:
• It is complex than other scheduler.

48
c) Fair Scheduler

49
FairScheduler

• FairScheduler allows YARN applications to fairly


share resources in large Hadoop clusters.
• It will dynamically balance resources between all
running applications.

• It assigns resources to applications that all


applications get, on average, and an equal amount
of resources.

50
• The FairScheduler, by default, takes scheduling
fairness decisions only on the basis of
memory. (both memory and CPU.)
• Its also support hierarchical queue.
• When an application is present in the queue,
then the app gets its minimum share, but
when the queue doesn’t need its full
guaranteed share, then the excess share is
split between other running applications.

51
Advantages:
• It provides a reasonable way to share between
the number of users in clusters.
• It can work with application priorities where
the priorities are used as weights.
Disadvantage:
• It requires configuration.

52
Features of MapReduce

53
1. Scalability:
• Apache Hadoop is a highly scalable framework.
• This is because of its ability to store and distribute huge
data across plenty of servers.
• All these servers were inexpensive and can operate in
parallel.
• We can easily scale the storage and computation power by
adding servers to the cluster.
• This can use thousands of terabytes of data.

54
2. Flexibility
• MapReduce programming enables companies to access
new sources of data.
• It enables companies to operate on different types of
data.
• It allows enterprises to access structured as well as
unstructured data, and derive significant value by gaining
insights from the multiple sources of data.
• Its provides support for the multiple languages and
data from email, social media, to click stream resources.
• MapReduce is flexible to deal with data rather than
traditional DBMS.

55
3.Security and Authentication
• The MapReduce programming model uses HBase and
HDFS security platform that allows access only to the
authenticated users to operate on the data.
• it protects from unauthorized access to system data
and enhances system security.

4. Cost-effective solution
• Its allows the storage and processing of large data sets
in very affordable manner.

56
5. Fast
if we are dealing with large volumes of unstructured data,
Hadoop MapReduce just takes minutes to process
terabytes of data.
It can process petabytes of data in just an hour.
that allow for the faster processing of data.
6. Simple model of programming
• Its allows programmers to develop the MapReduce
programs which can handle tasks easily and efficiently.
• Anyone can easily learn and write MapReduce
programs and meet their data processing Bz, its written by
JAVA.

57
7. Parallel Programming
• Major aspects of the working of MapReduce programming
is parallel processing.
• It divides the tasks that allows execution in parallel.
• Its allows multiple processors to execute these divided
tasks. So the entire program is run in less time.

58
8. Availability and resilient nature
• Whenever the data is sent to an individual node, the same
set of data is forwarded to some other nodes in a cluster.
• So, if any particular node suffers from a failure, then there
are always other copies present on other nodes that can still be
accessed whenever needed.
• This assures high availability of data.
• One of the major features offered by Apache Hadoop is its
fault tolerance. The Hadoop MapReduce framework has the
ability to quickly recognizing faults that occur.
• It then applies a quick and automatic recovery solution.
This feature makes it a game-changer in the world of big data
processing.

59

You might also like