Unit 3 Bda
Unit 3 Bda
BigData Analytics
(20CSE361)
MapReduce
1
What is MapReduce?
3
4
Application of MapReduce
Entertainment: To discover the most popular movies,
based on what you like and what you watched in this
case Hadoop MapReduce help you out.
• It mainly focuses on their logs and clicks.
E-commerce: Numerous E-commerce suppliers, like
Amazon, Walmart, and eBay, utilize the MapReduce
programming model to distinguish most loved items
dependent on clients’ inclinations or purchasing
behavior.
• It incorporates making item proposal Mechanisms for
E-commerce inventories, examining website records,
buy history, user interaction logs, etc…
5
Cont….
Data Warehouse: We can utilize MapReduce to
analyze large data volumes in data warehouses
while implementing specific business logic for data
insights.
6
How does MapReduce works?
Hadoop Ecosystem component ‘MapReduce’ works by
breaking the processing into two phases:
1. Map phase
2. Reduce phase
• Each phase has key-value pairs as input and output.
7
8
Cont…
• Map Phase: In this phase, data in each split is
passed to a mapping function to produce output
values.
• The map takes key/value pair as input.
• The data may be structured or unstructured
form.
• Keys are the reference of the input files and value
is the data set.
• The user can create a custom business logic
based on their need for data processing, this task
is applied on every input value.
9
Cont…
2.Reduce Phase: In this phase, MapReduce
Reduce takes intermediate Key / Value pairs
as input and process the output of the mapper.
• Key/value pairs provided to reduce are sorted
by key.
• In reducer, do aggregation or summation sort
of computation.
Reduce produces a final output as a list of
key/value pairs.
This final output is stored in HDFS and replication is
done as usual.
10
Cont….
11
Example:
12
Anatomy of MapReduce Job Run
13
14
15
16
17
Advantage of Map Reduce
Fault tolerance: It can handle failures without
downtime.
Speed: It splits, shuffles, and reduces the
unstructured data in a short time.
Cost-effective: Hadoop MapReduce has a scale-
out feature that enables users to process or
store the data in a cost-effective manner.
Scalability: It provides a highly scalable
framework. MapReduce allows users to run
applications from many nodes. 18
Advantage of Map Reduce
Parallel Processing:
•Here multiple job-parts of the same dataset can be
processed in a parallel manner.
•This can reduce the task that can be taken to
complete a task.
Data Locality:
•Moving huge data to processing is costly.
•Processing takes time as the data is processed by a
single unit which becomes the bottleneck.
19
Limitations of MapReduce
20
Shuffling and Sorting in MapReduce
21
22
Objective,
23
Shuffling in MapReduce
24
Sorting in MapReduce
• The MapReduce Framework automatically
Sorts the data based on key-values on output
of Mapper. So, before sending to the reducer,
all key-value pairs are sorted.
• The Reducer can easily understand when a
new reducing task will be started by the
sorted Key-Value pairs.
• If the user not set Reducer task, the Shuffling
and Sorting phase will not take place. The task
will over after the Mapper task.
25
Secondary Sorting in MapReduce
• If we want to sort reducer’s values,
then, the secondary sorting technique is used
as it enables us to sort the values (in
ascending or descending order) passed to
each reducer.
26
Conclusion
• Shuffling-Sorting occurs simultaneously to
summarize the Mapper Intermediate output.
27
INPUT FORMATS
● InputFormat describes the input-specification
for execution of the Map-Reduce job.
28
cont…
29
TYPES OF INPUT FORMAT
1. FileInputFormat:
• It is the base class for all file-based
InputFormats.
• When we start a MapReduce job execution,
FileInputFormat provides a path containing
files to read.
• This InputFormat will read all files and
divides these files into one or more
InputSplits.
30
TYPES OF INPUT FORMAT
2. TextInputFormat:
31
TYPES OF INPUT FORMAT
3. KeyValueTextInputFormat:
• It is similar to TextInputFormat.
33
TYPES OF INPUT FORMAT
5. N-lineInputFormat:
6. DBInputFormat:
• This InputFormat reads data from a relational
database, using JDBC.
35
OUTPUT FORMATS
• Its similar with InputFormat functions.
• OutputFormat instances are used to write to files
on the local disk or in HDFS.
• In MapReduce job execution on the basis of
output specification;
36
TYPES OF OUTPUT FORMAT
1. TextOutputFormat:
• The default OutputFormat is TextOutputFormat.
• It writes (key, value) pairs on individual lines of text files.
• Its keys and values can be of any type.
• The reason behind is that TextOutputFormat turns them
to string by calling toString() on them.
• It separates key-value pair by a tab character.
• By using MapReduce.output.textoutputformat.separator
property we can also change it.
37
TYPES OF OUTPUT FORMAT
2. SequenceFileOutputFormat:
• This OutputFormat writes sequences files for its output.
SequenceFileInputFormat is also intermediate format
use between MapReduce jobs.
• It serializes arbitrary data types to the file. And the
corresponding SequenceFileInputFormat will deserialize
the file into the same types.
• It presents the data to the next mapper in the same
manner as it was emitted by the previous reducer. Static
methods also control the compression.
38
TYPES OF OUTPUT FORMAT
3. SequenceFileAsBinaryOutputFormat:
• It is another variant of SequenceFileInputFormat.
• It also writes keys and values to sequence file in binary
format.
4. MapFileOutputFormat:
• It is another form of FileOutputFormat.
• It also writes output as map files.
• The framework adds a key in a MapFile in order.
• So we need to ensure that reducer emits keys in sorted
order.
39
TYPES OF OUTPUT FORMAT
5. MultipleOutputs:
• This format allows writing data to files whose names are
derived from the output keys and values.
6. LazyOutputFormat:
40
TYPES OF OUTPUT FORMAT
7. DBOutputFormat:
• It is the OutputFormat for writing to relational databases
and HBase.
• This format also sends the reduce output to a SQL table.
41
Failure in MapReduce
45
Cont…
FIFO Advantage:
• It is simple to understand
• Doesn’t need any configuration.
• Jobs are executed in the order of their submission.
FIFO Disadvantage:
• It is not suitable for shared clusters.
• If the large application comes before the shorter
one, then the large application will use all the
resources in the cluster, and the shorter
application has to wait for its turn. This leads to
starvation.
46
b) Capacity Scheduler
• The Capacity Scheduler allows multiple-tenants to
securely share a large Hadoop cluster.
• It’s designed to run multi-tenant cluster while
maximizing the throughput and the utilization of the
cluster.
• It supports hierarchical queues that utilizes the cluster
resources.
• A queue hierarchy contains three types of queues that
are root, parent, and leaf.
• The root queue represents the cluster itself, parent
queue represents organization/group or sub-
organization/sub-group, and leaf accepts application
submission.
47
Advantages:
• It maximizes the utilization of resources and
throughput in the Hadoop cluster.
• Provides Elasticity for groups or organizations
in a cost-effective manner.
• It also gives capacity guarantees and
safeguards to the organization utilizing cluster.
Disadvantage:
• It is complex than other scheduler.
48
c) Fair Scheduler
49
FairScheduler
50
• The FairScheduler, by default, takes scheduling
fairness decisions only on the basis of
memory. (both memory and CPU.)
• Its also support hierarchical queue.
• When an application is present in the queue,
then the app gets its minimum share, but
when the queue doesn’t need its full
guaranteed share, then the excess share is
split between other running applications.
51
Advantages:
• It provides a reasonable way to share between
the number of users in clusters.
• It can work with application priorities where
the priorities are used as weights.
Disadvantage:
• It requires configuration.
52
Features of MapReduce
53
1. Scalability:
• Apache Hadoop is a highly scalable framework.
• This is because of its ability to store and distribute huge
data across plenty of servers.
• All these servers were inexpensive and can operate in
parallel.
• We can easily scale the storage and computation power by
adding servers to the cluster.
• This can use thousands of terabytes of data.
54
2. Flexibility
• MapReduce programming enables companies to access
new sources of data.
• It enables companies to operate on different types of
data.
• It allows enterprises to access structured as well as
unstructured data, and derive significant value by gaining
insights from the multiple sources of data.
• Its provides support for the multiple languages and
data from email, social media, to click stream resources.
• MapReduce is flexible to deal with data rather than
traditional DBMS.
55
3.Security and Authentication
• The MapReduce programming model uses HBase and
HDFS security platform that allows access only to the
authenticated users to operate on the data.
• it protects from unauthorized access to system data
and enhances system security.
4. Cost-effective solution
• Its allows the storage and processing of large data sets
in very affordable manner.
56
5. Fast
if we are dealing with large volumes of unstructured data,
Hadoop MapReduce just takes minutes to process
terabytes of data.
It can process petabytes of data in just an hour.
that allow for the faster processing of data.
6. Simple model of programming
• Its allows programmers to develop the MapReduce
programs which can handle tasks easily and efficiently.
• Anyone can easily learn and write MapReduce
programs and meet their data processing Bz, its written by
JAVA.
57
7. Parallel Programming
• Major aspects of the working of MapReduce programming
is parallel processing.
• It divides the tasks that allows execution in parallel.
• Its allows multiple processors to execute these divided
tasks. So the entire program is run in less time.
58
8. Availability and resilient nature
• Whenever the data is sent to an individual node, the same
set of data is forwarded to some other nodes in a cluster.
• So, if any particular node suffers from a failure, then there
are always other copies present on other nodes that can still be
accessed whenever needed.
• This assures high availability of data.
• One of the major features offered by Apache Hadoop is its
fault tolerance. The Hadoop MapReduce framework has the
ability to quickly recognizing faults that occur.
• It then applies a quick and automatic recovery solution.
This feature makes it a game-changer in the world of big data
processing.
59