PPT1 Module2 Hadoop Distribution

Big Data Analytics
Sathyavathi.S
Department of Information Technology
Overview of this
Lecture
Hadoop Distributions
Top Commercial Hadoop Vendors.
• 1) Amazon Web Services Elastic MapReduce Hadoop

Distribution.
• 2) Hortonworks Hadoop Distribution.
• 3) Cloudera Hadoop Distribution.
• 4) MapR Hadoop Distribution.
• 5) IBM Infosphere BigInsights Hadoop Distribution.
Hadoop Versions
• 31 May 2018: Release 3.0. 3 available. ...
• 15 May 2018: Release 2.8.4 available. This is the next
release of Apache Hadoop 2.8 release line. ...
• 3 May 2018: Release 2.9. ...
• 21 April 2018: Release 3.0. ...
• 16 April, 2018: Release 2.7. ...
• 6 Apr 2018: Release 3.1. ...
• 25 March 2018: Release 3.0. ...
• 14 December, 2017: Release 2.7.
Stable Version of Hadoop
• Hadoop 2.7. 2 is the latest stable iteration and is ready for
production use. If you still happen to have 2.6. x and
planning to move to 2.7.
• Apache Spark. Hailed as the de-facto successor to the
already popular Hadoop, Apache Spark is used as a
computational engine for Hadoop data. Unlike Hadoop,
Spark provides an increase in computational speed and
offers full support for the various applications that the tool
offers.
Difference Between Hadoop
Versions
• Hadoop 1 only supports MapReduce processing model in
its architecture and it does not support
non MapReduce tools.
• On other hand Hadoop 2 allows to work in MapReducer
model as well as other distributed computing models like
Spark, Hama, Giraph, (Message Passing Interface) MPI &
HBase coprocessors.
Difference Between Hadoop
Versions
• Hadoop 3 creates one parity block on every two blocks of
data. This requires only 1,5 times more disk space
compared with 3 times more with the replications
in Hadoop 2.
• The level of fault tolerance in Hadoop 3 remains the same,
but less disk space is required for its operations.
New Feature in Hadoop 3
• Erasure coding, a new feature in HDFS, can reduce
storage overhead by approximately 50% compared to
replication while maintaining the same durability
guarantees. ... HDFS by default replicates each
block three times. Replication provides a simple and robust
form of redundancy to shield against most failure scenarios.
Difference Between YARN &
MAPREDUCE
• YARN is a generic platform to run any distributed
application,
• Map Reduce version 2 is the distributed application which
runs on top of YARN, Whereas map reduce is processing
unit of Hadoop component, it process data in parallel in
the distributed environment.
Traditional Approach
Challenges of Traditional
Approach
1.Critical path problem: It is the amount of time taken to finish the job without
delaying the next milestone or actual completion date. So, if, any of the machines
delay the job, the whole work gets delayed.
2.Reliability problem: What if, any of the machines which are working with a part of
data fails? The management of this failover becomes a challenge.
3.Equal split issue: How will I divide the data into smaller chunks so that each
machine gets even part of data to work with. In other words, how to equally divide
the data such that no individual machine is overloaded or underutilized.
4.The single split may fail: If any of the machines fail to provide the output, I will not
be able to calculate the result. So, there should be a mechanism to ensure this
fault tolerance capability of the system.
5.Aggregation of the result: There should be a mechanism to aggregate the result
generated by each of the machines to produce the final output.
Map Reduce
• To overcome these issues, we have the MapReduce framework
which allows us to perform such parallel computations without
bothering about the issues like reliability, fault tolerance etc.
Therefore, MapReduce gives you the flexibility to write code
logic without caring about the design issues of the system.
Contd..
MAP REDUCE
• MapReduce is a programming model that allows us to perform parallel
and distributed processing on huge data sets.
• MapReduce consists of two distinct tasks – Map and Reduce.
• As the name MapReduce suggests, the reducer phase takes place after the
mapper phase has been completed.
• So, the first is the map job, where a block of data is read and processed to
produce key-value pairs as intermediate outputs.
• The output of a Mapper or map job (key-value pairs) is input to the Reducer.
• The reducer receives the key-value pair from multiple map jobs.
• Then, the reducer aggregates those intermediate data tuples (intermediate
key-value pair) into a smaller set of tuples or key-value pairs which is the
final output.
MAP REDUCE OPERATION
• MapReduce majorly has the following three Classes.
They are,
• Mapper Class
• The first stage in Data Processing using MapReduce is
the Mapper Class. Here, Record Reader processes each
Input record and generates the respective key-value pair.
Hadoop’s Mapper store saves this intermediate data into
the local disk
Contd..
• Input Split
• It is the logical representation of data. It represents a
block of work that contains a single map task in the
MapReduce Program.
• RecordReader
• It interacts with the Input split and converts the obtained
data in the form of Key-Value Pairs.
Contd..
• Reducer Class
• The Intermediate output generated from the mapper is
fed to the reducer which processes it and generates the
final output which is then saved in the HDFS.
• Driver Class
• The major component in a MapReduce job is a Driver
Class. It is responsible for setting up a MapReduce Job to
run-in Hadoop. We specify the names
of Mapper and Reducer Classes long with data types
and their respective job names.
Example
• A Word Count Example of MapReduce
• Let us understand, how a MapReduce works by taking
an example where I have a text file called example.txt
whose contents are as follows:
• Dear, Bear, River, Car, Car, River, Deer, Car and Bear
• Now, suppose, we have to perform a word count on the
sample.txt using MapReduce. So, we will be finding the
unique words and the number of occurrences of those
unique words.
Contd..
STEPS IN MAP REDUCE
• First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
• Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
• Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs –
Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
• After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
• So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
• Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
• Finally, all the output key/value pairs are then collected and written in the output file.
ADVANTAGES OF MAP REDUCE
• The two biggest advantages of MapReduce are:
• 1. Parallel Processing: So, MapReduce is based on
Divide and Conquer paradigm which helps us to
process the data using different machines.
ADVANTAGES OF MAP REDUCE
• 2. Data Locality:
• Instead of moving data to the processing unit, we are moving
the processing unit to the data in the MapReduce
Framework. In the traditional system, we used to bring data to
the processing unit and process it. But, as the data grew and
became very huge, bringing this huge amount of data to the
processing unit posed the following issues:
• Moving huge data to processing is costly and deteriorates the
network performance.
• Processing takes time as the data is processed by a single unit
which becomes the bottleneck.
• The master node can get over-burdened and may fail.

PPT1 Module2 Hadoop Distribution

Uploaded by

Copyright:

Available Formats

PPT1 Module2 Hadoop Distribution

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PPT1 Module2 Hadoop Distribution

Uploaded by

Copyright:

Available Formats

Big Data Analytics

• 1) Amazon Web Services Elastic MapReduce Hadoop

You might also like