This document discusses big data analytics using Hadoop distributions and MapReduce framework. It provides an overview of top commercial Hadoop vendors like Amazon EMR, Hortonworks, Cloudera and IBM Infosphere. It also summarizes the latest Hadoop versions, differences between versions, features in Hadoop 3 and difference between YARN and MapReduce. Finally, it explains MapReduce operations using a word count example.
This document discusses big data analytics using Hadoop distributions and MapReduce framework. It provides an overview of top commercial Hadoop vendors like Amazon EMR, Hortonworks, Cloudera and IBM Infosphere. It also summarizes the latest Hadoop versions, differences between versions, features in Hadoop 3 and difference between YARN and MapReduce. Finally, it explains MapReduce operations using a word count example.
This document discusses big data analytics using Hadoop distributions and MapReduce framework. It provides an overview of top commercial Hadoop vendors like Amazon EMR, Hortonworks, Cloudera and IBM Infosphere. It also summarizes the latest Hadoop versions, differences between versions, features in Hadoop 3 and difference between YARN and MapReduce. Finally, it explains MapReduce operations using a word count example.
This document discusses big data analytics using Hadoop distributions and MapReduce framework. It provides an overview of top commercial Hadoop vendors like Amazon EMR, Hortonworks, Cloudera and IBM Infosphere. It also summarizes the latest Hadoop versions, differences between versions, features in Hadoop 3 and difference between YARN and MapReduce. Finally, it explains MapReduce operations using a word count example.
Sathyavathi.S Department of Information Technology Overview of this Lecture Hadoop Distributions Top Commercial Hadoop Vendors.
• 1) Amazon Web Services Elastic MapReduce Hadoop
Distribution. • 2) Hortonworks Hadoop Distribution. • 3) Cloudera Hadoop Distribution. • 4) MapR Hadoop Distribution. • 5) IBM Infosphere BigInsights Hadoop Distribution. Hadoop Versions • 31 May 2018: Release 3.0. 3 available. ... • 15 May 2018: Release 2.8.4 available. This is the next release of Apache Hadoop 2.8 release line. ... • 3 May 2018: Release 2.9. ... • 21 April 2018: Release 3.0. ... • 16 April, 2018: Release 2.7. ... • 6 Apr 2018: Release 3.1. ... • 25 March 2018: Release 3.0. ... • 14 December, 2017: Release 2.7. Stable Version of Hadoop • Hadoop 2.7. 2 is the latest stable iteration and is ready for production use. If you still happen to have 2.6. x and planning to move to 2.7. • Apache Spark. Hailed as the de-facto successor to the already popular Hadoop, Apache Spark is used as a computational engine for Hadoop data. Unlike Hadoop, Spark provides an increase in computational speed and offers full support for the various applications that the tool offers. Difference Between Hadoop Versions • Hadoop 1 only supports MapReduce processing model in its architecture and it does not support non MapReduce tools. • On other hand Hadoop 2 allows to work in MapReducer model as well as other distributed computing models like Spark, Hama, Giraph, (Message Passing Interface) MPI & HBase coprocessors. Difference Between Hadoop Versions • Hadoop 3 creates one parity block on every two blocks of data. This requires only 1,5 times more disk space compared with 3 times more with the replications in Hadoop 2. • The level of fault tolerance in Hadoop 3 remains the same, but less disk space is required for its operations. New Feature in Hadoop 3 • Erasure coding, a new feature in HDFS, can reduce storage overhead by approximately 50% compared to replication while maintaining the same durability guarantees. ... HDFS by default replicates each block three times. Replication provides a simple and robust form of redundancy to shield against most failure scenarios. Difference Between YARN & MAPREDUCE • YARN is a generic platform to run any distributed application, • Map Reduce version 2 is the distributed application which runs on top of YARN, Whereas map reduce is processing unit of Hadoop component, it process data in parallel in the distributed environment. Traditional Approach Challenges of Traditional Approach 1.Critical path problem: It is the amount of time taken to finish the job without delaying the next milestone or actual completion date. So, if, any of the machines delay the job, the whole work gets delayed. 2.Reliability problem: What if, any of the machines which are working with a part of data fails? The management of this failover becomes a challenge. 3.Equal split issue: How will I divide the data into smaller chunks so that each machine gets even part of data to work with. In other words, how to equally divide the data such that no individual machine is overloaded or underutilized. 4.The single split may fail: If any of the machines fail to provide the output, I will not be able to calculate the result. So, there should be a mechanism to ensure this fault tolerance capability of the system. 5.Aggregation of the result: There should be a mechanism to aggregate the result generated by each of the machines to produce the final output. Map Reduce • To overcome these issues, we have the MapReduce framework which allows us to perform such parallel computations without bothering about the issues like reliability, fault tolerance etc. Therefore, MapReduce gives you the flexibility to write code logic without caring about the design issues of the system. Contd.. MAP REDUCE • MapReduce is a programming model that allows us to perform parallel and distributed processing on huge data sets. • MapReduce consists of two distinct tasks – Map and Reduce. • As the name MapReduce suggests, the reducer phase takes place after the mapper phase has been completed. • So, the first is the map job, where a block of data is read and processed to produce key-value pairs as intermediate outputs. • The output of a Mapper or map job (key-value pairs) is input to the Reducer. • The reducer receives the key-value pair from multiple map jobs. • Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair) into a smaller set of tuples or key-value pairs which is the final output. MAP REDUCE OPERATION • MapReduce majorly has the following three Classes. They are, • Mapper Class • The first stage in Data Processing using MapReduce is the Mapper Class. Here, Record Reader processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store saves this intermediate data into the local disk Contd.. • Input Split • It is the logical representation of data. It represents a block of work that contains a single map task in the MapReduce Program. • RecordReader • It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs. Contd.. • Reducer Class • The Intermediate output generated from the mapper is fed to the reducer which processes it and generates the final output which is then saved in the HDFS. • Driver Class • The major component in a MapReduce job is a Driver Class. It is responsible for setting up a MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long with data types and their respective job names. Example • A Word Count Example of MapReduce • Let us understand, how a MapReduce works by taking an example where I have a text file called example.txt whose contents are as follows: • Dear, Bear, River, Car, Car, River, Deer, Car and Bear • Now, suppose, we have to perform a word count on the sample.txt using MapReduce. So, we will be finding the unique words and the number of occurrences of those unique words. Contd.. STEPS IN MAP REDUCE • First, we divide the input into three splits as shown in the figure. This will distribute the work among all the map nodes. • Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is that every word, in itself, will occur once. • Now, a list of key-value pair will be created where the key is nothing but the individual words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs – Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes. • After the mapper phase, a partition process takes place where sorting and shuffling happen so that all the tuples with the same key are sent to the corresponding reducer. • So, after the sorting and shuffling phase, each reducer will have a unique key and a list of values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc. • Now, each Reducer counts the values which are present in that list of values. As shown in the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the number of ones in the very list and gives the final output as – Bear, 2. • Finally, all the output key/value pairs are then collected and written in the output file. ADVANTAGES OF MAP REDUCE • The two biggest advantages of MapReduce are: • 1. Parallel Processing: So, MapReduce is based on Divide and Conquer paradigm which helps us to process the data using different machines. ADVANTAGES OF MAP REDUCE • 2. Data Locality: • Instead of moving data to the processing unit, we are moving the processing unit to the data in the MapReduce Framework. In the traditional system, we used to bring data to the processing unit and process it. But, as the data grew and became very huge, bringing this huge amount of data to the processing unit posed the following issues: • Moving huge data to processing is costly and deteriorates the network performance. • Processing takes time as the data is processed by a single unit which becomes the bottleneck. • The master node can get over-burdened and may fail.