0% found this document useful (0 votes)
50 views14 pages

Hadoop Karunesh

The document explains the architecture of Apache Hadoop including HDFS and MapReduce. HDFS follows a master/slave architecture with a single NameNode as the master and multiple DataNodes as slaves. HDFS stores and manages large files across clusters. MapReduce allows distributed processing of large datasets across clusters. It consists of map and reduce functions where the mapper processes input data in parallel to produce intermediate outputs which are shuffled and sorted before being input to the reducer to produce the final output. Input splits represent logical chunks of data that are processed by individual mappers. The driver class is responsible for setting up and running MapReduce jobs.

Uploaded by

Mukul Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views14 pages

Hadoop Karunesh

The document explains the architecture of Apache Hadoop including HDFS and MapReduce. HDFS follows a master/slave architecture with a single NameNode as the master and multiple DataNodes as slaves. HDFS stores and manages large files across clusters. MapReduce allows distributed processing of large datasets across clusters. It consists of map and reduce functions where the mapper processes input data in parallel to produce intermediate outputs which are shuffled and sorted before being input to the reducer to produce the final output. Input splits represent logical chunks of data that are processed by individual mappers. The driver class is responsible for setting up and running MapReduce jobs.

Uploaded by

Mukul Mishra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

Q1. Explain the architecture of Apache Hadoop in detail.

Justify the presence of input


splits before map reduce model is applied on the block data on HDFS.

What is Hadoop?

Hadoop is an open source distributed processing framework that manages data processing and
storage for Big Data application running in clustered systems. It also includes predictive
analytics, data mining, and machine learning applications.

Hadoop can handle various forms of structured and unstructured data, giving users more
flexibility for collecting, processing and analyzing data than relational databases and data
warehouses provided.

Basically Hadoop architecture is composed of two main systems:

1. HDFS (Hadoop Distributed File System) – HDFS is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster
of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a single machine, but in the
practical world, these DataNodes are spread across various machines.

1|Page
NameNode/Master Node

 Stores metadata for the files, like the directory structure of a typical FS.
 The server holding the NameNode instance is quite crucial, as there is only one.
 Transaction log for file deletes/adds,etc.,Does not use transaction for whole blocks
or file-streams, only metadata.
 Handles creation of more replica blocks when necessary after a DataNode failure.

DataNode/Slave Node

 Stores the actual data in HDFS


 Can run on any underlying filesystem (ext3/4, NTFS, etc)
 Notifies NameNode of what blocks it has
 NameNode replicates blocks 2x in local rack, 1x elsewhere

2. MapReduce: MapReduce is a programming framework that allows us to perform


distributed and parallel processing on large data sets in a distributed environment.

 MapReduce consists of two distinct tasks – Map and Reduce.


 As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

2|Page
Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.

 RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

 Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.

 Driver Class 

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.

3|Page
The Process of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount.

2. Data Locality: 
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework. 

 It is very cost-effective to move processing unit to the data.


 The processing time is reduced as all the nodes are working with their part of the data in
parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened. 

4|Page
Why Input Split Is Important?

 Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of
data. It is a Java class with pointers to start and end locations in blocks. So when Mapper tries to
read the data it clearly knows where to start reading and where to stop reading. The start location
of an InputSplit can start in a block and end in another block.

InputSplit respect logical record boundary and that is why it becomes very important. During
MapReduce execution Hadoop scans through the blocks and create InputSplits and each
InputSplit will be assigned to individual mappers for processing.

Q3: Discuss the functions of using Driver, Mapper and Reducer in Hadoop. You can take
an example of word count problem.

 MapReduce consists of two distinct tasks – Map and Reduce.


 As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

MapReduce majorly has the following three Classes. They are,

Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

5|Page
Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.

RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Driver Class 

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.

The data goes through the following phases

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map

Mapping

This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.

6|Page
Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.

How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)


2. Reduce tasks (Shuffling, Reducing)

As mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a

1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)


2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode

 A Word Count Example of MapReduce

MapReduce works by taking an example where I have a text file called abc.txt whose contents
are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

7|Page
 First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
 Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
– Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
 After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc. 
 Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

8|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.

Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.

1. Download and Install Eclipse IDE

https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020

After downloading the file, we need to move this file to the home directory.

2. Extracting the file

9|Page
Two methods to extract the file either right click on the file and click on extract here or

Unzip the file using following command:

$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz

3.Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is


actually a directory on your computer. That’s why when you start Eclipse, it asks to choose a
workspace location like this:

10 | P a g e
3. Create Java Project in Package Explorer:
By clicking in the left side of the interface
• File > New > Java Project >( Name it – WordCountProject1) > Finish
The new Java Project has created

11 | P a g e
o After that we need to create the Package, by clicking:
o Right Click > New > Package ( Name it - WordCountPackage) > Finish
o After creating the Package the third step is to create the class
 Right Click on Package > New > Class (Name it - WordCount)

3.Download Hadoop Libraries


Add Following Reference Libraries –
hadoop-core-1.2.1.jar
commons-cli-1.2.jar

12 | P a g e
To Add this Following libraries

Right Click on Project > Build Path> Add External Archivals

Downloads/hadoop-core-1.2.1.jar

Downloads/commons-cli-1.2.jar

13 | P a g e
14 | P a g e

You might also like