Hadoop Karunesh
Hadoop Karunesh
What is Hadoop?
Hadoop is an open source distributed processing framework that manages data processing and
storage for Big Data application running in clustered systems. It also includes predictive
analytics, data mining, and machine learning applications.
Hadoop can handle various forms of structured and unstructured data, giving users more
flexibility for collecting, processing and analyzing data than relational databases and data
warehouses provided.
1. HDFS (Hadoop Distributed File System) – HDFS is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster
of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a single machine, but in the
practical world, these DataNodes are spread across various machines.
1|Page
NameNode/Master Node
Stores metadata for the files, like the directory structure of a typical FS.
The server holding the NameNode instance is quite crucial, as there is only one.
Transaction log for file deletes/adds,etc.,Does not use transaction for whole blocks
or file-streams, only metadata.
Handles creation of more replica blocks when necessary after a DataNode failure.
DataNode/Slave Node
2|Page
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.
Input Split
It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.
Reducer Class
The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.
3|Page
The Process of MapReduce
1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount.
2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework.
4|Page
Why Input Split Is Important?
Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of
data. It is a Java class with pointers to start and end locations in blocks. So when Mapper tries to
read the data it clearly knows where to start reading and where to stop reading. The start location
of an InputSplit can start in a block and end in another block.
InputSplit respect logical record boundary and that is why it becomes very important. During
MapReduce execution Hadoop scans through the blocks and create InputSplits and each
InputSplit will be assigned to individual mappers for processing.
Q3: Discuss the functions of using Driver, Mapper and Reducer in Hadoop. You can take
an example of word count problem.
Mapper Class
The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.
5|Page
Input Split
It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.
RecordReader
It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.
Driver Class
The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.
Input Splits:
An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map
Mapping
This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>
Shuffling
This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.
6|Page
Reducing
In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.
Hadoop divides the job into tasks. There are two types of tasks:
As mentioned above.
The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a
For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode
MapReduce works by taking an example where I have a text file called abc.txt whose contents
are as follows:
Dear, Bear, River, Car, Car, River, Deer, Car and Bear
7|Page
First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
– Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
Finally, all the output key/value pairs are then collected and written in the output file.
8|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.
Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.
https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020
After downloading the file, we need to move this file to the home directory.
9|Page
Two methods to extract the file either right click on the file and click on extract here or
$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz
10 | P a g e
3. Create Java Project in Package Explorer:
By clicking in the left side of the interface
• File > New > Java Project >( Name it – WordCountProject1) > Finish
The new Java Project has created
11 | P a g e
o After that we need to create the Package, by clicking:
o Right Click > New > Package ( Name it - WordCountPackage) > Finish
o After creating the Package the third step is to create the class
Right Click on Package > New > Class (Name it - WordCount)
12 | P a g e
To Add this Following libraries
Downloads/hadoop-core-1.2.1.jar
Downloads/commons-cli-1.2.jar
13 | P a g e
14 | P a g e