0% found this document useful (0 votes)

50 views14 pages

Hadoop Karunesh

The document explains the architecture of Apache Hadoop including HDFS and MapReduce. HDFS follows a master/slave architecture with a single NameNode as the master and multiple DataNodes as slaves. HDFS stores and manages large files across clusters. MapReduce allows distributed processing of large datasets across clusters. It consists of map and reduce functions where the mapper processes input data in parallel to produce intermediate outputs which are shuffled and sorted before being input to the reducer to produce the final output. Input splits represent logical chunks of data that are processed by individual mappers. The driver class is responsible for setting up and running MapReduce jobs.

Uploaded by

Mukul Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views14 pages

Hadoop Karunesh

Uploaded by

Mukul Mishra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 14

Q1. Explain the architecture of Apache Hadoop in detail.

Justify the presence of input

splits before map reduce model is applied on the block data on HDFS.

What is Hadoop?

Hadoop is an open source distributed processing framework that manages data processing and
storage for Big Data application running in clustered systems. It also includes predictive
analytics, data mining, and machine learning applications.

Hadoop can handle various forms of structured and unstructured data, giving users more
flexibility for collecting, processing and analyzing data than relational databases and data
warehouses provided.

Basically Hadoop architecture is composed of two main systems:

1. HDFS (Hadoop Distributed File System) – HDFS is a block-structured file system where
each file is divided into blocks of a pre-determined size. These blocks are stored across a cluster
of one or several machines. Apache Hadoop HDFS Architecture follows a Master/Slave
Architecture, where a cluster comprises of a single NameNode (Master node) and all the other
nodes are DataNodes (Slave nodes). HDFS can be deployed on a broad spectrum of machines
that support Java. Though one can run several DataNodes on a single machine, but in the
practical world, these DataNodes are spread across various machines.

1|Page
NameNode/Master Node

 Stores metadata for the files, like the directory structure of a typical FS.
 The server holding the NameNode instance is quite crucial, as there is only one.
 Transaction log for file deletes/adds,etc.,Does not use transaction for whole blocks
or file-streams, only metadata.
 Handles creation of more replica blocks when necessary after a DataNode failure.

DataNode/Slave Node

 Stores the actual data in HDFS

 Can run on any underlying filesystem (ext3/4, NTFS, etc)
 Notifies NameNode of what blocks it has
 NameNode replicates blocks 2x in local rack, 1x elsewhere

2. MapReduce: MapReduce is a programming framework that allows us to perform

distributed and parallel processing on large data sets in a distributed environment.

 MapReduce consists of two distinct tasks – Map and Reduce.

 As the name MapReduce suggests, the reducer phase takes place after the mapper phase
has been completed.
 So, the first is the map job, where a block of data is read and processed to produce key-
value pairs as intermediate outputs.
 The output of a Mapper or map job (key-value pairs) is input to the Reducer.
 The reducer receives the key-value pair from multiple map jobs.
 Then, the reducer aggregates those intermediate data tuples (intermediate key-value pair)
into a smaller set of tuples or key-value pairs which is the final output.

2|Page
Mapper Class

The first stage in Data Processing using MapReduce is the Mapper Class. Here, RecordReader
processes each Input record and generates the respective key-value pair. Hadoop’s Mapper store
saves this intermediate data into the local disk.

 Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.

 RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

 Reducer Class

The Intermediate output generated from the mapper is fed to the reducer which processes it and
generates the final output which is then saved in the HDFS.

 Driver Class

The major component in a MapReduce job is a Driver Class. It is responsible for setting up a
MapReduce Job to run-in Hadoop. We specify the names of Mapper and Reducer Classes long
with data types and their respective job names.

3|Page
The Process of MapReduce

The two biggest advantages of MapReduce are:

1. Parallel Processing:
In MapReduce, we are dividing the job among multiple nodes and each node works with a part
of the job simultaneously. So, MapReduce is based on Divide and Conquer paradigm which
helps us to process the data using different machines. As the data is processed by multiple
machines instead of a single machine in parallel, the time taken to process the data gets reduced
by a tremendous amount.

2. Data Locality:
Instead of moving data to the processing unit, we are moving the processing unit to the data in
the MapReduce Framework.

 It is very cost-effective to move processing unit to the data.

 The processing time is reduced as all the nodes are working with their part of the data in
parallel.
 Every node gets a part of the data to process and therefore, there is no chance of a node
getting overburdened.

4|Page
Why Input Split Is Important?

Blocks are physical chunks of data store in disks where as InputSplit is not physical chunks of
data. It is a Java class with pointers to start and end locations in blocks. So when Mapper tries to
read the data it clearly knows where to start reading and where to stop reading. The start location
of an InputSplit can start in a block and end in another block.

InputSplit respect logical record boundary and that is why it becomes very important. During
MapReduce execution Hadoop scans through the blocks and create InputSplits and each
InputSplit will be assigned to individual mappers for processing.

Q3: Discuss the functions of using Driver, Mapper and Reducer in Hadoop. You can take
an example of word count problem.

 MapReduce consists of two distinct tasks – Map and Reduce.

MapReduce majorly has the following three Classes. They are,

Mapper Class

5|Page
Input Split

It is the logical representation of data. It represents a block of work that contains a single map
task in the MapReduce Program.

RecordReader

It interacts with the Input split and converts the obtained data in the form of Key-Value Pairs.

Driver Class

The data goes through the following phases

Input Splits:

An input to a MapReduce job is divided into fixed-size pieces called input splits Input split is a
chunk of the input that is consumed by a single map

Mapping

This is the very first phase in the execution of map-reduce program. In this phase data in each
split is passed to a mapping function to produce output values. In our example, a job of mapping
phase is to count a number of occurrences of each word from input splits (more details about
input-split is given below) and prepare a list in the form of <word, frequency>

Shuffling

This phase consumes the output of Mapping phase. Its task is to consolidate the relevant records
from Mapping phase output. In our example, the same words are clubed together along with their
respective frequency.

6|Page
Reducing

In this phase, output values from the Shuffling phase are aggregated. This phase combines values
from Shuffling phase and returns a single output value. In short, this phase summarizes the
complete dataset.

How MapReduce Organizes Work?

Hadoop divides the job into tasks. There are two types of tasks:

1. Map tasks (Splits & Mapping)

2. Reduce tasks (Shuffling, Reducing)

As mentioned above.

The complete execution process (execution of Map and Reduce tasks, both) is controlled by two
types of entities called a

1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)

2. Multiple Task Trackers: Acts like slaves, each of them performing the job

For every job submitted for execution in the system, there is one Jobtracker that resides
on Namenode and there are multiple tasktrackers which reside on Datanode

A Word Count Example of MapReduce

MapReduce works by taking an example where I have a text file called abc.txt whose contents
are as follows:

Dear, Bear, River, Car, Car, River, Deer, Car and Bear

7|Page
 First, we divide the input into three splits as shown in the figure. This will distribute the
work among all the map nodes.
 Then, we tokenize the words in each of the mappers and give a hardcoded value (1) to
each of the tokens or words. The rationale behind giving a hardcoded value equal to 1 is
that every word, in itself, will occur once.
 Now, a list of key-value pair will be created where the key is nothing but the individual
words and value is one. So, for the first line (Dear Bear River) we have 3 key-value pairs
– Dear, 1; Bear, 1; River, 1. The mapping process remains the same on all the nodes.
 After the mapper phase, a partition process takes place where sorting and shuffling
happen so that all the tuples with the same key are sent to the corresponding reducer.
 So, after the sorting and shuffling phase, each reducer will have a unique key and a list of
values corresponding to that very key. For example, Bear, [1,1]; Car, [1,1,1].., etc.
 Now, each Reducer counts the values which are present in that list of values. As shown in
the figure, reducer gets a list of values which is [1,1] for the key Bear. Then, it counts the
number of ones in the very list and gives the final output as – Bear, 2.
 Finally, all the output key/value pairs are then collected and written in the output file.

8|Page
Q2: Explain the steps required for the configuration of eclipse with Apache Hadoop.
Support your answer with screenshots.

Eclipse is the most popular Integrated Development Environment (IDE) for developing Java
applications. It is robust, feature-rich, easy-to-use and powerful IDE which is the #1 choice of
almost Java programmers in the world. And it is totally FREE.

1. Download and Install Eclipse IDE

https://www.eclipse.org/downloads/download.php?file=/oomph/epp/2020

After downloading the file, we need to move this file to the home directory.

2. Extracting the file

9|Page
Two methods to extract the file either right click on the file and click on extract here or

Unzip the file using following command:

$ tar -xzvf eclipse-java-inst-1-linux-gtk-x86_64.tar.gz

3.Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is

actually a directory on your computer. That’s why when you start Eclipse, it asks to choose a
workspace location like this:

10 | P a g e
3. Create Java Project in Package Explorer:
By clicking in the left side of the interface
• File > New > Java Project >( Name it – WordCountProject1) > Finish
The new Java Project has created

11 | P a g e
o After that we need to create the Package, by clicking:
o Right Click > New > Package ( Name it - WordCountPackage) > Finish
o After creating the Package the third step is to create the class
 Right Click on Package > New > Class (Name it - WordCount)

3.Download Hadoop Libraries

Add Following Reference Libraries –
hadoop-core-1.2.1.jar
commons-cli-1.2.jar

12 | P a g e
To Add this Following libraries

Right Click on Project > Build Path> Add External Archivals

Downloads/hadoop-core-1.2.1.jar

Downloads/commons-cli-1.2.jar

13 | P a g e
14 | P a g e

04 MapReduce
No ratings yet
04 MapReduce
45 pages
Hadoop Beginner's Guide
From Everand
Hadoop Beginner's Guide
Garry Turkington
4/5 (7)
Bda Unit 2
No ratings yet
Bda Unit 2
48 pages
Bda U2
No ratings yet
Bda U2
79 pages
The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Bda Unit-3
No ratings yet
Bda Unit-3
44 pages
Bda Unit 3
No ratings yet
Bda Unit 3
29 pages
Map Reduce
No ratings yet
Map Reduce
45 pages
BDA Unit 2 Notes
No ratings yet
BDA Unit 2 Notes
32 pages
Hadoop (Mapreduce)
No ratings yet
Hadoop (Mapreduce)
43 pages
Hadoop Map Reduce
No ratings yet
Hadoop Map Reduce
53 pages
Map Reduce
No ratings yet
Map Reduce
74 pages
Bda Unit 3
No ratings yet
Bda Unit 3
14 pages
Unit - III
No ratings yet
Unit - III
37 pages
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
From Everand
SAP interface programming with RFC and VBA: Edit SAP data with MS Access
Karl Josef Hensel
No ratings yet
Map Reduce Programming
No ratings yet
Map Reduce Programming
67 pages
3 Bda Unit 3 Notes
No ratings yet
3 Bda Unit 3 Notes
12 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
15 pages
Understanding MapReduce
No ratings yet
Understanding MapReduce
4 pages
BDA Unit 3 1
No ratings yet
BDA Unit 3 1
37 pages
CNC Usb Controller
100% (1)
CNC Usb Controller
138 pages
Unit-2 MapReduce2024
No ratings yet
Unit-2 MapReduce2024
41 pages
BDA UNIT-3 (1) - Merged
No ratings yet
BDA UNIT-3 (1) - Merged
98 pages
Parallel Project
No ratings yet
Parallel Project
32 pages
Unit 2 Topic 4 Map Reduce
No ratings yet
Unit 2 Topic 4 Map Reduce
27 pages
Unit - Iii
No ratings yet
Unit - Iii
38 pages
MapReduce Arch
No ratings yet
MapReduce Arch
29 pages
Cao 2012
0% (1)
Cao 2012
116 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
AutoSys Training
50% (2)
AutoSys Training
30 pages
Professional Cloud Developer
No ratings yet
Professional Cloud Developer
168 pages
Shortnotes For Cloud
No ratings yet
Shortnotes For Cloud
22 pages
Unit 3
No ratings yet
Unit 3
10 pages
Unit 5 - Mapreduce
No ratings yet
Unit 5 - Mapreduce
8 pages
Synchronous Sequential Logic: UNIT-4
No ratings yet
Synchronous Sequential Logic: UNIT-4
52 pages
Epson Printer Manual
No ratings yet
Epson Printer Manual
140 pages
Unit 4 1
No ratings yet
Unit 4 1
12 pages
BDA Unit-2
No ratings yet
BDA Unit-2
11 pages
Chapter Five Hadoop Mapreduce & HDFS
No ratings yet
Chapter Five Hadoop Mapreduce & HDFS
44 pages
Unit 2
No ratings yet
Unit 2
12 pages
Big Data BCA Unit4
No ratings yet
Big Data BCA Unit4
9 pages
BDA Unit 3 Notes
No ratings yet
BDA Unit 3 Notes
11 pages
Case Study
100% (4)
Case Study
7 pages
Data Science Presentation
No ratings yet
Data Science Presentation
20 pages
Spectro V16 SoftWare
No ratings yet
Spectro V16 SoftWare
20 pages
Big Data Unit - 3
No ratings yet
Big Data Unit - 3
7 pages
HadoopMapreduce Summerization
No ratings yet
HadoopMapreduce Summerization
24 pages
Bda Unit-3
No ratings yet
Bda Unit-3
20 pages
Hadoop: Er. Gursewak Singh Dsce
No ratings yet
Hadoop: Er. Gursewak Singh Dsce
15 pages
Hadoop Wordcount Program
No ratings yet
Hadoop Wordcount Program
20 pages
3.1.how Map Reduce Works & 3.2 Anatomy
No ratings yet
3.1.how Map Reduce Works & 3.2 Anatomy
11 pages
Unit 2 - From Hadoop Streaming PDF
No ratings yet
Unit 2 - From Hadoop Streaming PDF
20 pages
Unit 5
No ratings yet
Unit 5
7 pages
Unit 3
No ratings yet
Unit 3
13 pages
Bda Mod2
No ratings yet
Bda Mod2
8 pages
Big Data Analytics Mid 2
No ratings yet
Big Data Analytics Mid 2
9 pages
DM Hadoop Architecture
No ratings yet
DM Hadoop Architecture
6 pages
Manual de Operaciones-MK9Config 3 Instrukcja 3.1.8.7 en
No ratings yet
Manual de Operaciones-MK9Config 3 Instrukcja 3.1.8.7 en
34 pages
Unit - III Advanced Analytics Technology and Tools
No ratings yet
Unit - III Advanced Analytics Technology and Tools
44 pages
Cloud Computing Prof
No ratings yet
Cloud Computing Prof
11 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
Big Data Assignment PDF
No ratings yet
Big Data Assignment PDF
18 pages
Introduction To Personal Finance
No ratings yet
Introduction To Personal Finance
5 pages
What Is MapReduce in Hadoop - Architecture - Example
No ratings yet
What Is MapReduce in Hadoop - Architecture - Example
7 pages
System 800xa: PLC Connect Configuration
No ratings yet
System 800xa: PLC Connect Configuration
128 pages
Hadoop Interview Questions Author: Pappupass Learning Resource
No ratings yet
Hadoop Interview Questions Author: Pappupass Learning Resource
16 pages
Map Reduce
No ratings yet
Map Reduce
40 pages
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
No ratings yet
03 Firstmrjob Invertedindexconstruction 141206231216 Conversion Gate01 PDF
54 pages
DSBDA Manual Assignment 11
No ratings yet
DSBDA Manual Assignment 11
6 pages
Primer SX EB700
No ratings yet
Primer SX EB700
19 pages
CC101 Introduction To Computing
No ratings yet
CC101 Introduction To Computing
3 pages
24C16 Serial EEPROM Datasheet
No ratings yet
24C16 Serial EEPROM Datasheet
26 pages
MKT525 Business To Business Marketing: Dr. Prashant Chauhan
No ratings yet
MKT525 Business To Business Marketing: Dr. Prashant Chauhan
9 pages
The History of Computer Viruses
No ratings yet
The History of Computer Viruses
18 pages
Emc Powerpath On Sola
No ratings yet
Emc Powerpath On Sola
13 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Administering Your Microsoft SQL Server Geodatabase
No ratings yet
Administering Your Microsoft SQL Server Geodatabase
35 pages
Chapter 2
No ratings yet
Chapter 2
25 pages
Top Answers To Map Reduce Interview Questions
No ratings yet
Top Answers To Map Reduce Interview Questions
6 pages
Notes Bug Data and of Apache
No ratings yet
Notes Bug Data and of Apache
4 pages
Computer Organization and Architecture: Chapter Five
No ratings yet
Computer Organization and Architecture: Chapter Five
23 pages
Introduction To The MATLAB
No ratings yet
Introduction To The MATLAB
6 pages
Big Data Ca
No ratings yet
Big Data Ca
14 pages
Study On Intel 80386 Microprocessor
No ratings yet
Study On Intel 80386 Microprocessor
3 pages
Chapter 9 - Designing Interfaces and Dialogues
No ratings yet
Chapter 9 - Designing Interfaces and Dialogues
25 pages
Installationshandbuch Windows SLMBC
No ratings yet
Installationshandbuch Windows SLMBC
24 pages
String: Replace String Using Regexp Split To Lines Should Be String
No ratings yet
String: Replace String Using Regexp Split To Lines Should Be String
7 pages
About The Cas1 General Food
No ratings yet
About The Cas1 General Food
6 pages
A132835183 - 21576 - 23 - 2019 - CASE BASED Ca2
No ratings yet
A132835183 - 21576 - 23 - 2019 - CASE BASED Ca2
6 pages
Creating Shortcut Menu Handlers - Win32 Apps - Microsoft Docs PDF
No ratings yet
Creating Shortcut Menu Handlers - Win32 Apps - Microsoft Docs PDF
21 pages
Today's Topic: - Inter-Process Communication With Pipes
No ratings yet
Today's Topic: - Inter-Process Communication With Pipes
16 pages
Assesment Report by Raksha Paliwal
No ratings yet
Assesment Report by Raksha Paliwal
13 pages
Computer Science e 55
No ratings yet
Computer Science e 55
14 pages
Class 3
No ratings yet
Class 3
1 page
Presented By: Nisha Agrawal (Roll-27)
No ratings yet
Presented By: Nisha Agrawal (Roll-27)
11 pages
Unit 3 - Scripting
No ratings yet
Unit 3 - Scripting
16 pages
Samsung Firmware Download - Lastest Official Firmware Update
No ratings yet
Samsung Firmware Download - Lastest Official Firmware Update
1 page
Learn Hive in 24 Hours
From Everand
Learn Hive in 24 Hours
Alex Nordeen
No ratings yet
4 Hadoop 2.0
No ratings yet
4 Hadoop 2.0
5 pages
Bajaj
No ratings yet
Bajaj
2 pages
Study of Different Wireless Network Components and Features of Any One of The Mobile Security Apps.
No ratings yet
Study of Different Wireless Network Components and Features of Any One of The Mobile Security Apps.
11 pages
Sahil Behl Resume
No ratings yet
Sahil Behl Resume
3 pages

Hadoop Karunesh

Uploaded by

Hadoop Karunesh

Uploaded by

Q1. Explain the architecture of Apache Hadoop in detail.

Justify the presence of input

Basically Hadoop architecture is composed of two main systems:

 Stores the actual data in HDFS

2. MapReduce: MapReduce is a programming framework that allows us to perform

 MapReduce consists of two distinct tasks – Map and Reduce.

The two biggest advantages of MapReduce are:

 It is very cost-effective to move processing unit to the data.

 MapReduce consists of two distinct tasks – Map and Reduce.

MapReduce majorly has the following three Classes. They are,

The data goes through the following phases

How MapReduce Organizes Work?

1. Map tasks (Splits & Mapping)

1. Jobtracker: Acts like a master (responsible for complete execution of submitted job)

A Word Count Example of MapReduce

1. Download and Install Eclipse IDE

2. Extracting the file

Unzip the file using following command:

3.Choose a Workspace Directory:

Eclipse organizes projects by workspaces. A workspace is a group of related projects and it is

3.Download Hadoop Libraries

Right Click on Project > Build Path> Add External Archivals

You might also like