0% found this document useful (0 votes)

213 views5 pages

Hadoop Distributed File System

HDFS is a distributed file system designed to store extremely large data sets across commodity hardware. It divides files into blocks and replicates them across multiple DataNodes. The NameNode manages the file system metadata and monitors DataNodes, while DataNodes store blocks and serve read/write requests. HDFS provides reliable and scalable storage for large data sets and is highly fault-tolerant due to block replication.

Uploaded by

gnikithaspandanasridurga3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

213 views5 pages

Hadoop Distributed File System

Uploaded by

gnikithaspandanasridurga3112

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 5

Introduction to Hadoop Distributed File

System(HDFS)

With growing data velocity the data size easily outgrows the storage limit of a
machine. A solution would be to store the data across a network of
machines. Such filesystems are called distributed filesystems. Since data is
stored across a network all the complications of a network come in.
This is where Hadoop comes in. It provides one of the most reliable
filesystems. HDFS (Hadoop Distributed File System) is a unique design that
provides storage for extremely large files with streaming data access pattern
and it runs on commodity hardware. Let’s elaborate the terms:
 Extremely large files: Here we are talking about the data in range
of petabytes(1000 TB).
 Streaming Data Access Pattern: HDFS is designed on principle
of write-once and read-many-times. Once data is written large
portions of dataset can be processed any number times.
 Commodity hardware: Hardware that is inexpensive and easily
available in the market. This is one of feature which specially
distinguishes HDFS from other file system.
Nodes: Master-slave nodes typically forms the HDFS cluster.
1. NameNode(MasterNode):
 Manages all the slave nodes and assign work to them.
 It executes filesystem namespace operations like opening,
closing, renaming files and directories.
 It should be deployed on reliable hardware which has the
high config. not on commodity hardware.
2. DataNode(SlaveNode):
 Actual worker nodes, who do the actual work like reading,
writing, processing etc.
 They also perform creation, deletion, and replication upon
instruction from the master.
 They can be deployed on commodity hardware.
HDFS daemons: Daemons are the processes running in background.
 Namenodes:
 Run on the master node.
 Store metadata (data about data) like file path, the number
of blocks, block Ids. etc.
 Require high amount of RAM.
 Store meta-data in RAM for fast retrieval i.e to reduce
seek time. Though a persistent copy of it is kept on disk.
 DataNodes:
 Run on slave nodes.
 Require high memory as data is actually stored here.
Data storage in HDFS: Now let’s see how the data is stored in a distributed
manner.

Lets assume that 100TB file is inserted, then masternode(namenode) will

first divide the file into blocks of 10TB (default size is 128 MB in Hadoop 2.x
and above). Then these blocks are stored across different
datanodes(slavenode). Datanodes(slavenode)replicate the blocks among
themselves and the information of what blocks they contain is sent to the
master. Default replication factor is 3 means for each block 3 replicas are
created (including itself). In hdfs.site.xml we can increase or decrease the
replication factor i.e we can edit its configuration here.
Note: MasterNode has the record of everything, it knows the location and
info of each and every single data nodes and the blocks they contain, i.e.
nothing is done without the permission of masternode.
Why divide the file into blocks?
Answer: Let’s assume that we don’t divide, now it’s very difficult to store a
100 TB file on a single machine. Even if we store, then each read and write
operation on that whole file is going to take very high seek time. But if we
have multiple blocks of size 128MB then its become easy to perform various
read and write operations on it compared to doing it on a whole file at once.
So we divide the file to have faster data access i.e. reduce seek time.
Why replicate the blocks in data nodes while storing?
Answer: Let’s assume we don’t replicate and only one yellow block is present
on datanode D1. Now if the data node D1 crashes we will lose the block and
which will make the overall data inconsistent and faulty. So we replicate the
blocks to achieve fault-tolerance.
Terms related to HDFS:
 HeartBeat : It is the signal that datanode continuously sends to
namenode. If namenode doesn’t receive heartbeat from a datanode
then it will consider it dead.
 Balancing : If a datanode is crashed the blocks present on it will be
gone too and the blocks will be under-replicated compared to the
remaining blocks. Here master node(namenode) will give a signal
to datanodes containing replicas of those lost blocks to replicate so
that overall distribution of blocks is balanced.
 Replication:: It is done by datanode.
Note: No two replicas of the same block are present on the same datanode.
Features:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple
datanodes.
 Even if multiple datanodes are down we can still do our work, thus
making it highly reliable.
 High fault tolerance.
Limitations: Though HDFS provide many features there are some areas
where it doesn’t work well.
 Low latency data access: Applications that require low-latency
access to data i.e in the range of milliseconds will not work well with
HDFS, because HDFS is designed keeping in mind that we need
high-throughput of data even at the cost of latency.
 Small file problem: Having lots of small files will result in lots of
seeks and lots of movement from one datanode to another
datanode to retrieve each small file, this whole process is a very
inefficient data access pattern.

The secondary NameNode merges the fsimage and the edits log files periodically and
keeps edits log size within a limit. It is usually run on a different machine than the
primary NameNode since its memory requirements are on the same order as the
primary NameNode.

What is difference between NameNode and secondary NameNode?

This relieves the NameNode from worrying about merging the contents of FSIMAGE
with the temporary log file. Secondary NameNode however doesn't take over the
functions of the NameNode if the NameNode encounters an issue. Secondary
NameNode can be manually made the primary NameNode but it doesn't happen
automatically.

What are the main functions of the secondary NameNode?

The main function of the Secondary namenode is to store the latest copy of the
FsImage and the Edits Log files. How does it help? When the namenode is
restarted , the latest copies of the Edits Log files are applied to the FsImage file in
order to keep the HDFS metadata latest.

JobTracker and
TaskTracker
JobTracker and TaskTracker are 2 essential process involved in MapReduce execution in MRv1 (or
Hadoop version 1). Both processes are now deprecated in MRv2 (or Hadoop version 2) and
replaced by Resource Manager, Application Master and Node Manager Daemons.

Job Tracker –
1. JobTracker process runs on a separate node and not usually on a DataNode.
2. JobTracker is an essential Daemon for MapReduce execution in MRv1. It is replaced
by ResourceManager/ApplicationMaster in MRv2.

3. JobTracker receives the requests for MapReduce execution from the client.

4. JobTracker talks to the NameNode to determine the location of the data.

5. JobTracker finds the best TaskTracker nodes to execute tasks based on the data
locality (proximity of the data) and the available slots to execute a task on a given
node.

6. JobTracker monitors the individual TaskTrackers and the submits back the overall
status of the job back to the client.

7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce

execution.

8. When the JobTracker is down, HDFS will still be functional but the MapReduce
execution can not be started and the existing MapReduce jobs will be halted.

TaskTracker –
1. TaskTracker runs on DataNode. Mostly on all DataNodes.

2. TaskTracker is replaced by Node Manager in MRv2.

3. Mapper and Reducer tasks are executed on DataNodes administered by

TaskTrackers.

4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.

5. TaskTracker will be in constant communication with the JobTracker signalling the

progress of the task in execution.

6. TaskTracker failure is not considered fatal. When a TaskTracker becomes

unresponsive, JobTracker will assign the task executed by the TaskTracker to another
node.

7. What is the difference between TaskTracker and JobTracker?

8. The TaskTracker performs its tasks while being closely monitored by
JobTracker. If the job fails, JobTracker simply resubmits the job to another
TaskTracker. However, JobTracker itself is a single point of failure, meaning if
it fails the whole system goes down. JobTracker updates its status when the
job completes.
What is Hadoop cluster configuration?
A hadoop cluster architecture consists of a data centre, rack and the node that
actually executes the jobs. Data centre consists of the racks and racks consists of
nodes. A medium to large cluster consists of a two or three level hadoop cluster
architecture that is built with rack mounted servers.

How to create a cluster in Hadoop?

Hadoop Cluster Setup
1. Purpose.
2. Prerequisites.
3. Installation.
4. Configuring Hadoop in Non-Secure Mode. Configuring Environment of
Hadoop Daemons. Configuring the Hadoop Daemons.
5. Monitoring Health of NodeManagers.
6. Slaves File.
7. Hadoop Rack Awareness.
8. Logging.

1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
No ratings yet
1734787260059cloud Computing AKTU Notes Password Chaudhary - Unlocked
55 pages
Unit Iii
No ratings yet
Unit Iii
20 pages
Daa Lab Manual
No ratings yet
Daa Lab Manual
60 pages
Cloud Unit3
No ratings yet
Cloud Unit3
26 pages
Roblox - Broken Bones IV Hack Script Pastebin
No ratings yet
Roblox - Broken Bones IV Hack Script Pastebin
3 pages
Problem Solving Using C KCA-102: Introduction To Course
No ratings yet
Problem Solving Using C KCA-102: Introduction To Course
36 pages
Module II
No ratings yet
Module II
22 pages
P.prabu (29x61c) CCS334 BDA - Unit 2
No ratings yet
P.prabu (29x61c) CCS334 BDA - Unit 2
29 pages
OS 2 Marks
100% (11)
OS 2 Marks
15 pages
Hadoop Unit-4
No ratings yet
Hadoop Unit-4
44 pages
BDA Experiment 14 PDF
No ratings yet
BDA Experiment 14 PDF
77 pages
MST Unit 5
No ratings yet
MST Unit 5
6 pages
Advanced Computing Lab Manual
No ratings yet
Advanced Computing Lab Manual
49 pages
Advanced Data Structures Lab
No ratings yet
Advanced Data Structures Lab
2 pages
Big Data Unit 4
No ratings yet
Big Data Unit 4
96 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
Module 4 Nosql
No ratings yet
Module 4 Nosql
8 pages
Hbase
No ratings yet
Hbase
13 pages
Distributed Querry Optimization
No ratings yet
Distributed Querry Optimization
4 pages
DWDM Unit 1
No ratings yet
DWDM Unit 1
103 pages
Module 6 Lecture 1 (Advance Topics)
No ratings yet
Module 6 Lecture 1 (Advance Topics)
18 pages
Bda Unit 1
No ratings yet
Bda Unit 1
32 pages
DBDM Unit Four
No ratings yet
DBDM Unit Four
33 pages
Hadoop ppt@87
No ratings yet
Hadoop ppt@87
16 pages
STC Sample Report
No ratings yet
STC Sample Report
21 pages
Unit 4 HIVE - PIG
No ratings yet
Unit 4 HIVE - PIG
71 pages
Unix - Module 1
No ratings yet
Unix - Module 1
38 pages
Unit Iii
No ratings yet
Unit Iii
43 pages
Cp4152 Database Practice Lab Manual R 2021
No ratings yet
Cp4152 Database Practice Lab Manual R 2021
48 pages
Big Data Unit 1 AKTU Notes
No ratings yet
Big Data Unit 1 AKTU Notes
87 pages
Unit V Easy To Learn
No ratings yet
Unit V Easy To Learn
21 pages
B.TECH. CSE (IoT) Syllabus 3rd Year 2024-25
No ratings yet
B.TECH. CSE (IoT) Syllabus 3rd Year 2024-25
29 pages
Knowledge Representation Issues: Expressiveness
No ratings yet
Knowledge Representation Issues: Expressiveness
5 pages
PPL I-GGoyal U2.1 Structured - Data - Objects 2022-11-18 20 - 07 Office Lens
100% (1)
PPL I-GGoyal U2.1 Structured - Data - Objects 2022-11-18 20 - 07 Office Lens
49 pages
Data Warehousing & Data Mining Unit-2 Notes
100% (1)
Data Warehousing & Data Mining Unit-2 Notes
36 pages
R22 SkillDevelopmentCourse
No ratings yet
R22 SkillDevelopmentCourse
21 pages
SEPM Handwritten Notes
No ratings yet
SEPM Handwritten Notes
14 pages
Notes - Unit 3 - Map Reduce Applications
No ratings yet
Notes - Unit 3 - Map Reduce Applications
11 pages
Dbms Unit 1 Notes
0% (1)
Dbms Unit 1 Notes
14 pages
Unit 1 Data Structures and Algorithms
No ratings yet
Unit 1 Data Structures and Algorithms
26 pages
BDA Presentations Unit-4 - Hadoop, Ecosystem
100% (1)
BDA Presentations Unit-4 - Hadoop, Ecosystem
25 pages
Dbms Notes Handwritten
No ratings yet
Dbms Notes Handwritten
28 pages
Multiprocessor Configuration
100% (1)
Multiprocessor Configuration
7 pages
Mean Stack Technologies Lab Record
No ratings yet
Mean Stack Technologies Lab Record
49 pages
Introduction To NoSQL
No ratings yet
Introduction To NoSQL
16 pages
UNIT 4 NOTES Oops
No ratings yet
UNIT 4 NOTES Oops
15 pages
FSD Notes
No ratings yet
FSD Notes
47 pages
DBMS Previous Year Question Paper
No ratings yet
DBMS Previous Year Question Paper
3 pages
UNIX PROGRAMMING Tie
No ratings yet
UNIX PROGRAMMING Tie
60 pages
Laboratory Manual: Silver Oak College of Engineering and Technology
No ratings yet
Laboratory Manual: Silver Oak College of Engineering and Technology
27 pages
DATABASE MANAGEMENT SYSTEMS Unit Wise Important Questions: PART - A (Short Answer Questions)
No ratings yet
DATABASE MANAGEMENT SYSTEMS Unit Wise Important Questions: PART - A (Short Answer Questions)
6 pages
Handwritten Cloud Computing
No ratings yet
Handwritten Cloud Computing
69 pages
@vtucode - in BCS515C Module 2 PDF
No ratings yet
@vtucode - in BCS515C Module 2 PDF
30 pages
Software Engineering
No ratings yet
Software Engineering
3 pages
Java Lab Manual: Aurora'S PG College Moosarambagh Mca Department
No ratings yet
Java Lab Manual: Aurora'S PG College Moosarambagh Mca Department
97 pages
Unit V-Apache Pig
No ratings yet
Unit V-Apache Pig
10 pages
1 Conceptual Graphs
No ratings yet
1 Conceptual Graphs
11 pages
Hadoop I/O: Jaeyong Choi
No ratings yet
Hadoop I/O: Jaeyong Choi
36 pages
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Introduction To Hadoop Distributed File System
No ratings yet
Introduction To Hadoop Distributed File System
3 pages
Saurav Kadariya
No ratings yet
Saurav Kadariya
3 pages
AWS Academy Resource - Content Resources
No ratings yet
AWS Academy Resource - Content Resources
7 pages
Sat Math Easy Practice Quiz 2
No ratings yet
Sat Math Easy Practice Quiz 2
17 pages
Final Report PDF
No ratings yet
Final Report PDF
93 pages
B.8.1. Diff Trafo - Brochure - NR - PCS-978S
No ratings yet
B.8.1. Diff Trafo - Brochure - NR - PCS-978S
21 pages
CertyIQ AZ900 40 Important Real Exam Questions - 2022
No ratings yet
CertyIQ AZ900 40 Important Real Exam Questions - 2022
84 pages
2048 Game - 4096 Elérve
No ratings yet
2048 Game - 4096 Elérve
2 pages
EN-Annex+I +technical+specifications
No ratings yet
EN-Annex+I +technical+specifications
51 pages
Sap La s4c01 en 32 Ex
100% (1)
Sap La s4c01 en 32 Ex
62 pages
Top 100 React-Redux Interview Questions 2021
No ratings yet
Top 100 React-Redux Interview Questions 2021
34 pages
Point Cloud Compression in MPEG
No ratings yet
Point Cloud Compression in MPEG
88 pages
Unit2 IT
No ratings yet
Unit2 IT
16 pages
Ultrsonic Sensor
No ratings yet
Ultrsonic Sensor
7 pages
2WB05 Simulation Lecture 5: Random-Number Generators: Marko Boon
No ratings yet
2WB05 Simulation Lecture 5: Random-Number Generators: Marko Boon
32 pages
David J. Malan: 33 Oxford Street, Cambridge, Massachusetts, 02138 USA
No ratings yet
David J. Malan: 33 Oxford Street, Cambridge, Massachusetts, 02138 USA
11 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
11 pages
BNOSS Blasting Painting Level 2 Endorsed
No ratings yet
BNOSS Blasting Painting Level 2 Endorsed
72 pages
ASA 8.3 - 8.4 Static NAT Migration Lab Guide - Lab 1.3 - My Tech World
No ratings yet
ASA 8.3 - 8.4 Static NAT Migration Lab Guide - Lab 1.3 - My Tech World
7 pages
Three Generations of Distance Education Pedagogy
No ratings yet
Three Generations of Distance Education Pedagogy
51 pages
Base Sas & Advance Sas Course Content
No ratings yet
Base Sas & Advance Sas Course Content
6 pages
Narrative Report
No ratings yet
Narrative Report
8 pages
Familias Tutorial English
No ratings yet
Familias Tutorial English
119 pages
Rogue Amoeba Audio Hijack Manual
No ratings yet
Rogue Amoeba Audio Hijack Manual
6 pages
Ma3151 Important Questions Padeepz
No ratings yet
Ma3151 Important Questions Padeepz
10 pages
NATO Review - Countering Cognitive Warfare Awareness and Resilience (2021)
No ratings yet
NATO Review - Countering Cognitive Warfare Awareness and Resilience (2021)
7 pages
Dma Dma 2.93
No ratings yet
Dma Dma 2.93
84 pages
21cs44 Os - Simp 2023 (For 21 Scheme Only) - Tie
No ratings yet
21cs44 Os - Simp 2023 (For 21 Scheme Only) - Tie
2 pages
Abbreviations/Acronyms Expansion
No ratings yet
Abbreviations/Acronyms Expansion
7 pages
GraphQL Basics
No ratings yet
GraphQL Basics
42 pages

Hadoop Distributed File System

Uploaded by

Hadoop Distributed File System

Uploaded by

Introduction to Hadoop Distributed File

Lets assume that 100TB file is inserted, then masternode(namenode) will

What is difference between NameNode and secondary NameNode?

What are the main functions of the secondary NameNode?

4. JobTracker talks to the NameNode to determine the location of the data.

7. JobTracker process is critical to the Hadoop cluster in terms of MapReduce

2. TaskTracker is replaced by Node Manager in MRv2.

3. Mapper and Reducer tasks are executed on DataNodes administered by

4. TaskTrackers will be assigned Mapper and Reducer tasks to execute by JobTracker.

5. TaskTracker will be in constant communication with the JobTracker signalling the

6. TaskTracker failure is not considered fatal. When a TaskTracker becomes

7. What is the difference between TaskTracker and JobTracker?

How to create a cluster in Hadoop?

You might also like