Hadoop Distributed File System Basics

Hadoop Distributed File System (HDFS) is designed for distributed storage and processing of large datasets across clusters of computers. HDFS uses a master/slave architecture with a NameNode that manages metadata and DataNodes that store data blocks. Data is replicated across multiple DataNodes for fault tolerance. The MapReduce programming model is used for distributed processing of data in HDFS through parallel mapping and reducing of data blocks.

Uploaded by

ashuvasuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

194 views30 pages

Hadoop Distributed File System Basics

Uploaded by

ashuvasuma

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 30

Hadoop Distributed File

System Basics
Contents
• Hadoop Distributed File System Design Features
• HDFS Components
• HDFS Block Replication
• HDFS Safe Mode
• Rack Awareness
• NameNode High Availability
• HDFS Checkpoints and Backups
• HDFS Snapshots
• HDFS NFS Gateway
Hadoop Distributed File System
Design Features
• Big data processing
• write-once/read-many model
• No caching of data
• Design based on Google File System(GFS)
• Designed for Data Streaming
• Data Locality
• Moving Computation
HDFS Components
• Node Types
• Name Node - Manage Metadata
• Data Node – Store/Retrieve Data

•Design-Master/Slave Architecture
• Master(NN)-File System Namespace
• Slave(DN)-Read/Write Request
Various system roles in an HDFS deployment
• Disk Files-
• fsimage_*
• image of the file system state used only at startup by the NameNode.
• Stores metadata

• edit_*
• A series of modifications done to the file system after starting the NameNode.

• Location -
dfs.namenode.name.dir property in the hdfs-site.xml file.
HDFS Block Replication Cluster Replication
Factor
>8 3
• Replicates across the cluster >1 and <8 2

• amount of replication - dfs.replication Single

Machine
1

in hdfs-site.xml file
• Replicating each block across number of
machine(default=3)
• HDFS based on block size default is
64MB
• Splits are based on logical partitioning
of data.
HDFS Safe Mode
• When the NameNode starts
• enters a read-only safe mode
• blocks cannot be replicated or deleted
• Safe Mode enables the NameNode to perform two important
processes:
• Loading fsimage file into memory and replaying the edit log
• Mapping between blocks and data nodes is created,at least one copy of the
data is available before safe mode exit.
• Safe mode for maintenance - hdfs dfsadmin-safemode command
-Administrator maintenance
Rack Awareness
• Data Locality
• Hadoop MapReduce is to move the computation to the data
1. Data resides on the local machine (best).
2. Data resides in the same rack (better).
3. Data resides in a different rack (good).
Example- YARN Scheduler
Pros- improved fault tolerance
Cons-Entire rack failure then performance degraded
NameNode High Availability
• Earlier,NameNode was a single point of failure
• NameNode High Availability (HA) -to provide
true failover service
HA

Active Standby
Name node Name node

all client HDFS maintains enough state to

operations in the provide a fast failover
cluster

Apache ZooKeeper is used to monitor the NameNode

health.
HDFS NameNode Federation
• Earlier,single
namespace/nameNode
• Federation addresses this
limitation
• Multiple nameNodes/namespaces
• Key benefits:
• Namespace scalability
• Better performance
• System isolation
HDFS Checkpoints and Backups
• Checkpoints:
• fsimage - stores metedata
• edit log file - file system
modifications
• Backups:
• Maintains an up-to-date copy of the file system namespace both in
memory and on disk
• NameNode supports one BackupNode at a time
• No CheckpointNodes may be registered if a Backup node is in use.
HDFS Snapshots
• Created by administrators using the hdfs dfs snapshot command.
• HDFS snapshots are read-only point-in-time copies of the file system
• Features:
• sub-tree of the file system or the entire file system
• data backup, protection against user errors, and disaster recovery
• Instant creation
• Data not copied
• record the block list and the file size
• Doesnot affect regular HDFS operations.
Hadoop MapReduce
Framework
Contents
• MapReduce Model
• MapReduce Parallel Data Flow
• Data flow for a word count program
• Process placement during MapReduce
• Fault Tolerance
• Speculative Execution
MapReduce Model
• Simple, powerful
• Stages:
• Mapping stage
• Reducing stage
• Example word count
• grep " <<word>>" <<filename>> | wc -l/c
• The mapper and reducer functions are both defined with respect to
data structured in (key, value) pairs.
• The mapper takes one pair of data with a type in one data domain,
and returns a list of pairs in a different domain:
• Map(key1,value1) → list(key2,value2)
• The reducer function is then applied to each key–value pair, which in
turn produces a collection of values in the same domain:
• Reduce(key2, list (value2)) → list(value3)
• Each reducer call typically produces either one value (value3) or an
empty response. Thus, the MapReduce framework transforms a list of
(key, value) pairs into a list of values.
• Properties:
• Data flow is in one direction (map to reduce)
• Original data is preserverd
• No dependency
• Hadoop accomplishes parallelism by using a distributed file system
(HDFS) to slice and spread data over multiple servers.
• Data slice are then combined in the reducer step
MapReduce Parallel Data Flow
• Input Splits-HDFS distributes and replicates data over multiple servers
• Map Step-Parallel nature of Hadoop
• Combiner Step-Key–value pairs are combined prior
• Shuffle Step-similar keys are combined and counted by reducer
• Reduce Step-Actual Reduction happens and outputs to HDFS
Simple Hadoop MapReduce data flow for a word
count program
The count for run can be
combined into (run,2) before
the shuffle. This optimization
can help minimize the
amount of data transfer
needed for the shuffle phase.
• Placement of mappers and reducers are done by Hadoop YARN
resource manager and MapReduce framework.
• Nodes can run both mapper and reducer tasks.
• The dynamic nature of YARN enables the work containers used by
completed map tasks to be returned to the pool of available
resources.
• The shuffle stage makes sure the necessary data are sent to each
mapper.
• Mappers can complete at any time.
Process placement during MapReduce
Fault Tolerance
• Strict control of data flow throughout the execution of the program.
• Design of MapReduce makes it possible to easily recover from the
failure of one or many map processes as it has no dependency.
• Failed reducers can be restarted.
• MapReduce ApplicationMaster will need to restart the reducer tasks
• Process is totally transparent to the user and provides a fault-tolerant
system to run applications.
Speculative Execution
• Challenges with large clusters-inability to predict or manage
unexpected system failures.
• Congested network, slow disk controller, failing disk, high processor
load, or similar problem lead to slow performance.
• One part of a MapReduce process runs slowly, then the application
cannot complete.
• As input data are immutable in the MapReduce process, start a copy
of a running map process without disturbing any other running
mapper processes.
• Speculative execution can reduce cluster efficiency,it can be turned
off and on in the mapred-site.xml configuration file.
Node A Task
Node 1
Progress
Task(Slow)

Scheduler
Node 2
Node B Launch
Speculative
Task(Duplicate)

Node 3

The CAP Theorem Overview
No ratings yet
The CAP Theorem Overview
16 pages
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
No ratings yet
Lovely Professional University (Lpu) : Mittal School of Business (Msob)
10 pages
Mini-Project-Report (Expense Manager Tracker Management System)
No ratings yet
Mini-Project-Report (Expense Manager Tracker Management System)
21 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
8 pages
Unit-4 BDA As On 25-11-2024
No ratings yet
Unit-4 BDA As On 25-11-2024
258 pages
Unit IV Notes
No ratings yet
Unit IV Notes
34 pages
Unit 3
No ratings yet
Unit 3
44 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hadoop Module1
No ratings yet
Hadoop Module1
37 pages
BDA Unit 1
No ratings yet
BDA Unit 1
35 pages
Hadoop Intro and Hdfs
No ratings yet
Hadoop Intro and Hdfs
37 pages
BDA Mod 3 QB Solns
No ratings yet
BDA Mod 3 QB Solns
19 pages
EX294
100% (2)
EX294
12 pages
Unit Iv-1
No ratings yet
Unit Iv-1
84 pages
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
No ratings yet
UNIT-1-part-2-BIG DATA ANALYTICS AND TOOLS
19 pages
Bda - M 2
No ratings yet
Bda - M 2
113 pages
Hadoop Major Components
No ratings yet
Hadoop Major Components
10 pages
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
No ratings yet
Hadoop Common Hadoop Distributed File System (HDFS) Hadoop Yarn Hadoop Mapreduce
30 pages
Big Data Unit 2
No ratings yet
Big Data Unit 2
31 pages
Introduction To Hadoop - Chapter-2
No ratings yet
Introduction To Hadoop - Chapter-2
59 pages
Learn
No ratings yet
Learn
16 pages
Module II
No ratings yet
Module II
46 pages
HDFS
No ratings yet
HDFS
46 pages
Unit - II
No ratings yet
Unit - II
64 pages
Unit 3
No ratings yet
Unit 3
18 pages
BDA-Unit 4
No ratings yet
BDA-Unit 4
20 pages
Unit 2 Hadoop
No ratings yet
Unit 2 Hadoop
60 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
Hadoop
No ratings yet
Hadoop
4 pages
Chapter3 HDFS MapReduce YARN
No ratings yet
Chapter3 HDFS MapReduce YARN
35 pages
Sem 7 - COMP - BDA
No ratings yet
Sem 7 - COMP - BDA
16 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Chapter2 Bdi
No ratings yet
Chapter2 Bdi
101 pages
U-3 Big Data
No ratings yet
U-3 Big Data
23 pages
Unit - 2
No ratings yet
Unit - 2
42 pages
CC Unit-7
No ratings yet
CC Unit-7
16 pages
Unit 2
No ratings yet
Unit 2
56 pages
Hadoop Presentaton
No ratings yet
Hadoop Presentaton
47 pages
Big Data Unit 4 Own
No ratings yet
Big Data Unit 4 Own
18 pages
Bda Unit-Iv
No ratings yet
Bda Unit-Iv
37 pages
UNIT-5-HDFS (Hadoop Distributed File System)
No ratings yet
UNIT-5-HDFS (Hadoop Distributed File System)
18 pages
BDA Notes
No ratings yet
BDA Notes
25 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
ECS765P - W3 - Hadoop Principles and Components
No ratings yet
ECS765P - W3 - Hadoop Principles and Components
47 pages
DW - Bigdata9
No ratings yet
DW - Bigdata9
113 pages
Unit - 6 Embedded System Implementation and Testing
No ratings yet
Unit - 6 Embedded System Implementation and Testing
23 pages
ISO27k Controls Cross Check 2013
No ratings yet
ISO27k Controls Cross Check 2013
6 pages
Unit 1 Haoop Architecture
No ratings yet
Unit 1 Haoop Architecture
26 pages
NYOUG Hadoop Presentaton
No ratings yet
NYOUG Hadoop Presentaton
47 pages
Make Your PC Do The Work For You With Microsoft Power Automate
100% (1)
Make Your PC Do The Work For You With Microsoft Power Automate
12 pages
CC Unit 5 Notes
No ratings yet
CC Unit 5 Notes
30 pages
Hadoopintro
No ratings yet
Hadoopintro
31 pages
Hadoop Presentation
No ratings yet
Hadoop Presentation
19 pages
02 Unit-II Hadoop Architecture and HDFS
No ratings yet
02 Unit-II Hadoop Architecture and HDFS
18 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
No ratings yet
Prepared By: Manoj Kumar Joshi & Vikas Sawhney
47 pages
Module 1 PDF
No ratings yet
Module 1 PDF
49 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
Business Intelligence & Big Data Analytics-CSE3124Y
No ratings yet
Business Intelligence & Big Data Analytics-CSE3124Y
26 pages
Unit - 5 Learning Notes
No ratings yet
Unit - 5 Learning Notes
8 pages
An Introduction To Hadoop
No ratings yet
An Introduction To Hadoop
12 pages
Hadoop: A Report Writing On
No ratings yet
Hadoop: A Report Writing On
13 pages
Unit 2 Notes BDA
No ratings yet
Unit 2 Notes BDA
10 pages
Splits Input Into Independent Chunks in Parallel Manner
No ratings yet
Splits Input Into Independent Chunks in Parallel Manner
4 pages
Developer's Guide To Microsoft Enterprise Library - 2nd Edition
100% (1)
Developer's Guide To Microsoft Enterprise Library - 2nd Edition
245 pages
Order 6750.24E
No ratings yet
Order 6750.24E
20 pages
Understanding OOP Concepts With C#
No ratings yet
Understanding OOP Concepts With C#
16 pages
Python Programming Notes by CodingClub PDF
No ratings yet
Python Programming Notes by CodingClub PDF
141 pages
Tuning of PID Controller Using Ziegler-Nichols Method For Speed Control of DC Motor
No ratings yet
Tuning of PID Controller Using Ziegler-Nichols Method For Speed Control of DC Motor
6 pages
Parking Record System
50% (2)
Parking Record System
10 pages
Software Testing Chapter 1
No ratings yet
Software Testing Chapter 1
23 pages
Assignment 7
No ratings yet
Assignment 7
3 pages
Cybersecurity Fundamentals
No ratings yet
Cybersecurity Fundamentals
20 pages
PMF Comprog 2 Finals
No ratings yet
PMF Comprog 2 Finals
136 pages
Project Report On Employee Management System
No ratings yet
Project Report On Employee Management System
30 pages
Sequence Diagrams
No ratings yet
Sequence Diagrams
70 pages
Introducing Classes, Objects, and Methods
No ratings yet
Introducing Classes, Objects, and Methods
83 pages
Introduction To Concrete - Part 2 Outcomes and Sample Questions
No ratings yet
Introduction To Concrete - Part 2 Outcomes and Sample Questions
22 pages
Turbo Brochure
100% (1)
Turbo Brochure
11 pages
Automated Tests Vs Manual Tests: Software Testing Has Lot of Challenges Both in Manual As Well As in Automation
No ratings yet
Automated Tests Vs Manual Tests: Software Testing Has Lot of Challenges Both in Manual As Well As in Automation
11 pages
4-REMOTE en
No ratings yet
4-REMOTE en
15 pages
Business Analyst Interview Questions 1677747111
No ratings yet
Business Analyst Interview Questions 1677747111
5 pages
IT 3030 - Programming Applications and Frameworks6
No ratings yet
IT 3030 - Programming Applications and Frameworks6
3 pages
CH 01
No ratings yet
CH 01
7 pages
A Blueprint For Future Electrical Engineering Education
No ratings yet
A Blueprint For Future Electrical Engineering Education
5 pages
Architectural Modeling and Design Using Revit
No ratings yet
Architectural Modeling and Design Using Revit
35 pages
2009 10 15 LIME Appendix B Erlang B Table
No ratings yet
2009 10 15 LIME Appendix B Erlang B Table
4 pages
Manoj Chapagain: Skills
No ratings yet
Manoj Chapagain: Skills
2 pages
Big Data Analytics
From Everand
Big Data Analytics
Nitin Kumar Yadav
No ratings yet

Hadoop Distributed File System Basics

Uploaded by

Hadoop Distributed File System Basics

Uploaded by

Hadoop Distributed File

• amount of replication - dfs.replication Single

all client HDFS maintains enough state to

Apache ZooKeeper is used to monitor the NameNode

You might also like