0% found this document useful (0 votes)

35 views

Introduction To HDFS

This document provides an introduction to HDFS (Hadoop Distributed File System). It describes HDFS as a distributed, fault-tolerant, and scalable file system for Hadoop applications. The key components of HDFS are the NameNode, which manages file system metadata, and DataNodes, which store the actual data blocks. HDFS is designed to handle large data sets with replication across multiple DataNodes for failure tolerance. The document outlines HDFS architecture and features, and provides examples of common HDFS commands for user operations, administration, and advanced usage.

Uploaded by

Samuel temesgen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Introduction To HDFS

Uploaded by

Samuel temesgen

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 20

Introduction to HDFS

1
What’s HDFS
• HDFS is a distributed file system that is fault tolerant,
scalable and extremely easy to expand.
• HDFS is the primary distributed storage for Hadoop
applications.
• HDFS provides interfaces for applications to move
themselves closer to data.
• HDFS is designed to ‘just work’, however a working
knowledge helps in diagnostics and improvements.

Introduction to HDFS 2
Components of HDFS
There are two (and a half) types of machines in a HDFS
cluster
• NameNode :– is the heart of an HDFS filesystem, it
maintains and manages the file system metadata. E.g;
what blocks make up a file, and on which datanodes
those blocks are stored.
• DataNode :- where HDFS stores the actual data, there
are usually quite a few of these.

Introduction to HDFS 3
HDFS Architecture

Introduction to HDFS 4
Unique features of HDFS
HDFS also has a bunch of unique features that make it ideal for distributed systems:

• Failure tolerant - data is duplicated across multiple DataNodes to protect

against machine failures. The default is a replication factor of 3 (every block is
stored on three machines).
• Scalability - data transfers happen directly with the DataNodes so your
read/write capacity scales fairly well with the number of DataNodes
• Space - need more disk space? Just add more DataNodes and re-balance
• Industry standard - Other distributed applications are built on top of HDFS
(HBase, Map-Reduce)

HDFS is designed to process large data sets with write-once-read-many semantics,

it is not for low latency access

Introduction to HDFS 5
HDFS – Data Organization
• Each file written into HDFS is split into data blocks
• Each block is stored on one or more nodes
• Each copy of the block is called replica
• Block placement policy
• First replica is placed on the local node
• Second replica is placed in a different rack
• Third replica is placed in the same rack as the second replica

Introduction to HDFS 6
Read Operation in HDFS

Introduction to HDFS 7
Write Operation in HDFS

Introduction to HDFS 8
HDFS Security
• Authentication to Hadoop
• Simple – insecure way of using OS username to determine hadoop identity
• Kerberos – authentication using kerberos ticket
• Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
• read (r), write (w), and execute (x) permissions
• also has an owner, group and mode
• enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ
from natural hierarchy of users and groups
• enabled by dfs.namenode.acls.enabled=true
Introduction to HDFS 9
HDFS Configuration
HDFS Defaults

• Block Size – 64 MB
• Replication Factor – 3
• Web UI Port – 50070

HDFS conf file - /etc/hadoop/conf/hdfs-site.xml

<property>
<name>dfs.namenode.name.dir</name>
<value>file:///data1/cloudera/dfs/nn,file:///data2/cloudera/dfs/nn</value>
</property>

<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>

<property>
<name>dfs.replication</name>
<value>3</value>
</property>

<property>
<name>dfs.namenode.http-address</name>
<value>itracXXX.cern.ch:50070</value>
</property>

Introduction to HDFS 10
Interfaces to HDFS
• Java API (DistributedFileSystem)
• C wrapper (libhdfs)
• HTTP protocol
• WebDAV protocol
• Shell Commands
However the command line is one of the simplest
and most familiar

Introduction to HDFS 11
HDFS – Shell Commands
There are two types of shell commands
User Commands
hdfs dfs – runs filesystem commands on the HDFS
hdfs fsck – runs a HDFS filesystem checking command
Administration Commands
hdfs dfsadmin – runs HDFS administration commands

Introduction to HDFS 12
HDFS – User Commands (dfs)
List directory contents
hdfs dfs –ls
hdfs dfs -ls /
hdfs dfs -ls -R /var

Display the disk space used by files

hdfs dfs -du -h /
hdfs dfs -du /hbase/data/hbase/namespace/
hdfs dfs -du -h /hbase/data/hbase/namespace/
hdfs dfs -du -s /hbase/data/hbase/namespace/

Introduction to HDFS 13
HDFS – User Commands (dfs)

Copy data to HDFS

hdfs dfs -mkdir tdata
hdfs dfs -ls
hdfs dfs -copyFromLocal tutorials/data/geneva.csv tdata
hdfs dfs -ls –R

Copy the file back to local filesystem

cd tutorials/data/
hdfs dfs –copyToLocal tdata/geneva.csv geneva.csv.hdfs
md5sum geneva.csv geneva.csv.hdfs

Introduction to HDFS 14
HDFS – User Commands (acls)
List acl for a file
hdfs dfs -getfacl tdata/geneva.csv

List the file statistics – (%r – replication factor)

hdfs dfs -stat "%r" tdata/geneva.csv

Write to hdfs reading from stdin

echo "blah blah blah" | hdfs dfs -put - tdataset/tfile.txt
hdfs dfs -ls –R
hdfs dfs -cat tdataset/tfile.txt

Introduction to HDFS 15
HDFS – User Commands (fsck)
Removing a file
hdfs dfs -rm tdataset/tfile.txt
hdfs dfs -ls –R

List the blocks of a file and their locations

hdfs fsck /user/cloudera/tdata/geneva.csv -
files -blocks –locations

Print missing blocks and the files they belong to

hdfs fsck / -list-corruptfileblocks

Introduction to HDFS 16
HDFS – Adminstration Commands
Comprehensive status report of HDFS cluster
hdfs dfsadmin –report

Prints a tree of racks and their nodes

hdfs dfsadmin –printTopology

Get the information for a given datanode (like ping)

hdfs dfsadmin -getDatanodeInfo
localhost:50020

Introduction to HDFS 17
HDFS – Advanced Commands
Get a list of namenodes in the Hadoop cluster
hdfs getconf –namenodes

Dump the NameNode fsimage to XML file

cd /var/lib/hadoop-hdfs/cache/hdfs/dfs/name/current
hdfs oiv -i fsimage_0000000000000003388 -o
/tmp/fsimage.xml -p XML

The general command line syntax is

hdfs command [genericOptions] [commandOptions]

Introduction to HDFS 18
Other Interfaces to HDFS
HTTP Interface
http://quickstart.cloudera:50070

MountableHDFS – FUSE
mkdir /home/cloudera/hdfs
sudo hadoop-fuse-dfs dfs://quickstart.cloudera:8020
/home/cloudera/hdfs

Once mounted all operations on HDFS can be performed using standard Unix
utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep',

Introduction to HDFS 19
Q&A

E-mail: [email protected]
Blog: http://prasanthkothuri.wordpress.com
See also: https://db-blog.web.cern.ch/ 20

Introduction To HDFS
No ratings yet
Introduction To HDFS
21 pages
Introduction_to_HDFS
No ratings yet
Introduction_to_HDFS
18 pages
Exp3 BDI 60004200124
No ratings yet
Exp3 BDI 60004200124
5 pages
Exp1 Bda
No ratings yet
Exp1 Bda
11 pages
Unit 3.1
No ratings yet
Unit 3.1
88 pages
Wa Introhdfs PDF
No ratings yet
Wa Introhdfs PDF
11 pages
1 Hdfs Notes
No ratings yet
1 Hdfs Notes
38 pages
Unit 2
No ratings yet
Unit 2
22 pages
huawei
No ratings yet
huawei
32 pages
BIGDTA_UNIT_3
No ratings yet
BIGDTA_UNIT_3
65 pages
HDFS 3
No ratings yet
HDFS 3
51 pages
Hadoop Distributed File System HDFS 1688981751
No ratings yet
Hadoop Distributed File System HDFS 1688981751
49 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
84 pages
Big Data Hadoop HDFS
No ratings yet
Big Data Hadoop HDFS
32 pages
Module-2 PPT-1
No ratings yet
Module-2 PPT-1
126 pages
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
No ratings yet
Hadoop File System: CSC 369 Distributed Computing Alexander Dekhtyar
5 pages
Big Data-UNIT-2
No ratings yet
Big Data-UNIT-2
46 pages
05 - Introduction To HDFS
No ratings yet
05 - Introduction To HDFS
27 pages
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
No ratings yet
Data Storage Data Processing: Hadoop Distributed File System (HDFS) Mapreduce
35 pages
Big Data Ia Answers
No ratings yet
Big Data Ia Answers
14 pages
10 Dfs
No ratings yet
10 Dfs
5 pages
Chapter 4 - Hadoop Ecosystem
No ratings yet
Chapter 4 - Hadoop Ecosystem
24 pages
Hadoop Architecture
No ratings yet
Hadoop Architecture
48 pages
HDFS v001
No ratings yet
HDFS v001
30 pages
HDFS Internals
No ratings yet
HDFS Internals
30 pages
HDFS
No ratings yet
HDFS
13 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
45 pages
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
No ratings yet
Big-Data Computing: Hadoop Distributed File System: B. Ramamurthy
43 pages
Lecture 4 Introduction to Hadoop
No ratings yet
Lecture 4 Introduction to Hadoop
25 pages
Hadoop File System: B. Ramamurthy
No ratings yet
Hadoop File System: B. Ramamurthy
36 pages
3021170
No ratings yet
3021170
51 pages
HDFS
No ratings yet
HDFS
1 page
(17CS82) 8 Semester CSE: Big Data Analytics
No ratings yet
(17CS82) 8 Semester CSE: Big Data Analytics
169 pages
Unit 2-HDFS SGS
No ratings yet
Unit 2-HDFS SGS
29 pages
Hadoop Distributed File System (HDFS) : Suresh Pathipati
No ratings yet
Hadoop Distributed File System (HDFS) : Suresh Pathipati
43 pages
3.1 Hadoop Ecosystem
No ratings yet
3.1 Hadoop Ecosystem
48 pages
BDA UNIT -3 Updated (1).docx
No ratings yet
BDA UNIT -3 Updated (1).docx
25 pages
Unit-4 BDA as on 25-11-2024
No ratings yet
Unit-4 BDA as on 25-11-2024
248 pages
Unit 4
No ratings yet
Unit 4
104 pages
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
No ratings yet
Hadoop Distributed File System: Bhavneet Kaur B.Tech Computer Science 2 Year
34 pages
HDFS Architecture Guide: by Dhruba Borthakur
No ratings yet
HDFS Architecture Guide: by Dhruba Borthakur
13 pages
HDFS
No ratings yet
HDFS
16 pages
15-2-2019 9.55-10.50
No ratings yet
15-2-2019 9.55-10.50
20 pages
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
No ratings yet
Hadoop Distributed File System: Presented by Mohammad Sufiyan Nagaraju Kola Prudhvi Krishna Kamireddy
17 pages
Introduction To Hadoop Ecosystem
No ratings yet
Introduction To Hadoop Ecosystem
46 pages
The Hadoop Distributed File System
No ratings yet
The Hadoop Distributed File System
29 pages
HDFS
100% (2)
HDFS
6 pages
Hadoop
No ratings yet
Hadoop
71 pages
Hadoop Commands Only
No ratings yet
Hadoop Commands Only
19 pages
Hdfs and Pig
No ratings yet
Hdfs and Pig
13 pages
Bda - Unit 2
No ratings yet
Bda - Unit 2
56 pages
Unit-2 Introduction To Hadoop
No ratings yet
Unit-2 Introduction To Hadoop
19 pages
Hadoop1
No ratings yet
Hadoop1
15 pages
Hadoop Distributed File System (HDFS)
No ratings yet
Hadoop Distributed File System (HDFS)
6 pages
Hadoop Distributed File System
No ratings yet
Hadoop Distributed File System
7 pages
3 Hadoop
No ratings yet
3 Hadoop
40 pages
Apex Institute of Technology: Big Data Security
No ratings yet
Apex Institute of Technology: Big Data Security
30 pages
HDFSnew
No ratings yet
HDFSnew
20 pages
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
From Everand
Quick Configuration of Openldap and Kerberos in Linux and Authenicating Linux to Active Directory
Dr. Hidaia Mahmood Alassouli
No ratings yet
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
From Everand
Exploring Hadoop Ecosystem (Volume 1): Batch Processing
Wei Liu
No ratings yet
FEC Complaint
No ratings yet
FEC Complaint
7 pages
01-Huawei Unistar - SCT v1.0
No ratings yet
01-Huawei Unistar - SCT v1.0
9 pages
English Task (Kentucky Fried Chicken)
No ratings yet
English Task (Kentucky Fried Chicken)
5 pages
We Will Not Forget by Eutonnah
No ratings yet
We Will Not Forget by Eutonnah
4 pages
Sejarah Kertas 3 SPM Intro
No ratings yet
Sejarah Kertas 3 SPM Intro
1 page
CMU and MALABANAN Digest
No ratings yet
CMU and MALABANAN Digest
4 pages
Leaflet Natural Refrigerants
No ratings yet
Leaflet Natural Refrigerants
2 pages
R964 Site - Scribd - Com - Google Search
100% (1)
R964 Site - Scribd - Com - Google Search
2 pages
Energy Management Plan Guide: Energy Efficiency and Conservation Programme July 2017
No ratings yet
Energy Management Plan Guide: Energy Efficiency and Conservation Programme July 2017
5 pages
Pomal Kanji Govindji & Ors Vs Vrajlal Karsandas Purohit & Ors1988
No ratings yet
Pomal Kanji Govindji & Ors Vs Vrajlal Karsandas Purohit & Ors1988
37 pages
Gucci
100% (1)
Gucci
23 pages
management control Assignment 1
No ratings yet
management control Assignment 1
9 pages
Chapter no. 2 Role of Communialism and Religious difference
No ratings yet
Chapter no. 2 Role of Communialism and Religious difference
3 pages
Narration
No ratings yet
Narration
3 pages
MEKELLE UNIVERSITY
No ratings yet
MEKELLE UNIVERSITY
77 pages
FM 3-22 27-1
No ratings yet
FM 3-22 27-1
11 pages
The Shifting Tectonics of Japan One Year After March 11, 2011
No ratings yet
The Shifting Tectonics of Japan One Year After March 11, 2011
10 pages
MANCONS
No ratings yet
MANCONS
25 pages
Nepal raw file
No ratings yet
Nepal raw file
2 pages
Arturo Soria y Mata
100% (1)
Arturo Soria y Mata
13 pages
Dr. Ram Manohar Lohiya National Law University, Lucknow 2021-2022
No ratings yet
Dr. Ram Manohar Lohiya National Law University, Lucknow 2021-2022
10 pages
Rise and Fall of Performance Investing
No ratings yet
Rise and Fall of Performance Investing
5 pages
Revised Notice for Readmission for Management Students
No ratings yet
Revised Notice for Readmission for Management Students
1 page
Sale Deed - Vijayawada (Khader Sahib)
No ratings yet
Sale Deed - Vijayawada (Khader Sahib)
7 pages
Polar Bears Article
No ratings yet
Polar Bears Article
4 pages
Breast Disorders in Children and Adolescents - UpToDate
No ratings yet
Breast Disorders in Children and Adolescents - UpToDate
18 pages
Advanced Accounting Course With SAP FICO For Consultant
No ratings yet
Advanced Accounting Course With SAP FICO For Consultant
2 pages
AE 24 Module 1 Lesson 4
No ratings yet
AE 24 Module 1 Lesson 4
2 pages
Field Project Final R&S
No ratings yet
Field Project Final R&S
40 pages
Charya Geet
100% (1)
Charya Geet
11 pages

Introduction To HDFS

Uploaded by

Introduction To HDFS

Uploaded by

Introduction to HDFS

• Failure tolerant - data is duplicated across multiple DataNodes to protect

HDFS is designed to process large data sets with write-once-read-many semantics,

HDFS conf file - /etc/hadoop/conf/hdfs-site.xml

Display the disk space used by files

Copy data to HDFS

Copy the file back to local filesystem

List the file statistics – (%r – replication factor)

Write to hdfs reading from stdin

List the blocks of a file and their locations

Print missing blocks and the files they belong to

Prints a tree of racks and their nodes

Get the information for a given datanode (like ping)

Dump the NameNode fsimage to XML file

The general command line syntax is

You might also like