Introduction To HDFS
Introduction To HDFS
Introduction To HDFS
1
What’s HDFS
• HDFS is a distributed file system that is fault tolerant,
scalable and extremely easy to expand.
• HDFS is the primary distributed storage for Hadoop
applications.
• HDFS provides interfaces for applications to move
themselves closer to data.
• HDFS is designed to ‘just work’, however a working
knowledge helps in diagnostics and improvements.
Introduction to HDFS 2
Components of HDFS
There are two (and a half) types of machines in a HDFS
cluster
• NameNode :– is the heart of an HDFS filesystem, it
maintains and manages the file system metadata. E.g;
what blocks make up a file, and on which datanodes
those blocks are stored.
• DataNode :- where HDFS stores the actual data, there
are usually quite a few of these.
Introduction to HDFS 3
HDFS Architecture
Introduction to HDFS 4
Unique features of HDFS
HDFS also has a bunch of unique features that make it ideal for distributed systems:
Introduction to HDFS 5
HDFS – Data Organization
• Each file written into HDFS is split into data blocks
• Each block is stored on one or more nodes
• Each copy of the block is called replica
• Block placement policy
• First replica is placed on the local node
• Second replica is placed in a different rack
• Third replica is placed in the same rack as the second replica
Introduction to HDFS 6
Read Operation in HDFS
Introduction to HDFS 7
Write Operation in HDFS
Introduction to HDFS 8
HDFS Security
• Authentication to Hadoop
• Simple – insecure way of using OS username to determine hadoop identity
• Kerberos – authentication using kerberos ticket
• Set by hadoop.security.authentication=simple|kerberos
• File and Directory permissions are same like in POSIX
• read (r), write (w), and execute (x) permissions
• also has an owner, group and mode
• enabled by default (dfs.permissions.enabled=true)
• ACLs are used for implemention permissions that differ
from natural hierarchy of users and groups
• enabled by dfs.namenode.acls.enabled=true
Introduction to HDFS 9
HDFS Configuration
HDFS Defaults
• Block Size – 64 MB
• Replication Factor – 3
• Web UI Port – 50070
<property>
<name>dfs.blocksize</name>
<value>268435456</value>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
</property>
<property>
<name>dfs.namenode.http-address</name>
<value>itracXXX.cern.ch:50070</value>
</property>
Introduction to HDFS 10
Interfaces to HDFS
• Java API (DistributedFileSystem)
• C wrapper (libhdfs)
• HTTP protocol
• WebDAV protocol
• Shell Commands
However the command line is one of the simplest
and most familiar
Introduction to HDFS 11
HDFS – Shell Commands
There are two types of shell commands
User Commands
hdfs dfs – runs filesystem commands on the HDFS
hdfs fsck – runs a HDFS filesystem checking command
Administration Commands
hdfs dfsadmin – runs HDFS administration commands
Introduction to HDFS 12
HDFS – User Commands (dfs)
List directory contents
hdfs dfs –ls
hdfs dfs -ls /
hdfs dfs -ls -R /var
Introduction to HDFS 13
HDFS – User Commands (dfs)
Introduction to HDFS 14
HDFS – User Commands (acls)
List acl for a file
hdfs dfs -getfacl tdata/geneva.csv
Introduction to HDFS 15
HDFS – User Commands (fsck)
Removing a file
hdfs dfs -rm tdataset/tfile.txt
hdfs dfs -ls –R
Introduction to HDFS 16
HDFS – Adminstration Commands
Comprehensive status report of HDFS cluster
hdfs dfsadmin –report
Introduction to HDFS 17
HDFS – Advanced Commands
Get a list of namenodes in the Hadoop cluster
hdfs getconf –namenodes
Introduction to HDFS 18
Other Interfaces to HDFS
HTTP Interface
http://quickstart.cloudera:50070
MountableHDFS – FUSE
mkdir /home/cloudera/hdfs
sudo hadoop-fuse-dfs dfs://quickstart.cloudera:8020
/home/cloudera/hdfs
Once mounted all operations on HDFS can be performed using standard Unix
utilities such as 'ls', 'cd', 'cp', 'mkdir', 'find', 'grep',
Introduction to HDFS 19
Q&A
E-mail: [email protected]
Blog: http://prasanthkothuri.wordpress.com
See also: https://db-blog.web.cern.ch/ 20