Hadoop Intro and Hdfs
Hadoop Intro and Hdfs
Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult
to process them using traditional data processing applications.
The Apache Hadoop software library is a framework that allows for the
distributed processing of large data sets across clusters of computers using simple
programming models.
The library itself is designed to detect and handle failures at the application layer,
so delivering a highly-available service.
• Distributed, Scalable, and Portable file
Hadoop Distributed system written in Java for the Hadoop
File System framework
• Fault‐Tolerant Storage System
HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
Maps a block to the Datanodes
Controls read/write access to files
Manages Replication Engine for Blocks
DataNode
NameNode DataNode
Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
JobTracker Daemons
Accepts Map-Reduce tasks from the Users
Assigns tasks to the Task Trackers &
monitors their status
JobTracker TaskTracker
TaskTracker
Runs Map-Reduce tasks
Sends heart-beat to Job Tracker
Retrieves Job resources from HDFS
Scalability
Batch-Processing only
Partitioning of Resources
•Distributed, Scalable, and Portable file system written in Java for the
Hadoop Distributed File Hadoop framework
•HDFS High Availability
System •Fault‐Tolerant Storage System
YARN Engine : --
Processes vast amounts of data in-
parallel on large clusters in a
reliable & fault-tolerant manner
Consists of :
• Resource Manager & Node
Manager
HDFS Layer : --
Stores files across storage nodes
in a Hadoop cluster
Consists of :
• Namenode & Datanodes
NameNode
Maps a block to the Datanodes
Controls read/write access to files
Manages Replication Engine for Blocks
DataNode
NameNode DataNode
Responsible for serving read and write
requests (block creation, deletion, and
replication) Hadoop
ResourceManager Daemons
Accepts Map-Reduce or Application tasks
from the Users
Assigns tasks to the NodeManager & Resource Node
monitors their status
NodeManager
Manager Manager
Runs Application tasks
Sends heart-beat to ResourceManager
Retrieves Application resources from HDFS
HDFS Design Goals
Hardware Failure - Detection of faults and quick, automatic recovery
Block 1
Block 2
Block 4
Interface DNSToSwitchMapping
Rack Topology - /rack1 & /rack2
Replica Placement
Critical to HDFS reliability and
performance
c) Third replica is placed on the same rack as the second, but on a different node chosen at random.
Replica Selection - HDFS tries to satisfy a read request from a replica that is closest to the
reader.
FileSystem Image and Edit Logs
fsimage file is a persistent checkpoint of the filesystem metadata
When a client performs a write operation, it is first recorded in the edit log.
It does the check pointing process itself. without recourse to the Secondary NameNode.
The locations of blocks in the system are not persisted by the NameNode - this information resides with
the DataNodes, in the form of a list of the blocks it is storing.
Safe mode is needed to give the DataNodes time to check in to the NameNode with their block lists
Safe mode is exited when the minimal replication condition is reached, plus an extension time of 30
seconds.
Administration
HDFS Trash
HDFS Quotas
Safe Mode
FS Shell
dfsadmin Command
HDFS Trash – Recycle Bin
When a file is deleted by a user, it is not immediately removed from HDFS. HDFS moves it to a file in the /trash directory.
File : core-site.xml
Property : fs.trash.interval
A file remains in /trash for a configurable amount of time. After the expiry of its life in /trash, the NameNode deletes the
file from the HDFS namespace.
File : core-site.xml
Property : fs.trash.checkpoint.interval
Description : Number of minutes between trash checkpoints. Should be smaller or equal to fs.trash.interval.
Undelete a file: User needs to navigate the /trash directory and retrieve the file by using mv command.
HDFS Quotas
Name Quota - a hard limit on the number of file and directory names in the tree rooted at that directory.
dfsadmin -setQuota <N> <directory>... Set the name quota to be N for each directory.
dfsadmin -clrQuota <directory>... Remove any name quota for each directory.
Space Quota - a hard limit on the number of bytes used by files in the tree rooted at that directory.
dfsadmin -setSpaceQuota <N> directory>.. Set the space quota to be N bytes for each directory.
dfsadmin -clrSpaceQuota <directory>... Remove any spce quota for each directory.
Reporting Quota - count command of the HDFS shell reports quota values and the current count of names and
bytes in use. With the -q option, also report the name quota value set for each directory, the available name quota
remaining, the space quota value set, and the available space quota remaining.
fs -count -q <directory>..
FS Shell – Some Basic Commands
cat
hadoop fs -cat URI [URI …]
Copies source paths to stdout.
cp
hadoop fs -chgrp [-R] GROUP URI [URI …]
Change group association of files. With -R, make the change recursively through the directory structure.
chmod
hadoop fs -chmod -R 777 hdfs://nn1.example.com/file1
Change the permissions of files. With -R, make the change recursively through the directory structure.
copyFromLocal / put
hadoop fs -copyFromLocal <localsrc> URI
Copy single src, or multiple srcs from local file system to the destination filesystem
copyToLocal / get
hadoop fs -copyToLocal <localdst>
Copy files to the local file system.
FS Shell – Commands Continued…
expunge
hadoop fs –expunge
Empty the Trash.
mkdir
hadoop fs -mkdir <paths>
Takes path uri's as argument and creates directories.
rmr
hadoop fs –rmr /user/hadoop/dir
Recursive version of delete.
Touchz
hadoop -touchz pathname
Create a file of zero length.
du
hadoop fs -du URI [URI …]
Displays aggregate length of files contained in the directory or the length of a file in case its just a file.
DfsAdmin Command
bin/hadoop dfsadmin [Generic Options] [Command Options]
-safemode enter / Safe mode maintenance command. Safe mode can also be entered manually, but then it can only be
leave / get / wait turned off manually as well.
-refreshNodes Re-read the hosts and exclude files to update the set of Datanodes that are allowed to connect to the
Namenode and those that should be decommissioned or recommissioned.
-metasave filename Save Namenode's primary data structures to filename in the directory specified by hadoop.log.dir
property. filename is overwritten if it exists. filename will contain one line for each of the following
1. Datanodes heart beating with Namenode
2. Blocks waiting to be replicated
3. Blocks currrently being replicated
4. Blocks waiting to be deleted
Modes
Pseudo Distributed
• Daemons run on a single-node
• Each Hadoop daemon runs in a separate Java process
Fully Distributed
• Master-Slave Architecture
• One machine is designated as the NameNode and other as ResourceManager (can be within same machine as
well)
• Rest of the machines in the cluster act as both DataNode and NodeManager
(i) (ii)
(iii) (iv)
Create Establish
Create Hadoop Hadoop
Dedicated User Authentication
folder Configuration
& Group among Nodes
(viii) (v)
(vi)
Run Simple (vii) Remote Copy
Start Hadoop
WordCount Testing Hadoop Hadoop folder
Program Cluster
to Slave Nodes