04-Hadoop Distributed File System
04-Hadoop Distributed File System
Unit objectives
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Unit objectives
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
What is Hadoop?
• Apache open source software framework for reliable, scalable,
distributed computing over massive amount of data:
▪ hides underlying system details and complexities from user
▪ developed in Java
• Consists of 4 sub projects:
▪ MapReduce
▪ Hadoop Distributed File System (HDFS)
▪ YARN
▪ Hadoop Common
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
What is Hadoop?
Hadoop is an open source project of the Apache Foundation.
It is a framework written in Java originally developed by Doug Cutting who named it
after his son's toy elephant.
Hadoop uses Google's MapReduce and Google File System (GFS) technologies as its
foundation.
It is optimized to handle massive amounts of data which could be structured,
unstructured or semi-structured, and uses commodity hardware (relatively inexpensive
computers).
This massive parallel processing is done with great performance. However, it is a batch
operation handling massive amounts of data, so the response time is not immediate. As
of Hadoop version 0.20.2, updates are not possible, but appends were made possible
starting in version 0.21.
What is the value of a system if the information it stores or retrieves is not consistent?
Hadoop replicates its data across different computers, so that if one goes down, the
data is processed on one of the replicated computers.
You may be familiar with OLTP (Online Transactional processing) workloads where
data is randomly accessed on structured data like a relational database, such as when
you access your bank account.
You may also be familiar with OLAP (Online Analytical processing) or DSS (Decision
Support Systems) workloads where data is sequentially access on structured data, like
a relational database, to generate reports that provide business intelligence.
You may not be that familiar with the concept of big data. Big data is a term used to
describe large collections of data (also known as datasets) that may be unstructured
and grow so large and quickly that is difficult to manage with regular database or
statistics tools.
Hadoop is not used for OLTP nor OLAP, but is used for big data, and it complements
these two to manage data. Hadoop is not a replacement for a RDBMS.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Timeline:
• 2003: Google launches project Nutch to handle billions of searches and
indexing millions of web pages.
• Oct 2003: Google releases papers with GFS (Google File System)
• Dec 2004: Google releases papers with MapReduce
• 2005: Nutch used GFS and MapReduce to perform operations
• 2006: Yahoo! created Hadoop based on GFS and MapReduce (with Doug
Cutting and team)
• 2007: Yahoo started using Hadoop on a 1000 node cluster
• Jan 2008: Apache took over Hadoop
• Jul 2008: Tested a 4000-node cluster with Hadoop successfully
• 2009: Hadoop successfully sorted a petabyte of data in less than 17 hours to
handle billions of searches and indexing millions of web pages.
• Dec 2011: Hadoop releases version 1.0
For later releases, and the release numbering structure, refer to:
https://wiki.apache.org/hadoop/Roadmap
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
HDFS architecture
• Master/Slave architecture NameNode File1
• Master: NameNode a
b
▪ manages the file system namespace c
and metadata d
― FsImage
― Edits Log
▪ regulates client access to files
• Slave: DataNode
▪ many per cluster
▪ manages storage attached to the
nodes
▪ periodically reports status to a b a c
NameNode b a d b
d c c d
DataNodes
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
HDFS architecture
Important points still to be discussed:
• Secondary NameNode
• Import checkpoint
• Rebalancing
• SafeMode
• Recovery Mode
The entire file system namespace, including the mapping of blocks to files and file
system properties, is stored in a file called the FsImage. The FsImage is stored as a file
in the NameNode's local file system. It contains the metadata on disk (not exact copy of
what is in RAM, but a checkpoint copy).
The NameNode uses a transaction log called the EditLog (or Edits Log) to persistently
record every change that occurs to file system metadata, synchronizes with metadata in
RAM after each write.
The NameNode can be a potential single point of failure (this has been resolved in later
releases of HDFS with Secondary NameNode, various forms of high availability, and in
Hadoop v2 with NameNode federation and high availability as out-of-the-box options).
• Use better quality hardware for all management nodes, and in particular do not
use inexpensive commodity hardware for the NameNode.
• Mitigate by backing up to other storage.
In case of power failure on NameNode, recover is performed using the FsImage and
the EditLog.
HDFS blocks
• HDFS is designed to support very large files
• Each file is split into blocks: Hadoop default is 128 MB
• Blocks reside on different physical DataNodes
• Behind the scenes, each HDFS block is supported by multiple
operating system blocks
64 MB HDFS blocks
OS blocks
• If a file or a chunk of the file is smaller than the block size, only the
needed space is used. For example, a 210MB file is split as:
64 MB 64 MB 64 MB 18 MB
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
HDFS blocks
Data in a Hadoop cluster is broken down into smaller pieces (called blocks) and
distributed throughout the cluster. In this way, the map and reduce functions can be ex-
ecuted on smaller subsets of your larger data sets, and this provides the scalability that
is needed for big data processing. See https://hortonworks.com/apache/hdfs.
In earlier versions of Hadoop/HDFS, the default blocksize was often quoted as 64 MB,
but the current default setting for Hadoop/HDFS is noted in
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/hdfs-
default.xml
• dfs.blocksize = 134217728
Default block size for new files, in bytes. You can use the following suffix (case
insensitive): k(kilo), m(mega), g(giga), t(tera), p(peta), e(exa) to specify the size
(such as 128k, 512m, 1g, etc.), or you can provide the complete size in bytes
(such as 134217728 for 128 MB).
It should be noted that Linux itself has both a logical block size (typically 4 KB) and a
physical or hardware block size (typically 512 bytes).
Linux filesystems:
• For ext2 or ext3, the situation is relatively simple: each file occupies a certain
number of blocks. All blocks on a given filesystem have the same size, usually
one of 1024, 2048 or 4096 bytes.
• What is the physical blocksize?
[clsadmin@chs-gbq-108-mn003 ~]$ lsblk -o NAME,PHY-SEC
NAME PHY-SEC
xvda 512
├─xvda1 512
└─xvda2 512
xvdb 512
└─xvdb1 512
xvdc 512
├─xvdc1 512
├─xvdc2
├─xvdc3
├─xvdc4
└─xvdc5
• Approach:
▪ first replica goes on
any node in the cluster
▪ second replica on a
node in a different rack
▪ third replica on a
different node in the
second rack
The approach cuts inter-rack
network bandwidth, which
improves write performance
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
replicas in the cluster which makes it easy to balance load on component failure,
however this policy increases the cost of writes because a write needs to transfer
blocks to multiple racks.
For the common case, when the replication factor is three, HDFS's placement policy is
to put one replica on one node in the local rack, another on a different node in the local
rack, and the last on a different node in a different rack. This policy cuts the inter-rack
write traffic which generally improves write performance. The chance of rack failure is
far less than that of node failure; this policy does not impact data reliability and
availability guarantees. However, it does reduce the aggregate network bandwidth used
when reading data since a block is placed in only two unique racks rather than three.
With this policy, the replicas of a file do not evenly distribute across the racks. One third
of replicas are on one node, two thirds of replicas are on one rack, and the other third
are evenly distributed across the remaining racks. This policy improves write
performance without compromising data reliability or read performance.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
ipaddr=$0
echo /${segments}
#!/bin/bash
RACK_PREFIX=default
if [ $# -gt 0 ]; then
CTL_FILE=${CTL_FILE:-"rack_topology.data"}
HADOOP_CONF=${HADOOP_CONF:-"/etc/hadoop/conf"}
if [ ! -f ${HADOOP_CONF}/${CTL_FILE} ]; then
exit 0
fi
while [ $# -gt 0 ] ; do
nodeArg=$1
exec< ${HADOOP_CONF}/${CTL_FILE}
result=""
ar=( $line )
result="${ar[1]}"
fi
done
shift
if [ -z "$result" ] ; then
else
fi
done
else
fi
Compression of files
• File compression brings two benefits:
▪ reduces the space need to store files
▪ speeds up data transfer across the network or to/from disk
• But is the data splitable? (necessary for parallel reading)
• Use codecs, such as org.apache.hadoop.io.compressSnappyCodec
Compression Format Algorithm Filename extension Splitable?
DEFLATE DEFLATE .deflate No
Hadoop and to
Introduction HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Compression of files
gzip
gzip is naturally supported by Hadoop. gzip is based on the DEFLATE algorithm,
which is a combination of LZ77 and Huffman Coding.
bzip2
bzip2 is a freely available, patent free, high-quality data compressor. It typically
compresses files to within 10% to 15% of the best available techniques (the PPM
family of statistical compressors), whilst being around twice as fast at
compression and six times faster at decompression.
LZO
The LZO compression format is composed of many smaller (~256K) blocks of
compressed data, allowing jobs to be split along block boundaries. Moreover, it
was designed with speed in mind: it decompresses about twice as fast as gzip,
meaning it is fast enough to keep up with hard drive read speeds. It does not
compress quite as well as gzip; expect files that are on the order of 50% larger
than their gzipped version. But that is still 20-50% of the size of the files without
any compression at all, which means that IO-bound jobs complete the map phase
about four times faster.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
NameNode startup
1. NameNode reads fsimage in memory
2. NameNode applies editlog changes
3. NameNode waits for block data from data nodes
▪ NameNode does not store the physical-location information of the blocks
▪ NameNode exits SafeMode w hen 99.9% of blocks have at least one copy
accounted for
block inf ormation send
3
1 f simage is read to NameNode
datadir
block1
Nam eNode datanode1 block2
…
NameNode startup
During start up, the NameNode loads the file system state from the fsimage and the
edits log file. It then waits for DataNodes to report their blocks so that it does not
prematurely start replicating the blocks though enough replicas already exist in the
cluster.
During this time NameNode stays in SafeMode. SafeMode for the NameNode is
essentially a read-only mode for the HDFS cluster, where it does not allow any
modifications to file system or blocks. Normally the NameNode leaves SafeMode
automatically after the DataNodes have reported that most file system blocks are
available.
If required, HDFS can be placed in SafeMode explicitly using the command hdfs
dfsadmin -safemode. The NameNode front page shows whether SafeMode is on or
off.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
The client then tells the NameNode that the file is closed. At this point, the NameNode
commits the file creation operation into a persistent store. If the NameNode dies before
the file is closed, the file is lost.
The aforementioned approach was adopted after careful consideration of target
applications that run on HDFS. These applications need streaming writes to files. If a
client writes to a remote file directly without any client side buffering, the network speed
and the congestion in the network impacts throughput considerably. This approach is
not without precedent. Earlier distributed file systems, such as AFS, have used client
side caching to improve performance. A POSIX requirement has been relaxed to
achieve higher performance of data uploads.
Replication Pipelining
When a client is writing data to an HDFS file, its data is first written to a local file as
explained previously. Suppose the HDFS file has a replication factor of three. When the
local file accumulates a full block of user data, the client retrieves a list of DataNodes
from the NameNode. This list contains the DataNodes that will host a replica of that
block. The client then flushes the data block to the first DataNode. The first DataNode
starts receiving the data in small portions (4 KB), writes each portion to its local
repository and transfers that portion to the second DataNode in the list. The second
DataNode, in turn starts receiving each portion of the data block, writes that portion to
its repository and then flushes that portion to the third DataNode. Finally, the third
DataNode writes the data to its local repository. Thus, a DataNode can be receiving
data from the previous one in the pipeline and at the same time forwarding data to the
next one in the pipeline. The data is pipelined from one DataNode to the next.
For good descriptions of the process, see the tutorials at:
• HDFS Users Guide at http://hadoop.apache.org/docs/current/hadoop-project-
dist/hadoop-hdfs/HdfsUserGuide.html
• An introduction to the Hadoop Distributed File System at
https://hortonworks.com/apache/hdfs
• How HDFS works at https://hortonworks.com/apache/hdfs/#section_2
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Active Standby
Nam eNode Nam eNode
Secondary NameNode
• During operation primary NameNode cannot merge fsImage and edits log
• This is done on the secondary NameNode
▪ Ev ery couple minutes, secondary NameNode copies new edit log from primary NN
▪ Merges edits log into fsimage
▪ Copies the new merged fsImage back to primary NameNode
• Not HA but faster startup time
▪ Secondary NN does not have complete image. In-flight transactions would be lost
▪ Primary NameNode needs to merge less during startup
• Was temporarily deprecated because of NameNode HA but has some advantages
▪ (No need f or Quorum nodes, less network traffic, less moving parts )
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Secondary NameNode
In the older approach, a Secondary NameNode is used.
The NameNode stores the HDFS filesystem information in a file named fsimage.
Updates to the file system (add/remove blocks) are not updating the fsimage file, but
instead are logged into a file, so the I/O is fast append only streaming as opposed to
random file writes. When restarting, the NameNode reads the fsimage and then applies
all the changes from the log file to bring the filesystem state up to date in memory. This
process takes time.
The job of the Secondary NameNode is not to be a secondary to the name node, but
only to periodically read the filesystem changes log and apply them into the fsimage file,
thus bringing it up to date. This allows the NameNode to start up faster next time.
Unfortunately, the Secondary NameNode service is not a standby secondary
NameNode, despite its name. Specifically, it does not offer HA for the NameNode. This
is well illustrated in the slide above.
Note that more recent distributions have NameNode High Availability using NFS
(shared storage) and/or NameNode High Availability using a Quorum Journal Manager
(QJM).
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
scheme://authority/path
• Scheme:
▪ For the local filesystem, the scheme is file
▪ For HDFS, the scheme is hdfs
• Authority is the hostname and port of the NameNode
hdfs dfs -copyFromLocal file:///myfile.txt
dfs://localhost:9000/user/virtuser/myfile.txt
• Scheme and authority are often optional
▪ Defaults are taken from configuration file core-site.xml
29
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Just as for the ls command, the file system shell commands can take paths as
arguments. These paths can be expressed in the form of uniform resource identifiers or
URIs. The URI format consists of a scheme, an authority, and path. There are multiple
schemes supported. The local file system has a scheme of "file". HDFS has a scheme
called "hdfs."
For example, if you want to copy a file called "myfile.txt" from your local filesystem to an
HDFS file system on the localhost, you can do this by issuing the command shown.
The copyFromLocal command takes a URI for the source and a URI for the destination.
"Authority" is the hostname of the NameNode. For example, if the NameNode is in
localhost and accessed on port 9000, the authority would be localhost:9000.
The scheme and the authority do not always need to be specified. Instead you may rely
on their default values. These defaults can be overridden by specifying them in a file
named core-site.xml in the conf directory of your Hadoop installation.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
HDFS supports many POSIX-like commands. HDFS is not a fully POSIX (Portable
operating system interface for UNIX) compliant file system, but it supports many of the
commands. The HDFS commands are mostly easily-recognized UNIX commands like
cat and chmod. There are also a few commands that are specific to HDFS such as
copyFromLocal.
Note that:
• localsrc and dst are placeholders for your actual file(s)
• localsrc can be a directory or a list of files separated by space(s)
• dst can be a new file name (in HDFS) for a single-file-copy, or a directory (in
HDFS), that is the destination directory
Example:
hdfs dfs -put *.txt ./Gutenberg
…copies all the text files in the local Linux directory with the suffix of .txt to the
directory Gutenberg in the user’s home directory in HDFS
The "direction" implied by the names of these commands (copyFromLocal, put) is
relative to the user, who can be thought to be situated outside HDFS.
Also, you should note there is no cd (change directory) command available for hadoop.
or
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
The copyToLocal (aka get) command copies files out of the file system you specify
and into the local file system.
get
Usage: hdfs dfs -get [-ignorecrc] [-crc] <src> <localdst>
• Copy files to the local file system. Files that fail the CRC check may be copied
with the -ignorecrc option. Files and CRCs may be copied using the -crc option.
• Example: hdfs dfs -get hdfs:/mydir/file file:///home/hdpadmin/localfile
Another important note: for files in Linux, where you would use the file:// authority, two
slashes represent files relative to your current Linux directory (pwd). To reference files
absolutely, use three slashes (and mentally pronounce as "slash-slash pause slash").
Unit summary
• Understand the basic need for a big data strategy in terms of parallel
reading of large data files and internode network speed in a cluster
• Describe the nature of the Hadoop Distributed File System (HDFS)
• Explain the function of the NameNode and DataNodes in an Hadoop
cluster
• Explain how files are stored and blocks ("splits") are replicated
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Unit summary
Checkpoint
1. True or False? Hadoop systems are designed for transaction
processing.
2. List the Hadoop open source projects.
3. What is the default number of replicas in a Hadoop system?
4. True or False? One of the driving principal of Hadoop is that the data is
brought to the program.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Checkpoint
Checkpoint solution
1. True or False? Hadoop systems are designed for transaction
processing.
▪ Hadoop systems are not designed for transaction processing, and w ould be
very terrible at it. Hadoop systems are designed for batch processing.
2. List the Hadoop open source projects.
▪ To name a few, MapReduce, YARN, Ambari, Hive, HBase, etc.
3. What is the default number of replicas in a Hadoop system?
▪ 3
4. True or False? One of the driving principal of Hadoop is that the data
is brought to the program.
▪ False. The program is brought to the data, to eliminate the need to move large
amounts of data.
5. True or False? At least 2 NameNodes are required for a standalone
Hadoop cluster.
▪ Only 1 NameNode is required per cluster; 2 are required for high-availability.
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Checkpoint solution
Lab 1
• File access and basic commands with HDFS
Introduction
Hadoop and to
HDFS
Hortonworks Data Platform (HDP) © Copyright IBM Corporation 2019
Lab 1:
File access and basic commands with HDFS
Purpose:
This lab is intended to provide you with experience in using the Hadoop
Distributed File System (HDFS). The basic HDFS file system commands
learned here will be used throughout the remainder of the course.
You will also be moving some data into HDFS that will be used in later units of
this course. The files that you will need are stored in the Linux directory
/home/labfiles.
COMMAND_OPTION Description
Path Start checking from this path.