0% found this document useful (0 votes)
179 views38 pages

1 Hdfs Notes

Big data has 3 V's - volume, velocity, and variety. Hadoop is a framework for distributed storage and processing of big data across clusters of commodity hardware. It uses HDFS for storage and MapReduce for distributed processing. HDFS stores data reliably across machines as blocks and uses a namenode and datanodes. The namenode manages file metadata and datanodes store blocks. HDFS provides high throughput access to large datasets.

Uploaded by

Sandeep Boyina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
179 views38 pages

1 Hdfs Notes

Big data has 3 V's - volume, velocity, and variety. Hadoop is a framework for distributed storage and processing of big data across clusters of commodity hardware. It uses HDFS for storage and MapReduce for distributed processing. HDFS stores data reliably across machines as blocks and uses a namenode and datanodes. The namenode manages file metadata and datanodes store blocks. HDFS provides high throughput access to large datasets.

Uploaded by

Sandeep Boyina
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

What is BIGDATA?

3 V’s of BIGDATA

Volume Velocity
Social
Petabyte scale Sensor

Big Data

Variety
Structured
Semi-structured
Unstructured
What is Hadoop?
New Hardware & Software Approach
to handle BIGDATA
New Hardware Approach New Software Approach
HDFS
A self-healing distributed filesystem running on
clusters of commodity hardware, intended for storing
large files with streaming data access patterns.
Principles of HDFS
• Highly fault-tolerant
• Designed to be deployed on low-cost hardware
• Highly scalable
• Provides high throughput access to application data
• Suitable for applications that have large data
sets(typically GBs to TBs)
• Portable across heterogeneous hardware and
operating system platforms
• No support for random updates but append is
allowed
HDFS Concepts
• File is split into blocks for storing in HDFS. Blocks of the
same file are distributed across multiple machines in the
cluster.

• Concept of block
• minimum amount of data that can be read or written

• Size on normal file system: few kilobytes(~2KB)

• Size in HDFS: Default 64 MB but can be increased upto


128 MB.
Namenode = master
• Manages filesystem namespace(filesystem
tree, metadata for dirs and files)and
maintains editlog file. The namespace image
and editlogs are stored persistently on disk.

• Stores the info about which datanodes


stores the blocks of given file. This info is
stored in RAM.
Datanode = slave
• Serve as storage for data blocks

• Responsible for serving read and write


requests from the clients

• Sends periodic "heartbeat" to Namenode and


also sends block reports
Write Path in HDFS
Read Path in HDFS
Fault Tolerance and
Self-Healing in HDFS
Detecting DataNode Failures: HeartBeat
Filesystem Metadata
• The HDFS namespace is stored by Namenode.
• Namenode uses a transaction log called the EditLog
to record every change that occurs to the
filesystem meta data.
– For example, creating a new file.
– Change replication factor of a file
– EditLog is stored in the Namenode’s local filesystem
• Entire filesystem namespace including mapping of
blocks to files and file system properties is stored in
a file FsImage. Stored in Namenode’s local
filesystem.
• It periodically merges Edit log with FsImage.
HDFS Access

 WebHDFS
HDFS Shell Command
Command Operation
Lists the contents of the directory specified by path, showing
-ls path the names, permissions, owner, size and modification date
for each entry.
Behaves like -ls, but recursively displays entries in all
-lsr path
subdirectories of path.
Shows disk usage, in bytes, for all files which match path;
-du path
filenames are reported with the full HDFS protocol prefix.
Moves the file or directory indicated by src to dest, within
-mv src dest
HDFS.
Copies the file or directory identified by src to dest, within
-cp src dest
HDFS.
-rm path Removes the file or empty directory identified by path.
Removes the file or directory identified by path. Recursively
-rmr path
deletes any child entries (i.e., files or subdirectories of path).
Copies the file or directory from the local file system
-put localSrc dest
identified by localSrc to dest within the DFS.
-copyFromLocal localSrc dest Identical to -put
Copies the file or directory from the local file system
-moveFromLocal localSrc dest identified by localSrc to dest within HDFS, then deletes the
local copy on success.
Copies the file or directory in HDFS identified by src to the
-get [-crc] src localDest
local file system path identified by localDest.
HDFS Shell Command
Command Operation
-copyToLocal [-crc] src localDest Identical to -get
-moveToLocal [-crc] src localDest Works like -get, but deletes the HDFS copy on success.
-cat filename Displays the contents of filename on stdout.
Creates a directory named path in HDFS. Creates any parent
-mkdir path directories in path that are missing (e.g., like mkdir -p in
Linux).
Returns 1 if path exists; has zero length; or is a directory, or
-test -[ezd] path
0 otherwise.
Prints information about path. format is a string which
-stat [format] path accepts file size in blocks (%b), filename (%n), block size
(%o), replication (%r), and modification date (%y, %Y).
-tail [-f] file Shows the lats 1KB of file on stdout.
Changes the file permissions associated with one or more
objects identified by path.... Performs changes recursively
-chmod [-R] mode,mode,... path... with -R. mode is a 3-digit octal mode, or {augo}+/-{rwxX}.
Assumes a if no scope is specified and does not apply a
umask.
Sets the owning user and/or group for files or directories
-chown [-R] [owner][:[group]] path...
identified by path.... Sets owner recursively if -R is specified.
Returns usage information for one of the commands listed
-help cmd
above. You must omit the leading '-' character in cmd
Enable WebHDFS in Your Cluster

Step1: Add the following property into hdfs-site.xml to enabling


HDFS access:
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>

Location of hdfs-site.xml: /etc/hadoop/conf/hdfs-site.xml

Step2: Restart hdfs service from cloudera manager


1) Create a directory called temp under /user/cloudera

http://localhost:50070/webhdfs/v1/user/cloudera/temp?user.name=cl
oudera&op=MKDIRS

2) Get the status of a directory /user/cloudera

http://localhost:50070/webhdfs/v1/user/cloudera?user.name=clouder
a&op=GETFILESTATUS

3) Create and write into a file

4) Open and read a file

You might also like