1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
1) Discuss The Design of Hadoop Distributed File System (HDFS) and Concept in Detail
concept in detail.
Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop
Distributed Filesystem.
The Design of HDFS :
HDFS is a filesystem designed for storing very large files with streaming data access
patterns, running on clusters of commodity hardware.
Very large files:
“Very large” in this context means files that are hundreds of megabytes, gigabytes, or
terabytes in size. There are Hadoop clusters running today that store petabytes of data.
Streaming data access :
HDFS is built around the idea that the most efficient data processing pattern is a write-
once, read-many-times pattern. A dataset is typically generated or copied from source,
then various analyses are performed on that dataset over time.
Commodity hardware :
Hadoop doesn’t require expensive, highly reliable hardware to run on. It’s designed to
run on clusters of commodity hardware (commonly available hardware available from
multiple vendors3) for which the chance of node failure across the cluster is high, at
least for large clusters. HDFS is designed to carry on working without a noticeable
interruption to the user in the face of such failure.
Concepts in HDFS
Namenode
The namenode is the commodity hardware that contains the GNU/Linux operating
system and the namenode software. It is a software that can be run on commodity
hardware. The system having the namenode acts as the master server and it does the
following tasks −
Datanode
The datanode is a commodity hardware having the GNU/Linux operating system and
datanode software. For every node (Commodity hardware/System) in a cluster, there
will be a datanode. These nodes manage the data storage of their system.
Datanodes perform read-write operations on the file systems, as per client
request.
They also perform operations such as block creation, deletion, and replication
according to the instructions of the namenode.
Block
The files in HDFS are divided into segments and stored in individual data nodes. These
file segments are called as blocks. In other words, the minimum amount of data that
HDFS can read or write is called a Block. The default block size is 64MB in
Hadoop(version 1.x) and 128 MB in Hadoop(version 2.x onwards), but it can be
increased as per the need to change in HDFS configuration.
2) Show how a client read and write data in HDFS, Give an example
with code.
Inserting Data into HDFS
Assume we have data in the file called file.txt file.txt in the local system which is ought
to be saved in the
HDFS file system. Follow the steps given below to insert the required file in the Hadoop
file system.
Step 1
You have to create an input directory.
$ $HADOOP_HOME/bin/hadoop fs -mkdir /user/input
Step 2
Transfer and store a data file from local systems to the Hadoop file system using the
put command.
$ $HADOOP_HOME/bin/hadoop fs -put /home/file.txt /user/input
Step 3
You can verify the file using ls command.
$ $HADOOP_HOME/bin/hadoop fs -ls /user/input
OR
Anatomy of File Read in HDFS
Let’s get an idea of how data flows between the client interacting with HDFS, the name
node, and the data nodes with the help of a diagram. Consider the figure:
Step 1: The client opens the file it wishes to read by calling open() on the File System
Object(which for HDFS is an instance of Distributed File System).
Step 2: Distributed File System( DFS) calls the name node, using remote procedure
calls (RPCs), to determine the locations of the first few blocks in the file. For each
block, the name node returns the addresses of the data nodes that have a copy of that
block. The DFS returns an FSDataInputStream to the client for it to read data from.
FSDataInputStream in turn wraps a DFSInputStream, which manages the data node
and name node I/O.
Step 3: The client then calls read() on the stream. DFSInputStream, which has stored
the info node addresses for the primary few blocks within the file, then connects to the
primary (closest) data node for the primary block in the file.
Step 4: Data is streamed from the data node back to the client, which calls read()
repeatedly on the stream.
Step 5: When the end of the block is reached, DFSInputStream will close the
connection to the data node, then finds the best data node for the next block. This
happens transparently to the client, which from its point of view is simply reading an
endless stream. Blocks are read as, with the DFSInputStream opening new
connections to data nodes because the client reads through the stream. It will also call
the name node to retrieve the data node locations for the next batch of blocks as
needed.
Step 6: When the client has finished reading the file, a function is called, close() on the
FSDataInputStream.
Hadoop RecordWriter takes output data from Reducer and writes this data to output
files. The way these output key-value pairs are written in output files by RecordWriter is
determined by the Output Format. The Output Format and InputFormat functions are
alike. OutputFormat instances provided by Hadoop are used to write to files on
the HDFS or local disk. OutputFormat describes the output-specification for a Map-
Reduce job. On the basis of output specification;
MapReduce job checks that the output directory does not already exist.
OutputFormat provides the RecordWriter implementation to be used to write the
output files of the job. Output files are stored in a FileSystem.
2. Binary Input:
SequenceFileInputFormat - stores sequences of binary key-value pairs
SequenceFileAsTextInputFormat - converts sequence file’s keys and values to
Text objects
SequenceFileAsBinaryInputFormat - retrieves the sequence file’s keys and
values as binary objects
FixedLengthInputFormat - reading fixed-width binary records from a file where the
records are not separated by delimiters
3. Multiple Inputs:
All input is interpreted by a single InputFormat and a single Mapper
MultipleInputs - allows programmer to specify which InputFormat and Mapper to
use on a per-path basis
Database Input/Output:
DBInputFormat - input format for reading data from a relational database
DBOutputFormat - output format for outputting data from a relational database
Output Formats
1. Text Output:
TextOutputFormat - default output format; writes records as lines of text (keys and
values are turned into strings)
KeyValueTextInputFormat - breaks lines into key-value pairs based on a
configurable separator
2. Binary Output:
SequenceFileOutputFormat - writes sequence files as output
SequenceFileAsBinaryOutputFormat - writes keys and values in binary format
into a sequence file container
MapFileOutputFormat - writes map files as output
3. Multiple Outputs:
MultipleOutputs - allows programmer to write data to files whose names are
derived from output keys and values to create more than one file
4. Lazy Output:
LazyOutputFormat - wrapper output format that ensures the output file is created
only when the first record is emitted for a given partition
6) Write the working procedure of HDFS and also explain the features
of HDFS.
Hadoop File System was developed using distributed file system design. It is run on
commodity
hardware. hardware. Unlike other distributed systems, systems, HDFS is highly fault
tolerant and designed using low-cost hardware.
HDFS holds very large amount of data and provides easier access. access. To store
such huge data, the
files are stored across multiple machines. machines. These files are stored in
redundant fashion to rescue the
system from possible data losses in case of failure. failure. HDFS also makes
applications available to
parallel processing.
Features of HDFS
1. It is suitable for the distributed storage and processing.
2. Hadoop provides a command interface to interact with HDFS.
3. The built-in servers of namenode and datanode help users to easily check the
status of
4. cluster.
5. Streaming access to file system data.
6. HDFS provides file permissions and authentication.
OR
o Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a
single cluster.
o Replication - Due to some unfavorable conditions, the node containing the data
may be loss. So, to overcome such problems, HDFS always maintains the copy of
data on a different machine.
o Fault tolerance - In HDFS, the fault tolerance signifies the robustness of the
system in the event of failure. The HDFS is highly fault-tolerant that if any
machine fails, the other machine containing the copy of that data automatically
become active.
o Distributed data storage - This is one of the most important features of HDFS
that makes Hadoop very powerful. Here, data is divided into multiple blocks and
stored into nodes.
o Portable - HDFS is designed in such a way that it can easily portable from
platform to another.
10) Relate crowd sourcing and big data. Justify the relationship with
an example.
11) Write down the aggregate data model in detail with an example.
13) Discuss in detail about the basic building blocks of Hadoop with
a neat sketch.
A Hadoop cluster consists of a single master and multiple slave nodes. The master node
includes Job Tracker, Task Tracker, NameNode, and DataNode whereas the slave node
includes DataNode and TaskTracker.
NameNode: The NameNode is the master of HDFS that directs the slave DataNode
daemons to perform the low-level I/O tasks. It is the bookkeeper of HDFS; it keeps
track of how your files are broken down into file blocks, which nodes store those blocks
and the overall health of the distributed filesystem.
DataNode: Each slave machine in your cluster will host a DataNode daemon to
perform the grunt work of the distributed filesystem - reading and writing HDFS blocks
to actual files on the local file system When you want to read or write a HDFS file, the
file is broken into blocks and the NameNode will tell your client which DataNode each
block resides in. Your client communicates directly with the DataNode daemons to
process the local files corresponding to the blocks.
JobTracker: Once you submit your code to your cluster, the JobTracker determines
the execution plan by determining which files to process, assigns nodes to different
tasks, and monitors all tasks as they're running. should a task fail, the JobTracker will
automatically relaunch the task, possibly on a different node, up to a predefined limit of
retries.
TaskTracker: As with the storage daemons, the computing daemons also follow a
master/slave architecture: the JobTracker is the master overseeing the overall
execution of a MapReduce job and the TaskTracker manage the execution of individual
tasks on each slave node.