10 Dfs

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Experiment 10

Aim: To demonstrate distributed file system

Theory: Distributed file system (DFS) is a method of storing and accessing files based in
a client/server architecture. In a distributed file system, one or more central servers store
files that can be accessed, with proper authorization rights, by any number of remote clients
in the network. Distributed file systems can be advantageous because they make it easier to
distribute documents to multiple clients and they provide a centralized storage system so
that client machines are not using their resources to store files.

Hadoop Distributed File System (HDFS)

Introduction:
The Hadoop Distributed File System (HDFS) is the primary data storage system used
by Hadoop applications. It employs a NameNode and DataNode architecture to implement
a distributed file system that provides high-performance access to data across highly
scalable Hadoop clusters. HDFS is a key part of the many Hadoop ecosystem technologies,
as it provides a reliable means for managing pools of big data and supporting related big
data analytics applications. Hadoop File System was developed using distributed file system
design. It is run on commodity hardware. Unlike other distributed systems, HDFS is highly
fault tolerant and designed using low-cost hardware. HDFS holds very large amount of data
and provides easier access. To store such huge data, the files are stored across multiple
machines. These files are stored in redundant fashion to rescue the system from possible
data losses in case of failure. HDFS also makes applications available to parallel processing.

HDFS Architecture:
Hadoop Distributed File System follows the master–slave data architecture. Each cluster
comprises a single Namenode that acts as the master server in order to manage the file
system namespace and provide the right access to clients. The next terminology in the HDFS
cluster is the Datanode that is usually one per node in the HDFS cluster. The Datanode is
assigned with the task of managing the storage attached to the node that it runs on. HDFS
also includes a file system namespace that is being executed by the Namenode for general
operations like file opening, closing, and renaming, and even for directories. The Namenode
also maps the blocks to Datanodes. HDFS data platform format follows a strictly hierarchical
file system.
HDFS exposes a file system namespace and allows user data to be stored in files. Internally,
a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The
NameNode executes file system namespace operations like opening, closing, and renaming
files and directories. It also determines the mapping of blocks to DataNodes. The DataNodes
are responsible for serving read and write requests from the file system’s clients. The
DataNodes also perform block creation, deletion, and replication upon instruction from the
NameNode.
Generally, the user data is stored in the files of HDFS. The file in a file system will be divided
into one or more segments and/or stored in individual data nodes. These file segments are
called as blocks. In other words, the minimum amount of data that HDFS can read or write is
called a Block. The default block size is 64MB, but it can be increased as per the need to
change in HDFS configuration.

Working of HDFS:
To store and process any data, the client submits the data and program to the Hadoop
cluster. Hadoop HDFS stores the data, MapReduce processes the data stored in HDFS, and
YARN divides the tasks and assigns resources.

1. HDFS
The data in Hadoop is stored in the Hadoop Distributed File System. There are two daemons
running in Hadoop HDFS that are NameNode and DataNode.

2. MapReduce
MapReduce is the processing layer in Hadoop. It processes the data in parallel across
multiple machines in the cluster. It works by dividing the task into independent subtasks and
executes them in parallel across various DataNodes. MapReduce processes the data into
two-phase, that is, the Map phase and the Reduce phase. The programmer specifies the
two functions, that is, map function and the reduce function.
3. YARN
YARN is the resource management layer in Hadoop. It schedules the task in the Hadoop
cluster and assigns resources to the applications running in the cluster. It is responsible for
providing the computational resources needed for executing the applications.
There are two YARN daemons running in the Hadoop cluster for serving YARN core services.

They are:
a. ResourceManager
b. NodeManager
c. ApplicationMaster

The NameNode is the master node and all requests go through the NameNode. The
DataNodes on the other hand are the nodes where processing is done on receiving the
request from the NameNode. NameNode is the daemon running of the master machine. It is
the centrepiece of an HDFS file system. NameNode stores the directory tree of all files in the
file system. It tracks where across the cluster the file data resides. It does not store the data
contained in these files. When the client applications want to add/copy/move/delete a file,
they interact with NameNode. The NameNode responds to the request from client by
returning a list of relevant DataNode servers where the data lives.
DataNode daemon runs on the slave nodes. It stores data in the Hadoop File System. In
functional file system data replicates across many DataNodes. On start-up, a DataNode
connects to the NameNode. It keeps on looking for the request from NameNode to access
data. Once the NameNode provides the location of the data, client applications can talk
directly to a DataNode, while replicating the data, DataNode instances can talk to each
other.

Features of HDFS:
 Distributed data storage.
 Blocks reduce seek time.
 The data is highly available as the same block is present at multiple datanodes.
 Even if multiple datanodes are down we can still do our work, thus making it highly
reliable.
 High fault tolerance.

Commands:
Command Description
Hadoop fs -ls List the files
hadoop version Shows the version of hadoop installed.
hdfs dfs -mkdir <path> This command takes the <path> as an
argument and creates the directory.
hdfs dfs -ls <path> This command displays the contents of the
directory specified by <path>. It shows the
name, permissions, owner, size and
modification date of each entry.
hdfs dfs -put <src> <dest> This command copies the file in the local
filesystem to the file in DFS.
hdfs dfs copyFrom Local <localsrc> <dest> This command is similar to put command.
But the source should refer to local file.
hdfs dfs get <src> <localdest> This Hadoop shell command copies the file in
HDFS identified by <src> to file in local file
system identified by <localdest>
hdfs dfs cat <file-name> This Hadoop shell command displays the
contents of file on console or stdout.
hdfs dfs mv <src> <dest> This Hadoop shell command moves the file
from the specified source to destination
within HDFS.
hdfs dfs cp <src> <dest> This Hadoop shell command copies the file
or directory from given source to destination
within HDFS.

Advantages of Hadoop

1. Varied Data Sources


Hadoop accepts a variety of data. Data can come from a range of sources like email
conversation, social media etc. and can be of structured or unstructured form.
2. Cost-effective
Hadoop is an economical solution as it uses a cluster of commodity hardware to store data.
Commodity hardware is cheap machines hence the cost of adding nodes to the framework
is not much high.
3. Performance
Hadoop with its distributed processing and distributed storage architecture processes huge
amounts of data with high speed. It divides the input data file into a number of blocks and
stores data in these blocks over several nodes.
4. High Throughput
Throughput means job done per unit time. A given job in Hdfs gets divided into small jobs
which work on chunks of data in parallel thereby giving high throughput.
5. Open Source
Hadoop is an open source technology i.e. its source code is freely available. We can modify
the source code to suit a specific requirement.
6. Scalable
Hadoop works on the principle of horizontal scalability i.e. we need to add the entire
machine to the cluster of nodes and not change the configuration of a machine.

Disadvantages of Hadoop

1. Issue with Small Files


Hadoop is suitable for a small number of large files but when it comes to the application
which deals with a large number of small files, Hadoop fails here.
2. Vulnerable By Nature
Hadoop is written in Java which is a widely used programming language hence it is easily
exploited by cyber criminals which makes Hadoop vulnerable to security breaches.
3. Processing Overhead
In Hadoop, the data is read from the disk and written to the disk which makes read/write
operations very expensive when we are dealing with tera and petabytes of data.
4. Supports Only Batch Processing
At the core, Hadoop has a batch processing engine which is not efficient in stream
processing. It cannot produce output in real-time with low latency.
5. Security
For security, Hadoop uses Kerberos authentication which is hard to manage. It is missing
encryption at storage and network levels which are a major point of concern.

Conclusion: Thus we have studied distributed file system.

You might also like