0% found this document useful (0 votes)
42 views

A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems

The document compares the architectures and applications of several scalable high-performance distributed file systems. It discusses Hadoop Distributed File System (HDFS) in detail, describing its architecture which uses a master/slave model and stores metadata and data separately. The document also outlines some areas HDFS is applied, including for agriculture, financial applications, and big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views

A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems

The document compares the architectures and applications of several scalable high-performance distributed file systems. It discusses Hadoop Distributed File System (HDFS) in detail, describing its architecture which uses a master/slave model and stores metadata and data separately. The document also outlines some areas HDFS is applied, including for agriculture, financial applications, and big data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

A COMPARATIVE …… . FUDMADada, E.

of
Journal G.Sciences
and Joseph,
(FJS)S. B. FJS
ISSN: 2616-1370
Vol. 2 No. 2, June, 2018, pp 223 – 233

A COMPARATIVE STUDY OF THE ARCHITECTURES AND APPLICATIONS OF SCALABLE


HIGH-PERFORMANCE DISTRIBUTED FILE SYSTEMS

Dada, E. G. and Joseph, S. B.

Department of Computer Engineering, University of Maiduguri


Maiduguri - Borno State, Nigeria
Correspondence: [email protected]

ABSTRACT
Distributed File Systems have enabled the efficient and scalable sharing of data across
networks. These systems were designed to handle some technical problems associated with
network data. For instance reliability and availability of data, scalability of infrastructure
supporting storage, high cost of gaining access to data, maintenance cost and expansion. In
this paper, we attempt to make a comparison of the key technical blocks that are referred to as
the mainstay of Distributed File Systems such as Hadoop FS, Google FS, Luster FS, Ceph FS,
Gluster FS, Oracle Cluster FS and TidyFS. This paper aims at elucidating the basic concepts
and techniques employed in the above mentioned File Systems. We explained the different
architectures and applications of these Distributed File Systems.
Keywords: Distributed File Systems, Google File System, Hadoop File System, Oracle
Cluster File System, Ceph File System.

INTRODUCTION duplication to enhance data accessibility in situation


Distributed File System (DFS) is an addition to the where there is failure or heavy load. One of the
concept of file system which performs the function of drawbacks of DFS is bottlenecks which can lead to
managing files and data that are stored on various traffic jam and restricted access in some
devices on the computer system. They are workstations. Such situation can be expanded to
characterised with high performance, scalability and cover a sizable quantity of storage locations and
dependability through the use various state-of-the-art offering moderate performance degradation in their
techniques. The people outside sees the DFS as one operations while there are possibilities of hardware
undivided storage medium (Vaidya and Deshpande, failure.
2016). File systems are concept that allows users to
read, maneuver and arrange data (Shyam and HADOOP DISTRIBUTED FILE SYSTEM
(HDFS)
Sudarshan, 2015). Usually, the data is kept in storage
The Hadoop is a distributed parallel, fault-tolerant
locations as files in a hierarchical tree in which the
distributed file system that drew its creative ideas
nodes are referred to as directories or folders. The file
from the Google file system (Suralkar et al., 2013).
system allows a similar view, unconnected to the
The architecture was meant to dependably store files
primary medium of storage that can be floppy disk,
that are very huge in size around machines having
hard disk, flash disk and CDs (Suralkar et al., 2013).
large allocation unit and deployed on low-cost
A distributed file systems (DFS) is a system that
hardware (Elomari et al., 2017). The Hadoop DFS is
permits many users to retrieve, via the network, a file
advantageous because it allows large volume of data
structure stored on one or other machines in isolated
to be processed within a short time and is appropriate
or distant locations (File Servers) by means related
for use in storing applications with huge data sets.
structures to the one employed to retrieve the file
Hadoop Distributed File System (HDFS) divides
stored on the local machine. It uses a client/server
large data files into different clusters and each
model in which data is distributed among many
segment is managed by different machines in the
storage locations, usually known as nodes (Elomari et
group (Shvachko, et al., 2010). Each segment is
al., 2017). Distributed file systems are occasionally
duplicated across many machines in a cluster, so that
perceived to be a single storage device but in actual
if any of the machines malfunctions it does not makes
fact they are interface to a larger extent creating the
the data to be inaccessible (Pooja, 2014). In spite of
platform for the storage of data on several machines.
the several benefits of HDFS, there is still a foremost
However, the DFS offers location transparency and
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
223
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

challenge which is data security. Whenever a virus, scalability issues. Since the NameNode stores the
malware or worm infect a data, it rapidly affects all entire namespace and block locations in memory, the
the data. Another challenge with HDFS is that it is size of the NameNode heap reduces the number of
susceptible in nature and unsuitable for storing small files and blocks that can be addressed. A possible
data. Each file is stored as a sequence of blocks; all solution to this problem is to permit the sharing of
blocks in a file excluding the last block are of the physical storage by several namespaces and
same size. Blocks belonging to a file are reproduced NameNodes within a cluster (Suralkar et al., 2013).
for fault tolerance. The block size and replication
factor are can be configured for each file. The files Architecture of HDFS
HDFS cluster contains one namenode, a master
are “write once” and only one writer can write on it at
server and several datanodes known as slaves in the
any time (Hurwitz et al., 2013). Other drawbacks of
architecture. The HDFS stores filesystem metadata
Hadoop DFS include centralisation. The Hadoop
and application data independently. HDFS stores
system uses a centralized master server. Thus, the
metadata on an autonomous dedicated server named
Hadoop cluster is unreachable whenever its
Namenode and Application data are stored on
NameNode is inoperative (Gemayel, 2016). A
separate servers termed Datanodes. Every servers are
suitable recovery approach to solve this problem so
fully connected and communicated with the TCP
far is to restart the NameNode, and adequate steps are
based protocols. Figure 1 shows the complete
being taken regarding the ability of the system
architecture of the HDFS.
automatically recover. Moreover, HDFS have

Fig.1: HDFS Architecture (Source: DataFlair Team, 2017).

Areas of Application of HDFS: illegal trading patterns and also to uncover


According to Sravanthi and Reddy (2015), the areas fraudulent activities.
of application of HDFS among many others include 3. Big Data Applications (BDA): HDSF can
the following; be used to effectively store big data. Big
1. Agriculture: HDFS have found application data application is a software application
in the agricultural sector. For instance in which analyses big data by processing them
genetic engineering, where sensors are used in a very large parallel framework. For
to monitor plant reactions to various changes example data from traffic routine, stock
in the environment, large amount of data is market updates, tweet massages and others.
collected and the simulation will allow for 4. Clustering: Data store in HDFS can be
the discovery of the most appropriate divided clusters. This is used mainly to
environmental conditions for various plants. identify and address group of data through a
2. Stock Exchange: HDFS is broadly used single click using a k-means algorithm.
through an analytical database to identify

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
224
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

5. Enterprise: HDFS is used for in-data reason why customers purchase some goods
analytics help business to make decisions more than others and the areas where they
faster than another traditional analytic tool. have to intensify their efforts such as
6. Credit cards: HDFS can assist an marketing the product.
organisation to detect possible fraudulent 8. Banking: Banks make use of HDFS for
acts in credit cards transactions. Companies storing customer data and also assists in
depend on the in-database analytics because detecting any questionable customer
of its speed and accuracy. It does a kind of activity. It provides easy access to
verification before authorization where there customer’s information wherever the
is any suspicious activity. customer requests for it. It helps the banks to
7. Consumer goods: HDFS is used for storing also track their progress and enhance the
collection of consumers’ details and efficiency of their service.
activities. For example type of goods or 9. Hadoop is also being used by the giant ISP
product, place of purchase, quantity (Yahoo) and popular social media such as
purchased, online transactions etc. When Facebook.
analyzed, it can help a company to
understand some information such as the
Google File System verifying of each copy by chunk server using check
Google File System (GFS) is a distributed file system sum. It has a reduced check sum cost. It has an
introduced by Google to manage huge amount of data increased bandwidth because of its batch operations
which is spread across different databases such as cabbage collection and writing to operation
(Ghemawat et al., 2003). GFS is mainly intended to log. At client end, no synchronization is needed
provide efficient, dependable and fault tolerant access because of the append operations. It takes care of
to data by employing huge clusters of commodity caching issues. It automatically collects garbage
servers. Google system (GFS cluster) is made up of chunks using garbage collection. Continuous
one master and many Chunk servers (nodes) and can monitoring of chunk server by GFS through
be retrieved by several clients (Vijayakumari et al., continuous messaging (Ghemawat et al., , 2003).
2014). GFS files are partitioned into chunks of 64 The GFS is limited to special purpose design and thus
megabytes, and are normally appended to or read. It cannot accommodate a general-purpose design. It is
is overwritten or compressed only in exceptionally very inefficient in the case of small files (Gemayel,
seldom cases. Compared with other conventional file 2016). Slow garbage collection is a setback i.e. when
systems, GFS is planned and enhanced to operate in the files are not static. Consistency checks are done
data centers to offer exceptionally great data by the clients themselves. When the number of
processing capacity, minimal delay to rapidly writers increase there can be degradation in the
changing data and continue in its operations despite performance of the system. The master memory also
individual server failures (Gemayel, 2016). The poses a limitation (kaushiki, 2012).
features that make GFS highly appealing include:
continuous operation in the event of system Architecture of GFS
A GFS cluster is made up of a one master and many
breakdown, important data duplication, automatic
chunkservers. It can be retrieved by several clients, as
and effective, salvaging corrupted or damaged data,
shown in Figure 2. Each of the components is
superior sum of the data rates that are delivered to all
normally a service Linux machine running a user-
terminals in a network, decreased client and master
level server process. The chunkserver and a client can
communication due of huge chunk server size,
be run on similar computer without difficulty
namespace organisation and securing, and better
provided that the resources on the computer system
accessibility.
allows it and the inferior inconsistency brought about
Some of the advantages of GFS include its very high
by running perhaps application code that is likely to
availability and high faults tolerance through data
act in an unusual manner is suitable (Ghemawat et
duplication. Its single master design makes it very
al., , 2003).
efficient and simple. GFS ensures data integrity by
It is easy to run both a chunkserver and a client on
the same machine, as long as

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
225
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

Application GFS master /foo/bar


( file name, chunk index )
GFS client File namespace chunk 2ef0
( chunk handle,
chunk locations)
Legend:

Data messages
Instructions to chunkserver Control messages

Chunkserver state
( chunk handle, byte range )
GFS chunkserver GFS chunkserver
chunk data
Linux file system Linux file system

Fig. 2: Architecture of GFS (Source: Ray, 2015)


The Luster architecture aims to connect a large
Areas of Application of GFSMapReduce: number of clients with the data servers in an efficient
This is the major application area of GFS. and foolproof manner. According to (Knowledge
MapReduce is a programming model proposed by Base, 2018), Luster file system is made up the
Google and used by both GFS and HDFS. The following components:
principal responsibility of MapReduce is to serve as a  Luster clients: This DFS client software
platform for the development and execution of large- runs on machines like desktop nodes that
scale data processing tasks. Hence, MapReduce takes interacts with file system's servers through
advantage of the high processing power made the Luster Network (LNET) layer. In an
available by computing clusters while at the same establishment, Luster offers clients a
time presenting a programming model that makes combined all inclusive namespace for every
easier the development of such distributed files and data in the file system. When
applications. MapReduce make the breakdown of Luster is attached on a client, its users can
jobs and fusion of results stress-free. It also offers the manage file system data in a way that makes
opportunity to easily track jobs and task. it look as if the data is stored locally.
Nevertheless, the clients will under no
LUSTER FILE SYSTEM
circumstances be able retrieve data instantly
Luster is a file system designed for efficient storage
from any part of a computer file, without
(Paul et al., 2013). Luster file systems have the
having to read the file from the beginning in
capacity to change in size and can be integrated into
the main file storage.
several computer clusters that have thousands
 Management Target (MGT): The MGT
of client nodes, large storage capacity on several
keeps file system information settings for
servers (Sage et al., 2004). This makes Luster file
use by the clients and other Luster parts.
systems a widely accepted file system for use in
Even though MGT storage prerequisites are
businesses where massive data centers are
reasonably low even when the system that
required. Luster file systems have high performance
controls how data is stored and retrieved file
abilities and open licensing, it is commonly used
is very huge, the information kept in it is
in supercomputers (Paul et al., 2013). Today, Luster
highly essential to log on to the system.
is completely based on Linux and usually use kernel-
 Management Server (MGS): It manages
based server modules to produce the required
the organization data stored on the MGT.
performance, but it can be re-exported with NFS or
Luster clients communicate with the MGS to
CIFS to allow use by Windows and OS X clients.
access information from it.
Another notable features of Luster are incorporated
 Metadata Target (MDT): It keeps
network diagnosis, and performance tracking and
filenames, directories, authorizations, and
fine-tuning mechanisms. Luster can support several
other data that gives information about the
kinds of clients and runs on almost any modern
namespace.
hardware. Scalability is one of the most critical
characteristics of Luster. It can also be used to  Metadata Server (MDS): The MDS is in
produce a single namespace of what seems to be charge of data about namespace kept on the
practically immeasurable capacity (Paul et al., 2013). MDT. Luster clients get in touch with the
Architecture of Luster File System MDS to access information from the MDT.

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
226
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

The MDS does not participate in file  Object Storage Server (OSS): The OSS
read/write actions. supervises read/write actions for (normally)
 Object Storage Targets (OSTs): The OSTs several OSTs.
keeps user file data in one or many logical
objects which have the ability to be streaked
hrough various OSTs.
Depicted in figure 3 below is the architecture of the Luster distributed file system.

Fig. 3: Luster Architecture (Source: Torben, 2018)


Area of Application layer, RADOS. Both metadata and the file data can
This Luster file system are used in various areas of take advantage of this uniformity. Most of the Ceph
applications such as Aeronautic Engineering, processes reside in user-space. Generally speaking,
Banking, Topography, Science of Weather and this makes the system easier to debug and maintain.
Climate. In in industries such The client-side support has been integrated into
as meteorology, simulation, oil and gas, life Linux mainline kernel, which eases the deployment
science, and finance. It is remarkable to know that a and out-of-box experience (Feiyi et al., 2013).
Luster file system is employed for a wide range of The CephFS is very attractive because it provides a
purpose, it is used for program interfaces and services very robust data safety for mission critical
relative to the initial user of these interfaces and applications. It makes practically boundless storage
service in many sites, from Internet service providers file systems possible. Applications that use file
(ISPs) to large businesses that deals with financial systems can use CephFS with POSIX semantics.
and monetary transactions. Moreover, it does not need any integration or
customization. CephFS automatically stabilise the
Ceph Distributed File System
file system to provide best performance. Also,
Ceph a distributed file system that offers users
CephFS enables enhanced scalability of the system in
outstanding performance, dependability, and
which clients execute huge read/write operations that
scalability. Ceph exploits the split between data and
linearly scale with the number of objects storage
metadata organization by substituting table that an
devices in the RADOS cluster. However, each Client
operating system maintains on a hard disk that
is constrained by the bandwidth of its network link.
provides a map of the clusters with a pseudo-random
Some of the weaknesses of CephFS is that it
data distribution function (CRUSH) designed for
occasionally endangers file system by making
diverse and dynamic clusters of undependable object
unauthorised users to be aware of the existence such
storage devices (OSDs). Ceph was designed to be a
thereby compromising data privacy. Features like
reliable, scalable fault-tolerant parallel file system.
Snapshots are not offered by CephFS. Also, testing
Incorporated into Ceph is an smart and robust data
gets inadequate coverage in test suite.
placement scheme, named CRUSH. The CRUSH
algorithm allows a client to pre-calculate object Architecture of CephFS
placement and layout while taking into consideration CephFS is a file storage solution part of Ceph. It
of failure domains and hierarchical storage tiers. works as a high-level component within the system
Ceph is built on top of a unified object management that provides files storage on top of RADOS, the

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
227
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

object is a store upon which all Ceph storage solution that you wants to use the file) and afterward the
are built (Sage et al., 2006). Two types of entities read/write is changed instantaneously, with only
cooperate to provide a file system interface: Clients intermittent updates to the MDS (Giacinto et al.,
and metadata server (MDS). What is required of a 2014). The CephFS is depicted in figure 4 below.
client is just to open a file (i.e. you inform the MDS

Fig 4: CephFS Architecture (Source: John, 2015)

Areas of Application of CephFS (Shyam and Sudarshan, 2015).  GlusterFS have four
1. CephFS is used in storing large files major notions:
efficiently.
2. They are used in the Field of Bricks - the part of a computer in which information
Telecommunication for storage and is stored. It is made up of a server and route that
distribution. points to a file system location by following the
directory tree hierarchy (i.e., server:/export)
GlusterFS File System Translators - components that are joined to transport
GlusterFS is a distributed network file system that data from point A to point B
has the capability to handle growing amount of work. Trusted Storage Pool – a reliable interconnection of
It is developed using C programming language servers that will as the central repository of data and
(Shyam and Sudarshan, 2015). By means of programs that are shared by users in a network
conventional standard hardware, GFS can produce Volumes - group of bricks having the same condition
massive, storage solutions that are geographically for redundancy.
dispersed. It can be used for media streaming, data
transformation and modeling, and other data and jobs Architecture of GlusterFS file System
that affects the maximum data transfer rate of a GlusterFS system architecture is easy to comprehend
network or Internet connection. GlusterFS is an and it is equally a robust file system written in user
unrestricted and source code that anyone can inspect, space that employs FUSE to attach itself to programs
change, and distribute. It is easy to understand its that forms an interface between an operating system's
architecture. Users put it on their computer for kernel and a more concrete file system. GlusterFS file
personal use. The idea of information about data for system architecture is in different levels. This makes
storing files on server does not apply in GlusterFS. It it possible to add or delete features. GlusterFS file
have the ability to compute files’ address and retrieve system works on ext3, ext4, xfs.etc.to store data. It
files without difficulty. GFS utilizes the idea of allows horizontal growth or the addition of new
bricks and volumes to separate various user’s data on resources in the network. The conceptual model that
small allocation units for a file within a file system defines the structure of GlusterFS is depicted in
figure 5 below.

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
228
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

Fig 5: GlusterFS System Architecture (Source: Shyam and Sudarshan, 2015)


Areas of Application of GlusterFS application that it is one server. Likewise, many
GlusterFS cluster have found relevance in computer servers should be managed as much as possible in the
storage of large and voluminous data, electronic same manner as a single server is being managed
device that can be used to store data (e.g. hard drives, (Burleson, 2017). The software that balances
CDs, DVDs, Floppy Disks, USB drives, ZIP disks, workload to reduce bottlenecks in the cluster makes
magnetic tapes and SD cards). this transparency possible. In other for the nodes to
appear as one server, files need to be kept in a way
The Oracle Cluster File System (OCFS) that they can be located by the particular node that
The OCFS is a file system which is shared by being requests for them. We have existence today several
installed on many servers concurrently. Oracle makes softwares that define the type and state of each node
balancing and automatically and seamlessly switch to in the cluster and the relation between them. This
a highly reliable backup solutions such as OCFS2 solves the data retrieval problem though it relies on
and ACFS possible by allowing shared disk clustered the prim goal of the person that designed the cluster.
file system architecture (Burleson, 2017). A cluster The connection is a computer network topology
comprises of two or more autonomous, but employed as a way of transmitting information
interconnected, servers. Many hardware merchants between each node of the cluster. A cluster is made
have made available cluster capacity for some years up of many connected computers or servers that look
to satisfy many needs. Some clusters were meant as though they are one server to end users and
only to offer the users superior availability by applications. Oracle Real Application Clusters
permitting the relocation of work to an auxiliary node (Hupfeld, et al., 2008). Some of the benefits of OCFS
if the operational node crashes. The reason for are Advanced Security (POSIX ACLs and SELinux),
designing others was to provide scalability by making REFLINK Snapshots with Copy-On-Write, in-built
it possible for user connections or work to be Clusterstack with a Distributed Lock Manager, file
distributed across the nodes. Additional popular Size Scalability up to 16 TB, and cluster Scalability
attribute of a cluster is that it should seem to an up to 32 Nodes.
Architecture of Oracle Cluster File System

Fig. 6: Oracle Cluster File Architecture (Source: Carlos, 2008).

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
229
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

Area of Application of Oracle Cluster file system (Fetterly, et al., 2010). There has been a sudden
According to Burleson (2017), Oracle Cluster File increase of research in computing using large
System is applicable in the areas explained below: numbers of already-available computing components
i. Metadata caching for parallel computing, to get the best number of
ii. Metadata journaling useful computation at reduced cost. Often, the high
iii. Asynchronous and direct I/O support for number of reads and writes, causing latency and
database files for enhanced database bottlenecks for such clusters is generated by a
workload, throughput, resources, software systems that runs on a cluster of networked
optimization, and contention. computers and looks like one dependable machine
iv. Oracle Database has found application in that offers huge collective volume of computational
storing files related to software programs and I/O performance. Examples include Map Reduce,
designed to accumulate, manage and Hadoop or Dryad. They have a high-throughput, they
disseminate information efficiently like are sequential, and are read-mostly (Fetterly, et al.,
CAD, medical images, invoice images, 2010). Unlike the other DFS that we discussed in the
documents, etc. The SQL usual data type, previous sections, TidyFS is very simple. The system
BLOB (and CLOB) is used by applications does not have any complicated duplication protocols
to store files in the database. Oracle and read/write code paths by taking advantage of
Database offers superior security, workload properties which include the nonexistence
accessibility, ability to operate without of simultaneous writes to a file by many clients, and
breakdown despite degradation in the presence of end-to-end fault tolerance in the
transactions, and capacity to be handle execution engine (Fetterly, et al., 2010).
increase in file size compared to Some of the advantages of TidyFS include: it permits
conventional file systems. Each time files applications to carryout I/O by means of whatsoever
are stored in the database, they are copied or access patterns and techniques for reducing the
achieved, harmonized to the failure number of bits required to represent data. This can
recuperation site by means of Data Guard, save storage capacity, speed up file transfer, and
and regained together with the relational decrease costs for storage hardware and network
data in the database (Burleson, 2017). bandwidth. It makes migrating data and tools from
obsolete technologies to modern ones easier. It
TidyFS removes the need for an additional layer of
The Tiny File Systems is a small uncomplicated indirection by means of TidyFS interfaces, ensuring
client/server-based application that permits clients to that clients can realize the highest obtainable I/O
retrieve and process data stored on the server as if it performance of the indigenous system (Sajjad and
were on their own. TidyFS offer the abstractions Harirbaf, 2013).
required for concurrent processing of data on clusters

Architecture of TidyFS

Fig. 7: TidyFS System Architecture (Source: Ghemawat et al., 2003)


The TidyFS storage system is made up of three computer that stores data; and the TidyFS Explorer, a
sections: a metadata server; a node service that GUI which permits users to observe the status of the
performs housekeeping tasks running on each cluster system

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
230
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

Area of Application
TidyFS is mostly used by TinyOS file system.
The table 1 below is a comparison of the different distributed file systems we studied in this paper.
Table 1: Comparative table of characteristics of distributed file systems under study
Data Fault Data Access Data Supported OS Reference
Scalability Tolerance Concurrency Striping
HDFS Yes Block Files have Block size Linux and Suralkar et al.,
Replications. strictly one of 128MB Windows are the 2013
Secondary write at any supported , but
Namenode. time BSD, Mac OS/X,
and Open Solaris
are known to
work
Google Yes Chunk Optimised for 64MB Linux Gemayel, 2016
FS Replication. concurrent Chunks
Meta data ‘appends’
replication
Luster FS Yes Meta-data Many seeks 800 Linux and Paul et al., 2013
replication by and read-and- MB/sec of provides a
a single write disk POSIX-compliant
server. operations of bandwidth UNIX file system
Data is stored small amounts interface.
on reliable of data
nodes
CephFS Yes Metadata is Allow 14-node Linux VFS and Sage et al.,
replicated aggregate I/O cluster of page cache. 2006
across performance to OSDs
multiple scale with the (around
MDS size of the 58MB)
nodes. OSD cluster.
GlusterFS Yes Replication Lowest level Atleast Linux but Shyam and
means single translator, 16KB file supports redhat. Sudarshan,
file can be stores and size CentOs, Fedoro. 2015
cloned accesses data Ubuntu flavors of
and placed on from local file Linux operating
multiple system
nodes.
Oracle Yes Oracle Net It supports row Can Windows, Red Burleson, 2017
Cluster FS connect-time level locking handle 4.3 Hat Linux and
failover and billion United Linux
connection fully
load consistent
balancing. reads and
1.2 fully
transactio
nal writes
per minute
Tidy FS Yes Automatic Uses Not Windows Fetterly, et al.,
replication of native available 2010
read-only interfaces to
database parts read and write
data

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
231
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

CONCLUSION Burleson (2017). Oracle Cluster File System (OCFS)


A comparative study of some important features of Tips. Available at http://www.dba-
seven distributed file storage systems were oracle.com/disk_ocfs.htm
considered in this paper. We first discussed the
Carlos Fernando Gamboa (2008). Atlas LCG 3D
different types of distributed file systems, their
Oracle cluster migration strategy at BNL, Gris
architectures and areas of application. At the end, we
Group, RACF Facility, Brookhaven National Lab,
drew a table of comparison (Table 1) whose each
WLCG Collaboration Workshop. Available at
column's header is a vital attribute of a DFS system
http://slideplayer.com/slide/8285174/
and each line's header parallels one of the seven DFS
systems we studied. At the intersection of each row D. Fetterly, M. Haridasan, M. Isard, and S.
and column, we indicate whether the attribute is Sundararaman, (2011). TidyFS: A Simple and Small
implemented by the system in addition to the Distributed File System, in USENIX ATC’11,
distinctiveness of the implementation. It is obvious Available at
from our analysis that the foremost mutual interest of http://research.microsoft.com/pubs/148515/tidyfs.pdf
these systems is scalability. These systems are aimed .
at efficiently managing the large volume of data that
kept on increasing on daily basis. Many drawbacks DataFlair Team, (2017). Hadoop HDFS Architecture
are associated with centralised storage systems. Their Explanation and Assumptions. HDFS Tutorials.
maintenance is complex and costly. Scalability in any Available at https://data-flair.training/blogs/hadoop-
DFS should with lowest cost and labour. hdfs-architecture/.
Moreover, availability of data and fault tolerance
Feiyi W. Mark N. Sarp O. Dong F. (2013). Ceph
continue to be some of the main concerns of DFS.
Parallel File System Evaluation Report. Oak Ridge
Several systems are apt to employ cheap hardware
National LaboratoryOak Ridge, Tennesse.
for storage. Such situation will the systems
vulnerable to periodic failures. This challenge is Felix Hupfeld, Toni Cortes, Bj¨orn Kolbeck, Jan
corrected by means of replication, versioning, Stender, Erich Focht, Matthias Hess, Jesus Malo,
snapshots and others which have the goal of restoring Jonathan Marti, Eugenio Cesario. (2008). The
the system state, in most cases spontaneously, once a XtreemFS architecture – a case for object-based file
fault or total loss occurs at any node. In addition to systems in Grids, Concurrency And Computation:
these mechanisms, data striping and lock mechanisms Practice And Experience Concurrency Computat.:
are included to control and enhance simultaneous Pract. Exper.: 8:1–12.
access to the data. The development of concurrent
access is very crucial for systems that manage huge Giacinto Donvito, Giovanni Marzulli, Domenico
files in substantial quantities. Locking a whole file to Diacono (2014).Testing of several distributed File-
modify a portion of it can stop the access to this file systems (HDFS,Ceph and GlusterFS) for supporting
for an undefined duration of time. Implementing the HEP experiments analysis. Journal of Physics:
solutions that will simply lock the byte range Conference Series 513 (2014) 042014
doi:10.1088/1742-6596/513/4/042014.
involved in the alteration is therefore principal. The
ability of the DFS to work on multiple operating Hooman Peiro Sajjad and Mahmoud Hakimzadeh
systems can be a great plus to its performance. Harirbaf. (2013). Maintaining Strong Consistency
Among the seven DFS we studied, the HDFS is the Semantics in a Horizontally Scalable and Highly
one offering the highest collection of operating Available Implementation of HDFS,Master Thesis,
systems that can support its implementation. KTH Royal Institute of Technology.

Hurwitz J., Nugent A., Halper F. (2013). Big Data for


REFERENCE Dummies. John Willey and Sons Inc, USA IBM
(2018). Apache MapReduce. Retrieved May 7, 2018,
Akram Elomari, Larbi Hassouni, Abderrahim from:
Maizate (2017). The Main Characteristics of Five https://www.ibm.com/analytics/hadoop/MapReduce
Distributed File Systems Required for Big Data: A
Comparatively Study. Advances in Science, John Spray (2015). CephFS Development Update.
Technology and Engineering Systems Journal, vol. 2, Available at
No. 4, pp. 78-91. events.linuxfoundation.org/sites/events/files/slides/C
ephFS-Vault.pdf

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
232
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS

Kaushiki, (2012). Google File System. Available at Sage A. Weil, Kristal T. Pollack, Scott A. Brandt,
kaushiki-gfs.blogspot.com/ and Ethan L. Miller (2004). Dynamic Metadata
Management for Petabyte-Scale File Systems,
Knowledge Base (2018). About Luster file systems.
Proceedings of the 2004 ACM/IEEE Conference on
Indiana University. Available at:
Supercomputing (SC ’04), Pittsburgh, PA, November
https://kb.iu.edu/d/ayfh
2004.
Kuchipudi Sravanthi and Tatireddy Subba Reddy
Sage A. Weil, Scott A. Brandt, Ethan L. Miller,
(2015). Applications of Big data in Various Fields.
Darrell D.E. Long, Carlos Maltzahn (2006). Ceph: A
(IJCSIT) International Journal of Computer Science
Scalable, High-Performance Distributed File System,
and Information Technologies, Vol. 6 (5) , 2015,
Proceedings of the 7th Symposium on Operating
4629-4632.
Systems Design and Implementation (OSDI), Seattle,
Madhavi Vaidya, Shrinivas Deshpande (2016). WA, November 2006.
Comparative Analysis of Various Distributed File
Sanjay Ghemawat, Howard Gobioff, Shun-Tak
Systems & Performance Evaluation using Map
Leung.(2003). The Google File System, SOSP '03
Reduce Implementation, IEEE International
Proceedings of the nineteenth ACM symposium on
Conference on Recent Advances and Innovations in
Operating systems principles.
Engineering (ICRAIE-2016), December 23-25,
Jaipur, India, pp. Shvachko, Kuang, Radia, Chansler (2010). The
Hadoop distributed file system, Proceedings of the
Nader Gemayel (2016). Analyzing Google File
26th Symposium on Mass Storage Systems and
System and Hadoop Distributed File System.
Technologies (MSST ’10), Lake Tahoe, NV, pp. 1–
Research Journal of Information Technology, 8: 66-
10.
74.
Shyam C Deshmukh, Sudarshan S Deshmukh.
Pooja S.H. (2014). The Hadoop Distributed File
(2015). Simple Application of GlusterFs: Distributed
System. International Journal of Computer Science
file system for Academics. International Journal of
and Information Technology, vol. 5(5), 6238-6243.
Computer Science and Information Technologies,
R.Vijayakumari, R.Kirankumar, K.Gangadhara R. vol. 6 (3) , pp. 2972-2974.
(2014). Comparative analysis of Google File System
Sunita Suralkar, Ashwini Mujumdar, Gayatri
and Hadoop Distributed File System. International
Masiwal, Manasi Kulkarni (2013). Review of
Journal of Advanced Trends in Computer Science
Distributed File Systems: Case Studies, International
and Engineering, vol. 3 , No.1, pp. 553– 558.
Journal of Engineering Research and A pplications
Ray Walshe (2015). Google File System - DCU (IJERA), vol. 3, Issue 1, pp. 1293-1298.
School of Computing. Available at
Torben Kling Petersen (2018). Inside The Luster File
https://www.computing.dcu.ie/~ray/teaching/CA485/
System - An introduction to the inner workings of the
notes/LectGFS.pdf.
world’s most scalable and popular open source HPC
file system Technology Paper, Seagate. pp. 1-14.

FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
233

You might also like