A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
A Comparative Study of The Architectures and Applications of Scalable High-Performance Distributed File Systems
of
Journal G.Sciences
and Joseph,
(FJS)S. B. FJS
ISSN: 2616-1370
Vol. 2 No. 2, June, 2018, pp 223 – 233
ABSTRACT
Distributed File Systems have enabled the efficient and scalable sharing of data across
networks. These systems were designed to handle some technical problems associated with
network data. For instance reliability and availability of data, scalability of infrastructure
supporting storage, high cost of gaining access to data, maintenance cost and expansion. In
this paper, we attempt to make a comparison of the key technical blocks that are referred to as
the mainstay of Distributed File Systems such as Hadoop FS, Google FS, Luster FS, Ceph FS,
Gluster FS, Oracle Cluster FS and TidyFS. This paper aims at elucidating the basic concepts
and techniques employed in the above mentioned File Systems. We explained the different
architectures and applications of these Distributed File Systems.
Keywords: Distributed File Systems, Google File System, Hadoop File System, Oracle
Cluster File System, Ceph File System.
challenge which is data security. Whenever a virus, scalability issues. Since the NameNode stores the
malware or worm infect a data, it rapidly affects all entire namespace and block locations in memory, the
the data. Another challenge with HDFS is that it is size of the NameNode heap reduces the number of
susceptible in nature and unsuitable for storing small files and blocks that can be addressed. A possible
data. Each file is stored as a sequence of blocks; all solution to this problem is to permit the sharing of
blocks in a file excluding the last block are of the physical storage by several namespaces and
same size. Blocks belonging to a file are reproduced NameNodes within a cluster (Suralkar et al., 2013).
for fault tolerance. The block size and replication
factor are can be configured for each file. The files Architecture of HDFS
HDFS cluster contains one namenode, a master
are “write once” and only one writer can write on it at
server and several datanodes known as slaves in the
any time (Hurwitz et al., 2013). Other drawbacks of
architecture. The HDFS stores filesystem metadata
Hadoop DFS include centralisation. The Hadoop
and application data independently. HDFS stores
system uses a centralized master server. Thus, the
metadata on an autonomous dedicated server named
Hadoop cluster is unreachable whenever its
Namenode and Application data are stored on
NameNode is inoperative (Gemayel, 2016). A
separate servers termed Datanodes. Every servers are
suitable recovery approach to solve this problem so
fully connected and communicated with the TCP
far is to restart the NameNode, and adequate steps are
based protocols. Figure 1 shows the complete
being taken regarding the ability of the system
architecture of the HDFS.
automatically recover. Moreover, HDFS have
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
224
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
5. Enterprise: HDFS is used for in-data reason why customers purchase some goods
analytics help business to make decisions more than others and the areas where they
faster than another traditional analytic tool. have to intensify their efforts such as
6. Credit cards: HDFS can assist an marketing the product.
organisation to detect possible fraudulent 8. Banking: Banks make use of HDFS for
acts in credit cards transactions. Companies storing customer data and also assists in
depend on the in-database analytics because detecting any questionable customer
of its speed and accuracy. It does a kind of activity. It provides easy access to
verification before authorization where there customer’s information wherever the
is any suspicious activity. customer requests for it. It helps the banks to
7. Consumer goods: HDFS is used for storing also track their progress and enhance the
collection of consumers’ details and efficiency of their service.
activities. For example type of goods or 9. Hadoop is also being used by the giant ISP
product, place of purchase, quantity (Yahoo) and popular social media such as
purchased, online transactions etc. When Facebook.
analyzed, it can help a company to
understand some information such as the
Google File System verifying of each copy by chunk server using check
Google File System (GFS) is a distributed file system sum. It has a reduced check sum cost. It has an
introduced by Google to manage huge amount of data increased bandwidth because of its batch operations
which is spread across different databases such as cabbage collection and writing to operation
(Ghemawat et al., 2003). GFS is mainly intended to log. At client end, no synchronization is needed
provide efficient, dependable and fault tolerant access because of the append operations. It takes care of
to data by employing huge clusters of commodity caching issues. It automatically collects garbage
servers. Google system (GFS cluster) is made up of chunks using garbage collection. Continuous
one master and many Chunk servers (nodes) and can monitoring of chunk server by GFS through
be retrieved by several clients (Vijayakumari et al., continuous messaging (Ghemawat et al., , 2003).
2014). GFS files are partitioned into chunks of 64 The GFS is limited to special purpose design and thus
megabytes, and are normally appended to or read. It cannot accommodate a general-purpose design. It is
is overwritten or compressed only in exceptionally very inefficient in the case of small files (Gemayel,
seldom cases. Compared with other conventional file 2016). Slow garbage collection is a setback i.e. when
systems, GFS is planned and enhanced to operate in the files are not static. Consistency checks are done
data centers to offer exceptionally great data by the clients themselves. When the number of
processing capacity, minimal delay to rapidly writers increase there can be degradation in the
changing data and continue in its operations despite performance of the system. The master memory also
individual server failures (Gemayel, 2016). The poses a limitation (kaushiki, 2012).
features that make GFS highly appealing include:
continuous operation in the event of system Architecture of GFS
A GFS cluster is made up of a one master and many
breakdown, important data duplication, automatic
chunkservers. It can be retrieved by several clients, as
and effective, salvaging corrupted or damaged data,
shown in Figure 2. Each of the components is
superior sum of the data rates that are delivered to all
normally a service Linux machine running a user-
terminals in a network, decreased client and master
level server process. The chunkserver and a client can
communication due of huge chunk server size,
be run on similar computer without difficulty
namespace organisation and securing, and better
provided that the resources on the computer system
accessibility.
allows it and the inferior inconsistency brought about
Some of the advantages of GFS include its very high
by running perhaps application code that is likely to
availability and high faults tolerance through data
act in an unusual manner is suitable (Ghemawat et
duplication. Its single master design makes it very
al., , 2003).
efficient and simple. GFS ensures data integrity by
It is easy to run both a chunkserver and a client on
the same machine, as long as
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
225
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
Data messages
Instructions to chunkserver Control messages
Chunkserver state
( chunk handle, byte range )
GFS chunkserver GFS chunkserver
chunk data
Linux file system Linux file system
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
226
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
The MDS does not participate in file Object Storage Server (OSS): The OSS
read/write actions. supervises read/write actions for (normally)
Object Storage Targets (OSTs): The OSTs several OSTs.
keeps user file data in one or many logical
objects which have the ability to be streaked
hrough various OSTs.
Depicted in figure 3 below is the architecture of the Luster distributed file system.
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
227
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
object is a store upon which all Ceph storage solution that you wants to use the file) and afterward the
are built (Sage et al., 2006). Two types of entities read/write is changed instantaneously, with only
cooperate to provide a file system interface: Clients intermittent updates to the MDS (Giacinto et al.,
and metadata server (MDS). What is required of a 2014). The CephFS is depicted in figure 4 below.
client is just to open a file (i.e. you inform the MDS
Areas of Application of CephFS (Shyam and Sudarshan, 2015). GlusterFS have four
1. CephFS is used in storing large files major notions:
efficiently.
2. They are used in the Field of Bricks - the part of a computer in which information
Telecommunication for storage and is stored. It is made up of a server and route that
distribution. points to a file system location by following the
directory tree hierarchy (i.e., server:/export)
GlusterFS File System Translators - components that are joined to transport
GlusterFS is a distributed network file system that data from point A to point B
has the capability to handle growing amount of work. Trusted Storage Pool – a reliable interconnection of
It is developed using C programming language servers that will as the central repository of data and
(Shyam and Sudarshan, 2015). By means of programs that are shared by users in a network
conventional standard hardware, GFS can produce Volumes - group of bricks having the same condition
massive, storage solutions that are geographically for redundancy.
dispersed. It can be used for media streaming, data
transformation and modeling, and other data and jobs Architecture of GlusterFS file System
that affects the maximum data transfer rate of a GlusterFS system architecture is easy to comprehend
network or Internet connection. GlusterFS is an and it is equally a robust file system written in user
unrestricted and source code that anyone can inspect, space that employs FUSE to attach itself to programs
change, and distribute. It is easy to understand its that forms an interface between an operating system's
architecture. Users put it on their computer for kernel and a more concrete file system. GlusterFS file
personal use. The idea of information about data for system architecture is in different levels. This makes
storing files on server does not apply in GlusterFS. It it possible to add or delete features. GlusterFS file
have the ability to compute files’ address and retrieve system works on ext3, ext4, xfs.etc.to store data. It
files without difficulty. GFS utilizes the idea of allows horizontal growth or the addition of new
bricks and volumes to separate various user’s data on resources in the network. The conceptual model that
small allocation units for a file within a file system defines the structure of GlusterFS is depicted in
figure 5 below.
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
228
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
229
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
Area of Application of Oracle Cluster file system (Fetterly, et al., 2010). There has been a sudden
According to Burleson (2017), Oracle Cluster File increase of research in computing using large
System is applicable in the areas explained below: numbers of already-available computing components
i. Metadata caching for parallel computing, to get the best number of
ii. Metadata journaling useful computation at reduced cost. Often, the high
iii. Asynchronous and direct I/O support for number of reads and writes, causing latency and
database files for enhanced database bottlenecks for such clusters is generated by a
workload, throughput, resources, software systems that runs on a cluster of networked
optimization, and contention. computers and looks like one dependable machine
iv. Oracle Database has found application in that offers huge collective volume of computational
storing files related to software programs and I/O performance. Examples include Map Reduce,
designed to accumulate, manage and Hadoop or Dryad. They have a high-throughput, they
disseminate information efficiently like are sequential, and are read-mostly (Fetterly, et al.,
CAD, medical images, invoice images, 2010). Unlike the other DFS that we discussed in the
documents, etc. The SQL usual data type, previous sections, TidyFS is very simple. The system
BLOB (and CLOB) is used by applications does not have any complicated duplication protocols
to store files in the database. Oracle and read/write code paths by taking advantage of
Database offers superior security, workload properties which include the nonexistence
accessibility, ability to operate without of simultaneous writes to a file by many clients, and
breakdown despite degradation in the presence of end-to-end fault tolerance in the
transactions, and capacity to be handle execution engine (Fetterly, et al., 2010).
increase in file size compared to Some of the advantages of TidyFS include: it permits
conventional file systems. Each time files applications to carryout I/O by means of whatsoever
are stored in the database, they are copied or access patterns and techniques for reducing the
achieved, harmonized to the failure number of bits required to represent data. This can
recuperation site by means of Data Guard, save storage capacity, speed up file transfer, and
and regained together with the relational decrease costs for storage hardware and network
data in the database (Burleson, 2017). bandwidth. It makes migrating data and tools from
obsolete technologies to modern ones easier. It
TidyFS removes the need for an additional layer of
The Tiny File Systems is a small uncomplicated indirection by means of TidyFS interfaces, ensuring
client/server-based application that permits clients to that clients can realize the highest obtainable I/O
retrieve and process data stored on the server as if it performance of the indigenous system (Sajjad and
were on their own. TidyFS offer the abstractions Harirbaf, 2013).
required for concurrent processing of data on clusters
Architecture of TidyFS
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
230
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
Area of Application
TidyFS is mostly used by TinyOS file system.
The table 1 below is a comparison of the different distributed file systems we studied in this paper.
Table 1: Comparative table of characteristics of distributed file systems under study
Data Fault Data Access Data Supported OS Reference
Scalability Tolerance Concurrency Striping
HDFS Yes Block Files have Block size Linux and Suralkar et al.,
Replications. strictly one of 128MB Windows are the 2013
Secondary write at any supported , but
Namenode. time BSD, Mac OS/X,
and Open Solaris
are known to
work
Google Yes Chunk Optimised for 64MB Linux Gemayel, 2016
FS Replication. concurrent Chunks
Meta data ‘appends’
replication
Luster FS Yes Meta-data Many seeks 800 Linux and Paul et al., 2013
replication by and read-and- MB/sec of provides a
a single write disk POSIX-compliant
server. operations of bandwidth UNIX file system
Data is stored small amounts interface.
on reliable of data
nodes
CephFS Yes Metadata is Allow 14-node Linux VFS and Sage et al.,
replicated aggregate I/O cluster of page cache. 2006
across performance to OSDs
multiple scale with the (around
MDS size of the 58MB)
nodes. OSD cluster.
GlusterFS Yes Replication Lowest level Atleast Linux but Shyam and
means single translator, 16KB file supports redhat. Sudarshan,
file can be stores and size CentOs, Fedoro. 2015
cloned accesses data Ubuntu flavors of
and placed on from local file Linux operating
multiple system
nodes.
Oracle Yes Oracle Net It supports row Can Windows, Red Burleson, 2017
Cluster FS connect-time level locking handle 4.3 Hat Linux and
failover and billion United Linux
connection fully
load consistent
balancing. reads and
1.2 fully
transactio
nal writes
per minute
Tidy FS Yes Automatic Uses Not Windows Fetterly, et al.,
replication of native available 2010
read-only interfaces to
database parts read and write
data
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
231
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
232
A COMPARATIVE …… . Dada, E. G. and Joseph, S. B. FJS
Kaushiki, (2012). Google File System. Available at Sage A. Weil, Kristal T. Pollack, Scott A. Brandt,
kaushiki-gfs.blogspot.com/ and Ethan L. Miller (2004). Dynamic Metadata
Management for Petabyte-Scale File Systems,
Knowledge Base (2018). About Luster file systems.
Proceedings of the 2004 ACM/IEEE Conference on
Indiana University. Available at:
Supercomputing (SC ’04), Pittsburgh, PA, November
https://kb.iu.edu/d/ayfh
2004.
Kuchipudi Sravanthi and Tatireddy Subba Reddy
Sage A. Weil, Scott A. Brandt, Ethan L. Miller,
(2015). Applications of Big data in Various Fields.
Darrell D.E. Long, Carlos Maltzahn (2006). Ceph: A
(IJCSIT) International Journal of Computer Science
Scalable, High-Performance Distributed File System,
and Information Technologies, Vol. 6 (5) , 2015,
Proceedings of the 7th Symposium on Operating
4629-4632.
Systems Design and Implementation (OSDI), Seattle,
Madhavi Vaidya, Shrinivas Deshpande (2016). WA, November 2006.
Comparative Analysis of Various Distributed File
Sanjay Ghemawat, Howard Gobioff, Shun-Tak
Systems & Performance Evaluation using Map
Leung.(2003). The Google File System, SOSP '03
Reduce Implementation, IEEE International
Proceedings of the nineteenth ACM symposium on
Conference on Recent Advances and Innovations in
Operating systems principles.
Engineering (ICRAIE-2016), December 23-25,
Jaipur, India, pp. Shvachko, Kuang, Radia, Chansler (2010). The
Hadoop distributed file system, Proceedings of the
Nader Gemayel (2016). Analyzing Google File
26th Symposium on Mass Storage Systems and
System and Hadoop Distributed File System.
Technologies (MSST ’10), Lake Tahoe, NV, pp. 1–
Research Journal of Information Technology, 8: 66-
10.
74.
Shyam C Deshmukh, Sudarshan S Deshmukh.
Pooja S.H. (2014). The Hadoop Distributed File
(2015). Simple Application of GlusterFs: Distributed
System. International Journal of Computer Science
file system for Academics. International Journal of
and Information Technology, vol. 5(5), 6238-6243.
Computer Science and Information Technologies,
R.Vijayakumari, R.Kirankumar, K.Gangadhara R. vol. 6 (3) , pp. 2972-2974.
(2014). Comparative analysis of Google File System
Sunita Suralkar, Ashwini Mujumdar, Gayatri
and Hadoop Distributed File System. International
Masiwal, Manasi Kulkarni (2013). Review of
Journal of Advanced Trends in Computer Science
Distributed File Systems: Case Studies, International
and Engineering, vol. 3 , No.1, pp. 553– 558.
Journal of Engineering Research and A pplications
Ray Walshe (2015). Google File System - DCU (IJERA), vol. 3, Issue 1, pp. 1293-1298.
School of Computing. Available at
Torben Kling Petersen (2018). Inside The Luster File
https://www.computing.dcu.ie/~ray/teaching/CA485/
System - An introduction to the inner workings of the
notes/LectGFS.pdf.
world’s most scalable and popular open source HPC
file system Technology Paper, Seagate. pp. 1-14.
FUDMA Journal of Sciences (FJS) Vol. 2 No. 2, June, 2018, pp 223 - 233
233