Analysis of Six Distributed File Systems: Benjamin Depardon, Gaël Le Mahec, Cyril Séguin
Analysis of Six Distributed File Systems: Benjamin Depardon, Gaël Le Mahec, Cyril Séguin
Benjamin Depardon
[email protected]
SysFera
Cyril Séguin
[email protected]
Laboratoire MIS, Université de Picardie Jules Verne
Gaël Le Mahec
[email protected]
Laboratoire MIS, Université de Picardie Jules Verne
1
2.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.2 Naming, API and client access . . . . . . . . . . . . . . . . . . . . 11
2.4.3 Cache consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4.4 Replication and synchronisation . . . . . . . . . . . . . . . . . . . 11
2.4.5 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.6 Fault detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5 GlusterFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.3 API and client access . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Cache consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.5 Replication and synchronisation . . . . . . . . . . . . . . . . . . . 13
2.5.6 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.7 Fault detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.6.2 Naming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.3 API and client access . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.4 Cache consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.5 Replication and synchronisation . . . . . . . . . . . . . . . . . . . 14
2.6.6 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.6.7 Fault detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2
4.3.1 HDFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3.2 MooseFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 iRODS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.3.4 Ceph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3.5 GlusterFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.6 Lustre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 System performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Conclusion 37
3
Abstract
1 http://www.moosefs.org/
1
Chapter 1
In this section, we remind basic issues, designs and features of DFSs. These definitions
are based on those of Levy and SilberSchatz [15].
• Transparency: users should access the system regardless where they log in from,
be able to perform the same operations on DFSs and local filesystems, and should
not care about faults due to distributed nature of the filesystem thanks to fault
tolerance mechanisms. Transparency can also be seen in term of performance: data
manipulations should be at least as efficient as on conventional file systems. In
short, the complexity of the underlying system must be hidden to users.
• Fault tolerance: a fault tolerant system should not be stopped in case of transient
or partial failures. Faults considered are: network and server failures that make
data and services unavailable, data integrity and consistency when several users
concurrently access data.
• Scalability: this is the ability to efficiently leverage large amounts of servers which
are dynamically and continuously added in the system. Usually, it is about ten of
thousands of nodes.
DFSs are designed to answer these issues. This is discussed in the next three Sections.
2
reached more quickly than limits of several. Another example is about network congestion
in which scalability depends on machines interactions. Performing a lot of data transfers
or sharing a lot of messages can lead to congestion. Therefore, some key concepts must
be taken into account before building one.
Currently, some systems still adopt a centralised architecture, but provide tools to
push limits. Multi-Threading are one of such tools. Requests will use little resources
and will not block the entire server in the contrary of single threading systems. Thus,
several requests can be processed in parallel, but the system will always be limited to
the computation power of the machine it runs on. Though local parallelism using multi-
threading improves the capacities of a system to scale, it is far to be enough for the
today’s volumes of data. Another solution consists in caching data, which can reduce the
number of data transfers. This is discussed in Section 1.3.3.
Following these observations and those of Thanh and al. [16], we now introduce some
DFSs’ architectures:
All of these DFSs can also be parallel: data are divided into several blocks which
are simultaneously distributed across several servers, thus maximising throughput. This
is called striping [16, 17].
1.3 Transparency
In a DFS, the end-user does not need to know how the system is designed, how data
is located and accessed, and how faults are detected. In this section, we present some
features that ensure transparency.
1.3.1 Naming
This is a mapping between a logical name and a physical location of a data. For example,
in a classic file system, clients use logical name (textual name) to access a file which is
mapped to physical disk blocks. In a DFS, servers name holding the disk on which data
is stored must be added. DFSs must respect location transparency: details of how and
where files are stored are hidden to clients. Furthermore, multiple copies of a file (see
Section 1.4.1) may exist, so mapping must return a set of locations of all the available
copies. DFSs should also be location independent: the logical name should not change
even if the file is moved to another physical location. To do so, allocation tables [1] or
sophisticated algorithms [7] are used to provide a global name space structure, that is the
same name space for all clients. For more technical details about naming see [15].
3
• Command line interface (CLI) is used to access files with traditional unix command
(cp, rm, mv . . . ).
• Java, C, C++, other programming languages and REST (web-based) API can be
used to design graphic interface like the Windows explorer.
• Users can be allowed to mount (attach) remote directories to their local file system,
thus accessing remote files as if they were stored in a local device. The FUSE
mechanism or the unix mount command are some examples.
1.3.3 Caching
This is a technique which consists in temporarily storing requested data into client’s
memory. DFSs use caching to avoid additional network traffic and CPU consumption
caused by repeated queries on the same file and thus increase performance [18]. When
a data is requested for the first time, a copy is made from the server that holds this
data to the client’s main memory. Thus for every future request of this data, the client
will use the local copy, avoiding communications with the server and disk access. This
feature is related to performance transparency since with this technique requests can be
quickly performed, hiding data distribution to users. However, when a data is changed,
the modification must be propagated to the server and to any other clients that have
cached the data. This is the cache consistency problem discussed in Section 1.4.3.
4
if a fault occurs anywhere in the system, data is still available [1]. However this can lead
to consistencies issues that are discussed in the next Section.
1.4.2 Synchronisation
In DFSs, synchronisation between copies (see Section 1.4.1) of data must be taken into
account. When a data is rewritten all of its copies must be updated to provide users with
the latest version of the data. Three main approaches exist:
• In the synchronous method any request on modified data is blocked until all the
copies are updated. This ensures the users access the latest version of the data, but
delays queries executions.
• In the second method called asynchronous, requests on modified data are allowed,
even if copies are not updated. This way, requests could be performed in a reason-
able time, but users can access to an out-of-date copy.
• The last approach is a trade-off between the first two. In the semi-asynchronous
method, requests are blocked until some copies, but not all, are updated. For
example, let assume there are five copies of a data, a request on this data will be
allowed once three copies will be updated. This limits the possibility to access to
an out-of-date data, while reducing delay for queries executions.
• Write Only Read Many (WORM): is a first approach to ensure consistency. Once a
file is created, it cannot be modified. Cached files are in read-only mode, therefore
each read reflects the latest version of the data.
• A second method is Transactional locking which consists in: obtaining a read lock
on the requested data, so that any other users cannot perform a write on this data;
or obtaining a write lock in order to prevent any reads or writes on this data.
Therefore, each read reflect the latest write, and each write is done in order.
• Another approach is Leasing. It is a contract for a limited period between the server
holding the data and the client requesting this data for writing. The lease is provided
when the data is requested, and during the lease the client is guaranteed that no
other user can modify the data. The data is available again if the lease expires or
if the client releases its rights [19]. For future read requests, the cache is updated
if the data has been modified. For future write requests a lease is provided to the
client if allowed (that is, no lease exists for this data or the rights are released).
5
allow the system to detect servers failures and servers overload. To correct these faults,
servers can be added or removed. When a server is removed from the system, the latter
must be able to recover the lost data, and to store them on other servers. When a server
is added to the system, tools for moving data from a hot server to the newly added server
must be provided. Users don’t have to be aware of this mechanism. Usually, DFSs use a
scheduled list in which they put data to be moved or recopied. Periodically, an algorithm
iterates over this list and performs the desired action. For example, Ceph uses a function
called Controlled Replication Under Scalable Hashing (CRUSH) to randomly
store new data, move a subset of existing data to new storage resources and uniformly
restore data from removed storage resources [5].
6
Chapter 2
It is difficult to make an exhaustive study given the number of existing DFSs. In this
paper, we choose to study popular, used in production, and frequently updated DFSs:
HDFS [1], MooseFS1 , iRODS [2, 3, 4], Ceph [5, 6, 7], GlusterFS [8, 9] and Lustre [10, 11,
12, 13, 14].
2.1 HDFS
HDFS2 is the Hadoop Distributed File System under Apache licence 2.0 developed by the
Apache Software Foundation [1].
2.1.1 Architecture
HDFS is a centralised distributed file system. Metadata are managed by a single server
called the namenode and data are split into blocks, distributed and replicated at several
datanodes. A secondary namenode is provided and is a persistent copy of the namen-
ode. This allows HDFS to restart with an up-to-date configuration, in case of namenode
failures, by restoring the namespace from the secondary namenode.
2.1.2 Naming
HDFS handles its name space in a hierarchy of files and directories using inodes which
hold metadata such as permissions, space disk quota, access time. . . The name space
and metadata are managed by the namenode which also performs the mapping between
filename and file blocks stored on the datanodes.
7
in user space). This is an interface which exposes users with a virtual file system which
corresponds to a physical remote directory. Thus, each client request is relayed to a
remote file by this interface.
8
decisions for load balancing. Namenode’s instructions (to correct faults) like removing or
replicating a block are also sent to datanodes thanks to heartbeats.
2.2 MooseFS
MooseFs3 is an open source (GPL) distributed file system developed by Gemius SA.
2.2.1 Architecture
MooseFS acts as HDFS. It has a master server managing metadata, several chunk servers
storing and replicating data blocks. MooseFS has a little difference since it provides
failover between the master server and the metalogger servers. Those are machines which
periodically download metadata from the master in order to be promoted as the new one
in case of failures.
2.2.2 Naming
MooseFS manages the namespace as HDFS does. It stores metadata (permission, last
access. . . ) and the hierarchy of files and directories in the master main memory, while
performing a persistent copy on the metalogger. It provides users with a global name
space of the system.
9
2.3 iRODS
iRODS4 [2, 3, 4] is a highly customisable system developed by the Data Intensive Cyber
Environments (DICE) research group.
2.3.1 Architecture
iRODS, a centralised system, has two major components: the iCat server which stores
metadata in a database and handles queries to these metadata; and several iRODS servers
which store data to storage resources. An iCat server and several iRODS servers form a
zone. Compared to the other distributed file systems, iRODS relies on storage resources’
local file system (Unix file system, NTFS. . . ) and does not format or deploy its own file
system.
2.3.2 Naming
iRODS stores the name space and metadata in a database, and provides tools similar to
SQL, to query the metadata . Users can see the same hierarchy of files and directories
like in Unix file system (e.g., /home/myname/myfile.txt). iRODS also provides tools to
federate different zones, making files of one zone reachable to clients of another zone.
10
whenever a server is overloaded, according to configurable parameters (CPU load, used
disk space. . . ) and allows iRODS to choose the appropriate storage in a group to place
new data. However, it is possible to tell iRODS to avoid or force a data to be placed on
a specific resource using other rules. Users can also move data from a storage to another
one. Therefore, just like replication, users can choose how to balance the system.
2.4 Ceph
Ceph5 [5, 6, 7] is an open source (LGPL) distributed file system developed by Sage Weil.
2.4.1 Architecture
Ceph is a totally distributed system. Unlike HDFS, to ensure scalability Ceph provides a
dynamic distributed metadata management using a metadata cluster (MDS ) and stores
data and metadata in Object Storage Devices (OSD). MDSs manage the namespace, the
security and the consistency of the system and perform queries of metadata while OSDs
perform I/O operations.
11
function according to free disk space and weighted devices and using the same placement
policy as HDFS. Ceph implements three synchronous replication strategies: primary-copy,
chain and splay replication. In primary-copy replication, the first OSD in the PG forwards
the writes to the other OSDs and once the latter have sent a acknowledgement, it applies
its writes, then reads are allowed. In chain replication, writes are applied sequentially
and reads are allowed once the last replication on the last OSD have been made. Finally,
in splay replication, half of the number of replicas are written sequentially and then in
parallel. Reads are permitted once all OSDs have applied the write.
2.5 GlusterFS
GlusterFS6 [8, 9] is an open source (GPL) distributed file system developed by the gluster
core team.
2.5.1 Architecture
GlusterFS is different from the other DFSs. It has a client-server design in which there
is no metadata server. Instead, GlusterFS stores data and metadata on several devices
attached to different servers. The set of devices is called a volume which can be configured
to stripe data into blocks and/or replicate them. Thus, blocks will be distributed and/or
replicated across several devices inside the volume.
2.5.2 Naming
GlusterFS does not manage metadata in a dedicated and centralised server, instead, it
locates files algorithmically using the Elastic Hashing Algorithm (EHA) [8] to provide a
global name space. EHA uses a hash function to convert a file’s pathname into a fixed
length, uniform and unique value. A storage is assigned to a set of values allowing the
system to store a file based on its value. For example, let assume there are two storage
6 http://www.gluster.org/
12
devices: disk1 and disk2 which respectively store files with value from 1 to 20 and from
21 to 40. The file myfile.txt is converted to the value 30. Therefore, it will be stored onto
disk2.
2.6 Lustre
Lustre7 [10, 11, 12, 13, 14] is a DFS available for Linux and is under GPL licence.
2.6.1 Architecture
Lustre is a centralised distributed file system which differs from the current DFSs in that
it does not provide any copy of data and metadata. Instead, Lustre chooses to It stores
metadata on a shared storage called Metadata Target (MDT) attached to two Metadata
Servers (MDS), thus offering an active/passive failover. MDS are the servers that handle
the requests to metadata. Data themselves are managed in the same way. They are split
7 http://wiki.lustre.org/index.php/
13
into objects and distributed at several shared Object Storage Target (OST) which can
be attached to several Object Storage Servers (OSS) to provide an active/active failover.
OSS are the servers that handle I/O requests.
2.6.2 Naming
The Lustre’s single global name space is provided to user by the MDS. Lustre uses in-
odes, like HDFS or MooseFS, and extended attributes to map file object name to its
corresponding OSTs. Therefore, clients will be informed of which OSTs it should query
for each requested data.
14
Chapter 3
3.1 Scalability
DFSs must face with an increasing number of clients performing requests and I/O oper-
ations and a growing number of files of different sizes to be stored. Scalability is the sys-
tem’s ability to grow to answer the above issues without disturbing system’s performance.
Here, we discuss about the benefits and disadvantages of the different architectures used.
15
Table 3.2: Input and Output performances
HDFS iRODS Ceph GlusterFS Lustre MooseFS
Input/Output I O I O I O I O I O I O
1 × 20GB 407s 401s 520s 500s 419s 382s 341s 403s 374s 415s 448s 385s
1000 × 1MB 72s 17s 86s 23s 76s 21s 59s 18s 66s 5s 68s 4s
that can be created. Furthermore, it does not separate data and metadata management
which allows it to quickly scale by just adding one server. Ceph acts as GlusterFS but
distributes the metadata management across several metadata servers. It allows it to face
with a large number of client’s requests. However, to increase both amount of data and
client’s queries, Ceph needs to add two kind of servers: metadata or data, which makes
the system more complex to scale.
3.2 Transparency
In a DFS, the complexity of the underlying system must be hidden to users. They should
access a file and perform operations in a same way as in a local file system and should
not care about faults due to distributed nature of the filesystem. We now compare the
different features used to ensure transparency.
16
is that it is the responsibility of the metadata server to find where data are stored when
a client request a file, adding more computing pressure on this server. Moreover, HDFS
and MooseFS store metadata in the memory, restricting the number of files to be created.
This is not the case for iRODS and Lustre since they put metadata on large space disk.
Ceph and GlusterFS use an algorithm to calculate data’s location. It reduces the
metadata servers workload because this is clients that search for data’s location. Meta-
data servers only have to provide the information needed to correctly run the algorithm.
Nevertheless, contrary to maintaining an index, with this method when clients request a
file, they do not immediately know where the data is stored but they need to calculate
the data’s location before accessing them.
17
using failover: several metadata servers, in standby, periodically save the metadata to be
ready to take control of the system.
18
run to perform a load balancing whereas HDFS is better since it is automatically done.
19
Chapter 4
We have perform some tests on the different DFSs on grid5000 platform. In this chapter,
we explain how we have set up the DFSs surveyed on this platform, detail how we have
accessed to the DFSs from an outside network, show the DFSs’ behaviour in case of faults
and finally introduce the results of some performance tests.
4.1.1 HDFS
Installation
HDFS4 needs installing Java before setting it up. Secondly, we have downloaded the
hadoop package5 , and put it on all nodes (including clients), then we install it with root
permissions:
Configuration
First we choose a node to be the namenode. Then, for all servers, we edit four files:
hdfs-site.xml, core-site.xml, slaves and hadoop-env.sh. The first includes settings for the
namespace checkpoint’s location and for where the datanodes store filesystem blocks. The
1 https://www.grid5000.fr/gridstatus/oargridmonika.cgi
2 https://www.grid5000.fr/mediawiki/index.php/Category:Portal:Environment
3 https://www.grid5000.fr/mediawiki/index.php/Toulouse:Hardware
4 http://developer.yahoo.com/hadoop/tutorial/
5 http://wwwftp.ciril.fr/pub/apache/hadoop/core/stable/
20
Table 4.1: HDFS config files
hdfs-site.xml core-site.xml slaves
<configuration>
<property>
<name>dfs.name.dir</name>
<configuration>
<value>/tmp/dfs/name</value> datanodes1
<property>
</property> datanodes2
<name>fs.default.name</name>
<value>hdfs://namenode_host:port</value> datanodes3
<property> ...
</property>
<name>dfs.data.dir</name> datanodesN
</configuration>
<value>/tmp/dfs/data</value>
</property>
</configuration>
second specifies which node is the namenode, the third must contain all the datanodes’
hostname and the last holds the JAVA HOME variable which specifies the path to Java
directory. Note that for hdfs’s clients, only the JAVA HOME variable and the core-
site.xml must be modified. Table 4.1 shows the config files used in our tests.
Running HDFS
Once connected to the namenode, we can start HDFS and then, from the clients, perform
some operations:
4.1.2 MooseFS
6
MooseFS needs installing pkg-config and zlib1g-dev before setting it up. Secondly, we
have downloaded the MooseFS archive7 and the fuse package8 . The latter is needed for
MooseFS’s clients. On all nodes (including clients) we extract the archive and create a
MooseFS group and user:
According to the kind of servers (master, backup, chunk or user), the installation is
different.
Master server
• Installation:
6 http://www.moosefs.org/tl files/manpageszip/moosefs-step-by-step-tutorial-v.1.1.pdf
7 http://www.moosefs.org/download.html
8 http://sourceforge.net/projects/fuse/
21
master~: cd mfs-1.6.25-1
master~: ./configure --prefix=/usr --sysconfdir=/etc
\--localstatedir=/var/lib --with-default-user=mfs
\--with-default-group=mfs --disable-mfschunkserver
\--disable-mfsmount
master~: make; make install
• Configuration:
Backup server
• Installation:
backup~: cd mfs-1.6.25-1
backup~: ./configure --prefix=/usr --sysconfdir=/etc
\--localstatedir=/var/lib --with-default-user=mfs
\--with-default-group=mfs --disable-mfschunkserver \
\--disable-mfsmount
backup~: make; make install
• Configuration:
– First add master’s server IP to hosts files:
backup~: echo "ip_master_server mfsmaster" >> /etc/hosts
– Then run the following commands to avoid some errors:
backup~: cd /etc
backup~: cp mfsmetalogger.cfg.dist mfsmetalogger.cfg
• Running backup server:
22
Chunk server
• Installation:
chunk~: cd mfs-1.6.25-1
chunk~: ./configure --prefix=/usr --sysconfdir=/etc
\--localstatedir=/var/lib --with-default-user=mfs
\--with-default-group=mfs --disable-mfsmaster
chunk~: make; make install
• Configuration:
– First add master’s server IP to hosts files:
chunk~: echo "ip_master_server mfsmaster" >> /etc/hosts
– Then configure the storage which will store data’s blocks:
chunk~: echo "/tmp" >> mfshdd.cfg
chunk~: chown -R mfs:mfs /tmp
– Finally run the following commands to avoid some errors:
chunk~: cd /etc
chunk~: mfschunkserver.cfg.dist mfschunkserver.cfg
chunk~: cp mfshdd.cfg.dist mfshdd.cfg
• Running chunk server:
Client
• Installation:
– FUSE:
user~: cd fuse-2.9.2
user~: ./configure; make; make install
– MooseFS:
user~: ./configure --prefix=/usr --sysconfdir=/etc
\--localstatedir=/var/lib --with-default-user=mfs
\--with-default-group=mfs --disable-mfsmaster
\--disable-mfschunkserver
user~: make; make install
• Configuration:
23
• Perform operations:
4.1.3 iRODS
iRODS910 setting up is made with non root user and is interactive. When the script is
run, some questions are asked to configure iRODS. We can choose, for example, which
server will be the iCat server or where to store data on iRODS servers. iCat server needs
postgresql and odcb to store metadata. Their installation is automatically run during
iRODS setting up. However, on grid5000 platform, some downloads are blocked, and we
had to retrieve these software manually and put them on the iCat, so that iRODS can
detect that a transfer is not needed:
Then, extract the iRODS archive, install and perform some operations:
user~: cd iRODS/clients/icommands/bin
user~: ./iput local_file irods_destination
user~: ./iget irods_file local_destination
4.1.4 Ceph
Installation
To set up Ceph, download and install the following package11 on all nodes (including
clients) with root permissions:
Configuration
Ceph12 uses a unique config file for all nodes (including clients). Here is the config file
used for our tests:
9 https://www.irods.org/index.php/Downloads
10 https://www.irods.org/index.php/Installation
11 http://ceph.com/debian/pool/main/c/ceph/
12 http://ceph.com/docs/master/start/
24
[global]
auth supported = none
keyring = /etc/ceph/keyring
[mon]
mon data = /tmp/mon.$id
keyring = /etc/ceph/keyring.$name
[mds]
keyring = /etc/ceph/keyring.$name
[osd]
osd data = /tmp/osd.$id
osd journal = /root/osd.$id.journal
osd journal size = 1000
filestore xattr use omap = true
keyring = /etc/ceph/keyring.$name
[mon."num_mon"]
host = "mon_hostname"
mon addr = "ip_mon":6789
[mds."num_mds"]
host = "mds_hostname"
[osd."num_osd"]
host = "osd_hostname"
For each monitor, metadata server and data server, change ”mon num”, ”mds num”
and ”osd num” by the server’s number (1, 2, 3 . . . ). Finally, do not forget to create all
directories needed on monitors and data servers:
Running Ceph
Choose a main monitor and run the following command from it:
25
user~: cp local_file /ceph
user~: cp /ceph/file local_destination
4.1.5 GlusterFS
Installation
First download the glusterFS package13 for all nodes (including clients) edit the sources.list
file and install glusterFS14 with root permissions:
node~: glusterfs_3.3.0-1_amd64.deb
node~: echo "deb http://ftp.de.debian.org/debian sid main" >> /etc/apt/sources.list
node~: apt-get update
node~: dpkg -i glusterfs_3.3.0-1_amd64.deb
node~: apt-get -f -y install
Finally, on all servers create a directory in which data will be stored and run GlusterFS:
node~: mkdir /tmp/data
node~: /etc/init.d/glusterd start
Configuration
First, choose a main server in which create a pool of trusted server (the main server is
automatically include in the pool). Then, from the main server, we can create a replicated
and/or striped volume. Note that, for n stripe and p replicas, the number of server needed
is n × p:
main~: gluster peer probe "server1_hostname"
main~: gluster peer probe "server2_hostname"
...
main~: gluster peer probe "serverN_hostname"
4.1.6 Lustre
Lustre needs to install15 a new linux kernel and reboot on it. We had to create a new
environment on grid5000 platform which is detailed here:
13 http://www.gluster.org/download/
14 http://www.gluster.org/community/documentation/index.php/Main Page
15 http://wiki.debian.org/Lustre#Installation of Lustre 2.2 in Debian Squeeze
26
Lustre environment on grid5000 platform
On one node, download the following packages16 17 and install them with root permissions:
Then create an archive of the new environment18 and modify the config file19 :
tarball : archive.tgz|tgz
kernel : /boot/"new_kernel"
initrd : /boot/"new_initrd"
Now we can run the new environment on all nodes (including clients) and boot on the
new kernel:
Running Lustre
On metadata server side (mds), choose a partition to format and mount it:
27
user~: mkdir /lustre
user~: mount -t lustre "ip_mds"@tcp0:/lustrefs /lustre
4.2.1 HDFS
We run the following command and then modify the core-site.xml file:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>
</configuration>
We are able to communicate with the namenode and so perform all the operations
related to metadata (ls, stat . . . ). However, to write files, we need to forward datanodes
ports too which is more complex and we do not try it.
4.2.2 MooseFS
We modify the hosts file and run the following commands:
4.2.3 iRODS
We run the following command and then modify the .irodsEnv file:
28
4.2.4 Ceph
We run the following command and then modify the ceph.conf file:
[mon.1]
host = localhost
mon addr = 127.0.0.1:6789
Currently, we have not succeeded yet to interact with the system from a outside client.
4.2.5 GlusterFS
For GlusterFS, it is harder because several tcp and udp ports are open. We try to redirect
all of them without any success.
4.3.1 HDFS
We use 1 namenode, 5 datanodes, 3 replicas.
• Put a data (34MB):
• Detection:
29
The inaccurate node is detected and ten minutes after it is removed from the system.
• Satisfying the replication:
• Detection:
4.3.2 MooseFS
We use 1 master server, 1 backup node, 5 chunk servers, 3 replicas.
• Put a data (34MB):
cp toto /tmp/mfs/toto
30
The file is replication three times across four the nodes. We do not know why
server3 does not hold any data.
• Crash a node:
• Detection: We have used a REST API which provides user with a global system’s
monitoring. The inaccurate node is detected and removed from the system.
The 13MB lost are not recover on other nodes. Replication is not satisfied.
• Get a data:
cp /tmp/mfstoto toto
• Detection: Using the REST API, we can see that the node is quickly available again.
• Load balancing
Finally, we do not remove data from the inaccurate server, that is why 13MB are
recover. The system is not automatically balanced.
4.3.3 iRODS
We use 1 iCat, 4 iRODS servers.
• Put a data (34MB):
31
Server1 Server2 Server3 Server4 Server5
SD before 197M 197M 197M 197M 197M
Put toto 34M
SD after 231M 197M 197M 197M 197M
Modification 34M 34M 0M 0M 0M 0M
The file is put on one node since iRODS does not split data into blocks.
• Crash a node:
• Detection:
ips -a
ERROR: for at least one of the server failed.
The inaccurate node is detected and data is lost since there is no replication.
• Rebooting node:
server1:~#irodsctl istart
Starting iRODS server...
• Detection:
ips -a
Server: server1
28237 rods#tempZone 0:00:00 ips 192.168.159.117
The node is quickly detected and if data are not removed from disk, they are
available again.
4.3.4 Ceph
We use 2 mds, 2 mon, 3 osd, 2 replicas.
• Put a data (34MB):
cp toto /ceph/toto
32
kapower3 -m server1 --off
The system is going down for system halt NOW!
• Detection:
ceph -s
osd : 3 osds: 2 up, 2 in
cp /ceph/toto toto
• Detection:
ceph -s
osd : 3 osds: 3 up, 3 in
33
4.3.5 GlusterFS
We use 4 servers, 2 stripes and 2 replicas.
• Put a data (34MB):
cp toto /tmp/gluster/toto
• Detection:
cp /tmp/gluster/toto toto
server1:~#/etc/init.d/glusterd start
Starting glusterd service: glusterd.
34
• Detection:
Finally, if data are not removed from the inaccurate disk, they are available again,
but the system is not automatically balanced.
4.3.6 Lustre
We use 1 mds, 4 data servers, no replica.
• Put a data (34MB):
cp toto /lustre/toto
• Detection:
35
• Get a data:
cp /lustre/toto toto
Input/Output error
• Detection:
Finally, if data are not removed from disk, they are available again, but the system
is not automatically balanced.
20 https://www.grid5000.fr
21 https://www.grid5000.fr/mediawiki/index.php/Toulouse:Hardware
36
Chapter 5
Conclusion
DFSs are the principle storage solution used by supercomputers, clusters and datacenters.
In this paper, we have given a presentation and a comparison of five DFSs based on
scalability, transparency and fault tolerance. DFSs surveyed are : Lustre, HDFS, Ceph,
MooseFS, GlusterFS and iRODS. We have seen that the DFSs ensure transparency and
fault tolerance using different methods that provide the same results. The main difference
lies on the design. In theory, decentralised architectures seem to scale better than a
centralised one thanks to the distributed workload management. Furthermore, the choice
of a DFS should be done according to their use. For performance, an asynchronous
replication and the use of an index to maintain the namespace are preferable whereas
a decentralised architecture is better for managing large amounts of data and requests.
The comparison given in this paper is theoretical. However we have performed some
simple tests to measure the system accessibility and fault tolerance. We try to access
to a cluster in a private network from another one with only a ssh connection. Using
port forwarding, we have conclude that only iRODS and MooseFS are is easily accessible.
About fault tolerance, we just have simulated a crash on a data server. For all DFSs,
except Lustre, the inaccurate server is detected and put in quarantine in a transparent
way. The desired number of replicas is maintained except for GlusterFS and iRODS. We
hope to perform stronger tests in future to provide a practical analysis. In particular,
measuring scalability and limits of metadata server(s) by stressing them, that is, sending
several requests. Asynchronous and synchronous I/O operations can also be compared.
Finally, testing fault tolerance in a more thorough way is needed.
Acknowledgment
This work was developed with financial support from the ANR (Agence Nationale de la
Recherche) through the SOP project referenced 11-INFR-001-04.
37
Bibliography
[1] Shvachko, K., Kuang, H., Radia, S., Chansler, R.: The hadoop distributed file
system. In: Proceedings of the 2010 IEEE 26th Symp. on Mass Storage Systems and
Technologies (MSST), Washington, DC, USA, IEEE Computer Society
[2] Rajasekar, A., Moore, R., Hou, C.y., Lee, C.A., Marciano, R., de Torcy, A., Wan,
M., Schroeder, W., Chen, S.Y., Gilbert, L., Tooby, P., Zhu, B.: iRODS Primer:
integrated Rule-Oriented Data System. Morgan and Claypool Publishers (2010)
[3] Wan, M., Moore, R., Rajasekar, A.: Integration of cloud storage with data grids.
In: Proc. Third Int. Conf. on the Virtual Computing Initiative. (2009)
[4] Hünich, D., Müller-Pfefferkorn, R.: Managing large datasets with irods - a perfor-
mance analyses. In: Int. Multiconf. on Computer Science and Information Technol-
ogy - IMCSIT 2010, Wisla, Poland, 18-20 October 2010, Proceedings. (2010) 647–654
[5] Weil, S.A., Brandt, S.A., Miller, E.L., Long, D.D.E., Maltzahn, C.: Ceph: A scalable,
high-performance distributed file system. In: In Proceedings of the 7th Symp. on
Operating Systems Design and Implementation (OSDI). (2006) 307–320
[6] Weil, S.A.: Ceph: reliable, scalable, and high-performance distributed storage. PhD
thesis, Santa Cruz, CA, USA (2007)
[7] Weil, S., Brandt, S.A., Miller, E.L., Maltzahn, C.: Crush: Controlled, scalable,
decentralized placement of replicated data. In: Proceedings of SC ’06. (nov 2006)
[11] Lustre: A scalable, high-performance file system. Cluster File Systems Inc. white
paper, version 1.0 (Nov 2002)
[12] Braam, P.J., Others: The Lustre storage architecture. White Paper, Cluster File
Systems, Inc. 23 (2003)
[13] Sun Microsystems, Inc., Santa Clara, CA, USA: Lustre file system - High-
performance storage architecture and scalable cluster file system (2007)
[14] Wang, F., Oral, S., Shipman, G., Drokin, O., Wang, T., Huang, I.: Understand-
ing lustre filesystem internals. Technical Report ORNL/TM-2009/117, Oak Ridge
National Lab., National Center for Computational Sciences (2009)
38
[15] Levy, E., Silberschatz, A.: Distributed file systems: concepts and examples. ACM
Comput. Surv. 22(4) (dec 1990) 321–374
[16] Thanh, T.D., Mohan, S., Choi, E., Kim, S., Kim, P.: A taxonomy and survey on
distributed file systems. In: Proceedings of the 2008 Fourth Int. Conf. on Networked
Computing and Advanced Information Management. NCM ’08, Washington, DC,
USA, IEEE Computer Society (2008) 144–149
[17] Nicolae, B., Antoniu, G., Bougé, L., Moise, D., Carpen-amarie, R.: Blobseer: Next-
generation data management for large scale infrastructures. J. Parallel Distrib. Com-
put (2011) 169–184
[18] Nelson, M.N., Welch, B.B., Ousterhout, J.K.: Caching in the sprite network file
system. ACM Trans. Comput. Syst. 6(1) (feb 1988) 134–154
[19] Gray, C., Cheriton, D.: Leases: an efficient fault-tolerant mechanism for distributed
file cache consistency. SIGOPS Oper. Syst. Rev. 23(5) (nov 1989) 202–210
[20] Satyanarayanan, M.: A survey of distributed file systems. In: Annual Review of
Computer Science. (1989)
[21] Songlin Bai, H.W.: The performance study on several distributed file systems. In:
Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), 2011
Int. Conf. on. (2011)
39