Coordinating Parallel Hierarchical Storage Managem
Coordinating Parallel Hierarchical Storage Managem
net/publication/228852839
CITATIONS READS
9 507
4 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Gary Grider on 19 May 2014.
1. The client retrieves the archival attribute of the file be- 9. Each local HSM instance responds to its MA to indi-
ing accessed. cate finishing of restoring. For the OSD initiating the
restore process, which is the left-most OSD in Fig-
ure 4, the requested object is ready in the online stor-
age at this moment. Therefore, the following steps are
performed asynchronous with the rest of the main se-
quence:
(a) The MA notifies the OST to retrieve data from
the online storage.
(b) The OST performs normal access operation on
the requested object in the online storage.
10. Each involved MA sends response to the MA to indi-
cate the finish of restoring.
11. The MC calls SetArchAttr to update the archival at-
tribute of the file in the MDS.
Figure 5 is the sequence steps of a local HSM instance
migrating objects from the online storage to the lower-level
storage. The steps are elaborated as follows:
1. The HSM daemon in the left-most OSD in Figure 5
finds out that the available free-space in the online Figure 5. Migrating object from online stor-
storage has decreased below a predefined threshold. It age to lower-level storage
walks through the objects in the online storage to pre-
pared a list of objects to be migrated to the lower-level
storage. Then, the HSM instance calls ReqArchive to 11. The initiating MA responds to its local HSM instance
report its intention to its local MA. to indicate the finishing of migrations. The local HSM
instance can check the available free-space level again
2. The left-most MA in Figure 5 calls ReqArchive RPC and initiate more migrations if necessary.
to the MC to relay the report of migration intent of its
local HSM instance.
3. The MC calls GetLayout to retrieve the layout infor- 3.4. Comparison to Single-server and Block-
mation of the files containing the requested objects based SAN solutions
from the MDS.
The obvious drawback of HSM solutions on single-
4. The MC calls ArchiveObject RPC to every MA of the
server platforms is that the lower-level storage can not be
OSDs hosting the striped objects of the requested files
shared by other servers. Such solution cannot scale up well
in parallel.
as the demands increase. In enterprises, it is also difficult to
5. The MAs calls HSMArchive functions of their local justify purchasing expensive tape libraries without sharing
HSM instance to migrate a list of objects from the on- them among multiple hosts.
line storages to their lower-level storage in parallel. HSM solutions in block-based SAN environment such
as SGI DMF [8] and IBM HPSS [9] enable the sharing of
6. Each local HSM instance migrates the requested ob-
expensive tape libraries. However, limited by the capabil-
jects in parallel.
ity of block devices used in these system, migration opera-
7. Each local HSM instance responds to its MA to indi- tions require data to be transmitted on the SAN twice: once
cate finishing of data migration. from online block devices to memory of migration hosts
and again from migration hosts memory to lower-level stor-
8. Each involved MA sends response to the MA to indi- age devices. Due to the limited number of migration hosts
cate the finishing of data migration. typically setup in such systems, they are overwhelmed with
9. The MC calls SetArchAttr to update the archival at- the responsibility of migrating data for lots of storage sys-
tributes of the files in the MDS. tems. In comparison, our proposed architecture explores
the capability of OSDs. The effects are that data only need
10. The MC responds to the initiating MA to indicate the to be transmitted over SAN once and migration tasks are
finishing of migration. more widely distributed among OSDs.
ALock BLock RLock make read-write locking not appropriate. Imagining that a
ALock CP EX EX file’s archival attribute has been read locked by a client for
BLock cancel CB error data accessing. It is possible that one OSD hosting a striped
RLock PR EX CB object of the file tries to archive the object in the meantime.
In traditional locking semantics, the write lock request by
Table 1. Lock compatibility table the MC will be blocked until the release of the read lock
by the client and then continue the archive. However, this
Another limitation of the existing HSM solutions in is not desirable since we are backing up a file that has just
block-based SAN environment is that they emphasize shar- been accessed. According to the locality property of data
ing of archiving storage too much so that they tend to accessing, the right way should be keeping the objects of
use only a few big archiving storage subsystems shared the file on online storages.
throughout the entire system. The problem arises from
the limited bandwidth to access the archiving storage sys- With the aforementioned properties in mind, we have
tems. This results in long archive and restore time. Con- designed a specialized locking mechanism for concurrency
sidering the shrinking backup window today, this becomes control of archival attributes. There are three types of locks:
more a concern for organizations using such systems. In Access Locks (ALocks), Backup Locks (BLocks) and Re-
comparison, our architecture is motivated by solving this store Locks (RLocks). Table 1 illustrates the compatibil-
narrow-pipe limitation of archive and restore. In stead of ities between locks. Note that these locks are asymmet-
using a few large archive storage systems, we choose to ric in the sense that their compatibilities and correspond-
use many smaller ones and properly connect them to online ing actions depends on the sequence of locking requests.
storages using switch-base fabric. Multiple archive/restore For a pair (xlock, ylock) in Table 1, the column element
data paths exist thus archives/restores can be executed in ylock is the existing lock type on an archival attribute and
parallel. the row element xlock is the newly requested lock type.
CP indicates the lock types are compatible in that request-
ing sequence. (ALock, ALock) has value CP since clients
4. Consistency and Error Recovery
accessing files do not modify archival attributes by them-
selves. EX indicates the new xlock is incompatible with
Like any other distributed systems with concurrent op- the existing ylock on the archival attribute and the request-
erations, we have to guarantee data consistency in spite ing process of the xlock should be blocked until the release
of concurrent object access operations, archive operations of ylock. According to Table 1, clients requesting ALock
and restore operations. Also, we have to guarantee data on archival attributes should be blocked if there are exist-
consistency in case of possible component failures. In ing BLock or RLock on the requested archival attributes.
this section, we first present a specially designed locking This is because clients should not be allowed to access
mechanism to handle concurrency control of object ac- the file during the process of migrations. Also, a request
cess, archive and restore operations. Then, we describe a for RLock is blocked if a Block exists since restore should
logging-based mechanism for error recovery. not be allowed during the process of archive. However,
the reverse case, i.e., a archive request occurs during the
4.1. Migration locking mechanism restore process as indicated by (BLock, RLock) , is an er-
roneous case since it is impossible for OSDs to initiate
In our proposed architecture, the critical data struc- archive process for an object not on the online storages. PR
tures that needs to be protected against concurrent accesses for (RLock, ALock) indicates that a request for RLock can
are the archival attributes of files in the MDS. Clients preempt existing ALocks on the archival attributes. This
read archival attributes when they are going to access data avoids the deadlock situation described earlier in this sec-
objects of the files. The migration coordinator updates tion. The preemption does not cancel the ALocks already
archival attributes when it has finished archive or restore granted to clients. Instead, before the releasing of RLock, a
migrations of the files’ striped objects. message is sent to every owner of ALocks to tell them that
Although it looks like a classic read-write locking sce- the archival attributes that they got earlier have been up-
nario, it is actual not. For example, if a client is accessing dated in the MDS. CB indicates that xlock is intending to
a file whose data is on lower-level storages, it anyway re- do actually the same thing as the existing ylock. Therefore,
quests a read lock on the archival attribute. However, when it can just let the owner of ylock to finish the task and notify
the MC try to request a write lock on the same archival the owner of xlock through asynchronous callback. This
attribute in order to restore the file’s objects to serve the happens when more than one independent OSDs request
request, it will be blocked by the early read lock and it is to archive or restore the same file objects as indicated by
a deadlock. In addition, our specific migration semantics elements (BLock, BLock) and (RLock, RLock). The asyn-
chronous callback method is used instead of blocking since
archive requests issued by local HSM instances contain a
list of objects. It is unnecessary to block archive operations
of other files. Finally, we use an optimization based-on
local property in the case of (BLock, ALock). An archive
request for a file being accessed is cancelled. This could
happen when a client is accessing one of the striped objects
of a file and another striped object’s hosting OSDs wants to
archive it to free online space.
This migration locking mechanism is implemented in
the MDS. ALocks are requested by clients in the Step 1 of
Figure 4 and released at the end of the file access. RLocks
are requested by MC in the Step 5 of Figure 4 and released
as part of MDS processing of Step 11. Finally, BLocks are
requested by MC in the Step 3 of Figure 5 and released in
the processing of the Step 9 of Figure 5.
Component failures can cause the system to enter an Figure 6. Lustre modules and prototyping
inconsistent state without proper error recovering mecha- components
nism. For example, in the sequence diagram of Figure 5,
imagine the right-most OSD fails after Step 4 and before
section, we briefly describe our prototyping and then study
Step 8. Even if that OSD can switch to an fail-over pro-
its performance. We also share some experiences we learnt
cessor and restart functioning, the system will not be in a
through our prototyping process.
consistent state unless the OSD can finish the archive oper-
ation and response to the migration coordinator as in Step
5.1. Prototyping on Lustre
8.
The scheme that we have employed is part of the error
Lustre is a scalable, secure, robust, highly-available
recovery framework of OCFS. We explicitly assume that
cluster file system [2] on Linux operating systems. It is
the OCFS has capability to discover component failures
open source so that it is possible for us to develop our add-
and system restarts, typically through heartbeating signals
on functions on top of it. Lustre has the exact architecture
and rebooting sequence numbers. We also assume OSDs
shown in Figure 2. Figure 6 is an anatomy of Lustre nodes
have fail-over processing components.
into functional modules. We will not explain the detailed
Not surprisingly, we use logical logging (also known as
functions of each module in this paper but interested read-
journalling in file systems) and replaying to handle error re-
ers can refer to [2]. All white blocks are modules already
covery. Generally speaking, components log their intention
in Lustre and gray blocks represents new components or
to perform an operation on its permanent storage before ac-
functions developed for parallel archiving. Instead of de-
tually start the operation. The logged record is removed af-
veloping separate modules for the migration agent and the
ter the data related to the operation have all been committed
migration coordinator, we decided to instrument existing
to permanent storages. If errors do happen in the middle of
Lustre modules to reuse existing codes. We implemented
an operation, the logged record on the permanent storage
the migration coordinator as part of the mds module that
is used to replay un-committed tasks. Fortunately, our data
handles metadata queries. In Lustre, all metadata are stored
migration tasks do not generate new data by themselves.
in the local ext3 file system in the MetaData Server node.
Therefore, there will be no data lost due to memory cache.
The migration agent is implemented as part of the obdfilter
Due to the complexity of the error recovery scheme and
module that translates object access requests to local file
the common understanding of journalling mechanism, we
system access requests. The archival attributes are stored
omit the details of the logging in this paper.
as extended attributes of the ext3 file system in the meta-
data server. All local ext3 file systems in Lustre nodes
5. Prototyping and Performance Evaluation (MDS and OSD) are accessed through lvfs modules that
is a Lustre-tailored Virtual File System (VFS). Another lo-
In order to prove the feasibility of our proposed scheme, cal file system currently supported by Lustre is ReiserFS.
we have done a prototyping on Lustre file system. In this The llite module in client nodes are modified to query and
CPU Two Intel XEON 2.0GHz w/ HT CPU Four Pentium III 500MHZ
Memory 256MB DDR DIMM Memory 1GB EDO DIMM
SCSI interface Ultra160 SCSI (160MBps) SCSI interface Ultra2/LVD SCSI (80MBps)
HDD speed 10,000 RPM HDD speed 10,000 RPM
Average seek time 4.7 ms Average seek time 5.2 ms
NIC Intel Pro/1000MF NIC Intel Pro/1000MF
Our prototyping is on Lustre version 1.4.0 and Linux ages. Configuration (d) represents the single-backup point
kernel 2.4.20 of RedHat 9. In all of our experiments, we setup, where multiple OSDs share the same backup stor-
run OSD on up to 3 machines with configuration described age. Note that even configuration (d) has multiple parallel
in Table 2. We use the iSCSI reference implementation de- migration paths but they unfortunately collide at the entry
veloped by Intel [10] to set up our backup storage devices. point of the backup storage. In all configurations, multi-
Up to 3 machines run as iSCSI targets in our experiments ple OSDs are configured as RAID 0, i.e., files are striped
and Table 3 contains the configuration of these target ma- as multiple objects on all OSDs. In the first three config-
chines. When the OSD machines have loaded iSCSI ini- urations, we create one partition on each iSCSI ”disk”. In
tiator drivers, a new SCSI disk device is registered and be- configuration (d), we create three partitions, each of which
come available (e.g., /dev/sdc). We then create one or more is used by one of the three OSDs as backup storage. We
than one partitions on it and make ext3 files on each parti- cannot let the three OSDs working on the same partition
tion. When we setting up the HSM instance on the OSD, since the file system may get corrupted.
the device name of one partition like /dev/sdc1 is specified Our prototyping allows us to manually trigger the migra-
so that it becomes the backup storage device of that OSD. tion of objects between OSDs and backup storages. When
In all experiments, client and MetaData server are running data are backed up or recalled, we measure the throughput
on the same machine with the configuration similar to Ta- achieved on each pair of OSD and backup storage. We also
ble 3. All machines are connected to a Cisco Catalyst 4000 measured the latency between the moment when the migra-
Series gigabit ethernet switch. tion agent initiates the parallel backup/recall operations and
With our available hardware resources, we have de- the moment that all replies are received. Finally, we mea-
signed four different configurations as illustrated in Fig- sure the latency perceived by applications on client hosts
ure 7. Configurations (a), (b) and (c) represents scaling up when accessing files on backup storage. We use a simple
the system by adding more pairs of OSD and backup stor- program reading the first byte of a specified file. When the
Figure 8. Throughput of iSCSI