0% found this document useful (0 votes)

60 views16 pages

Coordinating Parallel Hierarchical Storage Managem

This document proposes a novel parallel data moving architecture for coordinating hierarchical storage management in object-based cluster file systems. The architecture enables multiple parallel data movements between multiple pairs of storage subsystems to overcome bottlenecks in archive/restore speeds of existing solutions. The parallel movements are coordinated and performed transparently to improve performance of data-intensive applications in high performance computing environments.

Uploaded by

Adane Abate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views16 pages

Coordinating Parallel Hierarchical Storage Managem

Uploaded by

Adane Abate

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/228852839

Coordinating Parallel Hierarchical Storage Management in Object-base Cluster

File Systems

Article · January 2006

CITATIONS READS

9 507

4 authors, including:

David H. C. Du Gary Grider

University of Minnesota Twin Cities Los Alamos National Laboratory
408 PUBLICATIONS 6,909 CITATIONS 38 PUBLICATIONS 889 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Power use of disk subsystems in supercomputers View project

data deduplication performance improvement View project

All content following this page was uploaded by Gary Grider on 19 May 2014.

The user has requested enhancement of the downloaded file.

Coordinating Parallel Hierarchical Storage Management in Object-base Cluster
File Systems ∗

Dingshan He, Xianbo Zhang and David H.C. Du Gary Grider

Department of Computer Science and Engineering Los Alamos National Laboratory
DTC Intelligent Storage Consortium (DISC) Department of Energy
University of Minnesota [email protected]
{he,xzhang,du}@cs.umn.edu

Abstract environment [1], data archiving and Hierarchical Storage

Management (HSM) are necessary to reduce the total stor-
Object-based storage technology enables building age cost and to assure the sustained high performance in
large-scale and highly-scalable cluster file systems using the system. High performance data movement between
commodity hardware and software components. On the subsystems of storage hierarchy is extremely important for
other hand, a hierarchy of storage subsystems with different this type of cluster file systems. A storage hierarchy is usu-
costs and performance should be incorporated into such ally composed of expensive SCSI storage arrays for high
systems to make them affordable and cost-effective. Exist- performance accessing, cheaper SATA storage subsystems
ing SAN-based (block-based) cluster solutions suffer from for data staging and high-capacity robotic tape libraries for
slow data movement between storage levels due to their data archiving. The movements of data between storage hi-
single-archiving-point architecture. In this paper, we pro- erarchy are made transparent to applications through HSM
pose a novel parallel data moving architecture in object- software. HSM takes advantage of the fact that data are not
based cluster file systems. Data movements are coordi- of equal importance during any given period of time for ap-
nated and performed in parallel between multiple pairs of plications. Typically, only a small subset of the entire data
storage subsystems. In addition, data movements are fully set is actively used by applications.
automated and transparent to users. Our proposed par- Archive systems have been identified as one possible
allel data moving architecture is prototyped on the Lustre bottleneck in a SGS file system and scalable HSM solu-
file system. Performance study shows that our scheme can tions are declared as extremely desirable [1]. The expected
scale up easily by adding more pairs of hierarchical stor- archival storage bandwidth was 10GB/sec (over 35 TB/hr)
age devices. in 2003 and 100GB/sec in 2006 [4]. However, the backup
and restore records announced by SGI and its partners in
July 2003 was only 2.8GB/sec (10.1TB/hr) for file-level
1. Introduction backup and 1.3 GB/sec(4.5 TB/hr) for file-level restore,
even through these already tripled the previous record in
High performance computing (HPC) community has re- 2003. Obviously, the existing solution is far from catching
alized that a Scalable, Global and Secure (SGS) file system up with the growing requirements of archival storage band-
and I/O solution is necessary since late 90s [1]. Object- width. Filling the gap between the desired archiving speed
based Storage Device (OSD) is a potentially promising ap- and the currently available speed is our research focus.
proach for building such a file system. Receently two so- A high I/O speed for moving data between storage hi-
lutions based on the concept of OSD have emerged: the erarchy would be appreciated by scientific computing and
Lustre file system [2] and the Panasas ActiveScale file sys- business users. Scientific simulations at DOE usually run
tem [3]. These OSD-based systems are expected to serve for days or months so the system must be prepared for soft-
clusters with 10,000’s of nodes, 100’s GB/sec I/O through- ware and hardware failures during the running period. The
put and several petabytes of high performance storage [2]. practice is to dump the simulation states at every check-
Due to the large amount of data generated within the HPC point to persistent storage so that simulation can resume
∗ This project is supported by a grant from Los Alamos National Lab- from the latest checkpoint after failure. It is possible that
oratory. multiple terabytes of data are generated over a short period
of time (every 30 minutes as described in [1]). The total achieve real parallelism in data movements since multiple
amount of data can easily exceed the storage capacity pro- data objects associated with the same application can be
vided by the disk-based storage and have to be moved to located in different OSDs. Another challenge comes from
low-cost storage media like tapes. The data moving speed possible component failures that are faced by any cluster
therefore has a big impact on the simulation performance storage systems. We have addressed both challenges in our
since simulation process has to wait for the completion of proposed parallel data moving system.
the data moving before it can resume. To reduce the finan- The rest of this paper is organized as follows. Section
cial costs for maintaining large clusters, high performance 2 introduces some background information. Our proposed
computing power and storage resource are usually shared parallel data moving architecture is presented in Section 3.
by various applications among different researchers. Given Then, we discuss how to maintain consistency among dis-
a specific time, only a subset of the applications and their tributed components and how to handle component failures
associated data need to be retained on the expensive, high in Section 4. In Section 5, we discuss the prototyping of
performance storage. The in-active data can be moved to our proposed scheme on the Lustre file system. The perfor-
low-cost storage media. The data moving speed between mance data of our prototyping is also reported in the same
subsystems of storage hierarchy directly affects the overall section. Several related works are presented in Section 6.
system performance. Due to the economy globalization, Finally, we conclude our contributions in Section 7.
business data need to be accessed 24 hours a day, seven
days a week. For business continuance, data are periodi- 2. Background
cally backed up to low-cost tape media. Should any data
disaster happen, the system restore time needs to be mini-
In this section, we discuss three concepts that are es-
mized to reduce the business financial loss.
sential to our work: object-based storage interface, object-
In this paper, we propose a novel parallel data moving based cluster file systems and hierarchical storage manage-
architecture for the high performance object-based cluster ment.
file systems. The aforementioned archive/restore bottle-
neck in the current systems is overcome by enabling multi-
2.1. Object-based Storage Interface
ple parallel data movements between multiple pairs of stor-
age subsystems of different storage hierarchies. The object-
Object-based storage is an emerging storage interface
based storage interface is an enabling technology in that it
standard designed to replace the current storage interfaces,
allows direct data movement between storage subsystems
mainly SCSI and ATA/IDE. It is motivated by the desire
and avoids the performance bottleneck of any single sub-
to design a storage architecture providing 1) strong secu-
system. Further more, each OSD has a local instance of
rity, 2) data sharing across platforms, 3) high performance
HSM software that is responsible for the data movements
and 4) scalability [6]. The storage architectures in com-
between that OSD and one or more designated lower-level
mon use today based on SCSI and ATA/IDE interfaces
storage subsystems. The major challenge is to coordi-
are direct-attached storage (DAS), storage area network
nate the multiple instances of HSM on different OSDs to
(SANs), network-attached storage (NAS) and a newer ar-
chitecture called SAN file systems. Unfortunately, none of
these four architectures achieves the aforementioned four
desirable features of an ideal storage architecture at the
same time. Therefore, system designers usually first de-
cide which of these features are more important than others
as a trade-off in choosing a storage architecture.
Objects are storage containers with a file-like interface
in terms of space allocation and data access. An object is of
variable-length and can be used to store any types of data,
such as traditional files, database records, images or multi-
media data. Objects are composed of data, user-accessible
attributes and device-managed metadata.
A device that stores objects is referred to as an Object-
based Storage Device (OSD). OSDs can be in many forms
ranging from a single disk drive to a storage brick that con-
Figure 1. Comparison between traditional file tains a storage controller and an array of disk drives. The
systems and OSD [5] storage media can be magnetic disks, tapes or even read-
only optical media. Therefore, the essential difference be-
tween an OSD and a block-based device is their interfaces
instead of the physical media. The most immediate effect
of an object-based storage device is the offloading of space
management from storage applications. The comparison
between the traditional model on block I/O and the OSD
model is presented in Figure 1. The storage component
(managing storage space including free space management
and mapping logical storage objects such as files to physi-
cal storage blocks) of block-based file systems is offloaded
into OSDs in the OSD model. This improves data shar-
ing across platforms as the platform-dependent metadata
management is now inside OSDs. OSD model also im-
proves scalability since hosts running storage applications Figure 2. Object-based cluster file system
no longer need to coordinate metadata updates and the data
path and control path are separated. Furthermore, the OSD about who can do what on a particular object. It generates
model can provide strong security at per-object basis by the credentials upon the request of authenticated clients.
letting the device enforce security checks on each access Figure 2 also illustrates how an OCFS employs file strip-
request. ing to handle IO-intensive and data-intensive accesses to
large files. Any logical file can be striped into multiple
objects on multiple OSDs. This file-to-object mapping in-
2.2. Object-based Cluster File Systems formation, referred to as the layout information, is part of
the metadata managed by the MDS for every file. When a
Using OSDs as building blocks, Figure 2 shows a way client opens a file, its layout information is extracted from
to construct object-based cluster file systems (OCFS) [2, 3]. the MDS and stored in the file’s in-memory inode struc-
A metadata server (MDS) is used to manage metadata that ture on the client along with other metadata information of
include hierarchical or global namespace, file attributes and the file. Then, the client can use simple calculations based
file-to-object mapping information. Note that such meta- on the layout information to convert file access requests to
data are only part of the entire metadata used by file sys- one or more object access requests. When there are mul-
tems of block-based devices. Most noticeably, the meta- tiple object access requests, they are performed in parallel
data server in OCFS does not need to manage file-to-block to multiple OSDs instead of in sequence. The benefits of
mapping information and does not need to keep track of parallel accessing are even more obvious when there are
free space on storage devices. As presented in the previous multiple concurrent clients with multiple concurrent pro-
section, both functions are offloaded to OSDs. The meta- cesses. With a typical random access pattern, traffics can
data server is a logically centralized component (physically be evenly spread across OSDs. Similar benefits apply to a
it can be a cluster of servers) to expose a shared and uni- set of small files that are not striped when they are spread
form namespace to clients of OCFS. With many of meta- among OSDs.
data management functions taken care by either the meta-
data server or OSDs, clients are only responsible for in- 2.3. Hierarchical Storage Management
memory data structures such as inodes and dentries of tra-
ditional file systems of block-based devices. Since client Hierarchical Storage Management (HSM) is a policy-
machines of OCFS are typically also application servers based management of file backup and archiving on a hier-
such as Web servers or DBMS servers, the light-weighted archy of storage devices. The users need not to know when
file system functions allow them to serve their applications files are being archived and the access request may retrieve
more efficiently. data from online or lower-level storage devices. Using an
The MDS in Figure 2 also shows the optional function HSM software, an administrator can establish guidelines
of security manager. In the previous section, we men- for conditions under which different kinds of files are to
tioned that OSD model can provide strong security at per- be copied or migrated to a lower-level storage device. The
object basis by letting OSDs perform security checks on conditions are typically based on ages of files, types of files,
every access request. In order to do this, every access re- popularity of files and the availability of free space on stor-
quest should be accompanied with a tamper-proof message age devices. Once the guideline has been set up, the HSM
called credential describing what operations this request is software manages file migrations automatically.
allowed and disallowed. The security manager is the au- HSM makes the file migrations transparent to file sys-
thority in this architecture to maintain such information tem users by using two mechanisms: stub files and data
management API (DMAPI). A stub file is a pseudo file to OSDs such that clients and MDS become light-
kept in the file system after the file data are migrated to weighted and can perform efficiently. In addition,
lower-level storage. Because of it, file system users will the communication cost for free-space management is
not see a file missing although it has been migrated to a also reduced from the communication network. This
lower-level storage device and no longer available on on- additional management task causes very little problem
line storage devices. More importantly, the extended at- since OSDs typically have more than enough process-
tributes of a stub file are used to point to the real location of ing power and memory buffer beyond the requirement
the file in the lower-level storage. DMAPI is the interface of handling data transfers and physical location map-
defined by Data Storage Management (XDSM) specifica- ping in block-based storage devices.
tion [7] that uses events to notify user space Data Manage-
• Feature 2: separated control and data paths
ment (DM) applications about operations on files. HSM
As MDS is dedicated for metadata management, it
software is one kind of DM applications. When file sys-
performs and scales better than having to handle both
tem users read/write a stub file, the kernel component of
metadata and data. On the other hand, MDS is no
XDSM generates read/write event and notifies HSM who
longer in the critical path for clients accessing data
should have registered through DMAPI as willing to re-
from OSDs. Therefore, initiated data accesses are not
ceive such kind of events. The process of reading/writing
affected by ongoing metadata requests.
the stub file is blocked until a response is received from
HSM. The HSM software handles read/write events on stub • Feature 3: parallel data paths
files by migrating the requested file back to online stor- With proper configuration of the communication net-
age so that the stub files turn into real files. A response is work, e.g., using FC switch instead of FC-AL, mul-
sent through DMAPI after the migrations completed. The tiple parallel data paths from clients to OSDs are re-
blocked read/write process can then proceed to perform the alized. The bandwidth of the overall system can be
requested operation on the real file. scaled up easily by properly adding more switches
into the network.
3. Parallel Archiving Architecture In our proposed parallel data moving architecture, we
fully explore the aforementioned features of OCFS and
There is no doubt about the necessity of a storage hi- carefully avoid undermining achieved performance and
erarchy in large-scale storage systems. As the object- scalability before introducing the required hierarchical
based cluster file systems like ActiveScale and Lustre have storage management functions. Following is a list of the
demonstrated their extraordinary performance and scala- key architectural features of our proposed parallel data
bility, it is natural to ask how to incorporate some kind moving architecture for OCFS:
of HSM into OCFS while avoiding HSM becoming a new
• embedded HSM component on OSD
bottleneck of the system. Various HSM solutions exist for
We propose to add one more task, a HSM component,
single-server platforms and block-based SAN file systems.
to OSDs. This should not be a problem in terms of the
In Section 3.4, we compare our proposed architecture with
capability of OSDs considering that Lustre and Ac-
such solutions to show that they cannot take fully archi-
tiveScale are using general-purpose CPU with a de-
tectural advantages of the potentials of OCFS. In this sec-
cent amount of memory in their OSD targets. The
tion, we start with analyzing the architectural advantages
local HSM instance on an OSD performs similar to
of OCFS and motivate serval key design choices. We then
today’s HSM products in single-server platforms. It
present our proposed parallel data moving architecture in
has a daemon periodically checking the availability of
OCFS. Several key operations are also elaborated to ex-
free-space in the OSD’s online storage and initiating
plain how the system works. Finally, we briefly compared
data migrations when conditions meet. On the other
our proposed architecture with existing solutions in single-
hand, when an OSD receives data access of an object
server platforms and block-based SAN file systems.
not in its online storage, the local HSM is invoked to
retrieve the object from a lower-level storage. This de-
3.1. Design Rationale sign choice is an extension of the Feature 1 of OCFS.
It avoids building a complicated and centralized HSM
We have identified the following three architectural fea- component, which is doomed in terms of performance
tures that contribute to the high-performance and high- and scalability.
scalability of OCFS the most:
• direct data migration paths
• Feature 1: function offloading For one migration operation, the object data is trans-
The free-space management functions are offloaded mitted on the SAN at most once - directly from the
OSD to its associated lower-level storage. It is possi- NAS. The consideration here is that organizations
ble that the migrated data is not transmitted over the may already have major investments using non object-
SAN at all if the configuration chooses a directly at- based storage interface. The trick is to let the OSD’s
tached secondary storage for the particular OSD. In local HSM use a universal file system interface to ac-
comparison, the migrated data has to be transmitted cess its lower-level storage subsystems. If the lower-
twice in block-based SAN file systems - once from level storage subsystems use a block-based interface
the online storage device to the HSM host’s memory or a NAS, the local HSM can still communicate with
and once from the memory to the lower-level storage them. Of course, the performance will be different de-
device. pending on the choice.

• parallel data migration paths • eliminating DMAPI

Like Feature 2 of OCFS, data migration paths are par- As discussed in Section 2.3, DMAPI/XDSM is
allelized with proper configuration of SAN. To ex- widely used by many HSM products. Unfortunately,
plore this feature, it is better to use many smaller stor- DMAPI/XDSM is too heavy and HSM softwares only
age systems as lower-level storage than use a few large use a small number of events specified by DMAPI.
storage systems. In the former case, each OSD can Experiences have also shown that DMAPI/XDSM
have a dedicated lower-level storage system or a few does not scale well. This is also why popular file sys-
number OSDs share one. In contrast, in the latter case tems and operating systems are reluctant to support
many OSDs share one lower-level storage system so DMAPI/XDSM. For better performance and scalabil-
that the data migration paths may collide at the lim- ity, we have managed to use file attributes in MDS to
ited bandwidth of the shared lower-level storage sys- keep track of where file objects are currently stored
tems. Economically, it is also cheaper to buy a bunch and use local HSM to trap the events. The result is
of smaller storage systems than buy a large storage that DMAPI is no longer needed in our architecture.
system of equal capacity. From scalability point of In addition, we also do not need to leave stub files in
view, it is also easier to expand by adding new fab- the OSDs’ local file systems since they are not part of
ric switches and small storage systems gradually than the namespace visible to OCFS users.
buying an expensive large system that may not fully
utilized at the moment.

• separated migration coordination paths 3.2. Architecture Overview

Very often, the distributed local HSM instances on dif-
ferent OSDs require coordination among themselves. Figure 3 illustrates our proposed parallel data moving
Although they are physically independent from each architecture in OCFS. In order to coordinate distributed lo-
other, the data objects that they are maintaining may cal HSM instances, we introduce three new components
not be always independent. Especially, when file strip- into OCFS: archival attributes, a Migration Coordinator
ing is used to handle IO-intensive and data-intensive and Migration Agents. Each of them is elaborated in the
requests, an object is only a strip of the file. Such an follows.
object is called a striped object in this paper. Its fellow
striped objects are hosted by other OSDs under the
management of their local HSM. Several non-striped
data objects on different OSDs may also be required
for a given application at the same time. In order to
explore the feature of parallel data movement paths,
one policy can be letting striped objects or associated
objects always migrate in parallel. We choose to sep-
arate the coordination control path from the migration
data path following the spirit of feature 2 of OCFS.

• versatile storage interfaces

In the context of OCFS, the most natural choice of
storage interface for lower-level storage subsystems
is object-based storage interface. However, we de- Figure 3. Parallel archiving architecture in
cide not to exclude the possibility of using the other OCFS
two major storage interfaces, i.e., block interface and
3.2.1. Archival Attributes In MDS, each data file (not fies the Migration Agent with the list of migrating objects.
directory files) has a new archival attribute. It is a value The Migration Agent will again contact the Migration Co-
simply indicating the level of storage hierarchy that the file ordinator for possible collaborations.
data is currently stored. The possible number of levels of
storage hierarchy is a system parameter, which should be 3.2.3. Migration Coordinator The Migration Coordi-
supported by all participating local HSM instances. For nator receives archive/recall notifications from Migration
example, it can be as simple as only level 0 indicating disk Agents. For data objects corresponding to non-striped files,
storage and level 1 indicating tape storage, or having other the Migration Coordinator’s responsibility is only to notify
levels like SCSI disk level and SATA disk level. Currently the MDS to update the archival attributes of the correspond-
we are assuming a fixed storage hierarchy. Of course, if a ing files. However, for data objects belonging to striped
disk storage may be associated with more than one tape files, the additional responsibility of the Migration Coordi-
devices, the archival attribute needs to be more compli- nator is to coordinate the achive/recall by instructing mul-
cated than the current design. For non-striped files, it is tiple local HSM instances on different OSDs to perform
obvious what this archival attribute means. For files with data migrations of all striped objects. In order to do this,
striped objects on multiple OSDs, this single archival value the Migration Coordinator needs to contact the MDS to get
is still valid since our scheme is designed to collaborate the layout information of striped files. The Migration Co-
the archive and recall behaviors of involved local HSM in- ordinator instructs local HSM instances by contacting the
stances. The object stripes should be at the same level of Migration Agents who in turn contact their local HSM in-
storage hierarchy indicated by the archival attribute of the stances, instead of directly contacting them.
file. The values of archival attributes are updated by re-
quests from the Migration Coordinator when there are ob-
3.2.4. Interfaces Between Modules Modules in Fig-
ject migrations. The functions of the Migration Coordina-
ure 3 interact with each other through well defined inter-
tor are going to be discussed later in this section. When
faces. There are three pairs of new interactions introduced
accessing data from OSDs, a client sends the request to-
in our proposed architecture: MDS-MC (Migration Coor-
gether with the archival attribute of the file. How OSDs use
dinator), MA (Migration Agent)-HSM and MC-MA. If the
archival attributes is going to be discussed momentarily.
two entities are on different hosts like the MC-MA pair, the
interface is implemented through RPC (Remote Procedure
3.2.2. Migration Agent There is a Migration Agent Call). Otherwise, the interface is implemented through ex-
component in each OSD and it performs the following ported kernel module APIs.
three major functions: 1) taking care of the functions typi- The MDS-MC interface between the metadata server
cally provided by DMAPI to enable non-DMAPI local file and the migration coordinator is defined as following:
systems in OSDs; 2) interacting with the local HSM in-
stance; 3) interacting with the remote Migration Coordina- • Metadata Server exported APIs
tor. OSDs receive client object requests that are accompa-
nied with their corresponding archival attributes. The OSD – GetLayout: This API is used by MC to query the
interface converts the client request to a request of its local layout information of the file containing a certain
file system. Then the file system request and the archival data object.
attribute are passed to the Migration Agent. The archival at- – SetArchAttr: This API is used by MC to update
tribute specifies the level of storage hierarchy that the data a file’s archival attribute.
is currently stored. If it indicates the online storage, the
Migration Agent just passes the file system request to the • Migration Coordinator exported APIs
local file system that servers the request directly. However, None since MDS does not need any help from MC.
if the archival attribute specifies a lower-level storage, the
Migration Agent interacts with the remote Migration Coor- The MA-HSM interface between a Migration Agent and
dinator to ask for collaboration of other HSM instances on a local HSM instance on the same OSD is the bridge be-
different OSDs if the requested data object is just a stripe of tween the HSM instances and the rest of the parallel data
the requested file. Under the coordination of the Migration moving architecture. It is important for that any non-
Coordinator, all involved OSDs perform data movements parallel commercial HSM product can be incorporated into
in parallel. In another situation, the local HSM instance our propose parallel archiving architecture as long as it is
may decide to migrate some data in the primary storage compliant with this interface. The MA-HSM interface is
to the secondary storage to make room in the primary stor- define as following:
age (when the available free space decreases below a preset
threshold). The local HSM instance interacts with and noti- • Migration Agent exported APIs
– ReqArchive: HSM uses this API to report its in-
tention of migrating a list of objects from the on-
line storage to the lower-level storage in order to
free some online storage space.

• HSM exported APIs

– HSMArchive: This API is used by MA to instruct

HSM to migrate an object from the online stor-
age to the lower-level storage.
– HSMRestore: This API is used by MA to instruct
HSM to migrate an object from the lower-level
storage to the online storage.

Finally, the MC-MA interface between the Migration

Coordinator and the Migration Agents is define as follow-
ing:

• Migration Coordinator exported APIs

– ReqAchive: This API is used by MA to report

HSM’s intention of migrating a list of objects Figure 4. Accessing object not in online stor-
from the online storage to the lower-level storage
age.
– NotifyRestore: This API is used by MA to report 2. The client translates the file access request to one ob-
its intention of restoring an object that is not in ject access request. Then, it sends object access re-
online storage and has an access request from quest along with the file’s archival attribute to the OST
some client. module of the OSD hosting the requested object. The
translation can result in accessing multiple objects but
• Migration Agent exported APIs
we do not show such case here.
– ArchiveObject: This API is used by MC to in-
3. The OST module translates the object access request
struct MA to archive a list of objects.
to the back-end file system access request and passes
– RestoreObject: This API is used by MC to in- the result along with the archival attribute to MA.
struct MA to restore an object.
4. The MA finds out the requested object is not in the
3.3. Key Operations online storage by checking the archival attribute. It
then execute NotifyRestore RPC of MC.
In order to help understand how the components work
5. The MC calls GetLayout to retrieve layout informa-
together in a running system, we elaborate two key opera-
tion of the requested file from the MDS.
tions: 1) a client accesses objects not in OSD’s online stor-
age and 2) a local HSM instance archive objects to free its 6. The MC calls RestoreObject RPC to every MA of the
OSD’s online storage space. Note that both descriptions OSDs hosting the striped objects of the requested file
omit steps related to concurrency control and error recov- in parallel.
ery for easier understanding. Section 4 explains the concur-
rency control and error recovery mechanisms to guarantee 7. The MAs calls HSMRestore functions of their local
consistency under distributed concurrent clients and local HSM instance to restore their striped objects from the
HSM instances and under component failures. lower-level storage to the online storage in parallel.
Figure 4 is the sequence steps of a client accessing an
object not in OSD’s online storage. Following is the expla- 8. Each local HSM instance restores the object in paral-
nation of the steps: lel.

1. The client retrieves the archival attribute of the file be- 9. Each local HSM instance responds to its MA to indi-
ing accessed. cate finishing of restoring. For the OSD initiating the
restore process, which is the left-most OSD in Fig-
ure 4, the requested object is ready in the online stor-
age at this moment. Therefore, the following steps are
performed asynchronous with the rest of the main se-
quence:
(a) The MA notifies the OST to retrieve data from
the online storage.
(b) The OST performs normal access operation on
the requested object in the online storage.
10. Each involved MA sends response to the MA to indi-
cate the finish of restoring.
11. The MC calls SetArchAttr to update the archival at-
tribute of the file in the MDS.
Figure 5 is the sequence steps of a local HSM instance
migrating objects from the online storage to the lower-level
storage. The steps are elaborated as follows:
1. The HSM daemon in the left-most OSD in Figure 5
finds out that the available free-space in the online Figure 5. Migrating object from online stor-
storage has decreased below a predefined threshold. It age to lower-level storage
walks through the objects in the online storage to pre-
pared a list of objects to be migrated to the lower-level
storage. Then, the HSM instance calls ReqArchive to 11. The initiating MA responds to its local HSM instance
report its intention to its local MA. to indicate the finishing of migrations. The local HSM
instance can check the available free-space level again
2. The left-most MA in Figure 5 calls ReqArchive RPC and initiate more migrations if necessary.
to the MC to relay the report of migration intent of its
local HSM instance.
3. The MC calls GetLayout to retrieve the layout infor- 3.4. Comparison to Single-server and Block-
mation of the files containing the requested objects based SAN solutions
from the MDS.
The obvious drawback of HSM solutions on single-
4. The MC calls ArchiveObject RPC to every MA of the
server platforms is that the lower-level storage can not be
OSDs hosting the striped objects of the requested files
shared by other servers. Such solution cannot scale up well
in parallel.
as the demands increase. In enterprises, it is also difficult to
5. The MAs calls HSMArchive functions of their local justify purchasing expensive tape libraries without sharing
HSM instance to migrate a list of objects from the on- them among multiple hosts.
line storages to their lower-level storage in parallel. HSM solutions in block-based SAN environment such
as SGI DMF [8] and IBM HPSS [9] enable the sharing of
6. Each local HSM instance migrates the requested ob-
expensive tape libraries. However, limited by the capabil-
jects in parallel.
ity of block devices used in these system, migration opera-
7. Each local HSM instance responds to its MA to indi- tions require data to be transmitted on the SAN twice: once
cate finishing of data migration. from online block devices to memory of migration hosts
and again from migration hosts memory to lower-level stor-
8. Each involved MA sends response to the MA to indi- age devices. Due to the limited number of migration hosts
cate the finishing of data migration. typically setup in such systems, they are overwhelmed with
9. The MC calls SetArchAttr to update the archival at- the responsibility of migrating data for lots of storage sys-
tributes of the files in the MDS. tems. In comparison, our proposed architecture explores
the capability of OSDs. The effects are that data only need
10. The MC responds to the initiating MA to indicate the to be transmitted over SAN once and migration tasks are
finishing of migration. more widely distributed among OSDs.
ALock BLock RLock make read-write locking not appropriate. Imagining that a
ALock CP EX EX file’s archival attribute has been read locked by a client for
BLock cancel CB error data accessing. It is possible that one OSD hosting a striped
RLock PR EX CB object of the file tries to archive the object in the meantime.
In traditional locking semantics, the write lock request by
Table 1. Lock compatibility table the MC will be blocked until the release of the read lock
by the client and then continue the archive. However, this
Another limitation of the existing HSM solutions in is not desirable since we are backing up a file that has just
block-based SAN environment is that they emphasize shar- been accessed. According to the locality property of data
ing of archiving storage too much so that they tend to accessing, the right way should be keeping the objects of
use only a few big archiving storage subsystems shared the file on online storages.
throughout the entire system. The problem arises from
the limited bandwidth to access the archiving storage sys- With the aforementioned properties in mind, we have
tems. This results in long archive and restore time. Con- designed a specialized locking mechanism for concurrency
sidering the shrinking backup window today, this becomes control of archival attributes. There are three types of locks:
more a concern for organizations using such systems. In Access Locks (ALocks), Backup Locks (BLocks) and Re-
comparison, our architecture is motivated by solving this store Locks (RLocks). Table 1 illustrates the compatibil-
narrow-pipe limitation of archive and restore. In stead of ities between locks. Note that these locks are asymmet-
using a few large archive storage systems, we choose to ric in the sense that their compatibilities and correspond-
use many smaller ones and properly connect them to online ing actions depends on the sequence of locking requests.
storages using switch-base fabric. Multiple archive/restore For a pair (xlock, ylock) in Table 1, the column element
data paths exist thus archives/restores can be executed in ylock is the existing lock type on an archival attribute and
parallel. the row element xlock is the newly requested lock type.
CP indicates the lock types are compatible in that request-
ing sequence. (ALock, ALock) has value CP since clients
4. Consistency and Error Recovery
accessing files do not modify archival attributes by them-
selves. EX indicates the new xlock is incompatible with
Like any other distributed systems with concurrent op- the existing ylock on the archival attribute and the request-
erations, we have to guarantee data consistency in spite ing process of the xlock should be blocked until the release
of concurrent object access operations, archive operations of ylock. According to Table 1, clients requesting ALock
and restore operations. Also, we have to guarantee data on archival attributes should be blocked if there are exist-
consistency in case of possible component failures. In ing BLock or RLock on the requested archival attributes.
this section, we first present a specially designed locking This is because clients should not be allowed to access
mechanism to handle concurrency control of object ac- the file during the process of migrations. Also, a request
cess, archive and restore operations. Then, we describe a for RLock is blocked if a Block exists since restore should
logging-based mechanism for error recovery. not be allowed during the process of archive. However,
the reverse case, i.e., a archive request occurs during the
4.1. Migration locking mechanism restore process as indicated by (BLock, RLock) , is an er-
roneous case since it is impossible for OSDs to initiate
In our proposed architecture, the critical data struc- archive process for an object not on the online storages. PR
tures that needs to be protected against concurrent accesses for (RLock, ALock) indicates that a request for RLock can
are the archival attributes of files in the MDS. Clients preempt existing ALocks on the archival attributes. This
read archival attributes when they are going to access data avoids the deadlock situation described earlier in this sec-
objects of the files. The migration coordinator updates tion. The preemption does not cancel the ALocks already
archival attributes when it has finished archive or restore granted to clients. Instead, before the releasing of RLock, a
migrations of the files’ striped objects. message is sent to every owner of ALocks to tell them that
Although it looks like a classic read-write locking sce- the archival attributes that they got earlier have been up-
nario, it is actual not. For example, if a client is accessing dated in the MDS. CB indicates that xlock is intending to
a file whose data is on lower-level storages, it anyway re- do actually the same thing as the existing ylock. Therefore,
quests a read lock on the archival attribute. However, when it can just let the owner of ylock to finish the task and notify
the MC try to request a write lock on the same archival the owner of xlock through asynchronous callback. This
attribute in order to restore the file’s objects to serve the happens when more than one independent OSDs request
request, it will be blocked by the early read lock and it is to archive or restore the same file objects as indicated by
a deadlock. In addition, our specific migration semantics elements (BLock, BLock) and (RLock, RLock). The asyn-
chronous callback method is used instead of blocking since
archive requests issued by local HSM instances contain a
list of objects. It is unnecessary to block archive operations
of other files. Finally, we use an optimization based-on
local property in the case of (BLock, ALock). An archive
request for a file being accessed is cancelled. This could
happen when a client is accessing one of the striped objects
of a file and another striped object’s hosting OSDs wants to
archive it to free online space.
This migration locking mechanism is implemented in
the MDS. ALocks are requested by clients in the Step 1 of
Figure 4 and released at the end of the file access. RLocks
are requested by MC in the Step 5 of Figure 4 and released
as part of MDS processing of Step 11. Finally, BLocks are
requested by MC in the Step 3 of Figure 5 and released in
the processing of the Step 9 of Figure 5.

4.2. Error recovery

Component failures can cause the system to enter an Figure 6. Lustre modules and prototyping
inconsistent state without proper error recovering mecha- components
nism. For example, in the sequence diagram of Figure 5,
imagine the right-most OSD fails after Step 4 and before
section, we briefly describe our prototyping and then study
Step 8. Even if that OSD can switch to an fail-over pro-
its performance. We also share some experiences we learnt
cessor and restart functioning, the system will not be in a
through our prototyping process.
consistent state unless the OSD can finish the archive oper-
ation and response to the migration coordinator as in Step
5.1. Prototyping on Lustre
8.
The scheme that we have employed is part of the error
Lustre is a scalable, secure, robust, highly-available
recovery framework of OCFS. We explicitly assume that
cluster file system [2] on Linux operating systems. It is
the OCFS has capability to discover component failures
open source so that it is possible for us to develop our add-
and system restarts, typically through heartbeating signals
on functions on top of it. Lustre has the exact architecture
and rebooting sequence numbers. We also assume OSDs
shown in Figure 2. Figure 6 is an anatomy of Lustre nodes
have fail-over processing components.
into functional modules. We will not explain the detailed
Not surprisingly, we use logical logging (also known as
functions of each module in this paper but interested read-
journalling in file systems) and replaying to handle error re-
ers can refer to [2]. All white blocks are modules already
covery. Generally speaking, components log their intention
in Lustre and gray blocks represents new components or
to perform an operation on its permanent storage before ac-
functions developed for parallel archiving. Instead of de-
tually start the operation. The logged record is removed af-
veloping separate modules for the migration agent and the
ter the data related to the operation have all been committed
migration coordinator, we decided to instrument existing
to permanent storages. If errors do happen in the middle of
Lustre modules to reuse existing codes. We implemented
an operation, the logged record on the permanent storage
the migration coordinator as part of the mds module that
is used to replay un-committed tasks. Fortunately, our data
handles metadata queries. In Lustre, all metadata are stored
migration tasks do not generate new data by themselves.
in the local ext3 file system in the MetaData Server node.
Therefore, there will be no data lost due to memory cache.
The migration agent is implemented as part of the obdfilter
Due to the complexity of the error recovery scheme and
module that translates object access requests to local file
the common understanding of journalling mechanism, we
system access requests. The archival attributes are stored
omit the details of the logging in this paper.
as extended attributes of the ext3 file system in the meta-
data server. All local ext3 file systems in Lustre nodes
5. Prototyping and Performance Evaluation (MDS and OSD) are accessed through lvfs modules that
is a Lustre-tailored Virtual File System (VFS). Another lo-
In order to prove the feasibility of our proposed scheme, cal file system currently supported by Lustre is ReiserFS.
we have done a prototyping on Lustre file system. In this The llite module in client nodes are modified to query and
CPU Two Intel XEON 2.0GHz w/ HT CPU Four Pentium III 500MHZ
Memory 256MB DDR DIMM Memory 1GB EDO DIMM
SCSI interface Ultra160 SCSI (160MBps) SCSI interface Ultra2/LVD SCSI (80MBps)
HDD speed 10,000 RPM HDD speed 10,000 RPM
Average seek time 4.7 ms Average seek time 5.2 ms
NIC Intel Pro/1000MF NIC Intel Pro/1000MF

Table 2. Configuration of OSD hosts Table 3. Configuration of iSCSI target hosts

lock archival attributes of files before sending out read or
write requests for objects to any OSD nodes through local
osc modules, which is represented by the xattr gray block
in Figure 6. The llite module implements a light-weighted
client file system. Its function is similar to the NFS client
file system in NFS implementations.
Since we did not find any open source HSM software,
we developed our own with restricted functions. Specif-
ically, our HSM can only migrate files between two file (a) Single-pair (b) Dual-pair
systems. We have a kernel thread waken up periodically
to look for files in the file system of the online storage for
backup but we did not implement any sophisticated poli-
cies about when and what to backup. Our HSM codes are
also part of the obdfilter module of Lustre. When initial-
izing a HSM instance, a backup storage device is specified
and typically mounted as an ext3 file system. Our HSM
also use the lvfs module to access the backup file system.
In our experiments, we use iSCSI protocol to access remote
backup storage devices, which are exposed as an ordinary
SCSI disk device on OSD nodes.
(c) Triple-pair (d) Singe-backup

5.2. Experiment Setup Figure 7. Testing configurations

Our prototyping is on Lustre version 1.4.0 and Linux ages. Configuration (d) represents the single-backup point
kernel 2.4.20 of RedHat 9. In all of our experiments, we setup, where multiple OSDs share the same backup stor-
run OSD on up to 3 machines with configuration described age. Note that even configuration (d) has multiple parallel
in Table 2. We use the iSCSI reference implementation de- migration paths but they unfortunately collide at the entry
veloped by Intel [10] to set up our backup storage devices. point of the backup storage. In all configurations, multi-
Up to 3 machines run as iSCSI targets in our experiments ple OSDs are configured as RAID 0, i.e., files are striped
and Table 3 contains the configuration of these target ma- as multiple objects on all OSDs. In the first three config-
chines. When the OSD machines have loaded iSCSI ini- urations, we create one partition on each iSCSI ”disk”. In
tiator drivers, a new SCSI disk device is registered and be- configuration (d), we create three partitions, each of which
come available (e.g., /dev/sdc). We then create one or more is used by one of the three OSDs as backup storage. We
than one partitions on it and make ext3 files on each parti- cannot let the three OSDs working on the same partition
tion. When we setting up the HSM instance on the OSD, since the file system may get corrupted.
the device name of one partition like /dev/sdc1 is specified Our prototyping allows us to manually trigger the migra-
so that it becomes the backup storage device of that OSD. tion of objects between OSDs and backup storages. When
In all experiments, client and MetaData server are running data are backed up or recalled, we measure the throughput
on the same machine with the configuration similar to Ta- achieved on each pair of OSD and backup storage. We also
ble 3. All machines are connected to a Cisco Catalyst 4000 measured the latency between the moment when the migra-
Series gigabit ethernet switch. tion agent initiates the parallel backup/recall operations and
With our available hardware resources, we have de- the moment that all replies are received. Finally, we mea-
signed four different configurations as illustrated in Fig- sure the latency perceived by applications on client hosts
ure 7. Configurations (a), (b) and (c) represents scaling up when accessing files on backup storage. We use a simple
the system by adding more pairs of OSD and backup stor- program reading the first byte of a specified file. When the
Figure 8. Throughput of iSCSI

Figure 9. Throughput of ext3 file system over iSCSI

file is on backup storage, the system automatically migrates so we test the packet size from 4KB up to 256KB with in-
all the objects of the file back to OSDs before serving the crease of 4KB at each step. All tests are sequential read or
reading request. write starting from logical block address (LBA) 0 up to the
test size. Figure 8 shows 3 test sizes: 128MB, 256MB and
5.3. Performance Results 512MB. In summary, higher throughput can be achieved
for reads than for writes. Larger packet size leads to higher
throughput since overheads associated with each packet are
Before we start to test our prototyping system, we want
fixed. Larger data size has better read throughput. How-
to know how much throughput we can achieve between
ever, less throughput can be achieved on larger data size
OSD machines and iSCSI target machines using iSCSI as
due to the high cache pressure when dirty pages need to be
a baseline. Figure 8 illustrates such achievable through-
flushed into persistent storage devices of iSCSI targets.
put. We have 3 pairs of OSD and iSCSI target machines
so the numbers are averaged across them. For each pair, In the previous test, the throughput is measured for raw
tests are performed 10 times and the average numbers are iSCSI requests. Since the HSM instances in OSDs mount
reported. The iSCSI code has a packet-size limit of 256KB the iSCSI devices as ext3 file systems and perform file sys-
tem operations when migrating files, we use iozone to mea-
sure the file system throughput shown in Figure 9. These
numbers are therefore the approximate maximum through-
put we can achieve on one pair of OSD and iSCSI target
machines. In the tests, we let iozone perform synchronous
IO when writing since we want to disable the delay-writes
because of the OSD cache and get the real throughput to the
iSCSI target instead of OSD memory. In our prototyping,
files on file systems of backup storage devices are always
open with O SYNC flag to force synchronous writes. In
this set of tests, we test request sizes beyond 256KB. How-
ever, we see little increase in throughput since the underly-
ing iSCSI transport always breaks requests to a maximum
packet size of 256KB. For the file size of 1GB, we see a
substantial drop in the throughput in both read and write.
This is because that the iSCSI target machines have mem- Figure 10. Aggregated backup throughput
ory capacity of 1GB and the slower disk access starts to
take over for large file sizes.
We demonstrate the scalability of our proposed archi-
tecture by measuring the aggregated throughput achieved
in the four configurations described in the previous section.
From the previous two sets of tests, we know that packet
size of 256KB achieves nearly the maximum throughput
so we use this packet size in the all tests. We run each com-
bination of configuration and file size five times and report
the average numbers in the following.
Figure 10 shows the achieved aggregated backup
throughput of each configuration over four different file
sizes. The throughput achieved on each pair of OSD and
iSCSI target is also show as sections of the bars. By look-
ing at the single-pair, dual-pair and triple-pair configura-
tions, we can see the linear increases in backup through- Figure 11. Aggregated recall throughput
put with the number of pairs. Comparison of the triple-
pair and single-backup configuration demonstrate that sin- The latency perceived by application when accessing a
gle backup point can not scale up very well. In all con- file whose objects are stored on backup storages is illus-
figurations, the aggregated throughput decrease as the file trated in Fiugre 12. We see that the improvement from the
system increases due to the increase memory pressure in dual-pair configuration to triple-pair configuration is not as
the iSCSI target. However, the single-backup configuration dramatic as from the sing-pair configuration to the dual-
drops much faster than the dual-pair and triple-pair con- pair configuration. This is due to the fixed overhead spent
figurations since the other two also scale up their memory in places other than IO. The singe-backup configuration
when adding new pairs. again shows poor scalability compared with the dual-pair
Figure 11 shows the aggregated recall throughput and triple-pair configurations.
achieved in the four configurations. The high throughput
achieved in file sizes of 128MB and 256MB are due to the 5.4. Experiences in Prototyping
caching effect in OSD. We measure the recall throughput
by first migrate the file objects from OSDs to iSCSI targets Our proposed scheme requires the MDS to answer
and then immediately recall them so that part of the files queries of file layout information given input parameters
blocks may already in OSDs’ buffer cache. We configure of OSD identification and object identification. The Lustre
the OSD hosts with only 256MB memory so the raw iSCSI MDS uses an ext3 file system to manage the namespace and
throughput starts to take over in file sizes over 256MB. In stores file layout information as extended attributes of files.
file sizes of 512MB and 1GB, we can again observe the It can not answer the aforementioned query efficiently. In
near linear scalability in recall throughput with increasing our prototyping, we take a brutal force hacking to traverse
number of pairs of OSD and iSCSI target. the entire file system tree and extract layout attribute of
SCSI draft standard to T10 in 2004 [5]. Although the stan-
dard is still under developing, the industry is already imple-
menting systems using the object-based storage technolo-
gies. Examples include the IBM’s next generation Stor-
ageTank [13], the highly scalable Lustre file system [2] by
Cluster File Systems Inc., ActiveScale storage clusters [3]
from Panasas and so on.
There are several systems based on block devices pro-
viding transparent HSM functions. VERITAS’ NetBackup
Storage Migrator [14] is a classic DMAPI-based HSM so-
lution for single server file systems. It has implementations
on versatile file systems including OnlineJFS, XFS and
VxFS. SGI InfiniteStorage Data Migration Facility (DMF)
[8] is another single server HSM solution primarily on the
SGI XFS file system. DMF can also be combined with
Figure 12. Application perceived recall la- SGI’s cluster file system CXFS [15]. However, DMF is re-
tency quired to be installed on the host working as the metadata
server of CXFS so the architecture is still similar to that of
each regular file to check whether the requested object is single server file systems. IBM’s High Performance Stor-
part of it. A more elegant solution is definitely needed age Systems (HPSS) [9] is a cluster storage system on top
for file systems with thousands or millions of files. One of IBM General Parallel File System [16]. Its HSM func-
potential approach is to create a special directory grouped tion also relies on the DMAPI support of GPFS. However,
according to OSD identification number and object iden- XDSM/DMAPI was originally designed for single server
tification numbers so that a pair of them can be directly file systems. GPFS has an extension of XDSM/DMAPI
translated into a pathname of a regular file. This regular file for a cluster environment. A dedicated server is setup to
is a symbolic-link or hard-link to the real file or file inode receive data management events generated in client hosts.
containing the object. Another approach may be using a This server thus performs the core function of hierarchical
separate database to manage this mapping. However, both storage management and is call the Backup Core Server.
approaches requires extra efforts to keep the consistency of Since dumb block devices cannot copy data to each other
mappings. directly, special hosts called Tape-Disk Movers are attached
Another problem we encountered is caused by the time- to SAN to perform data migrations under the control of the
out mechanism used by Lustre to discover component fail- Backup Core Server.
ures. For our user program to test recall latency, we have
to increase the system timeout parameter so that recall of 7. Conclusions
large files can finish in time. This actually is just one of
many problems appear when we are moving from single
machine systems to multiple-component cluster systems. In this paper, we have proposed a parallel hierarchial
New user interfaces need to be designed. In our case, in- storage architecture in object-based cluster file systems.
terfaces that specific to files potentially on backup storage Our main contribution is a coordinating scheme to fully ex-
may be needed. plore the parallel migration data paths in this architecture.
We have developed a prototyping system on Lustre file sys-
tem as a proof-of-concept. Through performance studies of
6. Related Work this prototype, we have demonstrated the scalability of our
scheme.
Object-base storage is an emerging standard designed to
overcome the functional limitations in current storage in-
terfaces (SCSI and ATA/IDE). The Network Attached Se- References
cure Disks (NASD) project [11] at CMU is a pilot study in
object-based storage. This work led to a larger industry- [1] SGS file system RFP. Technical report, DOE NNCA
sponsored project under the auspices of the National Stor- and DOD NSA, April 25 2001.
age Industry Consortium (NSIC) generating a standard ex-
tension to the SCSI protocol for object-based storage [12]. [2] Lustre: A scalable, high-performance file sys-
The Storage Networking Industry Association (SNIA) con- tem. Whitepaper, Cluster File System, Inc.
tinues to define the NSIC draft and submitted a completed http://www.lustre.org/docs/lustre.pdf.
[3] Panasas. Activescale file system.
http://www.panasas.com/panfs.html.
[4] Gary Grider. Scalable i/o, file systems, and storage
netwoks:R&D at Los Alamos, May 2005.

[5] SNIA. SCSI object-based storage device commands

(OSD). T10 working draft. http://www.snia.org/osd.
[6] M. Mesnier, G. Ganger, and E. Riedel. Object-based
storage. IEEE Communications Magazine, 41(8):84–
90, August 2003.

[7] The Open Group. System management: Data storage

management (XDSM) api. Technical standards, Jan
1997. ISBN 1-85912-190-X.
[8] Laura Shepard. SGI infinitestorage data mi-
gration facility (DMF) a new frontier in data
lifecycle management. White paper, SGI.
http://www.sgi.com/pdfs/3631.pdf.
[9] Richard W. Watson. High performance storage sys-
tem scalability: Architecture, implementation and ex-
perience. In Proceeding of 22nd IEEE - 13th NASA
Goddard Conference on Mass Storage Systems and
Technologies, 2005.
[10] IETF. Internet small computer system interface
(iscsi). Rfc 3720. http://www.ietf.org/rfc/rfc3720.txt.
[11] Garth A. Gibson, David F. Nagle, Khalil Amiri, Jeff
Butler, Fay W. Chang, Howard Gobioff, Charles
Hardin, Erik Riedel, David Rochberg, and Jim Ze-
lenka. A cost-effective, high-bandwidth storage ar-
chitecture. SIGPLAN Not., 33(11):92–103, 1998.
[12] R. Weber. Object-based storage devices (osd).
http://www.t10.org.
[13] IBM Almaden Research. Storage tank.
http://www.almaden.ibm.com/StorageSystems.
[14] Veritas netbackup storage migrator for unix. White
paper, VERITAS Software Corporation.
[15] Laura Shepard and Eric Eppe. SGI infinitestor-
age shared filesystem cxfs: A high-performance,
multi-os filesystem from sgi. White paper, SGI.
http://www.sgi.com/pdfs/2691.pdf.
[16] Frank Schmuck and Roger Haskin. GPFS: A shared-
disk file system for large computing clusters. In Proc.
of the First Conference on File and Storage Technolo-
gies (FAST), pages 231–244, January 2002.

View publication stats

Object Storage
100% (1)
Object Storage
45 pages
DFSMSHSM Primer
No ratings yet
DFSMSHSM Primer
396 pages
Mass Storage Structure
100% (1)
Mass Storage Structure
35 pages
DFSMS/MVS V1R4 Technical Guide: June 1997
No ratings yet
DFSMS/MVS V1R4 Technical Guide: June 1997
176 pages
File Systems For Various Operating Systems: A Review
No ratings yet
File Systems For Various Operating Systems: A Review
15 pages
Systems I Support Brms PPT H04
No ratings yet
Systems I Support Brms PPT H04
80 pages
Seven Key Requirements For A Turnkey HPC Storage Solution: Features
No ratings yet
Seven Key Requirements For A Turnkey HPC Storage Solution: Features
4 pages
IBM TS7700 Virtual Tape Library - Level 2 Quiz
100% (1)
IBM TS7700 Virtual Tape Library - Level 2 Quiz
4 pages
Chapter 12: Mass-Storage Systems
No ratings yet
Chapter 12: Mass-Storage Systems
49 pages
TS7700 Seller Presentation - 2020-Jul-31
No ratings yet
TS7700 Seller Presentation - 2020-Jul-31
31 pages
Openarchive: The Final Destination of Your Data
No ratings yet
Openarchive: The Final Destination of Your Data
26 pages
Distributed Parallel Architecture For "Big Data"
No ratings yet
Distributed Parallel Architecture For "Big Data"
12 pages
Architecting A High Performance Storage System
No ratings yet
Architecting A High Performance Storage System
19 pages
Object Storage
No ratings yet
Object Storage
10 pages
Chapter 1. Introducing Tivoli Storage Manager: Server Server Program
No ratings yet
Chapter 1. Introducing Tivoli Storage Manager: Server Server Program
24 pages
CEA201 Group6
No ratings yet
CEA201 Group6
20 pages
UCS15E08 - Cloud Computing - Unit 3 Notes
No ratings yet
UCS15E08 - Cloud Computing - Unit 3 Notes
13 pages
UNS A Portable Mobile and Exchangeable Namespace For Supporting Fetch-from-Anywhere Big Data Eco-Systems
No ratings yet
UNS A Portable Mobile and Exchangeable Namespace For Supporting Fetch-from-Anywhere Big Data Eco-Systems
8 pages
Research Paper 5
No ratings yet
Research Paper 5
11 pages
Apache Hudi for Scalable Data Lakes: The Complete Guide for Developers and Engineers
From Everand
Apache Hudi for Scalable Data Lakes: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
OpenEBS for Kubernetes Storage: The Complete Guide for Developers and Engineers
From Everand
OpenEBS for Kubernetes Storage: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
From Everand
Reliability and Architecture of HDFS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Datastore Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
GlusterFS Administration and Deployment: Definitive Reference for Developers and Engineers
From Everand
GlusterFS Administration and Deployment: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kedro Catalog Essentials: The Complete Guide for Developers and Engineers
From Everand
Kedro Catalog Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Graylog Administration and Log Management: Definitive Reference for Developers and Engineers
From Everand
Graylog Administration and Log Management: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Bigtable Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
From Everand
Practical Parquet Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
From Everand
Virtuoso Database Systems: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
GreptimeDB Essentials: The Complete Guide for Developers and Engineers
From Everand
GreptimeDB Essentials: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Distributed File Systems Engineering: Definitive Reference for Developers and Engineers
From Everand
Distributed File Systems Engineering: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Network File System in Practice: Definitive Reference for Developers and Engineers
From Everand
Network File System in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
From Everand
Distributed Caching & Data Management: Mastering Redis, Memcached, And Apache Ignite Caching
Rob Botwright
No ratings yet
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
From Everand
Practical NetCDF Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Longhorn for Kubernetes Storage Architecture: The Complete Guide for Developers and Engineers
From Everand
Longhorn for Kubernetes Storage Architecture: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
From Everand
Advanced Resilient Distributed Datasets in Distributed Computing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Rsync Solutions: Definitive Reference for Developers and Engineers
From Everand
Rsync Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
From Everand
Iceberg Table Formats and Analytics: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
From Everand
Google Cloud Memorystore in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Essential Backup Strategies and Techniques: Definitive Reference for Developers and Engineers
From Everand
Essential Backup Strategies and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
From Everand
Efficient Algorithms and Structures with Heaps: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
From Everand
Storm Systems for Real-Time Data Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
uWSGI Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
From Everand
uWSGI Deployment and Configuration Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
UrBackup Solutions for Reliable System Backup: Definitive Reference for Developers and Engineers
From Everand
UrBackup Solutions for Reliable System Backup: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
From Everand
Advanced Apache Tez Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Aerospike Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
From Everand
Efficient Parallel Computing with Dask: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comprehensive Guide to BackupPC: Definitive Reference for Developers and Engineers
From Everand
Comprehensive Guide to BackupPC: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Couchbase Essentials: Definitive Reference for Developers and Engineers
From Everand
Couchbase Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
From Everand
Practical TimescaleDB Solutions: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
From Everand
Efficient Data Querying with Drill: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
Ceph Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Lustre Administration and Optimization: Definitive Reference for Developers and Engineers
From Everand
Lustre Administration and Optimization: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
From Everand
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
Robert Johnson
No ratings yet
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
From Everand
Mastering Apache Iceberg: Managing Big Data in a Modern Data Lake
Robert Johnson
No ratings yet
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
From Everand
LPIC-3 Exam 306-300 Mastery: 500 Practice Questions on High Availability & Storage Clusters
Steve Brown
No ratings yet
JavaScript File Handling from Scratch: A Practical Guide with Examples
From Everand
JavaScript File Handling from Scratch: A Practical Guide with Examples
William E. Clark
No ratings yet
Learn Hadoop in 24 Hours
From Everand
Learn Hadoop in 24 Hours
Alex Nordeen
No ratings yet
Optimized Caching Techniques: Application for Scalable Distributed Architectures
From Everand
Optimized Caching Techniques: Application for Scalable Distributed Architectures
Peter Jones
No ratings yet
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
From Everand
THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE: "THE STEP BY STEP GUIDE FOR SUCCESSFUL IMPLEMENTATION OF DATA LAKE-LAKEHOUSE-DATA WAREHOUSE"
AJIT DASH
2/5 (2)
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-3: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
Operating Systems: Concepts to Save Money, Time, and Frustration
From Everand
Operating Systems: Concepts to Save Money, Time, and Frustration
Jonathan Rigdon
No ratings yet
Database And Computer Management: SERIES 1, #3
From Everand
Database And Computer Management: SERIES 1, #3
Elias Mutegi
No ratings yet
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
From Everand
Exploring Hadoop Ecosystem (Volume 2): Stream Processing
Wei Liu
No ratings yet
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
From Everand
Hard Circle Drives (HDDs): Uncovering the Center of Information Stockpiling
Friend Good
No ratings yet
Oracle Recovery Appliance Handbook: An Insider’S Insight
From Everand
Oracle Recovery Appliance Handbook: An Insider’S Insight
Ramesh Raghav
No ratings yet
Gluster Filesystem - Practical Method
From Everand
Gluster Filesystem - Practical Method
Fabian Mestre
No ratings yet
Concise Oracle Database For People Who Has No Time
From Everand
Concise Oracle Database For People Who Has No Time
Billy Aung Myint
No ratings yet
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
From Everand
Real-Time Analytics: Techniques to Analyze and Visualize Streaming Data
Byron Ellis
No ratings yet

Coordinating Parallel Hierarchical Storage Managem

Uploaded by

Coordinating Parallel Hierarchical Storage Managem

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Coordinating Parallel Hierarchical Storage Management in Object-base Cluster

Article · January 2006

David H. C. Du Gary Grider

SEE PROFILE SEE PROFILE

Power use of disk subsystems in supercomputers View project

data deduplication performance improvement View project

The user has requested enhancement of the downloaded file.

Dingshan He, Xianbo Zhang and David H.C. Du Gary Grider

Abstract environment [1], data archiving and Hierarchical Storage

• parallel data migration paths • eliminating DMAPI

• separated migration coordination paths 3.2. Architecture Overview

• versatile storage interfaces

• HSM exported APIs

– HSMArchive: This API is used by MA to instruct

Finally, the MC-MA interface between the Migration

• Migration Coordinator exported APIs

– ReqAchive: This API is used by MA to report

4.2. Error recovery

Table 2. Configuration of OSD hosts Table 3. Configuration of iSCSI target hosts

5.2. Experiment Setup Figure 7. Testing configurations

Figure 9. Throughput of ext3 file system over iSCSI

[5] SNIA. SCSI object-based storage device commands

[7] The Open Group. System management: Data storage

View publication stats

You might also like