Tanimura 2014
Tanimura 2014
785
TABLE I
A LIST OF THE EXTENDED S3 OPERATIONS FOR THE RESERVATION
Operation Description
PUT Bucket Create a new bucket tied to the space reservation feature of Papio
GET Object Retrieve an object from the Papio storage system
PUT Object Put an object in a specified bucket in the Papio storage system
Initiate Multipart Upload Initiate a multipart upload to the Papio storage system
786
C. User Authentication and Access Control
The original RGW provides a utility tool, radosgw-admin,
to manage users. The ‘user create’ operation triggered by using
radosgw-admin generates a pair consisting of an Access Key
and a Secret Key, which is needed for the authentication of
the S3 interface. The original RGW maps all the S3 users to
one RADOS account and the S3 users’ information is stored in
RADOS. In contrast, one S3 account is mapped to one Papio
account, in PapioS3. The S3 user information, including the
map, is stored in the local file system on the node where the
PapioS3 server runs. Since this is a prototype implementation,
a default performance requirement for the implicit reservation
is set in each PapioS3 server and shared by all clients accessing
the same server. In addition, PapioS3 supports only private
permission for buckets and objects, though it is still possible
Fig. 5. Implementation of the I/O process in PapioS3 to share an object by using the ‘signing the request’ feature of
S3. More flexible configurations and permissions would be a
future task in the development of PapioS3.
default. In case that previous data has already been written,
the data is passed directly to the I/O process and written
to Papio immediately. When the RGW process receives a D. Data Integrity
complete request (Complete Multipart Upload) or an abort
request (Abort Multipart Upload) for the multipart upload, the In the S3 interface, Content-MD5 is used for verifying the
RGW process terminates the I/O process. Automatic self-stop message integrity of PUT and GET operations. At a ‘PUT
at I/O errors is also implemented in the I/O process. Object’ operation, the S3 client calculates an MD5 digest of
the object, which is a base64 encoded 128-bit MD5 digest, and
sends it to the S3 server as a Content-MD5 value in the request
2) Download Implementation in PapioS3 Server: Unlike header. The S3 server compares the digest of the received
the multipart upload, no explicit operations for initiating and object to the Content-MD5 value for verification. Then the
completing the request are defined in the S3 interface for Content-MD5 value is stored as an ETag (Entity Tag) with the
the multipart download. A ‘GET Object’ request in which verified object. In the multipart data transfer, an MD5 digest
the Range parameter is set in the request header, is simply is calculated per chunk and the all digests are concatenated to
repeated by the S3 client. However, even when the Range be an ETag. At a ‘GET Object’ operation, the ETag is sent to
parameter is set, it does not mean that the request is for the S3 client so that the S3 client can verify the object.
a multipart download. Therefore, in our implementation, for
every read request, the RGW process checks if a corresponding In our implementation of PapioS3, a ‘PUT Object’ is
I/O process is running. Since the same identifier of X-Papio- performed as described above. However, the PapioS3 client
Access-ID is embedded in a series of requests, the RGW does not verify the object with the ETag, even though the
process can detect the series. Unless the I/O process has started ETag itself is sent to the client. We assume that the verification
for the series, the RGW process newly launches an I/O process should be made by application users, if they think it is
before passing the request to the I/O process. Otherwise the necessary.
RGW process just passes the request to a running I/O process
On the other hand, the ETag feature is also helpful for
which has been launched by one of the previous requests.
checking any updates of the object. ‘If-Match’ and ‘If-None-
Match’ can be set in the request header, and with the condition
In the I/O process, the received read requests are managed parameters, the S3 client can upload or download only updated
in a queue. The I/O process fetches one request from the queue, objects or chunks. This feature seems to be available in RGW’s
reads the requested data from the Papio storage and writes it translation layer but has not tested in PapioS3 yet.
to a temporary storage space (e.g., /dev/shm). As the write
operation is performed asynchronously, the I/O process can
continue to process the next request without any pause. When IV. E VALUATION
the RGW process is notified of the completion of the write
operation, the RGW process reads data from the temporary A. Overview of Experiments
space and sends it to the client. The RGW process deletes the We evaluated PapioS3 with respect to two aspects, perfor-
data in the temporary space at the end. mance and its QoS capability. For the performance evaluation,
we tested an effect of the multipart data transfer in parallel
Due to unawareness of the completion of the multipart streams and the overhead of PapioS3 against the performance
download, the I/O process might remain on the server side of direct access to the Papio storage. For the QoS evaluation,
after the PapioS3 client finishes the access. In order to solve we examined the QoS capability of PapioS3 and showed
this problem, automatic self-stop of the I/O process is imple- the benefits through a comparison between PapioS3 and the
mented. The I/O process checks the end time of the reservation original RGW backed by RADOS, which is referred to as
of the access and terminates itself when the reservation is over. RGW-orig in this paper.
787
TABLE II
E XPERIMENT ENVIRONMENT
B. Performance of PapioS3
Fig. 7. Download performance of PapioS3
We evaluated performance of the S3-based data transfer of
PapioS3, when the benchmark was executed without interfer-
ence by other I/O workloads. Figure 6 shows the upload per- for 400MB/s. In this result, the multipart download could
formance of PapioS3 when 180MB/s or 360MB/s throughput increase the throughput by parallel streams and achieve more
was reserved for the write access to the backend Papio storage. than 98% of the performance that the Papio storage could
Inside of the Papio storage, one storage server was allocated for provide.
achieving 180MB/s throughput and two storage servers were These results indicate that the multipart data transfer is
allocated for 360MB/s. In Figure 6, Standard shows the result essential to achieve high throughput in the S3 interface when
of the normal, non-multiparted upload and N threads show the the throughput is provided by the backend storage system.
result of the multipart upload in N parallel streams. Then Papio
shows the result of direct access to the Papio storage. This
result indicates that the multipart upload can hide the overhead C. QoS Capability of PapioS3 in Concurrent Accesses
of the S3-based operation and achieve more than 80% of the
performance that the Papio storage provides, while Standard 1) QoS in PapioS3: First, we examined a basic QoS
operation does not make enough use of the performance. Since capability of PapioS3 in a simple configuration. The Papio
we confirmed that the Papio storage itself achieved the reserved storage which serves at the backend of the PapioS3, had
rate in the all cases, the reduced performance against Papio, in only one storage server, and thus the Papio had to control
particular for the 360MB/s request, would be caused by some I/O throughput rates on the storage server to satisfy each
internal overheads of the PapioS3 server. performance demand of concurrent clients. Figure 8 and 9
show the measured performance of 5 concurrent accesses for
Similarly, Figure 7 shows the download performance of upload and download respectively. The 5 clients (A∼E) were
PapioS3 when 200MB/s or 400MB/s throughput was reserved launched on a different node and their throughput requests
for the read access to the Papio storage. Inside of the Pa- were set in the ratio of 5:4:3:2:1. The aggregated throughput
pio storage, one storage server was allocated for achieving was set to 180MB/s for the upload test, and 200MB/s for the
200MB/s throughput and two storage servers were allocated download.
788
Fig. 8. QoS test in 5 concurrent uploads
789
overheads were low enough and PapioS3 effectively provided VI. C ONCLUSION AND F UTURE W ORK
each targeted rate without decrease of the total throughput.
This paper presents a design of the QoS-enabled function in
c) shows a mixed case of concurrent 2 uploads and 10 the S3-based object store, with keeping the S3 compatibility
downloads. While each performance of the uploads was much as much as possible. To respond to demands of individual
affected by other I/O workloads in RGW-orig, the 2 clients applications, an explicit performance request is allowed with
provided a high requested rate in PapioS3. The total throughput a minimum extension of the S3 RESTful interface. An im-
was targeted to be a much higher rate in PapioS3 than in RGW- plicit performance request is also supported, which allows
orig, and it was well achieved. administrators to set a quota-style QoS, in particular when
the S3 interface cannot be changed at all. This QoS feature
Discussion - The advantage of PapioS3 is obviously to is implemented based on performance guarantees provided by
provide the requested throughput to each client. This is more the Papio storage system running at the backend of the S3
beneficial when the upload and download workloads are server. Through our evaluation, it has been proved that the
mixed. However, a few limitations of the current PapioS3 have QoS capability can respond to the performance demand of
also been uncovered through the experiments. We saw a case each client, which is not simply possible with a load balancing
that some clients failed because the request patterns did not approach of the existing object store. The result is shown
fit in an availability situation of storage servers. The case did with the speed-up by our implementation of the multipart
not appear in RGW-orig because all clients continued access data transfer, which achieves high throughput that the backend
even when the performance was low. The problem might be storage provides.
solved by an automatic negotiation mode. The negotiation
On the other hand, since this study is an initial step to
allows a client to use the maximum available throughput at
provide our designed QoS feature in the S3-based object store,
the reservation request, and the request with the mode would
there are several remaining issues which would be a future
rarely fail by a resource shortage. In particular, implementation
work, as follows:
of the mode in the PapioS3 server is essential for the implicit
reservation. Another problem is necessity of QoS inside of • Our S3 extension does not include the reserve op-
the PapioS3 server. Currently, a separation of the accessed eration itself. A standardized effort for the S3 or
PapioS3 servers for clients would be the only solution when CDMI [23] like interface to request the QoS level
the problem appears seriously, but it would be fully solved by against the object store would be important. A co-
an implementation effort. operation with the immediate access but which does
not require performance guarantees would be needed,
too.
V. R ELATED W ORK • Since the S3 frontend server works on a fair-share
manner, achieving target rates would be difficult in
Automatic storage reconfiguration and storage tiering based high usage though the backend storage provides an
on a dynamic workload analysis are studied for I/O optimiza- end-to-end QoS. Thus the frontend server should also
tion, including QoS support [18,19]. In these studies, QoS is work on a QoS based manner.
mostly controlled on a volume basis, which is assigned to each
application or user group. In the cloud service, the volume • In PapioS3, a performance reservation is automatically
is assigned to one tenant which consists of a set of virtual mapped to a proper stripe pattern of Papio but the
machines. In Pisces [20], tenant-based performance isolation mapping cannot be changed. Automatic conversion to
and fairness of the shared key-value store are studied. Pisces another stripe pattern corresponding to a new reserva-
supports a weighted allocation of performance to the tenants. tion for the same object would be useful.
QoS controls based on the flash storage array is also available
in commercial products like SolidFire [21], and applied to the • PapioS3 should support more operations and functions
cloud service. of the S3 interface and be improved through further
evaluations with practical applications, so that it be
In contrast, PapioS3 aims at providing a finer-grained, per- used on the production environment.
access QoS function, with the assumption that each access
takes longer time because the accessed object is large enough. ACKNOWLEDGMENT
PapioS3 does not focus on the volume basis, over the block
storage device or file system interface. Our study targets on the This work was partly supported by KAKENHI (23680004).
object store interface and the Amazon S3 interface has been
chosen in PapioS3. Additionally, a performance measurement R EFERENCES
in PapioS3 is referred to throughput (MB/s) whereas many
systems use IOPS. [1] “Amazon EC2,” http://aws.amazon.com/ec2/.
[2] “Amazon S3,” http://aws.amazon.com/s3/.
In our previous works [10,22], any standard access inter- [3] “Swift,” http://swift.openstack.org/.
face is not considered in Papio and the only MPI-IO based [4] “RADOS Gateway,” http://ceph.com/docs/master/radosgw/.
interface is studied for the HPC applications. This work is our [5] “OpenStack,” http://www.openstack.org/.
first attempt to apply the QoS function of Papio to the clouds, [6] “Apache CloudStack,” http://cloudstack.apache.org/.
including a support of the widely used interface in the cloud [7] S. L. Garfinkel, “An Evaluation of Amazon’s Grid Computing Services:
service on the top of Papio. EC2, S3 and SQS,” Harvard University, Tech. Rep. TR-08-07, 2007.
790
[8] A. Iosup, N. Yigitbasi, and D. Epema, “On the Performance Variability
of Production Cloud Services,” in Proceedings of 11th IEEE/ACM
International Symposium on Cluster, Cloud and Grid Computing, 2011,
pp. 104–113.
[9] “Amazon EBS,” http://aws.amazon.com/ebs/.
[10] Y. Tanimura, H. Koie, T. Kudoh, I. Kojima, and Y. Tanaka, “A
Distributed Storage System Allowing Application Users to Reserve I/O
Performance in Advance for Achieving SLA,” in Proceedings of the
11th ACM/IEEE International Conference on Grid Computing, 2010,
pp. 193–200.
[11] S. A. Weil, A. W. Leung, S. A. Brandt, and C. Maltzahn, “RAODS: A
Scalable, Reliable Storage Service for Petabyte-scale Storage Clusters,”
in Proceedings of the 2nd International Workshop on Petascale Data
Storage, 2007, pp. 35–44.
[12] “Amazon S3 API Reference,”
http://awsdocs.s3.amazonaws.com/S3/latest/s3-ug.pdf.
[13] R. Takano, T. Kudoh, Y. Kodama, M. Matsuda, H. Tezuka, and
Y. Ishikawa, “Design and Evaluation of Precise Software Pacing Mech-
anism for Fast Long-Distance Networks,” in Proceedings of the 3rd
International Workshop for Fast Long Distance Networks, 2005.
[14] “GNS-WSI version 3,” http://www.g-lambda.net/.
[15] “Ceph,” http://ceph.com.
[16] “JetS3t,” http://jets3t.s3.amazonaws.com/index.html.
[17] “RFC 2616, Section 8,” http://www.w3.org/Protocols/rfc2616/rfc2616-
sec8.html.
[18] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and
A. Veitch, “Hippodrome: running circles around storage administration,”
in Proceedings of the 1st USENIX Conference on File and Storage
Technologies, 2002.
[19] A. Elnably, H. Wang, A. Gulati, and P. Varman, “Efficient QoS for
Multi-Tiered Storage Systems,” in Proceedings of the 4th USENIX
Workshop on Hot Topics in Storage and File Systems (HotStorage’12),
2012, pp. 1–5.
[20] D. Shue, M. J. Freedman, and A. Shaikh, “Performance Isolation and
Fairness for Multi-Tenant Cloud Storage,” in Proceedings of the 10th
USENIX conference on Operating Systems Design and Implementation,
2012, pp. 349–362.
[21] “SolidFire,” http://www.solidfire.com/.
[22] Y. Tanimura, R. Filgueira, I. Kojima, and M. Atkinson, “MPI Collective
I/O based on Advanced Reservations to Obtain Performance Guarantees
from Shared Storage Systems,” in Proceedings of the 5th Workshop
on Interfaces and Architectures for Scientific Data Storage, held in
conjunction with IEEE Cluster 2013, 2013, pp. 1–5.
[23] “CDMI (Cloud Data Management Interface),”
http://www.snia.org/cdmi.
791