Azureblast: A Case Study of Developing Science Applications On The Cloud
Azureblast: A Case Study of Developing Science Applications On The Cloud
ABSTRACT and best funded research projects are able to afford expen-
Cloud computing has emerged as a new approach to large sive computing infrastructure, most other projects are forced
scale computing and is attracting a lot of attention from the to opt for cheaper resources such as commodity clusters or
scientific and research computing communities. Despite its simply limit the scope of their research. Cloud comput-
growing popularity, it is still unclear just how well the cloud ing [2] proposes an alternative in which resources are no
model of computation will serve scientific applications. In longer hosted locally, but leased from big data centers only
this paper we analyze the applicability of cloud to the sci- when needed. This offers the promise of “democratizing” re-
ences by investigating an implementation of a well known search as a single researcher or small team can have access to
and computationally intensive algorithm called BLAST. the same large-scale compute resources as large, well-funded
BLAST is a very popular life sciences algorithm used com- research organizations without the need to invest in purchas-
monly in bioinformatics research. The BLAST algorithm ing or hosting their own physical infrastructure. Despite the
makes an excellent case study because it is both crucial to existence of several cloud computing vendors, such as Ama-
many life science applications and its characteristics are rep- zon AWS, GoGrid, and more recently Microsoft Windows
resentative of many applications important to data intensive Azure, the potential of cloud platforms for research com-
scientific research. In our paper we introduce a methodol- puting remains largely unexplored.
ogy that we use to study the applicability of cloud plat-
forms to scientific computing and analyze the results from While cloud computing holds promise for the seemingly in-
our study. In particular we examine the best practices of satiable computational demands of the scientific community,
handling the large scale parallelism and large volumes of there are unanswered questions in the applicability of cloud
data. While we carry out our performance evaluation on platforms, including performance, which is the focus of this
Microsoft’s Windows Azure the results readily generalize to work. In this paper we present an experimental prototype,
other cloud platforms. AzureBlast, which is designed to assess the applicability of
cloud platforms for science applications. BLAST [1] is one
Categories and Subject Descriptors of the most widely used bioinformatics algorithms in life
D.1.3 [PROGRAMMING TECHNIQUES]: Concurrent science applications. The BLAST algorithm can discover
Programming—Distributed programming; J.3 [Life and Med- the similarities between the two bio-sequences (e.g.,Protein).
ical Sciences]: Biology and genetics BLAST makes an excellent case study not only because of
its popularity but also because of its representative charac-
teristics of many applications important to data intensive
General Terms scientific research. AzureBlast is a parallel BLAST engine
Performance, Design running on the Windows Azure cloud fabric. Instead of us-
ing some high-level programming models or runtimes such
Keywords as MapReduce[8], AzureBlast is built directly on the fun-
BLAST, Cloud Computing, Windows Azure damental services of Windows Azure so that we are able to
examine each individual building-block. Our ultimate goal
1. INTRODUCTION is to provide a characterization of science applications ap-
Increasingly, scientific breakthroughs will be powered by ad- propriate for cloud computing platforms and best practices
vanced computing capabilities that help researchers manip- for deploying science applications on cloud platforms.
ulate and explore massive datasets. Today while the largest
The structure of this paper is as follows. In Section 2 we
briefly discuss the Windows Azure cloud platform and capa-
bilities it offers to a computational scientist. In this section
we also highlight aspects of modern data centers and the
implications for high performance computing. In Section
3 we introduce the background of BLAST, and then detail
our implementation of AzureBLAST, and identify how we
matched the requirements of the algorithm to capabilities
and limitations of the cloud platform. Throughout Section
3 we identify general patterns to follow in implementing the
similar science applications on a cloud platform. In Section
4 we carry out a detailed performance of AzureBLAST and
discuss implications for what science applications are appro-
priate for cloud computing platforms. Finally, we list the
related work and conclude with a summary of best practices
and application patterns for science applications on cloud
platform.
Figure 5: Workflow of AzureBlast tasks In AzureBlast, we take an indirect scheme which leverages
the highly scalable Azure blob storage. A background database
updating process, which runs on its own role instance, peri-
Although the workflow is straightforward, some subtle is- odically refreshes the NCBI databases into Azure blob stor-
age. Specified by the user, the database can be staged dur-
ing the initialization phase of each instance, or it can staged
in a lazy manner when the instance is going to execute a
BLAST task. In either case, if the timestamp of local replica
has expired, the database will be updated from blob stor-
age. As Azure blobs are designed to provide highly scalable
throughput, this indirect solution actually provides the best
performance.