0% found this document useful (0 votes)
6 views7 pages

Excalibur

The document presents Excalibur, an autonomic cloud architecture designed to execute parallel applications efficiently without requiring advanced technical knowledge from users. It addresses challenges in high-performance computing on cloud platforms by dynamically scaling applications, optimizing resource usage, and minimizing costs. Experimental results demonstrate significant improvements in execution time and cost reduction compared to user-specified configurations.

Uploaded by

cesarcachito171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Excalibur

The document presents Excalibur, an autonomic cloud architecture designed to execute parallel applications efficiently without requiring advanced technical knowledge from users. It addresses challenges in high-performance computing on cloud platforms by dynamically scaling applications, optimizing resource usage, and minimizing costs. Experimental results demonstrate significant improvements in execution time and cost reduction compared to user-specified configurations.

Uploaded by

cesarcachito171
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Excalibur: An Autonomic Cloud Architecture for

Executing Parallel Applications


Alessandro Ferreira Leite, Claude Tadonki, Christine Eisenbeis, Tainá Raiol,
Maria Emilia Walter, Alba Cristina Alves de Melo

To cite this version:


Alessandro Ferreira Leite, Claude Tadonki, Christine Eisenbeis, Tainá Raiol, Maria Emilia Walter, et
al.. Excalibur: An Autonomic Cloud Architecture for Executing Parallel Applications. Fourth Interna-
tional Workshop on Cloud Data and Platforms (CloudDP 2014), Apr 2014, Amsterdam, Netherlands.
pp 1-6, ฀10.1145/2592784.2592786฀. ฀hal-01087315฀

HAL Id: hal-01087315


https://hal-mines-paristech.archives-ouvertes.fr/hal-01087315
Submitted on 11 Dec 2014

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est


archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents
entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non,
lished or not. The documents may come from émanant des établissements d’enseignement et de
teaching and research institutions in France or recherche français ou étrangers, des laboratoires
abroad, or from public or private research centers. publics ou privés.
Excalibur: An Autonomic Cloud Architecture
for Executing Parallel Applications

Alessandro Ferreira Leite Claude Tadonki Christine Eisenbeis


Université Paris-Sud/University of MINES ParisTech / CRI INRIA Saclay / Université Paris-Sud
Brasilia [email protected] [email protected]
[email protected]

Tainá Raiol Maria Emilia M. T. Walter Alba Cristina Magalhães Alves de Melo
Institute of Biology Department of Computer Science Department of Computer Science
University of Brasilia University of Brasilia University of Brasilia
[email protected] [email protected] [email protected]

Abstract 1. Introduction
IaaS providers often allow the users to specify many re- Nowadays, the cloud infrastructure may be used for high
quirements for their applications. However, users without performance computing (HPC) purposes due to character-
advanced technical knowledge usually do not provide a good istics such as elastic resources, pay-as-you-go model, and
specification of the cloud environment, leading to low per- full access to the underlying infrastructure [1]. These char-
formance and/or high monetary cost. In this context, the acteristics can be used to decrease the cost of ownership,
users face the challenges of how to scale cloud-unaware ap- to increase the capacity of dedicated infrastructure when it
plications without re-engineering them. Therefore, in this runs out of resources, and to respond effectively to changes
paper, we propose and evaluate a cloud architecture, namely in the demand. However, doing high-performance comput-
Excalibur, to execute applications in the cloud. In our ar- ing in the cloud faces some challenges such as differences in
chitecture, the users provide the applications and the archi- HPC cloud infrastructures and the lack of cloud-aware ap-
tecture sets up the whole environment and adjusts it at run- plications.
time accordingly. We executed a genomics workflow in our The cloud infrastructure requires a new level of ro-
architecture, which was deployed in Amazon EC2. The ex- bustness and flexibility from the applications, as hard-
periments show that the proposed architecture dynamically ware failures and performance variations become part of
scales this cloud-unaware application up to 10 instances, re- its normal operation. In addition, cloud resources’ are opti-
ducing the execution time by 73% and the cost by 84% when mized to reduce the cost to the cloud provider often with-
compared to the execution in the configuration specified by out performance guarantees at low cost to the users. Fur-
the user. thermore, cloud providers offer different instance types
(e.g. Virtual Machine (VM)) and services that have costs
Categories and Subject Descriptors C.2.4 [Cloud comput- and performance defined according to their purpose us-
ing]: Software architecture age. In this scenario, cloud users face many challenges.
First, re-engineering the current applications to fit the cloud
Keywords Cloud computing architecture, parallel execu- model requires expertise in both domains: cloud and high-
tion, autonomic computing performance computing, as well as a considerable time to
accomplish it. Second, selecting the resources that fits their
applications’ needs requires data about the application char-
Permission to make digital or hard copies of all or part of this work for personal or acteristics and about the resources purpose usage. Therefore,
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full citation
deploying and executing an application in the cloud is still a
on the first page. Copyrights for components of this work owned by others than the complex task [14, 8].
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission
Although some efforts have been made to reduce the
and/or a fee. Request permissions from [email protected]. cloud’s complexity, most of them target software develop-
CloudDP’14, April 13, 2014, Amsterdam, Netherlands. ers [12, 13] and are not straightforward for unexperienced
Copyright is held by the owner/author(s). Publication rights licensed to ACM.
ACM 978-1-4503-2714-5/14/04. . . $15.00. users [8]. Therefore, in this paper, we propose and evalu-
http://dx.doi.org/10.1145/2592784.2592786
ate an architecture to execute applications in the cloud with Second, it can be used to detect failures — sometimes the
three main objectives: (a) provide a platform for high perfor- providers terminate the services when they are using/stress-
mance computing in the cloud for users without cloud skills; ing the CPU, RAM memory, or both. And finally, to support
(b) dynamically scale the applications without user interven- the auto scaling mechanism.
tion; and (c) meet the users requirements such high perfor- We provide a uniform view of the the cloud providers’
mance at reduced cost. APIs implementing a Communication API. This is neces-
The remainder of this paper is organized as follows. Sec- sary because each provider may offer different interfaces to
tion 2 presents our cloud architecture. In Section 3, ex- access the resources.
perimental results are discussed. Section 4 presents related On top of the Physical layer, the Application layer
work and discusses cloud architectures to perform high- provides micro-services to schedule jobs (Provisioning), to
performance computing. Finally, Section 5 presents the con- control the data flows (Data Event Workflow), to pro-
clusion and future work. vide data streaming service (Data Streaming Process-
ing), and to control the jobs execution. The architecture uses
2. Design of the Proposed Architecture
the MapReduce [3] strategy to distribute and execute the
Our architecture aims to simplify the use of the cloud and jobs. This does not mean that the users’ applications must
to run applications on it without requiring re-design of the be composed of MapReduce jobs, but only that they are dis-
applications. tributed following this strategy.
We propose an architecture composed of micro-services. The Coordination service manages the resources, which
A micro-service is a lightweight and independent service can be distributed across different providers, and provides a
that performs single functions and collaborates with other uniform view of the system such as the available resources
services using a well-defined interface to achieve some ob- and the system’s workload.
jectives. Micro-services make our architecture flexible and The Provisioning service uses high-level specifica-
scalable since services can be changed dynamically accord- tions to create a workflow. This workflow contains tasks
ing to users’ objectives. In other words, if a service does not which will set up the cloud environment and an execution
achieve a desirable performance in a given cloud provider, it plan which will be used to execute the application. In fact,
can be deployed in another cloud provider without requiring the Provisioning service communicates with the Coordi-
service restart. nation service to obtain data about the resources and to
In this paper, an application represents a user’s demand/- allocate them for a job. After that, it submits the workflow
work, being seen as a single unit by the user. An application for the Workflow Management service.
is composed of one or more tasks which represent the small- An execution plan consists of the application, the data
est work unit to be executed by the system. A partition is a sources, the resources to execute it, a state (initializing,
set of independent tasks. The tasks that form an application waiting data, ready, executing, and finished), and a
can be connected by precedence relations, forming a work- characteristic that can be known or unknown by the system.
flow. A workflow is defined to be a set of activities, and these A characteristic represents the application’s behavior such as
activities can be tasks, as said above, or even another work- CPU, memory, or I/O-bound.
flows. The terms application and job are used interchange- The Workflow Management service coordinates the ex-
ably in this paper. ecution of the workflow and creates the data flows in the
The proposed architecture has three layers: Physical, Workflow Data Event service.
Application, and User layer (Figure 1). In the Physi- The Workflow Data Event service is responsible for
cal layer, there are services responsible for managing the collecting and moving data for the execution plans. A data
resources (e.g. Virtual Machine (VM) and/or storage). A re- flow has a source and a sink and it can supply data for
source may be registered by the cloud providers or by the multiple execution plans. This avoids multiple accesses to
users through the Service Registry. By default, the re- the Distributed File System (DFS) to fetch the same data.
sources provided by the public clouds are registered with the The User layer has two micro-services: Job Submis-
following data: resource type (e.g. physical or virtual sion and Job Stats Processing services. A user sub-
machine, storage), resource URL, costs, and resource mits a job using the Job Submission service. A job has the
purpose (e.g. if the resource is optimized for CPU, memory following data: the tasks which compose it, the constraints,
or I/O). The Resource Management service is responsible the data definition (input and output), and the data about the
to validate these data and keep them updated. For instance, cloud providers (name and access key).
a resource registered at time ti may not be available at time The users can monitor or get the results of their jobs
tj , tj > ti , either because it failed or because its max- through the Job Stats Processing.
imum allowed usage was reached. With the Monitoring Scaling cloud-unaware application without technical skills
and Deployment service, we deploy the users’ jobs and requires an architecture that abstracts the whole environ-
monitor them. Monitoring is an important activity for many ment, taking into account the users’ objectives. In the
reasons. First, it collects data about the resources’ usage.
Job Stats
Although MapReduce is an elegant solution, it has the

User
Job submission
Processing

Provisioning Coordination overhead to create the map and the reduce tasks every time
a query must be executed. We minimize this overhead using
Management
Workflow

Data Streaming
Scripting
Query Processing a data structure that keeps data in memory. This increases the

Processing
Distributed
Database Processing

Application
memory usage and requires data consistency policies to keep
Distributed Job Processing the data updated, however it does not increase the monetary
Data event

(MapReduce)
Workflow

cost. We implemented a data policy that works as follows.


Distributed File System
(DFS)
For each record read by a node, it is kept in memory and
Monitoring and Resource its key is sent to the Coordination service (Figure 1). The
Service Registry
deployment Management
Coordination stores the key/value pairs, where they were

Physical
Communication API
read. When a node updates a record, it removes the record
Cloud Provider API
from its memory and notifies the Coordination service.
Then, the Coordination notifies asynchronously all the
Figure 1. The proposed architecture and its micro-services.
other nodes that have the key to remove it from memory.
next subsections, we explain how the proposed architecture 2.3 Minimizing job makespan through workload adjustment
achieves these goals. In an environment with incomplete information and unpre-
2.1 Scaling cloud-unaware applications with budget dictable usage pattern as in the cloud, load imbalance can
restrictions and resource constraints impact the total execution time and the monetary cost. For
The applications considered in this paper are workflows but instance, assigning a CPU-bound task to a memory opti-
some parts of the workflow can be composed of a set of inde- mized node is not a good choice. To tackle this problem,
pendent tasks that can be run in parallel. These independent we propose a workload adjustment technique that works as
tasks are the target of our scaling technique. They are split follows. For execution plans in the ready state and with an
into P partitions, assigned to different resources. One impor- unknown application’s characteristics, the scheduler selects
tant problem here is to determine the size and the number of similar execution plans and submits them for each available
partitions. Over-partitioning can lead to a great number of resource (i.e. CPU, memory or I/O optimized) and waits. As
short duration tasks that may cause a considerable overhead soon as the first execution finishes, the scheduler checks if
to the system and can result in inefficient resources usage. there are similar execution plans in the ready state and sub-
To avoid this, a partition is estimated by [2]: mits them.
When there are no more ready execution plans, the
Nq ∗ R scheduler assigns one in the executing state. Note that,
P = (1)
T in this case, the cost can increase, since we have more than
one node executing the same task. In fact, we minimize this,
where Nq is the workload size; T is the estimated CPU time finishing the slow node according to the difference between
for executing Nq in the partition; and R is a parameter for the the elapsed time and the time to charge the node usage.
maximum execution time for partition P . A partition can be
adjusted according to the node characteristics. For instance, 2.4 Making the cloud transparent for the users
if the resource usage by a partition Pi is below a threshold, As our architecture aims to make the cloud transparent for
Pi can be increased. the users, it automates the setup process. However for some
Partitions exist due to the concept of splittable and users, this is not sufficient since some jobs still require pro-
static files. It is the user who defines which data are gramming skills. For instance, consider the following sce-
splittable and how to split the data when the system does narios: (i) a biologist who wants to search DNA units that
not know. Splittable data are converted to JavaScript Object have some properties in a genomics database, and to com-
Notation (JSON) records and persisted onto the distributed pare these DNA units with another sequence that he/she has
database, so a partition represents a set of JSON records. On built; (ii) a social media analyst who wants to filter tweets
the other hand, static data are kept in the local file system. using some keywords.
Normally, these works require a program to read, parse,
2.2 Minimizing data movement to reduce cost and execution
time
and filter the data. However, in our solution the users only
have to know the data structure and to use a domain specific
Data movement can increase the total execution time of the language (DSL) to perform their work. Listings 1 and 2 show
application (makespan) and sometimes it can be higher than how those works are defined, where b, P1, P2, T and w are
the computation time due to the differences in networks’ users’ parameters.
bandwidth. In that case, we can invert the direction of the
e x e c u t e T w i t h ( s e l e c t r e a d s from genomic−
logical flow, moving the application as close as possible to
d a t a b a s e where P1 = X and P2 = Y) −s e q = b
the data location. Actually, we distribute the data using a ✆
Distributed File System and the MapReduce strategy. Listing 1. Specification of a genomics analysis application.
Instance type CPU RAM Cost ($/hour)
s e l e c t t w e e t from t w e e t s where t e x t c o n t a i n s (w) PC Intel Core 2 Quad CPU 2.40GHz 4 GB Not applicable
✆ hs1.8xlarge Intel Xeon 2.0 GHz 16 cores 171 GB 4.60
Listing 2. Specification of a Twitter analysis application. m1.xlarge Intel Xeon 2.0 GHz 4 cores 15 GB 0.48
c1.xlarge Intel Xeon 2.0 GHz 8 cores 7 GB 0.58
t1.micro Intel Xeon 2.0 GHz 1 core 613 MB 0.02
In this case, a data structure (e.g. a file) is seen as a ta-
ble whose fields can be filtered. Although we have simi- Table 1. Resources used during the experiments.
lar approaches in the literature such as BioPig [12] and Se-
qPig [13], they still require programming skills to register format. (iv) finally, the RNAFold tool[5] is used to calculate
the drivers and to load/store the data. In other words, to use the minimum free energy of the RNA molecules obtained in
them, the users have to know the system’s internals. step (iii).
In order to illustrate our architecture, consider the bioin- We used the Rfam version 11.1 (with 2278 ncRNA fam-
formatics scenario described above. In this case, the biolo- ilies) and S. pombe sequences extracted from the EMBL-
gist submits a XML or a YAML file with the application, the EBI (1 million reads). Rfam is a database of non-coding
requirements, and the data definition (the genomics database RNA families with a seed alignment for each family and a
and the built sequence) using a console application (a client covariance model profile built on this seed to identify addi-
of the Job Submission service) at the User layer. The Job tional members of a family [4].
Submission sends the job description to the Provision- Although, in its higher level, this workflow executes only
ing service at the Application layer and waits for the job’s four tools, it is data oriented. In other words, each step
ID. When the Provisioning service receives the applica- processes a huge amount of data and, in all tools, each
tion, it executes the following steps. First it creates a work- pairwise sequence comparison is independent. So, the data
flow with five activities: (i) select the cheapest Virtual Ma- can be split and processed by parallel tasks.
chine to setup the environment; (ii) get non splittable files
(e.g. a reference genome) to store them in the local file sys- Infernal Segemehl SAMtools RNAfold

tem; (iii) get the splittable files (the genomics database) and
persist them into the DFS; (iv) create a Virtual Machine Im- Figure 2. The Infernal-Segemehl workflow.
age (VMI) of the configured environment; and (v) finish the
The Amazon EC2 micro instance (t1.micro) was used to
VM used to configure the environment. Second, it selects the
setup the environment (e.g. install the applications, copy the
resources returned by the Coordination service that match
static files to the local file system) and to create a Virtual
the users’ requirements or the applications’ characteristics.
Machine Image (VMI). We chose it because it is cheaper
Third, it creates an execution plan for the application; selects
and also eligible for the free quota.
a resource to execute it; and starts the execution. Finally, it
In addition to the cloud’s executions, we also executed
returns the job’s ID.
the workflow in a local PC (Table 1) to have an idea of the
In this scenario, a partition has a set of genomics se-
cloud overhead.
quences read from the DFS by the Workflow Data Event
assigned to an execution plan. During the application’s exe- 3.1 Case study 1: execution without auto scaling
cution, the Provisioning service monitors the applications This experiment aims to simulate the users’ preferences,
through the Monitoring service and if the partition’s exe- where an instance is selected either upon their knowledge
cution time reaches the expected time it creates more VMs about the applications’ requirements or the amount of com-
to redistribute the workload. After all tasks have finished, putational resources offered by an instance. We executed the
the user receives the output through the Job submission workflow in the first four instances listed in Table 1. The
service. t1.micro instance was used exclusively to setup the environ-
ment and it was not used to run the application.
3. Experimental Results Figure 3 shows the costs and execution time for the four
We deployed an instance of our architecture on Amazon instances. The time was measured from the moment the ap-
EC2. Our goal was to evaluate the architecture when in- plication was submitted until the time all the results are
stanced by a user without cloud skills. produced (wallclock time). Therefore, it includes the cloud
We executed a genomics workflow that aims to iden- overhead (data movement to/from the cloud, VM instantia-
tify non-coding RNA (ncRNA) in the fungi Schizosac- tion, among others). The instance hs1.8xlarge, which was
charomyces pombe (S. pombe). This workflow, called selected based on the application requirements (≥ 88GB of
Infernal-Segemehl, consists of four phases (Figure 2): (i) first RAM), outperformed all other instances. Although it was
the tool Infernal[11] maps the S. pombe sequences onto a possible for the user to execute his/her application without
nucleic acid sequence database (e.g. Rfam [4]); (ii) then, the any technical cloud skills, the amount paid (USD 78.00) was
sequences with no hit or with a hit with a low score are pro- high. This happened because the user specified that the ap-
cessed by the segemehl tool[6] (iii) SAMTools[10] is used plication would need more than 88GB of RAM and in fact,
to sort the alignments and convert them to the SAM/BAM the application used only 3GB of RAM.
Considering this scenario, the cloud is not an attractive ready state and 1 in execution (execution state) the architec-
alternative for the users due to its execution times; those ture duplicates the partition in execution, since its execution
were 22% and 31% higher than the local execution (PC time in the m1.xlarge instance was unknown. After one
Table 1). Even in the best configuration (hs1.8xlarge), the hour, more three instances were created to redistribute the
execution time was only 60% lower with a high monetary tasks as shown in Figure 5.
cost. These differences are owing to the multitenant model Due to the cloud infrastructure, which provided in nearly
employed by the clouds. real time the requested resources, and the auto scaling mech-
anism, which selected the resources based on the partitions’
80 78 characteristics, we decreased both the cost (5 times) and the
65888
60000
61462 makespan (10, 830 seconds) using 10 c1.xlarge instances
(80 vCPUs) and one m1.xlarge (4 vCPUs).
Execution time in seconds

60
50113
Our strategy differs from the scaling services1 offered by
Cost (USD)

40
40000
the cloud providers, since the users do not have to select an
31295
instance type nor to split the work manually.
27
20000
20
12
80 8160
8000

0 0

Execution time in seconds


c1.xlarge hs1.8xlarge m1.xlarge c1.xlarge hs1.8xlarge m1.xlarge PC 60
6000
Instance type Instance type
5408

Cost (USD)
(a) Cost (b) Execution time
40 4000
Figure 3. Cost and execution time of the Infernal-Segemehl
workflow (Figure 2) in the cloud allocating the resources 20 2000
based on users’ preferences. 12

3.2 Case study 2: execution with auto scaling 2


0 0

This experiment aims to evaluate if the architecture can scale c1.xlarge m1x.large
Instance type
c1.xlarge m1.xlarge
Instance type
a cloud-unaware application.
Based upon the previous experiment (Figure 3), the sys- (a) Cost to execute the workflow (b) Execution time for one partition
using 10 c1.xlarge instances and 1 when executed in the c1.xlarge and
tem discarded the I/O optimized instance (hs1.8xlarge)
m1.xlarge instance. m1.xlarge instances. One partition
due to its high cost(Table 1) and also because the applica- was defined to finish in 1 hour with
tion did not really require the amount of memory defined by the deadline of 9 hours for the work-
the user. In a normal scenario, this instance is selected only flow.
if the monitoring service confirms that the application is I/O Figure 4. Cost to execute the workflow (Figure 2) with auto
intensive. scaling enabled.
To scale, the system creates P partitions (P1 , P2 , · · · , Pn )
using the Equation 1 (Section 2.1) with R equals to 1 hour 10

and T equals to 9 hours. These values represent respectively


the expected execution time for one partition Pi and for
the whole workflow. They were defined because Amazon 8

charges the resource by the hour and because, in the previ-


ous experiment, the best execution time took approximately 6
#instances

9 hours to finish (Figure 3). This means that this experiment


aims to at least decrease the cost. In this case, were created
4
9 partitions.
As the beginning, the system had not sufficient data to
decide if the workflow was memory or CPU-bound, so it 2

submitted two similar partitions — for two instance types


(m1.xlarge and c1.xlarge) — to realize which was the
0
most appropriate for the partition.
Figure 4 shows the execution time for each partition in 0 2000 4000 6000 8000 10000

time (seconds)
the selected instance types. As soon as execution in the par-
tition assigned to the c1.xlarge instance finished, the sys- Figure 5. Scaling the Infernal-Segemehl workflow.
tem created one VM for each partition in the ready state and
executed them. Although there were only 7 partitions in the 1 Amazon CloudWatch (aws.amazon.com/cloudwatch/).
4. Related Work 5. Conclusion and Future Work
In the last years, many works have described the challenges In this paper, we proposed and evaluated a cloud architecture
and opportunities of running high-performance computing based on micro-services to execute application in the cloud.
in the cloud[1, 8]. Many of the benefits identified by these With a user-oriented perspective, we could execute a ge-
works, such as easy access to the resources, elasticity, stabil- nomics workflow without requiring programming skills or
ity and resource provisioning in nearly real time, as well as cloud knowledge from the users. We executed two experi-
the technical skills required to administrate the cloud, con- ments using Amazon EC2 to evaluate the architecture when
firms the requirements of reducing the complexity for the instantiated with and without auto scaling. In the first case,
users, and are consistent with our work in scaling the work- the user was responsible to define an instance type to execute
flow Infernal-Segemehl using the Amazon EC2. the workflow without auto scaling. In the second case, an in-
Recently, many works have focused on developed new stance type was selected based on the applications’ charac-
architecture to execute users’ applications in the cloud con- teristics and the work was split to reduce the execution time.
sidering both cost and performance. For instance, the Cloud Using 11 VMs we decreased both the cost and the execu-
Virtual Service (CloVR)[2] is a desktop application for auto- tion time when compared to an execution without the auto
mated sequences analyses using cloud computing resources. scaling.
With CloVR, the users execute a VM on the their computer, As future work, we intend to instantiate the architecture
configure the applications, insert the data in a special direc- running other applications in a hybrid cloud. Also, we will
tory, and CloVR deploys an instance of this VM on the cloud consider a dynamic scenario, where both the number of tasks
to scale and to execute the applications. CloVR scales the are unknown and the resources usage are restricted. Finally,
application by splitting the workload in P partitions using we intend to incorporate QoS and budget requirements.
Equation 1, and the Cunningham BLAST runtime to esti- Acknowledgments
mate the CPU time for each BLAST query.
In [9], biological applications are run on the Microsoft The authors would like to thank CAPES/Brazil and CNPq/Brazil
though the STIC-AmSud project BioCloud, and INRIA/France for
Windows Azure showing the required skills and challenges
their financial support.
to accomplish the work. Iordache and colleagues[7] devel-
oped Resilin, an architecture to scale MapReduce jobs in the References
cloud. The solution has different services to provision the re- [1] M. AbdelBaky et al. “Enabling High-Performance Computing as a
sources, to handle jobs flow execution, to process the users Service”. In: Computer 45.10 (2012), pp. 72–80.
requests, and to scale according to the load of the system. [2] S. Angiuoli et al. “CloVR: A virtual machine for automated and
Doing bioinformatics data analysis with Hadoop requires portable sequence analysis from the desktop using cloud computing”.
In: BMC Bioinformatics 12.1 (2011), pp. 1–15.
knowledge about the Hadoop internal and considerable ef- [3] J. Dean et al. “MapReduce: simplified data processing on large clus-
fort to implement the data flow. In [12], a tool for bioin- ters”. In: 6th OSDI. Vol. 6. USENIX, 2004.
formatics data analysis called BioPig is presented. In this [4] S. Griffiths-Jones et al. “Rfam: annotating non-coding RNAs in com-
case, the users select and register a driver — bioformatics plete genomes.” In: Nucleic Acids Research 33 (1 2005), pp. D121–
D124.
algorithms — provided by the tool and write their analysis’
[5] I. Hofacker et al. “Fast folding and comparison of RNA secondary
jobs using the Apache Pig (pig.apache.org) data flow lan- structures”. In: Chemical Monthly 125.2 (1994), pp. 167–188.
guage. SeqPig [13] is another tool that has the same objec- [6] S. Hoffmann et al. “Fast Mapping of Short Sequences with Mis-
tive of BioPig. The differences between them are the drivers matches, Insertions and Deletions Using Index Structures”. In: PLoS
provided by each tool. These tools reduce the needs to know computational biology 5 (9 2009), e1000502.
[7] A. Iordache et al. “Resilin: Elastic MapReduce over Multiple Clouds”.
Hadoop internal to realize bioinformatics data analysis.
In: 13th IEEE/ACM CCGrid 0 (2013), pp. 261–268.
The closest works to ours are [2], [12], and [13]. Our [8] G. Juve et al. “Comparing FutureGrid, Amazon EC2, and Open
work differs from these approaches in the following ways. Science Grid for Scientific Workflows”. In: Computing in Sci 15.4
First, the users do not need to configure a VM in their com- (2013), pp. 20–29.
puters to execute the applications in the cloud. Second, our [9] J. Karlsson et al. “Enabling Large-Scale Bioinformatics Data Analy-
sis with Cloud Computing”. In: 10th IEEE ISPA. 2012.
architecture tries to match the workload to the appropriate
[10] H. Li et al. “The Sequence Alignment/Map format and SAMtools”.
instance type. Third, the data flow is defined using an ab- In: Bioinformatics 25.16 (Aug. 2009), pp. 2078–2079.
stract language freeing the users to write any code. The lan- [11] E. P. Nawrocki et al. “Infernal 1.0: inference of RNA alignments”.
guage is the same as used by BioPig and SeqPig but with the In: Bioinformatics 25.10 (2009), pp. 1335–1337.
difference that the users write the data flow only considering [12] H. Nordberg et al. “BioPig: a Hadoop-based analytic toolkit for large-
scale sequence data”. In: Bioinformatics 29.23 (2013), pp. 3014–
the data structure. For instance, to filter the sequences using 3019.
BioPig or SeqPig the users have to register the loaders, the [13] A. Schumacher et al. “SeqPig: simple and scalable scripting for large
drivers, and write a script to execute the analysis, which is sequencing data sets in Hadoop”. In: Bioinformatics 30.1 (2014),
more appropriate for software developers. pp. 119–120.
[14] Y. Zhao et al. “Opportunities and Challenges in Running Scientific
Workflows on the Cloud”. In: CyberC. Oct. 2011, pp. 455–462.

You might also like