0% found this document useful (0 votes)

37 views6 pages

Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research

This document summarizes a framework called Cloudflow that was developed to simplify the creation of biomedical data analysis pipelines using MapReduce. Cloudflow hides the complexity of MapReduce and allows researchers to build pipelines using intuitive operations. It includes common NGS analysis steps and allows users to customize operations. Cloudflow can translate pipelines into MapReduce jobs and be used with workflow systems. The document demonstrates Cloudflow on three genetic use cases and discusses how it reduces development time and increases code reusability compared to native MapReduce.

Uploaded by

Robert Carneiro Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

37 views6 pages

Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research

Uploaded by

Robert Carneiro Assis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

MIPRO 2015, 25-29 May 2015, Opatija, Croatia

Cloudflow A Framework for MapReduce

Pipeline Development in Biomedical Research
Lukas Forer1,2, Enis Afgan3,4, Hansi Weiensteiner1,2, Davor Davidovi3, Gnther Specht2,
Florian Kronenberg1, Sebastian Schnherr1,2,
1

Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria

Institute of Computer Science, Research Group Databases and Information Systems, Innsbruck, Austria
3
Center for Informatics and Computing, Ruer Bokovi Institute, Zagreb, Croatia
4
Department of Biology, Johns Hopkins University, Baltimore MD, USA
[email protected]

Abstract - The data-driven parallelization framework

Hadoop MapReduce allows analysing large data sets in a
scalable way. Since the development of MapReduce
programs can be a time-intensive and challenging task, the
application and usage of Hadoop in Biomedical Research is
still limited. Here we present Cloudflow, a high-level
framework to hide the implementation details of Hadoop
and to provide a set of building blocks to create biomedical
pipelines in a more intuitive way. We demonstrate the
benefit of Cloudflow on three different genetic use cases. It
will be shown how the framework can be combined with the
Hadoop workflow system Cloudgene and the cloud
orchestration platform CloudMan to provide Hadoop
pipelines as a service to everyone.
The framework is open source and free available at
https://github.com/genepi/cloudflow.

INTRODUCTION

Since the advent of high-throughput technologies in

the field of molecular biology (i.e. Next Generation
Sequencing (NGS)), even more data is produced and
needs to be analysed. Thus, molecular biology has
envolved into a big data science, where the bottleneck is
no longer the production of raw data in the laboratory, but
its analysis and interpretation [1]. To scale with the
increasing data volume and the number of available
resources, workflows need to be parallelized efficiently.
MapReduce and Cloud Computing constitute an attractive
alternative to deal with the large datasets [2]. However,
writing a MapReduce job can be a challenging task that
prevents domain experts from using such models in their
daily work. Additionally, the reusability of the mapper
and reducer functions is limited, resulting in a use-case
specific implementation for every problem.
Existing high-level languages on top of Hadoop
facilitate the development process and allow writing
MapReduce jobs in form of queries. For example, Apache
Pig (http://pig.apache.org/) provides a compiler to
translate such queries into an execution plan, which is then
automatically translated into a sequence of MapReduce
jobs. Apache Pig provides interfaces for filter, group and
join operations, extendible for application-specific logic
using user defined functions (UDFs). However, Apache
Pig has been developed to analyze datasets based on the
relational model. The execution of complex calculations
across several rows is limited. This implies the

172

consequence that users have to write complex UDFs.

Again, such functions need to be implemented by the end
users and can be, similar to native MapReduce jobs, hard
to implement, test and maintain. Several projects (e.g.
SeqPig [3] or BioPig [4]) make use of UDFs and provide a
collection of Pig scripts to researchers in Genetics. The
goal is to provide ready-to-use workflows to end users,
which can be then adapted to their use-case. Nevertheless,
combining different scripts to a pipeline or reusing
existing blocks does not build a key feature.
To overcome this issue, FlumeJava [5] proposed a new
concept to compose pipelines based on immutable parallel
collections where several operations can be used to
process them in a parallel way. This concept was
successfully
implemented
in
Apache
Crunch
(https://crunch.apache.org/), which executes the pipelines
as Hadoop MapReduce jobs. However, the utilization of
such pipelines in Bioinformatics is still limited.
In this paper we present Cloudflow, a MapReduce
pipeline framework, which is based on a similar concept
as proposed by [5]. In contrast to existing approaches,
Cloudflow was developed to simplify the pipeline creation
in biomedical research, especially in the field of genetics.
For that purpose Cloudflow supports a variety of NGS
data formats and contains a rich collection of built-in
operations for analyzing such kind of datasets (e.g. quality
checks, mapping reads or variation calling). The main
concept behind our approach is to break complex data
analysis steps into three basic operations. All further usecase specific operations are built by implementing or
extending one of the basic operations. Pipelines are then
composed by creating a sequence of these operations. The
framework itself translates the set of pipeline operations
into one or more MapReduce jobs and decides which of
the operations are executed in the map or in the reduce
phase. Thus, Cloudflow hides the complexity and the
implementation details of MapReduce jobs allowing
scientists to build pipelines via an intuitive method.
Moreover, Cloudflow can be utilised in combination with
the workflow system Cloudgene [6]. To validate our
approach, we developed three Genetics data-analysis
pipelines with Cloudflow. The results demonstrate that
our contribution (a) helps minimizing the development
time, (b) increases the reusability of code, and (c) creates
only a minimal overhead in terms of execution time
compared to an identical MapReduce implementation.

II.

MAPREDUCE BACKGROUND

MapReduce is a parallel programming model

introduced by Google in 2004 with the aim to develop a
simple and scalable method to process large datasets on
several machines in parallel. The main idea behind this
distributed programming model is that a long-time
calculation is split into a map and a reduce phase which
contains all the logic behind the calculation and is
specified by the user. The underlying framework itself
takes care of parallelization, task scheduling, loadbalancing and fault-tolerance. Due to its simplicity,
MapReduce is used to solve many scientific problems
where large-scale computing is needed. Moreover,
MapReduce is ideal for parallel batch processing of
terabytes of input data. The data-flow of a MapReduce
program consists of several steps, where only the map and
reduce function are problem-specific and the other steps
are loosely coupled with the problem and generalized. In
the first step, the input data set is split into key/value pairs
and a user defined map function is executed for each pair:
map(key, value) list((keyi, valuei)) , where i = 0 n
The map function reads a pair, performs some problemspecific calculations and produces zero or n intermediate
key/value pairs for each input pair. In the next step, the
intermediate key/value pairs are grouped by similar keys
and a merged list of all values for this key is created.
Finally, the user-defined reduce function is applied to the
intermediate key/list pairs:
reduce(keyi, list(valuei)) list(outvalue)
The values created by the reduce function are the final
outputs of a MapReduce job.
A. Apache Hadoop
Apache Hadoop is an open source project including
several sub-projects for distributed computing. It includes
the most widely used open-source implementation of
Google's MapReduce framework (simply called
MapReduce), the distributed file system (HDFS), and
several other sub-projects. In general, all previous
discussed features of the original MapReduce approach
are implemented in Apache Hadoop. Writing a
MapReduce job with Hadoop can either be done directly
in Java by implementing the relevant methods or by using
Hadoop's streaming mode. This allows specifying a script
or any executable as the mapper or reducer. Two different
versions of MapReduce are available. MRv1 includes a
namenode (centerpiece of HDFS), a secondary namenode
(merging logs into a file system snapshot), a job-tracker
(job assignment) and several task-trackers (workers).
MRv2 or Hadoop YARN, splits the work of the jobtracker (resource management, job scheduling) into two
different daemons, namely a resource manager and an
application master. Additionally a node manager is
introduced to manage the user processes per node. The
application master describes a framework that works
together with the node manager and the resource manager
to monitor and to execute tasks. The idea behind this new
model is that applications using frameworks different

173

than MapReduce (e.g. Apache Spark) can be executed via

YARN as well.
B. High-Level Languages based on MapReduce
To simplify the implementation of MapReduce jobs,
Apache Pig has been introduced. It is based on HDFS and
MapReduce and allows a fast implementation of data
flows with already available data operations such as join,
filter or group by. Data flows are specified in the Pig Latin
language that describes a directed acyclic graph (DAG)
and defines how data should be processed. A further
advantage of Apache Pig is the ability to check the data
flow on optimizations, e.g. if two grouping statements can
be combined. The costs to write code in Apache Pig are
lower than setting up a Java project and implementing the
functions. But as stated earlier, Apache Pig includes only
a limited number of operators and writing Apache
MapReduce jobs directly in Java yields to advantages in
speed compared to Apache Pig.
C. MapReduce Pipelining
The pipeline framework FlumeJava [5] is based on the
concept of immutable parallel collections. This kind of
data-structure can be used to process and analyze their
items in parallel. The end user can either use one of the
predefined functions or can combine them with its own.
Each function is implemented as parallel for-each loop,
which is than translated by the framework into a series of
MapReduce jobs. During the translation, the framework
itself decides if such a function should be executed locally
(i.e. sequentially) or remotely (i.e. parallel). Apache
Crunch is a freely available open source implementation
of FlumeJava.
III.

CLOUDFLOW

The overall idea behind Cloudflow is to simplify the

creation of analysis pipelines by encapsulating complex
data analysis steps in simple operations. This approach
helps hide the complexity and the implementation details
of complex data-parallel pipelines. Moreover, the concept
of using basic operations increases the reusability and
enables the testing of the operation logic on a local
workstation by using existing unit testing frameworks.
Since Cloudflow uses the Hadoop framework for
pipeline execution, it offers parallel data processing, data
reliability and fault tolerance out of the box. This fact is
especially important in the field of Cloud Computing,
where infrastructure often relies on commodity hardware
and nodes can fail on a regular basis (e.g. due to missconfiguration or hardware failures). At the same time, the
architecture of Cloudflow is independent from
MapReduce; it provides parallelization constructs and
abstraction interfaces that can be used to extend the
system by implementing other parallel programming
models in the future (e.g. translating the operations into
Apache Spark jobs).
Instead of developing a new declarative language for
the pipeline composition, we developd a clear Java API.
The proposed framework implements different patterns to
speed up the pipeline creation, to be extensible and to
support test-driven pipeline development. The following

section gives an overview on the abstraction, explains the

basic operations in detail, and shows how pipelines are
created.
A. Data Types and Basic Operations
Cloudflow operates on records consisting of a
key/value pair, whereby different record types are
available (e.g. TextRecord, IntegerRecord, FastqRecord).
A loader class is responsible to load the input data and to
convert it into an appropriate record type. As mentioned
earlier, Cloudflow supports three different basic
operations. These are used to analyze and transform
records; specifically, transform, summarize, and group
operations exist.
The transform-operation is used to analyze one input
record and to create between 0 and n output records. The
user implements the computational logic for this operation
by extending an abstract class. This class provides a
simple function, which is executed by the Cloudflow
framework for all input records in parallel:
Class MyTransformer extends Transformer {
public void transform(Record) {
doSomethingInParallel();
emit(new Record());
}
}

The summarizer operates on a list of records, whereby

records with the similar key are grouped. Thus, the
signature of the process method has the key and a list of
records as an input:
Class MySummarizer extends Summarizer {
public void summarize(Key, List<Record>) {
doSomethingInParallel();
emit(new Record());
}
}

The group-operation is a special operation, which takes a

list of records as an input and creates a Record group with
the same key. Our framework inserts automatically a
group-operation between a transform- and a summarizeoperation. This ensures, that output records of the
transform operation are compatible with the input records
of the summarize operation. The group-operation is
realized by using the shuffle phase of a MapReduce job.
Based on these three operations, the user defines pipelines
by building sequences of operations. A pipeline has to
start with a transform-operation, all further operations are
optional and can be used in arbitrary order.
B. Extended Operations
Complex operations are built by combing one or more
basic operations. We already implemented several
standard operations that are helpful for the analysis of text
or numerical data records:
-

Filter: this operation is a special transformoperation, which emits the record to the
subsequent operation iff a user-defined condition
is fulfilled.
Split: the transform-operation calculates for each
input record a new split level (i.e. a new key).
This key is used by the group-by operation to

174

create chunks, which can then be analyzed by a

user-defined summarize-operation.
-

Aggregation (sum, mean): This defines a groupby-operation followed by a summarize-operation

to aggregate all values with the same key (e.g.
calculates the mean of all values). One record
with the aggregated value is then emitted to the
subsequent operation.

Executor: this summarize-operation writes all

grouped values into a file on the local disk. It is
then used as the input of an external UNIX
command line program. Based on the lines of the
output file, new records are created and emitted to
the subsequent operation.

Since Cloudflows operations are based on the

Composite pattern, all these extended operations can also
be used as a basis for new operations. In addition, this
enables to split complex operations into several suboperations, which improves testing and maintenance.
C. Pipeline Composition
The user builds pipelines by connecting several
operations with compatible interfaces. For this purpose
our framework implements the Builder pattern, which
enables (a) building complex pipelines, (b) providing type
safety and (c) the implementation of domains specific
builders (see Section III.D). In addition, the Builder
pattern ensures that only a valid sequence of operations
can be created (i.e. after the group-by operation a
summarize operation has to be added).
Class LineToWords extends Transformer {
public void transform(TextRecord rec) {
String[] words = rec.getValue().split()
for (String word: words){
emit(new IntegerRecord(word, 1));
}
}
}
pipeline.loadText(input)
.transform(LineToWords.class)
.sum()
.save(output);
Listing 1. WordCount Example using Cloudflow

To help the user and accelerate the pipeline

composition process, Cloudflow provides already a set of
useful operations. This has the advantage that even a
default WordCount example can be broken down into a
few simple operations and is defined in a single line of
code (see Listing 1). In a first step, the text file is loaded
from HDFS (loadText). Then, for each record (i.e. line of
input) the application-specific LineToWords operation is
executed, which splits the line into words and creates a
new record efor ach word. In the last step a predefined
sum operation is executed. It extends the pipeline by a
group-by operation and a summarize-operation in order to
sum up all the values for a certain key.
For frequently used operations (e.g. sum, mean or
count), we created special builder functions, which extend
the pipeline and improve the code readability by keeping
the code simple.

Figure 1. Cloudflow translates the operation sequence automatically into an executable MapReduce job

D. Pipeline Execution
Before the execution, Cloudflow checks the
compatibility of input and output records of consecutive
operations. This ensures that only valid and executable
pipelines are submitted to a Hadoop cluster.
If the pipeline is executable and valid, then the
operation sequence is translated into an execution plan,
that decides if an operation is executed in the map or in
the reduce phase. Based on this plan, Cloudflow creates
one or more MapReduce jobs and configures them to
execute the user-defined operations in the correct order. In
this translation step, Cloudflow tries to minimize the
number of MapReduce jobs by combining consecutive
transform-operations and by executing all transformoperations after a summarize-operation in the same
reducer instance (see Figure 1).
For additive summarize-operations (e.g. sum),
Cloudflow takes advantage of Hadoops combiner
functionality. The idea of this improvement is to combine
the key/value pairs that are generated by all map tasks on
the same machine, into fewer pairs. Thus, the number of
pairs that are transferred between mapper and reducer are
minimized, which results in a positive effect on the
network bandwidth since useless communication is
avoided.
IV.

CLOUDFLOW FOR BIOINFORMATICS

Cloudflow provides a variety of already implemented

utilities, which facilitate the creation of pipelines in the
field of Bioinformatics (especially for NGS data in
Genetics). For that purpose, we implemented, based on
HadoopBAM [7], several record types and loader classes
in order to process FASTQ, BAM and VCF files.
Moreover, we created several operations and filters for the
analysis of biological datasets (see Table I for an overview
of all currently implemented operations and filters).
For example, a typical quality control pipeline for
VCF files can be implemented by simple combining of
several built-in operations. First, we apply predefined
filters to discard variations that are monomorphic, marked
as duplicates, or are Insertions or Deletions (InDels). For
all records passing the filters, Cloudflow applies a

175

summarize-operation that calculates the call rate for each

variation (see Listing 2).
Class CallRateCalc extends Transformer {
public void transform(VcfRecord record) {
VariantContext snp = record.getValue();
float call = callRate(snp);
emit(new FloatRecord(snp.getID(), call);
}
}
pipeline.loadVCF(input)
.filter(MonomorphicFilter.class)
.filter(DuplicateFilter.class)
.filter(InDelFilter.class)
.transform(CallRateCalc.class)
.save(output);
Listing 2. VCF Quality Control Pipeline using Cloudflow

DEPLOYING PIPELINES AS A SERVICE

Cloudgene [6] is a web-based platform to create and

execute workflows consisting of Hadoop MapReduce,
Apache Pig and command line-based programs. It can be
seen as an additional layer between Hadoop MapReduce
and the end user that hides the complexity of the
MapReduce framework. Therefore, Cloudgene is the
perfect candidate to provide Cloudflow pipelines as a
service. Such pipeline can be integrated into the workflow
platform by utilizing Cloudgenes plugin interface. No
adaptation to the source code is needed, while only a
simple plain text file including a header, input parameters,
output parameters and the definition of the workflow itself
need to be created. When launching Cloudgene, the
manifest file is loaded and the client interface is
automatically rendered using information from the file. As
Cloudgene supports different technologies, it is possible to
parallelize the calculations using Cloudflow and to
visualize the results using R.
Cloudflow requires a compatible MapReduce cluster
for executing pipelines. CloudMan [8] makes it possible to
easily procure and configure a functional data analysis
platform on a cloud infrastructure. The procured platform
delivers a scalable cluster-in-the-cloud and a data analysis
environment preconfigured with a number of applications.
With its ability to be launched and managed via a web
browser on a number of clouds, customized as necessary,
and easily shared with collaborators, CloudMan makes it

TABLE I. CURRENTLY SUPPORTED DATA FORMATS AND OPERATIONS

Data Format
Split
Fastq

Filter
Other
Split

BAM

Filter
Other
Split

VCF

Filter

Other

Pipeline Operation

Description

split()

Find pairs (for paired-end reads)

filter(LowQualityReads.class)

Filters reads by quality

filter(SequenceLength.class)

Filters reads by sequence length

findPairedReads()

Detects read pairs

align(referenceSequence)

Aligns sequences against a reference (using jBWA for alignment)

split()

Creates fixed size chunks (e.g. 64 MB)

split(5, BamChunk.MBASES)

Creates logical chunks (e.g. 5MBases)

filter(UnmappedReads.class)

Filters unmapped reads

filter(LowQualityReads.class)

Filters reads by map.quality

findVariations()

Finds variations in aligned reads (using samtools)

split()

Creates fixed size chunks (e.g. 64 MB)

split(5, VcfChunk.MBASES)

Creates logical chunks (e.g. 5MBases)

filter(MonomorphicFilter.class)

Filters monomorphic site

filter(DuplicateFilter.class)

Filters duplicates

filter(InDelFilter.class)

Filters inDels

filter(CallRateFilter.class)

Filters by call rate

filter(MafFilter.class)

Filters by MAF

checkAlleleFreq(reference)

Allele frequency check with external reference (e.g. 1000 genomes)

possible to readily utilize cloud resources in a research

environment. The approach on how Cloudgene and
CloudMan can be combined efficiently have already been
demonstrated [9].
VI.

EVALUATION

To evaluate our approach, we implemented three

different Bioinformatics data-analysis pipelines using
Cloudflow and integrated them into Cloudgene. The
results of our experiments demonstrate that Cloudflow has
only a minimal overhead in the execution time compared
to an identical pure MapReduce implementation.
However, the performance of Cloudflow is better than
Apache Crunchs (see Figure 2).

vast amount of short reads, mostly in the FASTQ file

format, to a reference genome. Factors such as read errors,
insertions or deletions of bases must be considered that
finally results in the most accurate genome position for
each read.
The goal of the following pipeline is the alignment of
paired-end data to a reference genome. Therefore, the
FASTQ data is loaded using the FastqLoader, which
creates records for each sequence. The records are then
filtered by an average base quality of 30. Numerous other
quality metrics (such as sequence length or C/G content)
can be filtered as well. Since paired-end reads are used,
read pairs are detected using a predefined transformoperation. The aligner step is implemented as a
summarize-operation that calls a parallelized version of
BWA-MEM [10]. This has been achieved by using JNIs
(https://github.com/lindenb/jbwa). Similar to BWA-MEM
99,100 reads are aligned to the user-specified reference
genome in one batch. Aligned reads are saved in HDFS in
the BAM file format (see Listing 3.A).
B. Variation Calling
After data has been aligned and cleaned (e.g. removing
duplicates, quality recalibration), the next step of NGS
pipelines is the detection of reliable variants that can be
used e.g. in association studies. A widely used pipeline for
variant detection is GotCloud, developed at the Center of
Statistical Genetics (University of Michigan) and utilized
in the 1000 Genomes (1000G) project.

Figure 2. Execution Time of a Cloudflow pipeline, an Apache Crunch

pipeline and a pure MapReduce implementation of WordCount

A. Preprocessing and Mapping

When working with NGS data, the quality of raw data
needs to be checked before a successful subsequent
downstream analysis (e.g. read mapping/alignment) can
be achieved. The overall goal of alignment is to dock the

176

In this example pipeline, the aim is to find variations

without a statistical model by implementing a simple
counting approach of the four bases. This is only possible
when using high coverage data. Therefore, in the first step
the BAM file (created in the previous pipeline) is loaded
and chunked in user-specified splits (e.g. 5 MBases).
Variations are then detected for each chunk by counting

the occurrences of A, C, G, T on each position. Finally,

the detected variations are stored in VCF files (see Listing
3.B).
C. Genome-Wide Association Studies
Many genome-wide association studies (GWAS) have
identified associations between various phenotypes and
common sequence polymorphisms, which might play a
role for disease development. Technologies like
microarrays have made it possible to measure millions of
single nucleotide polymorphisms (SNPs) of one individual
simultaneously and for low cost. Since the costs of
microarrays is much lower than next-generation
sequencing (NGS), it is today the cheapest method to
genotype large-scale population studies. Such datasets are
combined with collected phenotypes (e.g. diseases and
measured data) in order to detect if one of these variations
has a high impact on the value of a phenotype. Since the
size of such datasets grows rapidly, a parallelization at the
data-level is necessary to analyze the data in appropriate
time.

VII.

CONCLUSION

Cloudflows overall aim is to simplify the

development of complex MapReduce pipelines by
abstracting the map and the reduce function from end
users. Therefore, operations need only be written once and
can be re-used for future MapReduce usage. The major
advantage of Cloudflow lies in the provision of validated
operations, especially in the area of genetics, and its
extensibility. Combining Cloudflow with CloudMan
(cluster orchestration) and Cloudgene (Hadoop workflow
system) allows users to use Hadoop without a deeper
knowledge of the internal MapReduce concepts and could
yield to a boost of Hadoop in genetics.
ACKNOWLEDGMENT
This work was, in part, supported by the Scalable Big
Data Bioinformatics Analysis in the Cloud grant from the
Croatian Ministry of Science, Education, and Sport and
the Austrian Federal Ministry of Science and Research
(BMWF) and by the FP7-PEOPLE programme grant
277144 (AIS-DC).
REFERENCES

//A. Preprocessing and Mapping

pipeline.loadFastq(input)
.filter(LowQualityReads.class, 30)
.findPairedReads()
.align(refSeq)
.save(output);

[1]
[2]

//B. Variation Calling

pipeline.loadBam(input)
.split(5, ChunkSize.MBASES)
.groupByKey()
.findVariations(refSeq)
.save(output);

[3]

//C. Genome-Wide Association Study

pipeline.loadText(input)
.split(1000, ChunkSize.LINES)
.execute(SnpTestExecutor.class).
.filter(FilterHeader.class)
.filter(FilterInvalidSnps.class)
.save(output);
Listing 3. Complete NGS pipeline using Cloudflow

[4]

[5]

The parallelization of the association analysis was

realized by splitting the list of markers into chunks. In
detail, the mapper splits all input SNPs into chunks with a
fixed number of SNPs (e.g. 1000). Then, the reducer
executes the linear regression model for each chunk by
using SNPTest. Finally, the reducer collects the results
and merges them into a single file. The corresponding
Cloudflow pipeline loads the text input file and
automatically creates records for each line. On these
records we apply the split operation, which creates chunks
containing a fixed number of lines. For the execution of
the SNPTest program, we can implement a special
operation called BinaryExecutor, which enables us to
write the chunks automatically to the POSIX file system.
In the next step, we can use this file as the input file for
SNPTest. After the execution the operation creates text
records for each line of results (see Listing 3.C).

[6]

[7]

[8]
[9]

[10]

177

V. Marx, Biology: The big challenges of big data,

Nature, vol. 498, pp. 255260, 2013.
S. Yazar, G. E. C. Gooden, D. A. Mackey, and A. W.
Hewitt, Benchmarking undedicated cloud computing
providers for analysis of genomic datasets., PLoS
One, vol. 9, no. 9, p. e108490, Jan. 2014.
A. Schumacher, L. Pireddu, M. Niemenmaa, A. Kallio,
E. Korpelainen, G. Zanetti, and K. Heljanko, SeqPig:
simple and scalable scripting for large sequencing data
sets in Hadoop., Bioinformatics, vol. 30, no. 1, pp.
11920, Jan. 2014.
H. Nordberg, K. Bhatia, K. Wang, and Z. Wang,
BioPig: a Hadoop-based analytic toolkit for largescale sequence data., Bioinformatics, p. btt528, Oct.
2013.
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R.
Henry, R. Bradshaw, and N. Weizenbaum,
FlumeJava, ACM SIGPLAN Not., vol. 45, no. 6, p.
363, May 2010.
S. Schnherr, L. Forer, H. Weiensteiner, F.
Kronenberg, G. Specht, and A. Kloss-Brandsttter,
Cloudgene: A graphical execution platform for
MapReduce programs on private and public clouds,
BMC Bioinformatics, vol. 13, no. 1, p. 200, 2012.
M. Niemenmaa, A. Kallio, A. Schumacher, P. Klemel,
E. Korpelainen, and K. Heljanko, Hadoop-BAM:
Directly manipulating next generation sequencing data
in the cloud., Bioinformatics, p. bts054, Feb. 2012.
E. Afgan, B. Chapman, and J. Taylor, CloudMan as a
platform for tool, data, and analysis distribution.,
BMC Bioinformatics, vol. 13, no. 1, p. 315, Jan. 2012.
L. Forer, T. Lipic, S. Schonherr, H. Weisensteiner, D.
Davidovic, F. Kronenberg, and E. Afgan, Delivering
bioinformatics MapReduce applications in the cloud,
in Information and Communication Technology,
Electronics and Microelectronics (MIPRO), 2014 37th
International Convention on, 2014, pp. 373377.
H. Li, Aligning sequence reads , clone sequences and
assembly contigs with BWA-MEM, vol. 00, no. 00,
pp. 13, 2013.

Frito-Lay: Operations Management in Manufacturing
No ratings yet
Frito-Lay: Operations Management in Manufacturing
2 pages
Survey Questionnaire Final
90% (10)
Survey Questionnaire Final
4 pages
A Review of Bioinformatic Pipeline Frameworks
No ratings yet
A Review of Bioinformatic Pipeline Frameworks
7 pages
Big Data Technology Stack
100% (1)
Big Data Technology Stack
12 pages
Data Pipeline
No ratings yet
Data Pipeline
5 pages
Assignment Group 3
No ratings yet
Assignment Group 3
21 pages
Lab Manual BDA
No ratings yet
Lab Manual BDA
36 pages
Big Data Links
No ratings yet
Big Data Links
7 pages
Scientific Computing in The Cloud: Department of Physics, University of Washington, Seattle, WA 98195 (December 30, 2008)
No ratings yet
Scientific Computing in The Cloud: Department of Physics, University of Washington, Seattle, WA 98195 (December 30, 2008)
7 pages
Enabling Large-Scale Biomedical Analysis in The Cloud: Biomed Research International October 2013
No ratings yet
Enabling Large-Scale Biomedical Analysis in The Cloud: Biomed Research International October 2013
7 pages
Big Data and Genomics
No ratings yet
Big Data and Genomics
17 pages
Exploiting Dynamic Resource Allocation: 1. Abstract
No ratings yet
Exploiting Dynamic Resource Allocation: 1. Abstract
5 pages
Comparative Analysis of MapReduce and Apache Tez Performance in Multinode Clusters With Data Compression
No ratings yet
Comparative Analysis of MapReduce and Apache Tez Performance in Multinode Clusters With Data Compression
8 pages
Big Data Infarstructure
No ratings yet
Big Data Infarstructure
7 pages
Network Traffic Analysis: Hadoop Pig VS Typical Mapreduce
No ratings yet
Network Traffic Analysis: Hadoop Pig VS Typical Mapreduce
7 pages
BDA Report
No ratings yet
BDA Report
11 pages
Had Oop Cancer
No ratings yet
Had Oop Cancer
6 pages
Maya June Presentation
No ratings yet
Maya June Presentation
29 pages
Big Data Emerging Technologie
No ratings yet
Big Data Emerging Technologie
10 pages
Bigdata-Cloud Computing A K Mishra
No ratings yet
Bigdata-Cloud Computing A K Mishra
48 pages
BDCC 07 00068
No ratings yet
BDCC 07 00068
19 pages
A Survey of Big Data Pipeline Orchestration
No ratings yet
A Survey of Big Data Pipeline Orchestration
16 pages
Oakes2023 - Building Domain-Specific Machine Learning Workflows - A Conceptual Framework For The State-Of-The-Practice
No ratings yet
Oakes2023 - Building Domain-Specific Machine Learning Workflows - A Conceptual Framework For The State-Of-The-Practice
49 pages
Big Data Analytics Litrature Review
No ratings yet
Big Data Analytics Litrature Review
7 pages
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
No ratings yet
Big Data Analytics Using Hadoop Tools - Apache Hive VS Apache Pig - 1604726800
5 pages
HEPDOOP High-Energy Physics Analysis Using Hadoop
No ratings yet
HEPDOOP High-Energy Physics Analysis Using Hadoop
6 pages
Unit 5
No ratings yet
Unit 5
39 pages
BDP Unit 4
No ratings yet
BDP Unit 4
28 pages
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
No ratings yet
Design An Efficient Big Data Analytic Architecture For Retrieval of Data Based On Web Server in Cloud Environment
10 pages
Hadoop Ecosystem Tools
No ratings yet
Hadoop Ecosystem Tools
2 pages
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
No ratings yet
Hadoop and Pig Overview - Hands-On: Outline of Tutorial
52 pages
NSF CRII Yan
No ratings yet
NSF CRII Yan
15 pages
Bda Unit Iv Notes
No ratings yet
Bda Unit Iv Notes
32 pages
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
No ratings yet
Ijcet: International Journal of Computer Engineering & Technology (Ijcet)
6 pages
8 MapReduce Different Phases 08-01-2025
No ratings yet
8 MapReduce Different Phases 08-01-2025
28 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
8 pages
Unit V Notes
No ratings yet
Unit V Notes
17 pages
BD - HadoopEcoSystem Unit 2part 1
No ratings yet
BD - HadoopEcoSystem Unit 2part 1
12 pages
Hadoop Ecosystem
No ratings yet
Hadoop Ecosystem
21 pages
Unit 2
No ratings yet
Unit 2
9 pages
Ahora Si Este Es El Bueno
No ratings yet
Ahora Si Este Es El Bueno
8 pages
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
No ratings yet
Apache Pig - A Data Flow Framework Based On Hadoop Map Reduce
6 pages
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
No ratings yet
Database Search, Alignment Viewer and Genomics Analysis Tools: Big Data For Bioinformatics
12 pages
Screenshot 2025-01-13 at 12.17.38 PM
No ratings yet
Screenshot 2025-01-13 at 12.17.38 PM
12 pages
Performance Comparison of Apache Hadoop and Apache Spark
No ratings yet
Performance Comparison of Apache Hadoop and Apache Spark
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Unit 4
No ratings yet
Unit 4
29 pages
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
No ratings yet
Kadi Sarva Vishwavidyalaya: LDRP Institute of Technology and Research Gandhinagar
44 pages
Open Source Software Referance Guide
No ratings yet
Open Source Software Referance Guide
9 pages
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
No ratings yet
Py Chem Flow An Automated Pre Processing Pipeline in Python For Reproducible Machine Learning On Chemical Data
3 pages
类似文献 an AI Agent for Fully Automated Multi-Omic Analyses
No ratings yet
类似文献 an AI Agent for Fully Automated Multi-Omic Analyses
15 pages
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
No ratings yet
Replication-Based Query Management For Resource Allocation Using Hadoop and MapReduce Over Big Data
13 pages
Extending MetAMOS - New Methods and New Integrations
No ratings yet
Extending MetAMOS - New Methods and New Integrations
11 pages
Bda (Mid-2)
No ratings yet
Bda (Mid-2)
14 pages
Biochemical Engineering Assignment 2
No ratings yet
Biochemical Engineering Assignment 2
6 pages
Module 5 - Chapter 2
No ratings yet
Module 5 - Chapter 2
11 pages
Large Language Models For Water Distribution Systems Modeling and Decision-Making
No ratings yet
Large Language Models For Water Distribution Systems Modeling and Decision-Making
11 pages
Hadoop Tools - A Brief Overview
No ratings yet
Hadoop Tools - A Brief Overview
18 pages
KCS 061 - Big Data - Unit V
No ratings yet
KCS 061 - Big Data - Unit V
17 pages
Community Consultation On The Response Actions (CORA) For COVID-19 - 1
No ratings yet
Community Consultation On The Response Actions (CORA) For COVID-19 - 1
35 pages
Important Questions
No ratings yet
Important Questions
21 pages
CS IMMIGRATION LTD. Financial Statements 2023
No ratings yet
CS IMMIGRATION LTD. Financial Statements 2023
7 pages
Mud Logging
No ratings yet
Mud Logging
10 pages
Operation Analytics and Investigating Metric Spike
50% (2)
Operation Analytics and Investigating Metric Spike
14 pages
Background To IPSAS Implementation in Nigeria
67% (3)
Background To IPSAS Implementation in Nigeria
28 pages
Facilitator's CALA Guide: Learning Area: CALA Type: Level: Topic: Duration
No ratings yet
Facilitator's CALA Guide: Learning Area: CALA Type: Level: Topic: Duration
8 pages
Antarang Foundation
No ratings yet
Antarang Foundation
25 pages
Abangan v. Abangan
No ratings yet
Abangan v. Abangan
2 pages
IFD5 Manual - Issue 5
No ratings yet
IFD5 Manual - Issue 5
21 pages
2024-Vector Control of Brushless Doubly-Fed Induction Machines Based On Highly Efficient Nonlinear Controllers
No ratings yet
2024-Vector Control of Brushless Doubly-Fed Induction Machines Based On Highly Efficient Nonlinear Controllers
12 pages
(Campus of Open Learning) University of Delhi Delhi-110007
No ratings yet
(Campus of Open Learning) University of Delhi Delhi-110007
1 page
Piercing The Fog Intelligence and Army Air Forces Operations in World War II
No ratings yet
Piercing The Fog Intelligence and Army Air Forces Operations in World War II
516 pages
Sugar As On 01-08-2024
No ratings yet
Sugar As On 01-08-2024
1 page
Application For Probation 175 Basmayor
No ratings yet
Application For Probation 175 Basmayor
3 pages
QC Yorp Forms
No ratings yet
QC Yorp Forms
4 pages
Index: Powerpoint
No ratings yet
Index: Powerpoint
24 pages
Group3 CaseStudy3
No ratings yet
Group3 CaseStudy3
7 pages
Demands of Society From The Teacher
No ratings yet
Demands of Society From The Teacher
7 pages
Firestone Epdm Tds en 2020
No ratings yet
Firestone Epdm Tds en 2020
2 pages
Inquiry Worksheet 6.1 Answered
No ratings yet
Inquiry Worksheet 6.1 Answered
11 pages
U CMR March 2023
80% (5)
U CMR March 2023
2 pages
Slide Sledge Brochure
100% (1)
Slide Sledge Brochure
2 pages
Computer Aided Drug Design PPT 5
No ratings yet
Computer Aided Drug Design PPT 5
1 page
How To Create COBie Using With BIM Interoperability Tool
No ratings yet
How To Create COBie Using With BIM Interoperability Tool
26 pages
Young Medi CT Scanners
No ratings yet
Young Medi CT Scanners
3 pages
STO Process - Pricing Procedure
No ratings yet
STO Process - Pricing Procedure
30 pages
Anthill Protocol
No ratings yet
Anthill Protocol
2 pages

Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research

Uploaded by

Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research

Uploaded by

MIPRO 2015, 25-29 May 2015, Opatija, Croatia

Cloudflow A Framework for MapReduce

Division of Genetic Epidemiology, Medical University of Innsbruck, Innsbruck, Austria

Abstract - The data-driven parallelization framework

Since the advent of high-throughput technologies in

consequence that users have to write complex UDFs.

MapReduce is a parallel programming model

than MapReduce (e.g. Apache Spark) can be executed via

The overall idea behind Cloudflow is to simplify the

section gives an overview on the abstraction, explains the

The summarizer operates on a list of records, whereby

The group-operation is a special operation, which takes a

create chunks, which can then be analyzed by a

Aggregation (sum, mean): This defines a groupby-operation followed by a summarize-operation

Executor: this summarize-operation writes all

Since Cloudflows operations are based on the

To help the user and accelerate the pipeline

CLOUDFLOW FOR BIOINFORMATICS

Cloudflow provides a variety of already implemented

summarize-operation that calculates the call rate for each

DEPLOYING PIPELINES AS A SERVICE

Cloudgene [6] is a web-based platform to create and

TABLE I. CURRENTLY SUPPORTED DATA FORMATS AND OPERATIONS

Find pairs (for paired-end reads)

Filters reads by quality

Filters reads by sequence length

Detects read pairs

Aligns sequences against a reference (using jBWA for alignment)

Creates fixed size chunks (e.g. 64 MB)

Creates logical chunks (e.g. 5MBases)

Filters unmapped reads

Filters reads by map.quality

Finds variations in aligned reads (using samtools)

Creates fixed size chunks (e.g. 64 MB)

Creates logical chunks (e.g. 5MBases)

Filters monomorphic site

Filters by call rate

Allele frequency check with external reference (e.g. 1000 genomes)

possible to readily utilize cloud resources in a research

To evaluate our approach, we implemented three

vast amount of short reads, mostly in the FASTQ file

Figure 2. Execution Time of a Cloudflow pipeline, an Apache Crunch

A. Preprocessing and Mapping

In this example pipeline, the aim is to find variations

the occurrences of A, C, G, T on each position. Finally,

Cloudflows overall aim is to simplify the

//A. Preprocessing and Mapping

//B. Variation Calling

//C. Genome-Wide Association Study

The parallelization of the association analysis was

V. Marx, Biology: The big challenges of big data,

You might also like