Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research
Cloudflow - A Framework For Mapreduce Pipeline Development in Biomedical Research
I.
INTRODUCTION
172
II.
MAPREDUCE BACKGROUND
173
CLOUDFLOW
Filter: this operation is a special transformoperation, which emits the record to the
subsequent operation iff a user-defined condition
is fulfilled.
Split: the transform-operation calculates for each
input record a new split level (i.e. a new key).
This key is used by the group-by operation to
174
Figure 1. Cloudflow translates the operation sequence automatically into an executable MapReduce job
D. Pipeline Execution
Before the execution, Cloudflow checks the
compatibility of input and output records of consecutive
operations. This ensures that only valid and executable
pipelines are submitted to a Hadoop cluster.
If the pipeline is executable and valid, then the
operation sequence is translated into an execution plan,
that decides if an operation is executed in the map or in
the reduce phase. Based on this plan, Cloudflow creates
one or more MapReduce jobs and configures them to
execute the user-defined operations in the correct order. In
this translation step, Cloudflow tries to minimize the
number of MapReduce jobs by combining consecutive
transform-operations and by executing all transformoperations after a summarize-operation in the same
reducer instance (see Figure 1).
For additive summarize-operations (e.g. sum),
Cloudflow takes advantage of Hadoops combiner
functionality. The idea of this improvement is to combine
the key/value pairs that are generated by all map tasks on
the same machine, into fewer pairs. Thus, the number of
pairs that are transferred between mapper and reducer are
minimized, which results in a positive effect on the
network bandwidth since useless communication is
avoided.
IV.
175
V.
Filter
Other
Split
BAM
Filter
Other
Split
VCF
Filter
Other
Pipeline Operation
Description
split()
filter(LowQualityReads.class)
filter(SequenceLength.class)
findPairedReads()
align(referenceSequence)
split()
split(5, BamChunk.MBASES)
filter(UnmappedReads.class)
filter(LowQualityReads.class)
findVariations()
split()
split(5, VcfChunk.MBASES)
filter(MonomorphicFilter.class)
filter(DuplicateFilter.class)
Filters duplicates
filter(InDelFilter.class)
Filters inDels
filter(CallRateFilter.class)
filter(MafFilter.class)
Filters by MAF
checkAlleleFreq(reference)
EVALUATION
176
VII.
CONCLUSION
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
177