A Survey of Whole Genome Alignment Tools and Frameworks Based On Hadoop'S Mapreduce
A Survey of Whole Genome Alignment Tools and Frameworks Based On Hadoop'S Mapreduce
978-1-5090-3477-2/16/$31.00
Abstract—Next generation DNA sequencing (NGS) project might be still feasible if only several time of transferring some
that aims to give understandings in various genes seems to boosts set of genome through the Internet, yet the fact is that the data is
innovative breakthrough in whole genome issues. Dealing with ge- keep on growing bigger each time and the demand to store this
nomic data requires large-scale data storage and processing. Big enormous size of data is an urgency in this era where relying
data technology could be the most appropriate solution to gaining only on cloud storage is impractical [3]. Besides the size, the
useful knowledge from data comprehensively. This study discusses security of cloud storage and computing is yet another issue.
about genome tools and framework that implement MapReduce of Dealing with these problems urges bioinformatics researchers to
Hadoop’s components in sequence alignment computation. The harness big data technology that is not only able to securely store
aim of this discussion is presenting an overview of whole genome
massive data in any format but also able to provide further un-
alignment software tools and the implementation in big data.
derstanding of data to gain useful knowledge comprehensively.
Keywords—genome sequence alignment; multiple sequence A. Big Data Framework
alignment; MapReduce; Hadoop; big data
Hadoop as one of big data framework giants currently is be-
I. INTRODUCTION ing utilized in almost all fields of data mining involving big size
of data. Hadoop provides data storage and processing that in-
Recent years, DNA sequencing has become more popular volves very large datasets. Hadoop deals with batches that bring
especially in the field of bioinformatics. Next generation DNA through the flexibility in accepting all data format. It means bi-
sequencing (NGS) project that aims to give understandings in oinformatics or biomedical data processing is quite suitable to
various genes seems to boosts itself and attracts many research- work with it since data in bioinformatics and biomedical consists
ers to quest for more innovative breakthrough in dealing with of both structured and unstructured data. It includes genomic
whole genome related issues [1]. data in the first place, disease, patients’ health record, clinical
Single Nucleotide Polymorphism (SNP) is one of the com- information, molecules, various cells, tissues, etc. which are
ponents that stimulates the hugeness of the project. SNP is a var- very different in size, unit, and format. Some of them might be
iation at a single position in DNA sequence among individuals imagery data while some others are unstructured hand-writing
[2]. A single sequence of SNPs in can determine the traits an data and so on.
individual might has. The most significant SNPs are the bi- This study discusses about genome tools used in sequence
omarkers for scientists to locate abnormal genes, usually in as- alignment. The aim of this discussion is presenting an overview
sociation with diseases. There are roughly 10 million SNPs of whole genome alignment software tools and the implementa-
found in only a human genome. When it comes to bioinformatics tion in big data. This article is arranged as follows: in the first
where thousands to billions human’s genomic data needs to be section some fundamental knowledge in big data and genome
stored, it involves not only terabytes of data but petabytes or alignment are introduced in a brief explanation. In the second
even exabyte. One of the world’s largest biology-data reposito-
ries, European Bioinformatics Institute (EBI), for instance, d d ' ' d
stores 20 petabytes of data by year 2013 and rises up to 75
petabytes in 2015 [3].
d d ' d
Solutions to the large data management especially in the era
of data explosion as today are always a big challenge. Cloud
computing have been proposed to solve the issue. It gained a Figure 1. Single Sequence Alignment. Nucleotides in
good appreciation as it can store large dataset in the cloud. It red show matched pairs. Empty boxes represent gaps.
65
IWBIS 2016 c 2016 IEEE
978-1-5090-3477-2/16/$31.00
section, the most recent whole genome alignment software tools A. ProgressiveMauve
are reviewed. In the third section of this article, some frame- Being launched in 2010, this package was quite popular back
works that utilize MapReduce of Hadoop’s are presented. In the in time. Darling et al [4] declares that their tool can tackle the
last section, the conclusion is drawn and suggestion in selecting alignment issue of sequences that have undergone rearrange-
the most appropriate tools is presented. ment, gene gain, and loss. The tool implements an objective
II. SINGLE-MULTIPLE SEQUENCE ALIGNMENT scoring system called sum-of-pairs breakpoint score. It is
claimed that it can accurately detects if there is any rearrange-
In order to find how similar or distant two individuals are, ment breakpoint when the sequences have unequal gene content.
alignment of the whole genome sequence(s). A single alignment This is done by identifying local anchors, shared by a subset of
engages two whole genome sequences to see the similarity in the target sequence. To remove erroneous alignments of unre-
between. Multiple Sequence Alignment (MSA) on the other lated genome sequences, a probabilistic alignment filtering
hand, utilizes many single alignments. MSA aligns the results of method is also implemented [4].
each two single alignments. The alignment score can be match,
mismatch, or gap since there is also possibility that a sequence B. ALignment of PHAges (Alpha)
has no pair in the other sequence (see Figure 1). This tool mostly proposes a multiple sequence alignment on
bacteriophages (virus that attacks bacteria) whose similarity in
III. GENOME ALIGNMENT TOOLS its genomic comparison does not necessarily mean having simi-
In this section, the most recent genome alignment tools are lar or same functionality unlike in other genes [5]. In other case,
introduced. Some of these tools are implemented in parallel en- Alpha tool can be used to predict that genes have near biological
vironment while some others are not. Table I lists the general functionalities even if the detectable similarity is lack. Alpha im-
specification of each tools. plements the standard partial order alignment in their algorithm.
It also tries to alleviate misalignments that progressiveMauve
66
IWBIS 2016 c 2016 IEEE
978-1-5090-3477-2/16/$31.00
might get since the combinatorial properties of alignments in BLASTN works in parallel environment using several CPU. As
progressiveMauve are linearized [5]. a results it shows a much better performance that the original
MegaBLAST by speeding up 10-20 times over the Mega-
C. GEnome Comparison with K-mers Out-of-core (GECKO) BLAST.
GECKO’s work focuses on tackling the computational inef-
ficiency due to huge data. GECKO implements out-of-core al- G. PaSWAS
gorithm to overcome the issue in pairwise alignment. As a result Parallel Smith-Waterman Algorithm Software (PaSWAS)
of comparing this tool with several state-of-the-art tools includ- utilizes the power of Smith-Waterman (SW) algorithm in align-
ing ProgressiveMauve, it is proven that GECKO is quite more ing genome sequences. PaSWAS tackles almost all of the lacks
efficient in memory consuming. In its application, GECKO uti- of original SW, which is either less accurate due to statistical
lize cloud computing using OpenStack. Due to this ability, issues, slow in computing large-scale data, and so on. By paral-
GECKO can be run not only in clusters but also in simple desk- lelizing the implementation of SW algorithm using NVDIA-
top PCs. Furthermore, GECKO is a modular tool, thus, users can based general purpose GPUs (GPGPUs), PaSWAS is able to
add some useful features such as K-mer frequency calculation, perform an accurate and fast alignment process regardless the
pre-visualization monitoring tools, and so on [6]. length of the sequences [11]. Yet currently, it can only handle
local alignment [12].
D. ANItools
ANItools calculates the average nucleotide identity (ANI) From aforementioned above, most of the genome alignment
score of a bacterial species then identifies the bacteria using this tools using FASTA format as their input. Meanwhile, it has var-
score. ANItools basically uses the pre-existing and widely used ious format for the output. This output format variety is depend-
BLAST tool to do the alignment of genomes. This further con- ing on the objective of each tools such as for phylogenetic tree
tributes also in classification of bacterial species. ANItools pro- analysis, protein function analysis, profile analysis, and etc. In
vides both desktop using CLI and web app using friendly GUI addition, these tools mainly focus on proposing novel methods
for user with more basic computer skills. Yet, ANItools is only on memory access for reducing the parallel computing compu-
available for academic uses only and restricted by GNU General tational cost.
Public License License version 3.0 [7]. Recent genome alignment tools obstacle is the computa-
E. REcursiVe Exact-matching ALigner (REVEAL) tional cost. In few next years, many researchers will be focus on
building an effective and efficient sequence alignment algo-
REVEAL improves the de novo sequence alignment by in- rithm. Furthermore, many algorithms also will be proposed to
troducing a de novo approach to infer transcriptome that allows find the significant fragment in a sequence. Sequence alignment
multiple alignment and constructs alignments recursively. As re- computation in a significant genome fragment will be reducing
sults, REVEAL can be used to compute high resolution ge- the computational cost drastically.
nomes, to detect structural variations, and most importantly, it
can be used to construct MSA without involving reference ge- IV. GENOME ALIGNMENT FRAMEWORKS
nomes [8].
In this section, some genome alignment frameworks are de-
F. HS-BLASTN scribed. SEAL, Crossbow, CloudBurst, and CloudAligner are of
High speed nucleotide-nucleotide basic local alignment widely used frameworks. These frameworks work with MapRe-
search tool (HS-BLASTN) proposes a high speed computation duce mechanism of Hadoop since they are built in Hadoop en-
of nucleotide sequences alignment [9]. HS-BLASTN retrofits vironment [13].
the previously existing MegaBLAST by adding a novel lookup Hadoop consists of mainly two components, i.e. HDFS (Ha-
table using FMD-index [10]. This improvement makes align- doop Distributed File System) and MapReduce. The former is
ment process more efficient and less time consuming. HS- designed to store very large data in a distributed environment
67
IWBIS 2016 c 2016 IEEE
978-1-5090-3477-2/16/$31.00
ůƵƐƚĞƌ
DĂƉZĞĚƵĐĞ
^Žƌƚ Θ
^ŚƵĨĨůĞ
EŽĚĞ ϭ DĂƉ ZĞĚƵĐĞ
EŽĚĞ Ϯ ^Žƌƚ Θ
^ŚƵĨĨůĞ
DĂƉ ZĞĚƵĐĞ
^Žƌƚ Θ
EŽĚĞ Ŷ ^ŚƵĨĨůĞ
DĂƉ ZĞĚƵĐĞ
WƌĞƉƌŽĐĞƐƐĞĚ
ZĞƐƵůƚ ƚŽ ƐƚŽƌĞĚ ŝŶ
'ĞŶŽŵĞ ^ĞƋƵĞŶĐĞ ^Žƌƚ Θ
ůŝŐŶŵĞŶƚ ,&^
ĂƚĂ ƵƐŝŶŐ ^ŚƵĨĨůĞ
'ĞŶŽŵĞ
ůŝŐŶŵĞŶƚ
WĂĐŬĂŐĞ
Figure 2. Common architecture of genome alignment process in Hadoop framework by utilizing MapReduce
evolving several computers as cluster nodes while the latter is to long sequences whereas CloudBurst is limited only performs
execute the data processing. computation of short reads.
MapReduce is one of the main components in the Hadoop Crossbow is another framework that implements MapRe-
big data computing system other than HDFS. MapReduce con- duce based on Hadoop [16]. This framework gains much popu-
tributes to run the jobs in Hadoop. This is the framework where larity since it shows quite precise results i.e. almost 100% when
data parallelizing paradigm takes its place in. There are two ma- simulates human chromosome 22, simulated SNPs. Crossbow
jor phases datasets should undergo in MapReduce, i.e. Map and uses Bowtie to align reads. Bowtie [17] is also an alignment (and
Reduce. mapping) package available to align short reads. Bowtie can
minimize memory usage by indexing the genome with Burrows-
In the Map phase, a set of data records is collected from Wheeler (BW) index. BW Aligner (BWA) [18] is another tool
HDFS and partitioned into several tuples. A tuple consists of a that implements BW algorithms such as BWA-backtrack,
records value and its key <k, v>. The keys are not necessarily to BWA-SW and BWA-MEM [19]. BWA-MEM algorithm pro-
be unique. These keys are used to combine the records whose vides more accurate long-sequence alignment compared to the
characteristics are similar i.e. has the same key. The combined other algorithms.
tuples are the output of this first phase of MapReduce. The out-
put merged tuples from the Map phase are stored in buffer before Another example of short read alignment tool is SEAL.
being taken for shuffling. SEAL is also capable to remove duplicates during alignments. It
gives consistent results as yielded by BWA even with less time
The map output needs to undergo the shuffle process before consuming without performing duplicate removal compared to
entering the Reduce phase. In the shuffle process, the merged BWA [20].
tuples are sorted by the key. This process will make the data with
the same values grouped together. The next step is transferring V. CONCLUSION
these grouped data to the Reduce phase. In the Reduce phase,
since the data are already organized, demanded operations such This article provides information about packages and frame-
as counting, compression, etc. can be done. Afterwards, the out- works used in genome sequence alignment. ProgressiveMauve,
put from the Reduce phase is stored back in HDFS. Alpha, GECKO, ANItools, REVEAL, HS-BLASTN, PaSWAS
are few of numerous tools that offers both accurate and fast
CloudBurst, CloudAligner, Crossbow, and SEAL are re- alignment performance. As these tools come up recently, they
viewed in this section. These frameworks utilize MapReduce in are as or more powerful compared to the preceding tools such as
their common architecture as shown in Figure 2. The general BLAST, Bowtie, BWA, etc. which are implemented in Hadoop-
specifications of these frameworks are listed in Table 2. based frameworks as described in section III. These frameworks
that apply MapReduce are presented to show the implementation
CloudBurst offers a fast computation of millions of short
of alignment packages in Hadoop environment and to show the
reads using a larger remote cloud using 96 cores. Its running
possibility of newly developed tools to be implemented in Ha-
time is linearly scales as the number of processor increases [14].
doop as well.
This is not until CloudAligner comes up with faster and more
accurate results. In the CloudAligner’s architecture, reduce As suggestion, in selecting the most suitable tools or frame-
phase is omitted. By doing so, CloudAligner shows a significant work with our needs, the following criteria can be considered:
improvement [15]. Moreover, CloudAligner is available to read
68
IWBIS 2016 c 2016 IEEE
978-1-5090-3477-2/16/$31.00
• Pairwise or multiple sequence alignment [9] Y. Chen, W. Ye, Y. Zhang and Y. Xu, "High speed BLASTN: an
• Short or long reads accelerated MegaBLAST," Nucleic acids research, vol. 43, no. 16, pp.
7762-7768, 2015.
• Data size (this is to decide whether it is need to use Hadoop
frameworks or not) [10] A. Morgulis, G. Coulouris, Y. Raytselis, T. L. Madden, R. Agarwala and
A. A. Schäffer, "Database indexing for production MegaBLAST
• Implemented algorithm searches," Bioinformatics, vol. 24, no. 16, pp. 1757-1764, 2008.
• Expected output file format
[11] S. A. Manavski and G. Valle, "CUDA compatible GPU cards as efficient
• Programming language hardware accelerators for Smith-Waterman sequence alignment," BMC
• User (to determine whether to use CLI interface or GUI, bioinformatics, vol. 9, no. 2, p. 1, 2008.
desktop or web app)
[12] S. Warris, F. Yalcin, K. J. L. Jackson and J. P. Nap, "Flexible, fast and
accurate sequence alignment profiling on GPGPU with PaSWAS," PloS
VI. ACKNOWLEDGMENT one, vol. 10, no. 4, p. e0122524, 2015.
This study is a part of a grand research namely “Pengem- [13] Q. Zou, X. B. Li, W. R. Jiang, Z. Y. Lin, G. L. Li and K. Chen, "Survey
bangan Sistem Penilaian Kualitas Embrio pada Bayi”. We of MapReduce frame operation in bioinformatics," Briefings in
would like to show our gratitude to the Ministry of Research, bioinformatics, vol. 15, no. 4, p. bbs088, 2013.
Technology and Higher Education of the Republic of Indonesia [14] M. C. Schatz, "CloudBurst: highly sensitive read mapping with
for the support to our research by providing the Penelitian MapReduce," Bioinformatics, vol. 25, no. 11, pp. 1363-1369, 2009.
Unggulan Perguruan Tinggi (PUPT) Research Grant No: 0523/ [15] T. Nguyen, W. Shi and D. Ruden, "CloudAligner: A fast and full-
UN2.R12/HKP. 05.00/2015. featured MapReduce based tool for sequence mapping," BMC research
notes, vol. 4, no. 1, p. 1, 2011.
[16] B. Langmead, M. C. Schatz, J. Lin, M. Pop and S. L. Salzberg,
REFERENCES "Searching for SNPs with cloud computing," Genome biology, vol. 10,
no. 11, p. 1, 2009.
[1] M. I. Sani, M. E. Suryana, M. A. Akbar, A. Noviyanto, W. Jatmiko and [17] N. A. Fonseca, J. Rung, A. Brazma and J. C. Marioni, "Tools for mapping
A. M. Arymurthy, "Performance Analysis of ECG Signal Compression high-throughput sequencing data," Bioinformatics, vol. 28, no. 24, p.
using SPIHT," International Journal on Smart Sensing and Intelligent 3169–3177, 2012.
Systems, vol. 6, no. 5, 2013.
[18] H. Li and R. Durbin, "A Highly Parallel Next-Generation DNA
[2] Scitable, "Scitable by Nature Education," 2014. [Online]. Available: Sequencing," in IEEE International Conference on Bioinformatics and
http://www.nature.com/scitable/definition/single-nucleotide- Biomedicine (BTBM), 2015.
polymorphism-snp-295. [Accessed 3 September 2016].
[19] H. Li, "Aligning sequence reads, clone sequences and assembly contigs
[3] V. Marx, "Biology: The big challenges of big data," Nature, vol. 498, no. with BWA-MEM," arXiv preprint arXiv:1303.3997, 2013.
7453, pp. 255-260, 2013.
[20] L. Pireddu, S. Leo and G. Zanetti, "SEAL: a distributed short read
[4] A. E. Darling, B. Mau and N. T. Perna, "progressiveMauve: multiple mapping and duplicate removal tool," Bioinformatics, vol. 27, no. 15, pp.
genome alignment with gene gain, loss and rearrangement," PloS one, 2159-2160, 2011.
vol. 5, no. 6, 2010.
[21] H. Li and R. Durbin, "Fast and accurate short read alignment with
[5] S. Berard, A. Chateau, N. Pompidor, P. Guertin, A. Bergeron and K. M. Burrows--Wheeler transform," Bioinformatics, vol. 25, no. 14, pp. 1754-
Swenson, "Aligning the unalignable: bacteriophage whole genome 1760, 2009.
alignments," BMC bioinformatics, vol. 17, no. 1, 2016.
[22] O. Torreno and O. Trelles, "Breaking the computational barriers of
[6] W. Zhang, Y. Niu, H. Zou, L. Luo, Q. Liu and W. Wu, "Accurate pairwise genome comparison," BMC bioinformatics, vol. 16, no. 1, p. 1,
prediction of immunogenic T-cell epitopes from epitope sequences using 2015.
the genetic algorithm-based ensemble learning," PLos one, vol. 10, no.
5, p. e0128194, 2015. [23] W. Zhang, P. Du, H. Zheng, W. Yu, L. Wan and C. Chen, "Whole-
genome sequence comparison as a method forimproving bacterial
[7] N. Han, Y. Qiang and W. Zhang, "ANItools web: a web tool for fast species definition," Applied Microbiology, Molecular and Cellular
genome," Database, vol. 2016, 2016. Biosciences Research Foundation, vol. 60, no. 2, pp. 75-78, 2014.
[8] J. Linthorst, M. Hulsman, H. Holstege and M. Reinders, "Scalable multi
whole-genome alignment using recursive exact matching," bioRxiv, p.
011715, 2015.
69
IWBIS 2016 c 2016 IEEE
978-1-5090-3477-2/16/$31.00
70