Applications of Supercomputers in Sequence Analysis and Genome Annotation - Full
Applications of Supercomputers in Sequence Analysis and Genome Annotation - Full
net/publication/275045742
CITATIONS READS
2 3,428
1 author:
Gerard Dumancas
The University of Scranton
97 PUBLICATIONS 594 CITATIONS
SEE PROFILE
All content following this page was uploaded by Gerard Dumancas on 28 May 2015.
Richard S. Segall
Arkansas State University, USA
Jeffrey S. Cook
Independent Researcher, USA
Qingyu Zhang
Shenzhen University, China
Copyright © 2015 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Research and applications in global supercomputing / Richard S. Segall, Jeffrey S. Cook, and Qingyu Zhang, editors.
pages cm
Includes bibliographical references and index.
Summary: “This book investigates current and emerging research in the field, as well as the application of this technology
to a variety of areas by highlighting a broad range of concepts”-- Provided by publisher.
ISBN 978-1-4666-7461-5 (hardcover) -- ISBN 978-1-4666-7462-2 (ebook) 1. High performance computing 2. Super-
computers. I. Segall, Richard, 1949- II. Cook, Jeffrey S., 1966- III. Zhang, Qingyu, 1970-
QA76.88.R48 2015
004.1’1--dc23
2014045462
This book is published in the IGI Global book series Advances in Systems Analysis, Software Engineering, and High Perfor-
mance Computing (ASASEHPC) (ISSN: 2327-3453; eISSN: 2327-3461)
All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.
Chapter 6
Applications of Supercomputers
in Sequence Analysis and
Genome Annotation
Gerard G. Dumancas
Oklahoma Medical Research Foundation, USA
ABSTRACT
In the modern era of science, bioinformatics play a critical role in unraveling the potential genetic causes
of various diseases. Two of the most important areas of bioinformatics today, sequence analysis and ge-
nome annotation, are essential for the success of identifying the genes responsible for different diseases.
These two emerging areas utilize highly intensive mathematical calculations in order to carry out the
processes. Supercomputers facilitate such calculations in an efficient and time-saving manner generat-
ing high-throughput images. Thus, this chapter thoroughly discusses the applications of supercomputers
in the areas of sequence analysis and genome annotation. This chapter also showcases sophisticated
software and algorithms utilized by the two mentioned areas of bioinformatics.
DOI: 10.4018/978-1-4666-7461-5.ch006
Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.
Applications of Supercomputers in Sequence Analysis and Genome Annotation
more convenient manner. Nowadays, because of variants that explain differences in phenotypes
supercomputers, groundbreaking bioinformatics among individuals in a study population. Once
research is made possible. A good example is the association is found between the gene(s) and the
discovery of novel genes associated with differ- phenotype, scientists would be able to understand
ent diseases. With the discovery of these genes, the mechanism of action and disease etiology in
scientists have come up to a deeper understanding individuals and consequently characterize the
of the etiology of various unexplained diseases relevance and importance of such in the general
caused genetically. Consequently, various drugs population. The long-term goal of these studies
and treatments were discovered to counteract is to identify better treatment and prevention
such diseases. Within the area of bioinformatics, strategies. Association or any genetic analyses
sequence analysis and genome annotation are usually require highly intensive mathematical
among the two of the emerging and most impor- calculations. Supercomputers play a critical role
tant branches. In the recent years, supercomputers in the success of such calculations. Prior to genetic
play very important roles in the successes of such association analyses, any genotype information
branches. needs to undergo two critical steps—sequence
The objective of this chapter is to provide the analysis and genome annotation. Applications
readers a clear understanding of the specific appli- of supercomputers in genotype analyses involve
cations of supercomputers in the two most emerg- a wide array of applications and will be discussed
ing areas of bioinformatics, sequence analysis in the mentioned areas below.
and genome annotation. Though supercomputers
play critical roles in such areas, the audience is
not often aware of the potential applications that SUPERCOMPUTERS IN
may arise from them. A universal understanding SEQUENCE ANALYSIS
that constitute both fundamental and experimental
methodologies will enhance the development and Sequence analysis is the most commonly per-
progress of such areas. Thus, the major motivation formed task in bioinformatics. It was one of the
of this chapter is to provide the abovementioned first bioinformatics techniques founded in ~1970
understanding by discussing and analyzing the (Webb-Roberts, 2004). DNA sequencing is simply
fundamentals of several examples centered on any process used to map out the sequence of the
the various applications of supercomputers in nucleotides that comprise a strand of DNA. After
sequence analysis and genome annotation. While the discovery of the double helix shape of DNA in
the content of this chapter may be technical to 1953, and seeing how it is comprised of a series
some readers, we encourage them to review some of ladder like units known as DNA nucleotides,
basic concepts of genetics and biochemistry as the primary goal has been to find out just how the
well as to look at the definition of terms to better sequence of those little nucleotides leads to the
understand this chapter. physical characteristics of an organism, that is,
whether what your hair color, your skin color, and
every other detail from your bone marrow to the
BACKGROUND tip of your hair. Thus, DNA sequencing is simply
a way for scientists to unravel genetics, the study
Genotype analysis involves studying the asso- of how we are put together and how we transfer
ciation between genotype and phenotype, and our traits to our offspring.
the genotype frequencies. Genetic association It was in 1970 when DNA sequencing first
studies are aimed primarily in identifying genetic became possible with the discovery of restric-
150
Applications of Supercomputers in Sequence Analysis and Genome Annotation
tion enzymes and DNA polymerases. Eventually, is the process of turning a rough draft assembly
breakthrough in the rate of sequencing came when composed of shotgun sequencing reads into a
the dideoxy chain termination (Sanger, Nicklen, & highly accurate finished DNA sequence with a
Coulson, 1977) and chemical degradation (Maxam defined maximum allowed error rate. The inter-
& Gilbert, 1977) techniques were introduced in national publicly funded sequencing community
1977. Consequently, using the former method, the established a standard for considering a sequence
16.5 kb human mitochondria genome (Anderson, finished: It should be completely contiguous,
1981) was sequenced and the latter method was with no gaps in the sequence, and that it have a
used for the analysis of the 40 kb bacteriophage final estimated error rate of <1 error in 10,000
T7 (Dunn & Studier, 1983). Thus, these methods bases (Schmutz, Grimwood, & Myers, 2004). The
provide the theoretical and practical backgrounds next step in the analysis, after finishing, sequence
for our modern sequencing technologies (Chen, assembly or sequence alignment, is sequencing
1994). assembly. Sequencing assembly is used when as-
The GenBank, an NIH genetic sequence da- sembly of short DNA fragments (500-1000 bp) are
tabase is an annotated collection of all publicly generated by shotgun sequencing, and is widely
available DNA sequences. It used to contain used for sequencing large genomes, including the
only 15 million nucleotides in 1987 and had human genome (Y. Zhang & Waterman, 2003).
nearly doubled its size in each of the subsequent Sequencing alignment, on the other hand, is a way
five years. The GenBank had reached over 120 of arranging the sequencing of DNA to identify
million in 1992, with progressively more data regions of similarity that may be a consequence of
obtained using automated DNA sequencers functional, structural, or evolutionary relationships
(Chen, 1994). Today, GenBank has approximately between the sequences (Mount, 2004). It may be of
126,551,501,141 bases in 135,440,924 sequence two types, pairwise sequence alignment (PSA) and
records in the traditional GenBank divisions and multiple sequence alignment (MSA). PSA is one
191,401,393,188 bases in 62,715,288 sequence of the most commonly performed bioinformatics
records in the whole genome shotgun (WGS) tasks. It is a method to compare two sequences
division as of April 2011 (Information, 2011). and make inferences on the relationships between
With the large array of DNA sequences pro- them. In other words, it simply involves searching
duced by various DNA sequencing technologies, for homology between two molecules by a one-
it is necessary to perform sequence alignment to-one correspondence between the residues of
or multiple sequence alignment, which is a way the two sequences. PSA utilize three types of op-
of arranging the sequencing of DNA to identify timization approaches—dynamic programming,
regions of similarity that may be a consequence heuristic, and Bayesian. Dynamic programming
of functional, structural, or evolutionary relation- uses sequential approaches to solve the problem
ships between the sequences (Mount, 2004). Thus, and are generally considered slow and optimal.
a wide variety of sequence alignment softwares Heuristic methods on the other hand, are consid-
are available to assist scientists in this process. ered fast and provide approximate solutions (S.
Throughout the years, numerous computational M. Brown & Joubert). Bayesian approach, on the
tools have facilitated the success of genetic other hand, would formulate the sequence align-
research specifically in comparing sequences ment as a Bayesian inference problem (Webb-
(Table 1). Roberts, 2004).
After auto-assembly and before genomic an- When the alignment is concerned with find-
notation, the genomic finishing process is executed ing structural or functional patterns between
in a typical sequence analysis procedure. Finishing sequences, MSA is used (Webb-Roberts, 2004).
151
Applications of Supercomputers in Sequence Analysis and Genome Annotation
152
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Table 1. Continued
153
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Table 1. Continued
154
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Table 1. Continued
MSA is a fundamental analysis method in bioin- more efficient algorithms and the use of parallel
formatics and many comparative genomic applica- computing resources are necessary (Lloyd, 2010).
tions. It forms the basis of many other tasks such In general, several types of parallel systems have
as protein structure prediction, protein function emerged to address computationally intensive
prediction, and phylogenetic analysis (Agrawal, problems in sequence alignments. Over the years,
2008). The computation time for an optimal MSA hybrid systems, which may be a combination of
grows exponentially with respect to the number multiprocessor, vector, cell, graphics processing
of sequences. Thus, in order to achieve minimal unit (GPU), and field-programmable gate arrays
computation time in response to a growing MSA, (FPGA) are becoming more common. Current
155
Applications of Supercomputers in Sequence Analysis and Genome Annotation
multiprocessors usually consist of a cluster of alignment stage, only the comparison of sequences
nodes connected with a network. Each node and groups is parallelized.
typically has several processors or multi-core A parallel version of T-Coffee was implement-
chips that share on-board memory. Examples of ed by Zola and colleagues using a master-worker
these systems include clusters of workstations to architecture and message passing to obtain an
supercomputers with high-performance networks. overall speedup of about 40 on a system with 80
Current vectors, on the other hand, utilize x86- CPUs (Zola, Yang, Rospondek, & Aluru, 2007).
based processors having vector instructions in The parallelism comes mostly from distributing
the form of streaming single instruction, multiple pairwise alignment tasks with dynamic scheduling
data (SIMD) extensions, thus, reducing the time for a near linear speedup during library genera-
needed to perform the same operation on several tion. A sophisticated dynamic scheduling strategy
data elements. A multi-core Cell containing one is used that follows the guide tree, but almost no
64-bit PowerPC reduced instruction set com- speedup is seen with more than 16 CPUs in the
puter (RISC)-processor and eight 128-bit vector progressive alignment stage.
processors is also emerging. GPUs containing When all the DNA sequences are completed,
several hundred processors that are capable of these can now be used by scientists to find genes,
floating-point and integer operations are also which may explain the etiology of a specific dis-
currently used. Lastly, FPGAs that allow multiple ease. Nowadays, with the alignments executed in
processing elements to be executed in parallel at a reasonable amount of computation time using
hardware speed on data supplied from the host various algorithms, DNA sequencing can be more
are also currently used (Lloyd, 2010). efficient and convenient than ever before.
Some alignment software as given in the
next paragraphs utilize specialized parallel MSA
algorithms in order to achieve time efficiency in SUPERCOMPUTERS IN
sequence alignment calculations. Since, various GENOME ANNOTATION
MSA softwares listed in Tables and 2 are updated
through time, we suggest the readers to visit their Genome annotation is simply defined as the
specific websites for further information as to the process of attaching biological information to se-
current algorithms their softwares utilizes. quences, and consists of several steps. The first step
PRALINE repeatedly chooses the next highest involves an extended form of physical mapping,
scoring pair to align until all sequences and groups attempting to convert the unknown portions of raw
are aligned to produce the final alignment. The DNA into a set of easily recognized landmarks
highest scoring pair is determined by comparing and reference points. Along with the ‘gene find-
all sequences with each other at first, and then ing’, the major purpose of this step of annotation
comparing the aligned pair with the remaining is to identify and place all known landmarks into
sequences after each iteration. A speedup of 10 the genome. The next step involves identifying
with 25 processors on a distributed system us- the genomic DNA regions that encodes genes or
ing a set of 200 random sequences that are 200 otherwise known as ‘gene prediction.’ The last step
residues in length is realized by Kleinjung and simply involves the attachment of biological infor-
colleagues (Kleinjung, Douglas, & Heringa, 2002) mation to the predicted genes. The ultimate goal
after parallel implementation. In the method, the of high-quality genome annotation is to identify
pairwise sequence alignment stage is parallelized the key features of the genome, specifically their
by distributing pairwise sequence alignment tasks genes and gene products (Stein, 2001). Similar
to separate processors. In the progressive profile to sequencing, annotation of a massive amount
156
Applications of Supercomputers in Sequence Analysis and Genome Annotation
157
Applications of Supercomputers in Sequence Analysis and Genome Annotation
158
Applications of Supercomputers in Sequence Analysis and Genome Annotation
159
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Table 2. Continued
160
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Table 1. Continued
ence Institute (STSI) utilized Gordon to launch a E5-2697v2 12C 2.700GHz, Infiniband FDR,
project of conducting whole-genome sequencing consisting of 51,392 cores, 758.9 TFlop/s Linpack
of 438 patients with rheumatoid arthritis to bet- performance (Rmax), 387.20 kW power, and a
ter understand the disease, as well as explore the 22,528 GB memory (Top500, 2013).
genetic factors of patient responses to a specific
biologic therapy currently marketed by Janssen Department of Energy (DOE)/
in the US (Zverina, 2014). National Nuclear Security
Administration (NNSA)/Los Alamos
National Energy Research National Laboratory (LANL)
Scientific Computing (NERSC)
The DOE/NNSA/LANL boast with the Road-
The NERSC is a facility operated by the Lawrence Runner, a BladeCenter QS22/LS21 Cluster with
Berkeley National Laboratory and the Department 122,400 cores, 1,026.0 TFlop/s Linpack Perfor-
of Energy. It recently accepted “Edison,” a new mance, and a 2,345.00 kW power. Roadrunner
flagship supercomputer designed for scientific was ranked 1st in the Top500 supercomputers in
productivity. Named in honor of the American the world in 2008 (Top500, 2008). It was used to
inventor Thomas Alva Edison, the Cray XC30 create the largest HIV evolutionary tree. The goal
has 332 terabytes memory, 2.39 petaflop/second was to identify common features of the transmit-
peak performance, 124,608 processing cores, and ted virus, and attempted to create a vaccine that
a 7.56 petabytes disk storage. Edison specializes enables recognition the original transmitted virus
in data analyses including genome sequencing and before the body’s immune response causes the
molecular screening programs, which involve high virus to react and mutate (DOE/LANL, 2009).
throughput computing (Rath, 2014).
Iowa State University
Intel Corporation Supercomputing Center
Intel houses many of the world’s fastest super- The ISU is home to the IBM Blue Gene/L super-
computers including the new world’s fastest computer with 1024 dual-core PPC 440 CPU, 5.7
supercomputer powered by Intel® Xeon PhiTM TF peak performance, and 11 TB data storage
coprocessors. Intel processors power more than (ISU, 2006). It has been used for wide variety
80% of all systems on Top500 list of world’s most of computational biology research including as-
powerful supercomputers including 98% of new sembling the corn genome and studying protein
listed systems (Intel, 2013). Intel is also house networks (Aluru, 2006).
for the Endeavor – Intel Cluster, an Intel Xeon
161
Applications of Supercomputers in Sequence Analysis and Genome Annotation
CHALLENGES AND SOLUTIONS they are packed per six on a FPGA running on
150 MHz resulting in a full system performance
Datasets of hundreds of genomes are becoming of 460 GCUPS (billion elementary operations
common and it is believed that their sizes will per second). The elementary processing element
only increase in the future. MSA of hundreds of can also deliver double the work per clock cycle
genomes are becoming an intractable problem than a naïve implementation, resulting in a better
due to the quadratic increases in computation throughput per area ratio (Vermij, 2011).
time and memory footprint. Majority of alignment In the context of sequence alignment, which
algorithms to date are designed for commodity adds to the computational burden, is the issue of
clusters without parallelism. Thus, it is necessary very-large pattern-matching search. Pattern match-
to come up with alignment algorithms to enable ing in the presence of noise and uncertainty is an
comparison of hundreds instead of few genome important computational problem in a variety of
sequences within reasonable time. Church and fields. It is widely used in the field of bioinfor-
colleagues (Church et al., 2011) implemented a matics, and in that context, DNA or amino acid
design of MSA algorithms on massively paral- sequences are typically compared with a genetic
lel, distributed memory supercomputers to en- database. In order to achieve efficient parallelism
able researchers do comparative genomics on in a single very large pattern-matching search using
large datasets. In their work, they followed the a supercomputer cluster of GPUs, reformulation
methodology of sequential progressive Mauve of the SW algorithm was performed, modifying
algorithm and designed data structures includ- it in order to reduce inter-GPU communication
ing sequences and sorted k-mer lists on the IBM (Khajeh-Saeed & Blair Perot, 2011).
Blue Gene/P supercomputer (BG/P). Their results
show that they can reduce the memory footprint
to potentially align over 250 bacterial genomes on FUTURE RESEARCH DIRECTIONS
a single BG/P compute mode. Thus, their results
matched those of the original algorithm but in a As the uses of supercomputers to address impor-
shorter ½ time and with ¼ the memory footprint tant problems in the society such as research in
for scaffold building (Church et al., 2011). sequence analysis and genome annotation con-
Vermij (Vermij, 2011), on the other hand, im- tinue to grow and the place of supercomputing
plemented the well-known Smith-Waterman (SW) within the overall computing industry continues
optimal local alignment algorithm on the HC-1 to change, the value of innovation in supercom-
hybrid supercomputer from Convey Computer. puting architecture, modeling systems software,
The platform features four FPGAs, which can be applications software, and algorithms will endure
used to accelerate the problem of dealing with large (Academies, 2003). Optimization of algorithm
volume of datasets in genetic sequence alignment. further such as in sequence alignment by combin-
The FPGAs, and the CPU that control them, live ing several steps into a single GPU call is one of
in the same virtual memory space and share one the approaches in reducing computational time in
large memory. The solution allows a sustainable sequence alignment. The key approach is being
peak performance, being able to align sequences able to synchronize the threads within the kernel,
of any length, FPGA area efficient computations because each step of the algorithm such as in SW
and the cancellation of unnecessary workload. must be entirely completed (for all threads) before
The resulting SW FPGA core can run at 100% the next step can be executed (Khajeh-Saeed &
utilization for many alignments long. Further, Blair Perot, 2011).
162
Applications of Supercomputers in Sequence Analysis and Genome Annotation
163
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Braberg, H., Webb, B. M., Tjioe, E., Pieper, U., Chang, J. M., Di Tommaso, P., Taly, J. F., &
Sali, A., & Madhusudhan, M. S. (2012). SALIGN: Notredame, C. (2012). Accurate multiple se-
A web server for alignment of multiple protein quence alignment of transmembrane proteins
sequences and structures. Bioinformatics (Oxford, with PSI-Coffee. BMC Bioinformatics, 13(Suppl
England), 28(15), 2072–2073. doi:10.1093/bio- 4), S1. doi:10.1186/1471-2105-13-S4-S1
informatics/bts302 PMID:22618536 PMID:22536955
Bright, L. A., Burgess, S. C., Chowdhary, B., Chen, E. Y. (1994). The efficiency of automated
Swiderski, C. E., & McCarthy, F. M. (2009). DNA sequencing. In Automated DNA sequencing
Structural and functional-annotation of an equine and analysis. London: Academic Press Limited.
whole genome oligoarray. BMC Bioinformatics,
Chuang, T. J., Lin, W. C., Lee, H. C., Wang, C.
10(Suppl 11), S8. doi:10.1186/1471-2105-10-
W., Hsiao, K. L., & Wang, Z. H. et al. (2003). A
S11-S8 PMID:19811692
complexity reduction algorithm for analysis and
Brodie, R., Roper, R. L., & Upton, C. (2004). JDot- annotation of large genomic sequences. Genome
ter: A Java interface to multiple dotplots generated Research, 13(2), 313–322. doi:10.1101/gr.313703
by dotter. Bioinformatics (Oxford, England), 20(2), PMID:12566410
279–281. doi:10.1093/bioinformatics/btg406
Church, P. C., Goscinski, A., Holt, K., Inouye,
PMID:14734323
M., Ghoting, A., Makarychev, K., & Reumann,
Brodsky, L. I., & Vasiliev, A. V., Ya, L. K., Osipov, M. (2011). Design of multiple sequence alignment
Y. S., Tatuzov, R. L., & Feranchuk, S. I. (1992). algorithms on parallel, distributed memory super-
GeneBee: The program package for biopolymer computers. In Proceedings of the Institute of Elec-
structure analysis. Dimacs, 8, 127–139. trical and Electronics Engineers Engineering in
Medicine and Biology Society (pp. 924-927). Aca-
Brown, N. (1996). Consensus. Retrieved January
demic Press. doi:10.1109/IEMBS.2011.6090208
1, 2013, from http://coot.embl.de/Alignment/
consensus.html Ciria, R., Abreu-Goodger, C., Morett, E., &
Merino, E. (2004). GeConT: Gene context analy-
Brown, S. M., & Joubert, F. (n.d.). Pairwise se-
sis. Bioinformatics (Oxford, England), 20(14),
quence alignment. Retrieved February 10, 2013,
2307–2308. doi:10.1093/bioinformatics/bth216
from http://www.med.nyu.edu/rcr/rcr/course/
PMID:15073003
PairAlign.ppt
Consortium, T. E. P. (2011). A user’s guide to the
Bu, J., Chi, X., & Jin, Z. (2013). HSA: A heuris-
encyclopedia of DNA elements (ENCODE). PLoS
tic splice alignment tool. BMC Systems Biology,
Biology, 9(4), e1001046. doi:10.1371/journal.
7(Suppl 2), S10. doi:10.1186/1752-0509-7-S2-
pbio.1001046 PMID:21526222
S10 PMID:24564867
Corpet, F. (1988). Multiple sequence align-
Burge, C., & Karlin, S. (1997). Prediction of
ment with hierarchical clustering. Nucleic Acids
complete gene structures in human genomic DNA.
Research, 16(22), 10881–10890. doi:10.1093/
Journal of Molecular Biology, 268(1), 78–94.
nar/16.22.10881 PMID:2849754
doi:10.1006/jmbi.1997.0951 PMID:9149143
164
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Darling, A. E., Carey, L., & Feng, W. (2003). Dutheil, J. Y., Gaillard, S., & Stukenbrock, E. H.
The design, implementation, and evaluation of (2014). MafFilter: A highly flexible and extensible
mpiBLAST. In Proceedings of 4th International multiple genome alignment files processor. BMC
Conference on Linux Clusters: The HPC Revolu- Genomics, 15(1), 53. doi:10.1186/1471-2164-15-
tion 2003. San Jose, CA: mpiBLAST. 53 PMID:24447531
Di Tommaso, P., Moretti, S., Xenarios, I., Orobitg, ENCODE. (2004). The ENCODE (encyclopedia
M., Montanyola, A., Chang, J. M., . . . Notredame, of DNA elements) project. Science, 306(5696),
C. (2011). T-coffee: A web server for the multiple 636–640. doi:10.1126/science.1105136
sequence alignment of protein and RNA sequences PMID:15499007
using structural information and homology ex-
Fan, L., Hui, J. H., Yu, Z. G., & Chu, K. H.
tension. Nucleic Acids Research, 39(Web Server
(2014). VIP barcoding: Composition vector-based
issue), W13-17. doi: 10.1093/nar/gkr245
software for rapid species identification based on
Do, C. B., Mahabhashyam, M. S., Brudno, M., DNA barcoding. Molecular Ecology Resources,
& Batzoglou, S. (2005). ProbCons: Probabilistic 14(4), 871–881. doi:10.1111/1755-0998.12235
consistency-based multiple sequence alignment. PMID:24479510
Genome Research, 15(2), 330–340. doi:10.1101/
Flicek, P., Amode, M. R., Barrell, D., Beal, K.,
gr.2821705 PMID:15687296
Brent, S., & Chen, Y. et al. (2011). Ensembl
DOE/LANL. (2009). Scientists use world’s 2011. Nucleic Acids Research, 39(Database is-
fastest supercomputer to create the largest HIV sue), D800–D806. doi:10.1093/nar/gkq1064
evolutionary tree. Retrieved April 17, 2014, PMID:21045057
from http://www.sciencedaily.com/releas-
Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E.
es/2009/10/091027161536.htm
M., & Dubchak, I. (2004). VISTA: computational
Dunham, I., Shimizu, N., Roe, B. A., Chissoe, tools for comparative genomics. Nucleic Acids
S., Hunt, A. R., & Collins, J. E. et al. (1999). Research, 32(Web Server issue), W273-279. doi:
The DNA sequence of human chromosome 22. 10.1093/nar/gkh458
Nature, 402(6761), 489–495. doi:10.1038/990031
Frishman, D. (2007). Protein annotation at ge-
PMID:10591208
nomic scale: The current status. Chemical Reviews,
Dunn, J. J., Studier, F. W., & Gottesman, M. 107(8), 3448–3466. doi:10.1021/cr068303k
(1983). Complete nucleotide sequence of bacte- PMID:17658902
riophage T7 DNA and the locations of T7 genetic
Galperin, M. Y., & Koonin, E. V. (2004). ‘Con-
elements. Journal of Molecular Biology, 166(4),
served hypothetical’ proteins: Prioritization of
477–535. doi:10.1016/S0022-2836(83)80282-4
targets for experimental study. Nucleic Acids
PMID:6864790
Research, 32(18), 5452–5463. doi:10.1093/nar/
Duret, L., Gasteiger, E., & Perriere, G. (1996). gkh885 PMID:15479782
LALNVIEW: A graphical viewer for pairwise
Gaspar, P., Lopes, P., Oliveira, J., Santos, R.,
sequence alignments. Computer Applications in
Dalgleish, R., & Oliveira, J. L. (2014). Variobox:
the Biosciences, 12(6), 507–510. PMID:9021269
Automatic detection and annotation of human ge-
netic variants. Human Mutation, 35(2), 202–207.
doi:10.1002/humu.22474 PMID:24186831
165
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Gelfand, M. S., Mironov, A. A., & Pevzner, P. A. Huang, X., & Miller, W. (1991). A time-efficient
(1996). Gene recognition via spliced sequence linear-space local similarity algorithm. Ad-
alignment. Proceedings of the National Acad- vances in Applied Mathematics, 12(3), 337–357.
emy of Sciences of the United States of America, doi:10.1016/0196-8858(91)90017-D
93(17), 9061–9066. doi:10.1073/pnas.93.17.9061
Huang da, W., Sherman, B. T., Tan, Q., Kir, J.,
PMID:8799154
Liu, D., Bryant, D., . . . Lempicki, R. A. (2007).
Gille, C., Birgit, W., & Gille, A. (2014). Sequence Bioinformatics resources: Expanded annotation
alignment visualization in HTML5 without database and novel algorithms to better extract
Java. Bioinformatics (Oxford, England), 30(1), biology from large gene lists. Nucleic Acids Re-
121–122. doi:10.1093/bioinformatics/btt614 search, 35(Web Server issue), W169-175. doi:
PMID:24273246 10.1093/nar/gkm415
Godzik, A. (2012). FFAS fold and function align- Hyatt, D., Snoddy, J., Schmoyer, D., Chen, G.,
ment. Retrieved December 30, 2012, from http:// Fischer, K., Parang, M., et al. (2000). Improved
ffas.sanfordburnham.org/ffas-cgi/cgi/ffas.pl analysis and annotation tools for whole-genome
computational annotation and analysis: GRAIL-
Gouet, P., Courcelle, E., Stuart, D. I., & Metoz,
EXP genome analysis toolkit and related analysis
F. (1999). ESPript: Analysis of multiple sequence
tools. In Genome Sequencing & Biology Meeting.
alignments in PostScript. Bioinformatics (Oxford,
Information, N. C. f. B. Align sequences nucleo-
England), 15(4), 305–308. doi:10.1093/bioinfor-
tide BLAST. Retrieved December 30, 2012, from
matics/15.4.305 PMID:10320398
http://blast.ncbi.nlm.nih.gov/
Gremme, G., Steinbiss, S., & Kurtz, S. (2013).
Information, N. C. f. B. (2011). GenBank. Re-
GenomeTools: A comprehensive software library
trieved December 28, 2012
for efficient processing of structured genome an-
notations. Institute of Electrical and Electronics Intel. (2013). Intel powers the world’s fastest
Engineers/Association for Computing Machin- supercomputer, reveals new and future high per-
ery Transactions on Computatonal Biology and formance computing technologies. Retrieved April
Bioinformatics, 10(3), 645-656. doi: 10.1109/ 17, 2014, from http://www.intc.com/releasedetail.
TCBB.2013.68 cfm?ReleaseID=774058
HPC Service Will be Used for Genome Annota- ISU. (2006). CyBlue - Blue gene supercomputer.
tion System. (2006). Retrieved January 20, 2013, Retrieved April 17, 2014, from http://bluegene.
from http://www.hpcwire.com ece.iastate.edu
Huang, S., Zhang, J., Li, R., Zhang, W., He, Z., & Jager, M., Wang, K., Bauer, S., Smedley, D.,
Lam, T. W. et al. (2011). SOAPsplice: Genome- Krawitz, P., & Robinson, P. N. (2014). Jannovar:
wide ab initio detection of splice junctions from A Java library for exome annotation. Human Mu-
RNA-Seq data. Frontiers in Genetics, 2, 46. tation, 35(5), 548–555. doi:10.1002/humu.22531
doi:10.3389/fgene.2011.00046 PMID:22303342 PMID:24677618
Huang, X., Adams, M. D., Zhou, H., & Kerlavage, Jaroszewski, L., Li, Z., Cai, X. H., Weber, C., &
A. R. (1997). A tool for analyzing and annotating Godzik, A. (2011). FFAS server: Novel features
genomic sequences. Genomics, 46(1), 37–45. and applications. Nucleic Acids Research, 39(Web
doi:10.1006/geno.1997.4984 PMID:9403056 Server issue), W38-44. doi:10.1093/nar/gkr441
166
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Jeon, Y. S., Lee, K., Park, S. C., Kim, B. S., Cho, Krogh, A. (1997). Two methods for improving
Y. J., Ha, S. M., & Chun, J. (2014). EzEditor: A performance of an HMM and their application
versatile sequence alignment editor for both rRNA- for gene finding. Proceedings of the International
and protein-coding genes. International Journal Conference on Intelligent Systems for Molecular
of Systematic and Evolutionary Microbiology, Biology, 5, 179–186. PMID:9322033
64(Pt 2), 689–691. doi:10.1099/ijs.0.059360-0
Krogh, A. (1998). An introduction to hidden
PMID:24425826
Markov models for biological sequences. In
Junier, T., & Pagni, M. (2000). Dotlet: Diagonal Computational methods in molecular biology
plots in a web browser. Bioinformatics (Oxford, (pp. 45-63). Amsterdam: Elsevier. doi:10.1016/
England), 16(2), 178–179. doi:10.1093/bioinfor- S0167-7306(08)60461-5
matics/16.2.178 PMID:10842741
Krogh, A. (2000). Using database matches with
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, for HMMGene for automated gene detection in
K. M., Pringle, T. H., Zahler, A. M., & Haussler, Drosophila. Genome Research, 10(4), 523–528.
D. (2002). The human genome browser at UCSC. doi:10.1101/gr.10.4.523 PMID:10779492
Genome Research, 12(6), 996-1006. doi: 10.1101/
Larkin, M. A., Blackshields, G., Brown, N. P.,
gr.229102
Chenna, R., McGettigan, P. A., & McWilliam,
Khajeh-Saeed, A., & Blair Perot, J. (2011). GPU- H. et al. (2007). Clustal W and Clustal X version
supercomputer acceleration of pattern matching. 2.0. Bioinformatics (Oxford, England), 23(21),
In W. W. Hwu (Ed.), GPU computing gems (Vol. 2947–2948. doi:10.1093/bioinformatics/btm404
2, pp. 185–198). Morgan Kaufmann. doi:10.1016/ PMID:17846036
B978-0-12-384988-5.00013-9
Lin, H., Ma, X., Chandramohan, P., Geist, A., &
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Samatova, N. (2005). Efficient data access for
Kelley, R., & Salzberg, S. L. (2013). TopHat2: Ac- parallel BLAST. Academic Press.
curate alignment of transcriptomes in the presence
Liu, J., Xiao, H., Huang, S., & Li, F. (2014).
of insertions, deletions and gene fusions. Genome
OMIGA: Optimized maker-based insect genome
Biology, 14(4), R36. doi:10.1186/gb-2013-14-
annotation. Molecular Genetics and Genomics,
4-r36 PMID:23618408
289(4), 567–573. doi:10.1007/s00438-014-0831-
Kleinjung, J., Douglas, N., & Heringa, J. (2002). 7 PMID:24609470
Parallelized multiple alignment. Bioinformatics
Lloyd, S. (2010). Parallel multiple sequence align-
(Oxford, England), 18(9), 1270–1271. doi:10.1093/
ment: An overview. Retrieved January 6, 2013,
bioinformatics/18.9.1270 PMID:12217922
from http://dna.cs.byu.edu/msa/overview.pdf
Koyanagi, R., Takeuchi, T., Hisata, K., Gyoja,
Lohse, M., Nagel, A., Herter, T., May, P., Schroda,
F., Shoguchi, E., Satoh, N., & Kawashima, T.
M., & Zrenner, R. et al. (2014). Mercator: A fast
(2013). MarinegenomicsDB: An integrated ge-
and simple web server for genome scale functional
nome viewer for community-based annotation of
annotation of plant sequence data. Plant, Cell &
genomes. Zoological Science, 30(10), 797–800.
Environment, 37(5), 1250–1258. doi:10.1111/
doi:10.2108/zsj.30.797 PMID:24125644
pce.12231 PMID:24237261
167
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Loytynoja, A. (2014). Phylogeny-aware alignment Maxam, A. M., & Gilbert, W. (1977). A new
with PRANK. Methods in Molecular Biology method for sequencing DNA. Proceedings of
(Clifton, N.J.), 1079, 155–170. doi:10.1007/978- the National Academy of Sciences of the United
1-62703-646-7_10 PMID:24170401 States of America, 74(2), 560–564. doi:10.1073/
pnas.74.2.560 PMID:265521
Loytynoja, A., & Goldman, N. (2010). web-
PRANK: A phylogeny-aware multiple sequence Mazumder, R., Kolaskar, A., & Seto, D. (2001).
aligner with interactive alignment browser. BMC GeneOrder: Comparing the order of genes in
Bioinformatics, 11(1), 579. doi:10.1186/1471- small genomes. Bioinformatics (Oxford, Eng-
2105-11-579 PMID:21110866 land), 17(2), 162–166. doi:10.1093/bioinformat-
ics/17.2.162 PMID:11238072
Lukashin, A. V., & Borodovsky, M. (1998).
GeneMark.hmm: New solutions for gene find- Milanesi, L. K. N. A., Rogozin, I. B., Ischenko,
ing. Nucleic Acids Research, 26(4), 1107–1115. I. V., Kel, A. E., Orlov Yu, L., Ponomarenko, M.
doi:10.1093/nar/26.4.1107 PMID:9461475 P., & Vezzoni, P. (1993). GenView: A comput-
ing tool for protein-coding regions prediction
Ma, J., Wang, S., Wang, Z., & Xu, J. (2014).
in nucleotide sequences. In Proceedings of the
MRFalign: Protein homology detection through
Second International Conference on Bioinfor-
alignment of markov random fields. PLoS Com-
matics, Supercomputing and Complex Genome
putational Biology, 10(3), e1003500. doi:10.1371/
Analysis. Singapore: World Scientific Publishing.
journal.pcbi.1003500 PMID:24675572
doi:10.1142/9789814503655_0048
Mahadevan, P., King, J. F., & Seto, D. (2009a).
Mironov, A. A., Roytberg, M. A., Pevzner, P. A.,
CGUG: In silico proteome and genome pars-
& Gelfand, M. S. (1998). Performance-guarantee
ing tool for the determination of “core” and
gene predictions via spliced alignment. Genom-
unique genes in the analysis of genomes up to
ics, 51(3), 332–339. doi:10.1006/geno.1998.5251
ca. 1.9 Mb. BMC Research Notes, 2(1), 168.
PMID:9721203
doi:10.1186/1756-0500-2-168 PMID:19706165
Morgenstern, B. (2014). Multiple sequence align-
Mahadevan, P., King, J. F., & Seto, D. (2009b).
ment with DIALIGN. Methods in Molecular Biolo-
Data mining pathogen genomes using GeneOrder
gy (Clifton, N.J.), 1079, 191–202. doi:10.1007/978-
and CoreGenes and CGUG: Gene order, synteny
1-62703-646-7_12 PMID:24170403
and in silico proteomes. International Journal of
Computational Biology and Drug Design, 2(1), Mount, D. M. (2004). Bioinformatics: sequence
100–114. doi:10.1504/IJCBDD.2009.027586 and genome analysis (2nd ed.). Cold Springs Har-
PMID:20054988 bor, NY: Cold Springs Harbor Laboratory Press.
Mahadevan, P., & Seto, D. (2010). Rapid pair- Ning, Z., Cox, A. J., & Mullikin, J. C. (2001).
wise synteny analysis of large bacterial genomes SSAHA: A fast search method for large DNA
using web-based GeneOrder4.0. BMC Research databases. Genome Research, 11(10), 1725–1729.
Notes, 3(1), 41. doi:10.1186/1756-0500-3-41 doi:10.1101/gr.194201 PMID:11591649
PMID:20178631
Noe, L., & Kucherov, G. (2005). YASS: Enhancing
the sensitivity of DNA similarity search. Nucleic
Acids Research, 33(Web Server issue), W540-543.
doi: 10.1093/nar/gki478
168
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Ovcharenko, I., Loots, G. G., Hardison, R. Pei, J., & Grishin, N. V. (2007). PROMALS: To-
C., Miller, W., & Stubbs, L. (2004). zPicture: wards accurate multiple sequence alignments of
Dynamic alignment and visualization tool for distantly related proteins. Bioinformatics (Oxford,
analyzing conservation profiles. Genome Re- England), 23(7), 802–808. doi:10.1093/bioinfor-
search, 14(3), 472–477. doi:10.1101/gr.2129504 matics/btm017 PMID:17267437
PMID:14993211
Penn, O., Privman, E., Ashkenazy, H., Landan,
Pachter, L., Batzoglou, S., Spitkovsky, V. I., Banks, G., Graur, D., & Pupko, T. (2010). GUIDANCE:
E., Lander, E. S., Kleitman, D. J., & Berger, B. A web server for assessing alignment confidence
(1999). A dictionary-based approach for gene scores. Nucleic Acids Research, 38(Web Server
annotation. Journal of Computational Biology, issue), W23-28. doi: 10.1093/nar/gkq443
6(3-4), 419–430. doi:10.1089/106652799318364
Pevsner, J. (2009). Bioinformatics and func-
PMID:10582576
tional genomics. Hoboken, NJ: Wiley-Blackwell.
Paquete, L., Matias, P., Abbasi, M., & Pinheiro, M. doi:10.1002/9780470451496
(2014). MOSAL: Software tools for multiobjective
Plewniak, F., Bianchetti, L., Brelivet, Y., Carles,
sequence alignment. Source Code for Biology and
A., Chalmel, F., & Lecompte, O. et al. (2003).
Medicine, 9(1), 2. doi:10.1186/1751-0473-9-2
PipeAlign: A new toolkit for protein family analy-
PMID:24401750
sis. Nucleic Acids Research, 31(13), 3829–3832.
Parra, G., Blanco, E., & Guigo, R. (2000). GeneID doi:10.1093/nar/gkg518 PMID:12824430
in drosophila. Genome Research, 10(4), 511–515.
Portal, E. B. R. (n.d.). SIM - Alignment tool for
doi:10.1101/gr.10.4.511 PMID:10779490
protein sequences. Retrieved December 30, 2012,
Pearson, W. (1991). LALIGN - Find mulitple from http://web.expasy.org/sim/
matching subsegments in two sequences. Retrieved
Puckelwartz, M. J., Pesce, L. L., Nelakuditi, V.,
December 29, 2012, from http://www.ch.embnet.
Dellefave-Castillo, L., Golbus, J. R., & Day, S. M.
org/software/LALIGN_form.html
et al. (2014). Supercomputing for the paralleliza-
Pearson, W. R. (2006a). FASTA sequence com- tion of whole genome analysis. Bioinformatics (Ox-
parison at the University of Virginia. Retrieved ford, England), 30(11), 1508–1513. doi:10.1093/
December 30, 2012, from http://fasta.bioch.vir- bioinformatics/btu071 PMID:24526712
ginia.edu/fasta_www2/fasta_www.cgi?rm=lalign
Rath, J. (2014). NERSC flips the switch on new
Pearson, W. R. (2006b). LALIGN/PLALIGN. Edison supercomputer. Retrieved April 17, 2014,
Retrieved December 30, 2012, from http:// from http://www.datacenterknowledge.com/
fasta.bioch.virginia.edu/fasta_www2/fasta_www. archives/2014/01/31/nersc-flips-switch-new-
cgi?rm=lalign edison-supercomputer/
Pedersen, B. S., Yang, I. V., & De, S. (2013). Reese, M. G., Kulp, D., Tammana, H., & Haussler,
CruzDB: Software for annotation of genomic D. (2000). Genie--gene finding in Drosophila
intervals with UCSC genome-browser data- melanogaster. Genome Research, 10(4), 529–538.
base. Bioinformatics (Oxford, England), 29(23), doi:10.1101/gr.10.4.529 PMID:10779493
3003–3006. doi:10.1093/bioinformatics/btt534
PMID:24037212
169
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Rice, P., Longden, I., & Bleasby, A. (2000). Schnattinger, T., Schoning, U., Marchfelder, A.,
EMBOSS: The European molecular biology & Kestler, H. A. (2013). RNA-Pareto: Interac-
open software suite. Trends in Genetics, 16(6), tive analysis of Pareto-optimal RNA sequence-
276–277. doi:10.1016/S0168-9525(00)02024-2 structure alignments. Bioinformatics (Oxford,
PMID:10827456 England), 29(23), 3102–3104. doi:10.1093/
bioinformatics/btt536 PMID:24045774
Roberts, R. J. (2004). Identifying protein function-
-A call for community action. PLoS Biology, Schuler, G. D. (1997). Sequence mapping by
2(3), E42. doi:10.1371/journal.pbio.0020042 electronic PCR. Genome Research, 7(5), 541–550.
PMID:15024411 PMID:9149949
Russell, D. J. (2014). GramAlign: Fast alignment Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A.,
driven by grammar-based phylogeny. Methods Riemer, C., & Bouck, J. et al. (2000). PipMaker-
in Molecular Biology (Clifton, N.J.), 1079, -a web server for aligning two genomic DNA
171–189. doi:10.1007/978-1-62703-646-7_11 sequences. Genome Research, 10(4), 577–586.
PMID:24170402 doi:10.1101/gr.10.4.577 PMID:10779500
Rutherford, K., Parkhill, J., Crook, J., Horsnell, SDSC. (2014). San Diego supercompuer center.
T., Rice, P., Rajandream, M. A., & Barrell, B. Retrieved April 17, 2014, from http://www.sdsc.
(2000). Artemis: Sequence visualization and anno- edu/supercomputing/gordon/
tation. Bioinformatics (Oxford, England), 16(10),
Seemann, T. (2014). Prokka: Rapid prokaryotic
944–945. doi:10.1093/bioinformatics/16.10.944
genome annotation. Bioinformatics (Oxford,
PMID:11120685
England), 30(14), 2068–2069. doi:10.1093/bio-
Salamov, A. A., & Solovyev, V. V. (2000). Ab informatics/btu153 PMID:24642063
initio gene finding in Drosophila genomic DNA.
Shahid, S., & Axtell, M. J. (2013). Identification
Genome Research, 10(4), 516–522. doi:10.1101/
and annotation of small RNA genes using Short-
gr.10.4.516 PMID:10779491
Stack. Methods (San Diego, Calif.). doi:10.1016/j.
Sanger, F., Nicklen, S., & Coulson, A. R. (1977). ymeth.2013.10.004 PMID:24139974
DNA sequencing with chain-terminating in-
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J.,
hibitors. Proceedings of the National Academy
Karplus, K., & Li, W. et al. (2011). Fast, scal-
of Sciences of the United States of America,
able generation of high-quality protein multiple
74(12), 5463–5467. doi:10.1073/pnas.74.12.5463
sequence alignments using Clustal Omega. Mo-
PMID:271968
lecular Systems Biology, 7(1), 539. doi:10.1038/
Santos, A. R., Barbosa, E., Fiaux, K., Zurita- msb.2011.75 PMID:21988835
Turk, M., Chaitankar, V., & Kamapantula, B.
Simossis, V. A., & Heringa, J. (2005). PRALINE:
et al. (2013). PANNOTATOR: An automated
A multiple sequence alignment toolbox that
tool for annotation of pan-genomes. Genetics
integrates homology-extended and secondary
and Molecular Research, 12(3), 2982–2989.
structure information. Nucleic Acids Research,
doi:10.4238/2013.August.16.2 PMID:24065654
33(Web Server issue), W289-294. doi: 10.1093/
Schmutz, J., Grimwood, J., & Myers, R. M. (2004). nar/gki390
Sequence finishing. Methods in Molecular Biol-
ogy (Clifton, N.J.), 255, 333–342. doi:10.1385/1-
59259-752-1:333 PMID:15020836
170
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Smith, C., Heyne, S., Richter, A. S., Will, S., & Subramanian, A. R., Kaufmann, M., & Morgen-
Backofen, R. (2010). Freiburg RNA Tools: A web stern, B. (2008). DIALIGN-TX: Greedy and pro-
server integrating INTARNA, EXPARNA and gressive approaches for segment-based multiple
LOCARNA. Nucleic Acids Research, 38(Web sequence alignment. Algorithms for Molecular
Server issue), W373-377. doi: 10.1093/nar/ Biology; AMB, 3(1), 6. doi:10.1186/1748-7188-
gkq316 3-6 PMID:18505568
Softberry, I. (2007). SCAN2. Mount Kisco, NY: Sze, S. H., & Pevzner, P. A. (1997). Las Vegas
Softberry, Inc. Retrieved April 17, 2014, from algorithms for gene recognition: Suboptimal and
http://linux1.softberry.com/ error-tolerant spliced alignment. Journal of Com-
putational Biology, 4(3), 297–309. doi:10.1089/
Solovyev, V. V., Salamov, A. A., & Lawrence,
cmb.1997.4.297 PMID:9278061
C. B. (1994). Predicting internal exons by oligo-
nucleotide composition and discriminant analysis Thompson, J. D., Plewniak, F., Thierry, J., & Poch,
of spliceable open reading frames. Nucleic Acids O. (2000). DbClustal: Rapid and reliable global
Research, 22(24), 5156–5163. doi:10.1093/ multiple alignments of protein sequences detected
nar/22.24.5156 PMID:7816600 by database searches. Nucleic Acids Research,
28(15), 2919–2926. doi:10.1093/nar/28.15.2919
Solovyev, V. V., Salamov, A. A., & Lawrence, C.
PMID:10908355
B. (1995). Identification of human gene structure
using linear discriminant functions and dynamic Thorsen, O., Smith, B., Sosa, C. P., Jiang, K.,
programming. Proceedings of the International Lin, H., Peters, A., & Feng, W. (2007). Parallel
Conference on Intelligent Systems for Molecular genomic sequence-search on a massively paral-
Biology, 3, 367–375. PMID:7584460 lel system. New York, NY: Academic Press.
doi:10.1145/1242531.1242542
Stein, L. (2001). Genome annotation: From
sequence to biology. Nature Reviews. Genet- Top500. (2008). Top500 June 2008: Roadrunner
ics, 2(7), 493–503. doi:10.1038/35080529 - BladeCenter QS22/LS21 cluster, PowerXCell 8i
PMID:11433356 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire infiniband.
Retrieved April 17, 2014, from http://www.top500.
Sturrock, S., & Collins, J. (1993). MPsrch version
org/system/176026
1.3. Biocomputing Research Unit University of
Edinburgh. Retrieved April 17, 2014, from http:// Top500. (2013). Top500: Endeavor - Intel cluster.
www.ebi.ac.uk/Tools/MPsrch/ Retrieved April 17, 2014, from http://www.top500.
org/system/176908
Su, X., Pan, W., Song, B., Xu, J., & Ning, K. (2014).
Parallel-META 2.0: Enhanced metagenomic data Troshin, P. V., Procter, J. B., & Barton, G. J.
analysis with functional annotation, high perfor- (2011). Java bioinformatics analysis web services
mance computing and advanced visualization. for multiple sequence alignment--JABAWS:MSA.
PLoS ONE, 9(3), e89323. doi:10.1371/journal. Bioinformatics (Oxford, England), 27(14),
pone.0089323 PMID:24595159 2001–2002. doi:10.1093/bioinformatics/btr304
PMID:21593132
171
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Uberbacher, E. C., & Mural, R. J. (1991). Locating Ye, Y., Wei, B., Wen, L., & Rayner, S. (2013).
protein-coding regions in human DNA sequences BlastGraph: A comparative genomics tool based on
by a multiple sensor-neural network approach. BLAST and graph algorithms. Bioinformatics (Ox-
Proceedings of the National Academy of Sci- ford, England), 29(24), 3222–3224. doi:10.1093/
ences of the United States of America, 88(24), bioinformatics/btt553 PMID:24068035
11261–11265. doi:10.1073/pnas.88.24.11261
Yeh, R. F., Lim, L. P., & Burge, C. B. (2001).
PMID:1763041
Computational inference of homologous gene
Uberbacher, E. C., Xu, Y., & Mural, R. J. (1996). structures in the human genome. Genome Re-
Discovering and understanding genes in human search, 11(5), 803–816. doi:10.1101/gr.175701
DNA sequence using GRAIL. Methods in En- PMID:11337476
zymology, 266, 259–281. doi:10.1016/S0076-
Zafar, N., Mazumder, R., & Seto, D. (2001).
6879(96)66018-2 PMID:8743689
Comparisons of gene colinearity in genomes
Varadarajan, S. (2004). System X: Building the using GeneOrder2.0. Trends in Biochemical
Virginia Tech supercomputer. Paper presented at Sciences, 26(8), 514–516. doi:10.1016/S0968-
the 13th International Conference on Computer 0004(01)01881-3 PMID:11504629
Communications and Networks. New York, NY.
Zafar, N., Mazumder, R., & Seto, D. (2002). Core-
doi:10.1109/ICCCN.2004.1401571
Genes: A computational tool for identifying and
Vermij, E. P. (2011). Genetic sequence alignment cataloging “core” genes in a set of small genomes.
on a supercomputing platform. Netherlands: TU BMC Bioinformatics, 3(1), 12. doi:10.1186/1471-
Delft. 2105-3-12 PMID:11972896
Warren, A. S., Archuleta, J., Feng, W. C., & Se- Zammataro, L., DeMolfetta, R., Bucci, G., Ceol,
tubal, J. C. (2010). Missing genes in the annota- A., & Muller, H. (2014). AnnotateGenomicRe-
tion of prokaryotic genomes. BMC Bioinformat- gions: A web application. BMC Bioinformatics,
ics, 11(1), 131. doi:10.1186/1471-2105-11-131 15(Suppl 1), S8. doi:10.1186/1471-2105-15-S1-
PMID:20230630 S8 PMID:24564446
Webb-Roberts, B.-J. (2004). Protein & DNA Zhang, M. Q. (1997). Identification of protein
sequence analysis. Retrieved February 10, 2013, coding regions in the human genome by qua-
from http://www.sysbio.org/resources/tutorials/ dratic discriminant analysis. Proceedings of the
sequence_analysis_webb.pdf National Academy of Sciences of the United
States of America, 94(2), 565–568. doi:10.1073/
Wood, D. E., & Salzberg, S. L. (2014). Kraken:
pnas.94.2.565 PMID:9012824
Ultrafast metagenomic sequence classification us-
ing exact alignments. Genome Biology, 15(3), R46. Zhang, Y., & Waterman, M. S. (2003). DNA
doi:10.1186/gb-2014-15-3-r46 PMID:24580807 sequence assembly and multiple sequence align-
ment by an Eulerian path approach. Cold Spring
Xu, Y., Mural, R., Shah, M., & Uberbacher, E.
Harbor Symposia on Quantitative Biology,
(1994). Recognizing exons in genomic sequence
68(0), 205–212. doi:10.1101/sqb.2003.68.205
using GRAIL II. Genetic Engineering, 16,
PMID:15338619
241–253. PMID:7765200
172
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Zhao, K., & Chu, X. (2014). G-BLASTN: Accel- CpG Elements: Genomic regions consisting of
erating nucleotide alignment by graphics proces- a high frequency of CpG sites. A CpG site refers
sors. Bioinformatics (Oxford, England), 30(10), to a genomic region where a cytosine nucleotide
1384–1391. doi:10.1093/bioinformatics/btu047 exists next to a guanine nucleotide in the linear
PMID:24463183 sequence of bases along its length.
Dicodon Statistics: Statistics used for the
Zola, J., Yang, X., Rospondek, A., & Aluru, S.
prediction of splice signals and coding regions.
(2007). Parallel T-coffee: A parallel multiple
Discriminant Function: A function of a set of
sequence aligner. In Proceedings of the ISCA
variables that is evaluated for samples of events
20th International Conference on Parallel and
of objects and used as an aid in classifying them.
Distributed Computing Systems. Academic Press.
DNA: deoxyribonucleic acid is a molecule
Zuegge, J., Ebeling, M., & Schneider, G. (2001). that serves as the hereditary material in humans
H-BloX: Visualizing alignment block entropies. and almost all other organisms.
Journal of Molecular Graphics & Modelling, DNA Polymerases: An enzyme that catalyzes
19(3-4), 304–306, 379. doi:10.1016/S1093- the polymerization of DNAs into a DNA strand.
3263(00)00074-7 PMID:11449568 Evolution: Gradual unfolding of new varieties
of life from previous forms over long periods of
Zverina, J. (2014). SDSC assists in whole-genome
time; from the modern genetic perspective, it is
sequencing analysis under collaboration with
defined as a change in allele frequency from one
Janssen. Retrieved April 17, 2014, from http://
generation to the next.
ucsdnews.ucsd.edu/pressrelease/sdsc_assists_in_
Exon: DNA nucleotide sequence carrying out
whole_genome_sequencing_analysis_under_col-
the code for the final mRNA and, thus, determines
laboration_with_j
the amino acid sequence of an organism.
Expressed Sequence Tags (ESTs): Small
pieces of DNA sequence (usually 200 to 500
KEY TERMS AND DEFINITIONS nucleotides long) generated by sequencing either
one or both ends of an expressed gene.
Allele Frequency: In a population, this refers Field-Programmable Gate Arrays (FPGA):
to the percentage of all the alleles at a locus ac- An integrated circuit that can be programmed in
counted for by one specific allele. the field after manufacture.
Bacteriophage: Virus that infects and repli- Functional Annotation: The process of col-
cates within bacteria. lecting information and describing the gene’s
Comparative Genomics: Study that involves biological identity.
the comparison of the genomic sequences of dif- GenBank: The NIH genetic sequence da-
ferent species. tabase, an annotated collection of all publicly
Complementary DNA (cDNA): DNA derived available DNA sequences (Benson et al., 2013).
from messenger RNA (mRNA), which can be ob- Genetic Association: Statistical phenomenon,
tained from prokaryotes or eukaryotes and is often which associates a specific disease with a certain
utilized to clone eukaryotic genes in prokaryotes. gene(s).
Cosmid: A plasmid vector containing a bacte- Genomics: Field of study focusing on genes,
riophage lambda cos site, which directs insertion their functions, and related techniques.
of DNA into phage particles. Genotype Frequency: Sum of the number
of individuals possessing the genotype divided
by the total number of individuals in the sample.
173
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Genotype: An organism’s entire genetic Pattern Recognition: Method that deals with
makeup or to the alleles at a specific genetic locus. feature extraction and classification.
Graphics Processing Unit (GPU): A pro- Peptide Sequence: Unique amino acid se-
grammable logic chip that can perform animation, quence characterizing a given protein.
imaging, and videos for the computer screen. Population Genetics: The study concerned
Heuristic Methods: Methods that facilitate mainly with the genetic variation within species.
in learning, discover, or problem-solving by ex- Prokaryote: Organisms lacking a cell nucleus.
perimental or trial-and-error methods. Promoters: DNA segment usually occurring
Homology: Two molecules that share a com- from a gene coding region and acting as a control-
mon ancestor. ling element in gene expression
Intergenic: A region found between two genes. Reduced Instruction Set Computer (RISC):
Intron: A noncoding sequence between two A type of computer architecture that has a rela-
coding genomic sequence. tively small set of computer instructions that it
Markov Model System: Mathematical model can perform.
that allows the study of complex systems by estab- Regulatory Elements: DNA sequence that
lishing a state of the system and then consequently determines the regulation of gene expression.
effecting a transition to a new state, such a tran- Sequence Alignment: A method of arrang-
sition being dependent only on the values of the ing RNA, protein, or DNA sequences to identify
current state, and not dependent on the previous regions of similarity that maybe a consequence of
history of the system up to that point. functional, structural, or evolutionary relationships
Mendel’s Laws: Consist of three laws of between the sequences.
inheritance describing how genes are passed on Sequence Assembly: A method of determining
from parents to offsprings. the order of multiple sequenced DNA fragments.
Meta-Analysis: A method of combining Shotgun Sequencing: Laboratory technique
quantitative or qualitative datasets from differ- for determining the DNA sequence of an organ-
ent studies to determine a single conclusion with ism’s genome.
greater statistical power. Single Instruction, Multiple Data (SIMD)
Multiprocessor: A computer system with Processing: Processing technique in which an
more than one central processing unit (CPU) that operation is taken in one specified instruction and
share main memory. applies it to more than one set of data elements
Nucleotide: Building blocks from which DNA at the same time.
and RNA are built. Splicing: The process of inserting DNA or
Open Reading Frame (ORF): A DNA se- RNA fragments to form new genetic combinations
quence that does not contain a stop codon in a or alter a new genetic structure.
given reading frame. Start Codon: The first codon of an mRNA
Parallel Algorithm: An algorithm that allows transcript translated by a ribosome.
execution a piece at a time in different process- Stop Codon: The genetic codon in an mRNA
ing devices, and then eventually putting them that signals the termination of protein synthesis
back together again at the end to determine the during translation.
correct result. Structural Annotation: The process of local-
Parallel Programming: Computational izing the genes in both strands of a genome as well
method, which allows carrying out multiple cal- precisely determining the structural elements of
culations simultaneously. these genes.
174
Applications of Supercomputers in Sequence Analysis and Genome Annotation
Supernode: Any node that also serves as one Transposon: DNA segment consisting of an
of the network’s relayers and proxy servers, han- insertion sequence element at each end as a repeat
dling data flow and connections for other users. as well as genes specific to some other activity
Traditional GenBank: Divisions that contain such as resistance to antibiotics.
106 billion nucleotide bases from 108 million indi- Whole Genome Shotgun: A method of ge-
vidual sequences, with 11 million new sequences nome sequence determination based on assembly
added in 2009. of the whole genome from numerous sequence
reads at high coverage without requiring reference
to genetic or physical map locations for those reads.
175