0% found this document useful (0 votes)
42 views30 pages

Applications of Supercomputers in Sequence Analysis and Genome Annotation - Full

Applications of supercomputers in bioinformatics thus pdf contains its applications and all of that

Uploaded by

anitadimaniya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views30 pages

Applications of Supercomputers in Sequence Analysis and Genome Annotation - Full

Applications of supercomputers in bioinformatics thus pdf contains its applications and all of that

Uploaded by

anitadimaniya
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/275045742

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Chapter · January 2015


DOI: 10.4018/978-1-4666-7461-5.ch006

CITATIONS READS
2 3,428

1 author:

Gerard Dumancas
The University of Scranton
97 PUBLICATIONS 594 CITATIONS

SEE PROFILE

All content following this page was uploaded by Gerard Dumancas on 28 May 2015.

The user has requested enhancement of the downloaded file.


Research and Applications
in Global Supercomputing

Richard S. Segall
Arkansas State University, USA

Jeffrey S. Cook
Independent Researcher, USA

Qingyu Zhang
Shenzhen University, China

A volume in the Advances in Systems Analysis,


Software Engineering, and High Performance
Computing (ASASEHPC) Book Series
Managing Director: Lindsay Johnston
Managing Editor: Austin DeMarco
Director of Intellectual Property & Contracts: Jan Travers
Acquisitions Editor: Kayla Wolfe
Production Editor: Christina Henning
Typesetter: Mike Brehm
Cover Design: Jason Mull

Published in the United States of America by


Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue
Hershey PA, USA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: [email protected]
Web site: http://www.igi-global.com

Copyright © 2015 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in
any form or by any means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or
companies does not indicate a claim of ownership by IGI Global of the trademark or registered trademark.
Library of Congress Cataloging-in-Publication Data
Research and applications in global supercomputing / Richard S. Segall, Jeffrey S. Cook, and Qingyu Zhang, editors.
pages cm
Includes bibliographical references and index.
Summary: “This book investigates current and emerging research in the field, as well as the application of this technology
to a variety of areas by highlighting a broad range of concepts”-- Provided by publisher.
ISBN 978-1-4666-7461-5 (hardcover) -- ISBN 978-1-4666-7462-2 (ebook) 1. High performance computing 2. Super-
computers. I. Segall, Richard, 1949- II. Cook, Jeffrey S., 1966- III. Zhang, Qingyu, 1970-
QA76.88.R48 2015
004.1’1--dc23
2014045462

This book is published in the IGI Global book series Advances in Systems Analysis, Software Engineering, and High Perfor-
mance Computing (ASASEHPC) (ISSN: 2327-3453; eISSN: 2327-3461)

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this book is new, previously-unpublished material. The views expressed in this book are those of the
authors, but not necessarily of the publisher.

For electronic access to this publication, please contact: [email protected].


149

Chapter 6
Applications of Supercomputers
in Sequence Analysis and
Genome Annotation
Gerard G. Dumancas
Oklahoma Medical Research Foundation, USA

ABSTRACT
In the modern era of science, bioinformatics play a critical role in unraveling the potential genetic causes
of various diseases. Two of the most important areas of bioinformatics today, sequence analysis and ge-
nome annotation, are essential for the success of identifying the genes responsible for different diseases.
These two emerging areas utilize highly intensive mathematical calculations in order to carry out the
processes. Supercomputers facilitate such calculations in an efficient and time-saving manner generat-
ing high-throughput images. Thus, this chapter thoroughly discusses the applications of supercomputers
in the areas of sequence analysis and genome annotation. This chapter also showcases sophisticated
software and algorithms utilized by the two mentioned areas of bioinformatics.

INTRODUCTION ogy and information technology, analysis and


interpretation of data, and the development of
Bioinformatics is often regarded as a discipline in novel algorithms for analyzing biological data-
its infancy. However, this interdisciplinary field sets. With the advent of the emergence of these
had its historical start in 1960s when computers large amount of biological datasets, scientists
emerged as a vital tool in molecular biology. With are often confronted with the issues of analyzing
the notable efforts of Margaret O. Dayhoff, Walter and interpreting these massive information and
M. Fitch, Russell F. Doolittle among others, this datasets in a less amount of time, requiring high
area emerged as an approach to managing and accuracy, and cost-saving. In the last few decades,
interpreting massive data generated by genomic this has been made possible with the emergence
research. Bioinformatics today represent a conver- of supercomputers. The wide array of available
gence of various fields, which involve modeling supercomputers has made it possible to analyze
of biological phenomena, genomics, biotechnol- and interpret biological datasets and systems in a

DOI: 10.4018/978-1-4666-7461-5.ch006

Copyright © 2015, IGI Global. Copying or distributing in print or electronic forms without written permission of IGI Global is prohibited.

Applications of Supercomputers in Sequence Analysis and Genome Annotation

more convenient manner. Nowadays, because of variants that explain differences in phenotypes
supercomputers, groundbreaking bioinformatics among individuals in a study population. Once
research is made possible. A good example is the association is found between the gene(s) and the
discovery of novel genes associated with differ- phenotype, scientists would be able to understand
ent diseases. With the discovery of these genes, the mechanism of action and disease etiology in
scientists have come up to a deeper understanding individuals and consequently characterize the
of the etiology of various unexplained diseases relevance and importance of such in the general
caused genetically. Consequently, various drugs population. The long-term goal of these studies
and treatments were discovered to counteract is to identify better treatment and prevention
such diseases. Within the area of bioinformatics, strategies. Association or any genetic analyses
sequence analysis and genome annotation are usually require highly intensive mathematical
among the two of the emerging and most impor- calculations. Supercomputers play a critical role
tant branches. In the recent years, supercomputers in the success of such calculations. Prior to genetic
play very important roles in the successes of such association analyses, any genotype information
branches. needs to undergo two critical steps—sequence
The objective of this chapter is to provide the analysis and genome annotation. Applications
readers a clear understanding of the specific appli- of supercomputers in genotype analyses involve
cations of supercomputers in the two most emerg- a wide array of applications and will be discussed
ing areas of bioinformatics, sequence analysis in the mentioned areas below.
and genome annotation. Though supercomputers
play critical roles in such areas, the audience is
not often aware of the potential applications that SUPERCOMPUTERS IN
may arise from them. A universal understanding SEQUENCE ANALYSIS
that constitute both fundamental and experimental
methodologies will enhance the development and Sequence analysis is the most commonly per-
progress of such areas. Thus, the major motivation formed task in bioinformatics. It was one of the
of this chapter is to provide the abovementioned first bioinformatics techniques founded in ~1970
understanding by discussing and analyzing the (Webb-Roberts, 2004). DNA sequencing is simply
fundamentals of several examples centered on any process used to map out the sequence of the
the various applications of supercomputers in nucleotides that comprise a strand of DNA. After
sequence analysis and genome annotation. While the discovery of the double helix shape of DNA in
the content of this chapter may be technical to 1953, and seeing how it is comprised of a series
some readers, we encourage them to review some of ladder like units known as DNA nucleotides,
basic concepts of genetics and biochemistry as the primary goal has been to find out just how the
well as to look at the definition of terms to better sequence of those little nucleotides leads to the
understand this chapter. physical characteristics of an organism, that is,
whether what your hair color, your skin color, and
every other detail from your bone marrow to the
BACKGROUND tip of your hair. Thus, DNA sequencing is simply
a way for scientists to unravel genetics, the study
Genotype analysis involves studying the asso- of how we are put together and how we transfer
ciation between genotype and phenotype, and our traits to our offspring.
the genotype frequencies. Genetic association It was in 1970 when DNA sequencing first
studies are aimed primarily in identifying genetic became possible with the discovery of restric-

150

Applications of Supercomputers in Sequence Analysis and Genome Annotation

tion enzymes and DNA polymerases. Eventually, is the process of turning a rough draft assembly
breakthrough in the rate of sequencing came when composed of shotgun sequencing reads into a
the dideoxy chain termination (Sanger, Nicklen, & highly accurate finished DNA sequence with a
Coulson, 1977) and chemical degradation (Maxam defined maximum allowed error rate. The inter-
& Gilbert, 1977) techniques were introduced in national publicly funded sequencing community
1977. Consequently, using the former method, the established a standard for considering a sequence
16.5 kb human mitochondria genome (Anderson, finished: It should be completely contiguous,
1981) was sequenced and the latter method was with no gaps in the sequence, and that it have a
used for the analysis of the 40 kb bacteriophage final estimated error rate of <1 error in 10,000
T7 (Dunn & Studier, 1983). Thus, these methods bases (Schmutz, Grimwood, & Myers, 2004). The
provide the theoretical and practical backgrounds next step in the analysis, after finishing, sequence
for our modern sequencing technologies (Chen, assembly or sequence alignment, is sequencing
1994). assembly. Sequencing assembly is used when as-
The GenBank, an NIH genetic sequence da- sembly of short DNA fragments (500-1000 bp) are
tabase is an annotated collection of all publicly generated by shotgun sequencing, and is widely
available DNA sequences. It used to contain used for sequencing large genomes, including the
only 15 million nucleotides in 1987 and had human genome (Y. Zhang & Waterman, 2003).
nearly doubled its size in each of the subsequent Sequencing alignment, on the other hand, is a way
five years. The GenBank had reached over 120 of arranging the sequencing of DNA to identify
million in 1992, with progressively more data regions of similarity that may be a consequence of
obtained using automated DNA sequencers functional, structural, or evolutionary relationships
(Chen, 1994). Today, GenBank has approximately between the sequences (Mount, 2004). It may be of
126,551,501,141 bases in 135,440,924 sequence two types, pairwise sequence alignment (PSA) and
records in the traditional GenBank divisions and multiple sequence alignment (MSA). PSA is one
191,401,393,188 bases in 62,715,288 sequence of the most commonly performed bioinformatics
records in the whole genome shotgun (WGS) tasks. It is a method to compare two sequences
division as of April 2011 (Information, 2011). and make inferences on the relationships between
With the large array of DNA sequences pro- them. In other words, it simply involves searching
duced by various DNA sequencing technologies, for homology between two molecules by a one-
it is necessary to perform sequence alignment to-one correspondence between the residues of
or multiple sequence alignment, which is a way the two sequences. PSA utilize three types of op-
of arranging the sequencing of DNA to identify timization approaches—dynamic programming,
regions of similarity that may be a consequence heuristic, and Bayesian. Dynamic programming
of functional, structural, or evolutionary relation- uses sequential approaches to solve the problem
ships between the sequences (Mount, 2004). Thus, and are generally considered slow and optimal.
a wide variety of sequence alignment softwares Heuristic methods on the other hand, are consid-
are available to assist scientists in this process. ered fast and provide approximate solutions (S.
Throughout the years, numerous computational M. Brown & Joubert). Bayesian approach, on the
tools have facilitated the success of genetic other hand, would formulate the sequence align-
research specifically in comparing sequences ment as a Bayesian inference problem (Webb-
(Table 1). Roberts, 2004).
After auto-assembly and before genomic an- When the alignment is concerned with find-
notation, the genomic finishing process is executed ing structural or functional patterns between
in a typical sequence analysis procedure. Finishing sequences, MSA is used (Webb-Roberts, 2004).

151

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 1. Sequence alignment tools for comparing sequences

Name Description Authors


Advance PipMaker Aligns two DNA sequences and returns a percent identity plot of that alignment, together (Schwartz et al.,
with a traditional textual form of the alignment. This is a tool for sequence comparison 2000)
between two small genomes.
Alignment-To- HTML-based interactive visualization for annotated multiple sequence alignments. (Gille, Birgit, &
HTML Gille, 2014)
BLAST2 Useful for DNA sequence comparisons providing a small graphic for both proteins or short NCBI
DNA sequences. (Information)
BlastGraph An interactive Java program for comparative genome analysis based on Basic Local (Ye, Wei, Wen, &
Alignment Search Tool (BLAST), graph clustering and data visualization. Rayner, 2013)
BOXSHADE An alternative presentation of alignments accepting a wide variety of file formats and allows (Baron)
the requester considerable flexibility in defining the output appearance (color, arrangement,
format)
Clustal Omega Multiple sequence alignment program that uses seeded guide trees and hidden Markov (Sievers et al.,
models profile-profile techniques to generate alignments. 2011)
ClustalW General purpose multiple sequence alignment program for DNA or proteins; provides with a (Larkin et al.,
number of data presentation, homology matrices, and presentation of phylogenetic trees. 2007)
Consensus Takes CLUSTAL or MSF multiple alignments and calculates the consensus. (N. Brown, 1996)
ConSurf Estimates the evolutionary conservation of amino/nucleic acid positions in a protein/DNA/ (Ashkenazy, Erez,
RNA molecule based on the phylogenetic relations between homologous sequences. Martz, Pupko, &
Ben-Tal, 2010)
CoreGenes Designed to analyze two to five genomes simultaneously, generating a table of related genes (Zafar, Mazumder,
- orthologs and putative orthologs. These entries are linked to their GenBank data with a & Seto, 2002);
limit of 0.35 Mb. CoreGenes2.0 has a limit of approx. 2.0 Mb. The upgrade to this program (Mazumder,
is GeneOrder 4.0 which will compare genomes up to 8Mb. Kolaskar, & Seto,
2001); (Mahadevan
& Seto, 2010)
CoreGenes 3 Tallies the total number of genes in common between the two genomes being compared. It (Zafar et al., 2002);
also displays the percent value of genes in common with a specific genome and determines (Mahadevan, King,
the unique genes contained in a pair of proteomes. & Seto, 2009b);
(Mahadevan, King,
& Seto, 2009a)
DbClustal Aligns sequences from a BlastP database search with one query sequence. (Thompson,
Plewniak, Thierry,
& Poch, 2000)
DiAlign While standard alignment methods rely on comparing single residues and imposing gap (Subramanian,
penalties, DIALIGN constructs pairwise and multiple alignments by comparing whole Kaufmann, &
segments of the sequences. Morgenstern,
2008)
DIALIGN Software tool for multiple sequence alignment by combining global and local alignment (Morgenstern,
features. 2014)
Dotlet A program for comparing sequences between two small genomes by the diagonal plot (Junier & Pagni,
method. 2000)
ESPript2.2 An alternative presentation of alignments requiring to save your alignment as a *.aln file. (Gouet, Courcelle,
Good control over output appearance and format is available (ps, tiff and gif). Stuart, & Metoz,
1999)
EzEditor Sequence editing software designed for both rRNA and protein-coding genes with the (Jeon et al., 2014)
visualization of biologically relevant information; useful in molecular phylogenetic studies

continued on following page

152

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 1. Continued

Name Description Authors


FASTA Finds regions of local or global similarity between protein or DNA sequences, either by (W. R. Pearson,
searching protein or DNA databases, or by identifying local duplications within a sequence; 2006a)
can be used to infer functional and evolutionary relationships between sequences as well as
help identify members of gene families (W. R. Pearson, 2006a). LALIGN/PLALIGN offers
the users a graphic “dotplot” output of the alignments (W. R. Pearson, 2006b).
FFAS The Fold and Function Assignment System (FFAS); profile of a user’s protein can now be (Jaroszewski et al.,
compared with ~20 additional profile databases; features include navigating multiple results 2011) (Godzik,
pages, and also includes novel functionality, such as dotplot graph viewer, modeling tools, 2012)
and 3D alignment viewer and links to the database of structural similarities (Jaroszewski, Li,
Cai, Weber, & Godzik, 2011).
G-BLASTN A promising software tool that uses a GPU to accelerate protein sequence alignment (Zhao & Chu,
2014)
Gene ContextTool Tool for visualizing the genome context of a gene or group of genes. (Ciria, Abreu-
Goodger, Morett,
& Merino, 2004)
GeneOrder 3.0 Ideal for comparing small GenBank genomes (up to 2 Mb). Each gene from the Query (Mazumder et
sequence is compared to all of the genes from the Reference sequence using BLASTP. There al., 2001); (Zafar,
are two display formats: graphical and tabular. Currently the graph is an applet and must be Mazumder, & Seto,
saved as a “SCREEN SHOT”. 2001)
GramAlign Progressive alignment algorithm that uses a grammar-based relative complexity distance (Russell, 2014)
metric to determine the alignment order; Allows for a computationally efficient and scalable
program useful for aligning both large numbers of sequences and sets of long sequences
quickly
GUIDANCE Implements two different algorithms for evaluating confidence scores: (i) the heads-or-tails (Penn et al., 2010)
(HoT) method, which measures alignment uncertainty due
to co-optimal solutions; (ii) the GUIDANCE method, which measures the robustness of the
alignment to guide-tree uncertainty.
H-BLOX An alternative presentation of alignments providing information content or the relative (Zuegge, Ebeling,
entropy within DNA or protein alignment blocks. & Schneider, 2001)
HSA Effective spliced aligner of RNA-seq reads mapping. (Bu, Chi, & Jin,
2013)
JABAWS 2 Provides web services for multiple sequence alignment, prediction of protein disorder, and (Troshin, Procter,
aminoacid conservation conveniently packaged to run on your local computer, server or & Barton, 2011)
cluster. This program is used for meta-analysis.
JDotter A tool for sequence comparison between two small genomes with a Java Dot Plot Viewer for (Brodie, Roper, &
generating dotplots of large DNA or protein sequences. Upton, 2004)
Kraken Assigns taxonomic labels to metagenomic DNA sequences. (Wood & Salzberg,
2014)
LALIGN Finds multiple matching subsegments in two sequences; provides one with % identity for (W. Pearson, 1991)
different subsegments of the sequence; implements the algorithm of Huang and Miller (X.
Huang & Miller, 1991)
LocARNA Used for multiple alignments of RNA molecules requiring only RNA sequences as input (Smith, Heyne,
and will simultaneously fold and align the input sequences. It outputs a multiple alignment Richter, Will, &
together with a consensus structure. For the alignment it features RIBOSUM-like similarity Backofen, 2010)
scoring and realistic gap cost.
MafFilter A highly efficient and flexible tool to analyse multiple genome alignments (Dutheil, Gaillard,
& Stukenbrock,
2014)
MATCHER Part of the EMBOSS group of programs; finds the best local alignments between two (Rice, Longden, &
protein sequences. Bleasby, 2000)

continued on following page

153

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 1. Continued

Name Description Authors


MegaSeq Designed to harness the size and memory of the Cray XE6, housed at Argonne National (Puckelwartz et al.,
Laboratory, for whole genome analysis in a platform designed to better match current and 2014)
emerging sequencing volume.
MOSAL Provides an open-source implementation and an on-line application for multiobjective (Paquete, Matias,
pairwise sequence alignment. Abbasi, &
Pinheiro, 2014)
MPsrch An alternative presentation of alignments. It is a biological sequence sequence comparison (Sturrock &
tool that implements the true Smith and Waterman algorithm. It runs a search on a HP/ Collins, 1993)
COMPAQ cluster, using single and parallelised versions of the software. It allows a rigorous
search in a reasonable computational time. MPsrch utilizes an exhaustive algorithm, which
is recognized as the most sensitive sequence comparison method available, whereas BLAST
and FASTA utilize a heuristic one. As a consequence, MPsrch is capable of identifying
hits in cases where BLAST and FASTA fail and also reports fewer false-positive hits. (This
service was retired in 2009)
MRFalign Protein homology detection software through alignment of Markov Random Fields. (Ma, Wang, Wang,
& Xu, 2014)
MultAlin Multiple sequence alignment with hierarchical clustering producing results in colors. (Corpet, 1988)
Multi-zPicture Provides nice dotplot graphs and dynamic visualizations for comparing sequences between (Ovcharenko,
two small genomes. Loots, Hardison,
Miller, & Stubbs,
2004)
Multiple Align An alternative presentation of alignments allowing considerable choice in coloring Bioinformatics
Show alignments. Organization
(Gouet et al., 1999)
Multiple Arranges several protein or nucleic acid sequences with postulated gaps so that similar (Brodsky et al.,
Alignment residues (in one-letter code) are juxtaposed using the GeneBee service. 1992)
PipeAlign Offers an integrated approach to protein family analysis through a cascade of different (Plewniak et al.,
sequence analysis programs such as BALLAST, DbClustal multiple alignment program, 2003)
Rascal alignment analysis.
PRALINE A multiple sequence alignment program with many options to optimize the information for (Simossis &
each of the input sequences: i.e. global or local preprocessing, predicted secondary structure Heringa, 2005)
information and iteration capabilities.
PRANK Can provide the inferred ancestral sequences as a part of the output and mark the alignment (Loytynoja, 2014)
gaps differently depending on their origin in insertion or deletion events.
PROBCONS Combination of probabilistic modeling and consistency-based alignment techniques and has (Do,
achieved the highest accuracies of all multiple alignments of protein sequences methods to Mahabhashyam,
date. Brudno, &
Batzoglou, 2005)
PROMALS Constructs multiple protein sequence alignments using information from database searches (Pei & Grishin,
and secondary structure prediction for protein homologs with sequence identity below 10%, 2007)
aligning close to half of the amino acid residues correctly on average.
RNA-Pareto Allows a direct inspection of all feasible results to the pairwise RNA sequence-structure (Schnattinger,
alignment problem and greatly facilitates the exploration of the optimal solution set. Schoning,
Marchfelder, &
Kestler, 2013)
SALIGN Determines the best alignment procedure based on the inputs automatically, while allowing (Braberg et al.,
the user to override default parameter values. Dendograms are used to guide multiple 2012)
alignments computed from a matrix of all pairwise alignment scores. When aligning
sequences to structures, SALIGN uses structural environment
information to place gaps optimally. If two multiple sequence alignments of related proteins
are input to the server, a profile-profile alignment is performed.

continued on following page

154

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 1. Continued

Name Description Authors


SCAN2 Tool for sequence comparison between two small genomes providing one with a color-coded (Softberry, 2007)
graphical alignment of genome length DNAs in Java.
SIM Alignment tool between two protein sequences or within a sequence (Portal). Once (X. Huang &
alignment is calculated, LALNVIEW, a graphical viewer for pairwise alignments can be Miller, 1991)
used for viewing (Duret, Gasteiger, & Perriere, 1996). The PBIL (Pôle Bio-Informatique (Portal)
Lyonnais) server can be used to align nucleic acid sequences with a similar tool (Duret et al.,
1996).
SOAPsplice A robust tool to detect splice junctions using RNA-Seq data without using any information (S. Huang et al.,
of known splice junctions. 2011)
SSEA Secondary Structure Element Assignment (SSEA); computes alignments of protein (Duret et al., 1996)
secondary structures including both global and local structure element alignments.
SUPERMATCHER Part of the EMBOSS group of programs; calculates approximate local pair-wise alignments (Rice et al., 2000)
of larger protein sequences.
The Coffee A collection of alignment databases consisting of: (Chang, Di
Collection • T-Coffee – aligns DNA, RNA or Proteins using the default T-Coffee Tommaso, Taly, &
• M-Coffee – aligns DNA, RNA or Proteins by combining the output of popular aligners Notredame, 2012);
• R-Coffee – aligns RNA sequences usingpredicted secondary structures (Di Tommaso et
• Expresso – aligns protein sequences using structural information al., 2011)
• PSI-Coffee – aligns distantly related proteins using homology extension
TM-Coffee – aligns transmembrane proteins using homology extension
Tophat2 Accurate alignment tool of transcriptomes in the presence of insertions, deletions, and gene (Kim et al., 2013)
fusions; combines the ability to identify novel splice sites with direct mapping to known
transcripts, producing sensitive and accurate alignments, even for highly repetitive genomes
or in the presence of pseudogenes.
VIP Barcoding User-friendly software in graphical user interface for rapid DNA barcoding; able to deal (Fan, Hui, Yu, &
with both large-scale and multilocus barcoding data with accuracy and can contribute to Chu, 2014)
DNA barcoding for modern taxonomy.
VISTA VISualization Tools for Alignments; allows one to align two genome-length sequences. (Frazer, Pachter,
Poliakov, Rubin, &
Dubchak, 2004)
webPRANK Incorporates phylogeny-aware multiple sequence alignment, visualisation and post- (Loytynoja &
processing in an easy-to-use web interface. Goldman, 2010)
YASS A tool for sequence comparison between two small genomes performing DNA local (Noe & Kucherov,
alignments with results in dotplot and tabular form. 2005)
zPicture DNA or genome alignment and visualization tool based on blastz alignment program. (Ovcharenko et al.,
Alignments can be automatically submitted to rVista 2.0 to identify evolutionary conserved 2004)
transcription factor binding sites.

MSA is a fundamental analysis method in bioin- more efficient algorithms and the use of parallel
formatics and many comparative genomic applica- computing resources are necessary (Lloyd, 2010).
tions. It forms the basis of many other tasks such In general, several types of parallel systems have
as protein structure prediction, protein function emerged to address computationally intensive
prediction, and phylogenetic analysis (Agrawal, problems in sequence alignments. Over the years,
2008). The computation time for an optimal MSA hybrid systems, which may be a combination of
grows exponentially with respect to the number multiprocessor, vector, cell, graphics processing
of sequences. Thus, in order to achieve minimal unit (GPU), and field-programmable gate arrays
computation time in response to a growing MSA, (FPGA) are becoming more common. Current

155

Applications of Supercomputers in Sequence Analysis and Genome Annotation

multiprocessors usually consist of a cluster of alignment stage, only the comparison of sequences
nodes connected with a network. Each node and groups is parallelized.
typically has several processors or multi-core A parallel version of T-Coffee was implement-
chips that share on-board memory. Examples of ed by Zola and colleagues using a master-worker
these systems include clusters of workstations to architecture and message passing to obtain an
supercomputers with high-performance networks. overall speedup of about 40 on a system with 80
Current vectors, on the other hand, utilize x86- CPUs (Zola, Yang, Rospondek, & Aluru, 2007).
based processors having vector instructions in The parallelism comes mostly from distributing
the form of streaming single instruction, multiple pairwise alignment tasks with dynamic scheduling
data (SIMD) extensions, thus, reducing the time for a near linear speedup during library genera-
needed to perform the same operation on several tion. A sophisticated dynamic scheduling strategy
data elements. A multi-core Cell containing one is used that follows the guide tree, but almost no
64-bit PowerPC reduced instruction set com- speedup is seen with more than 16 CPUs in the
puter (RISC)-processor and eight 128-bit vector progressive alignment stage.
processors is also emerging. GPUs containing When all the DNA sequences are completed,
several hundred processors that are capable of these can now be used by scientists to find genes,
floating-point and integer operations are also which may explain the etiology of a specific dis-
currently used. Lastly, FPGAs that allow multiple ease. Nowadays, with the alignments executed in
processing elements to be executed in parallel at a reasonable amount of computation time using
hardware speed on data supplied from the host various algorithms, DNA sequencing can be more
are also currently used (Lloyd, 2010). efficient and convenient than ever before.
Some alignment software as given in the
next paragraphs utilize specialized parallel MSA
algorithms in order to achieve time efficiency in SUPERCOMPUTERS IN
sequence alignment calculations. Since, various GENOME ANNOTATION
MSA softwares listed in Tables and 2 are updated
through time, we suggest the readers to visit their Genome annotation is simply defined as the
specific websites for further information as to the process of attaching biological information to se-
current algorithms their softwares utilizes. quences, and consists of several steps. The first step
PRALINE repeatedly chooses the next highest involves an extended form of physical mapping,
scoring pair to align until all sequences and groups attempting to convert the unknown portions of raw
are aligned to produce the final alignment. The DNA into a set of easily recognized landmarks
highest scoring pair is determined by comparing and reference points. Along with the ‘gene find-
all sequences with each other at first, and then ing’, the major purpose of this step of annotation
comparing the aligned pair with the remaining is to identify and place all known landmarks into
sequences after each iteration. A speedup of 10 the genome. The next step involves identifying
with 25 processors on a distributed system us- the genomic DNA regions that encodes genes or
ing a set of 200 random sequences that are 200 otherwise known as ‘gene prediction.’ The last step
residues in length is realized by Kleinjung and simply involves the attachment of biological infor-
colleagues (Kleinjung, Douglas, & Heringa, 2002) mation to the predicted genes. The ultimate goal
after parallel implementation. In the method, the of high-quality genome annotation is to identify
pairwise sequence alignment stage is parallelized the key features of the genome, specifically their
by distributing pairwise sequence alignment tasks genes and gene products (Stein, 2001). Similar
to separate processors. In the progressive profile to sequencing, annotation of a massive amount

156

Applications of Supercomputers in Sequence Analysis and Genome Annotation

of DNA sequence data requires computational (either over-predicting or under-predicting). As


tools for finding genes in DNA sequences. Gener- such, a high-performance computing methodol-
ally, there are two interrelated types of genome ogy was developed and utilized to investigate the
annotation, structural and functional. Structural problem. The BLASTP search was performed
annotation involves delineating and demarcating using miBLAST on Virginia Tech’s System X
the genomic elements (such as genes, promoters, supercomputer and the results labeled each query
and regulatory elements) while functional anno- as a missing gene, an absent annotation, genomic
tation involves assigning functions to structural artifact, or as an unclassified open reading frame
elements (Bright, Burgess, Chowdhary, Swiderski, (ORF) (Varadarajan, 2004). mpiBLAST parallel-
& McCarthy, 2009). izes BLAST using database fragmentation, query
Finding genomic landmarks can be identified segmentation (Darling, Carey, & Feng, 2003),
rapidly using the e-PCR program (Schuler, 1997) parallel input-output(Lin, Ma, Chandramohan,
for short sequences. For longer sequences, such Geist, & Samatova, 2005), and advanced schedul-
as the restriction-fragment length polymorphism ing (Thorsen et al., 2007). Using these, the study
markers, SSAHA (Ning, Cox, & Mullikin, 2001), was able to identify 1,153 candidate genes that
and BLASTN (Altschul, Gish, Miller, Myers, & are missing from current genome annotations
Lipman, 1990) are usually used (Stein, 2001). For and uncovered 38,895 intergenic ORFs, readily
gene prediction, several sophistical software algo- identified as putative genes by similarity to cur-
rithms have been devised in eukaryotic genomes, rently annotated genes with the vast majority of
which include MZEF, HEXON, Grail, GENSCAN, the missing genes- in small number (less than 100
Genie, and GeneMark.hmm (Stein, 2001). amino acid) (Warren et al., 2010).
Genome annotation is considered to be a cru- A mature web tool for rapid and reliable display
cial step for the extraction of useful information of any requested genome at any scale, together
from genomes. BLAST is used as a basic level of with several dozen aligned annotation tracks is
annotation for finding similarities, and annotating provided at http://genome.ucsc.edu. This resolves
genomes based on those similarities (Pevsner, the issue of effective genome annotation displaying
2009). In the recent years, annotation platforms assembly contigs and gaps, mRNA and expressed
have been designed to include more features to sequence tag alignments, multiple gene predic-
de-convolute discrepancies between genes that are tions, cross-species homologies, single nucleotide
given the same annotation. More than a decade polymorphisms, sequence-tagged sites, radiation
of extensive efforts have been made in order to hybrid data, transposon repeats, and more as a
improve the annotation tools and obtaining novel stack of coregistered tracks (Kent et al., 2002).
experimental results, yet, despite these, there are Genome annotation is based primarily on the
still number of problems arising from available ab initio and homology methods. The ab initio
genome annotations (Warren, Archuleta, Feng, & method predicts genes directly from the genomic
Setubal, 2010). Such problems include the pos- sequence using the computational properties of
sible existence of genes that may be undetected, exons, introns, and other signature features without
mis-annotated genes or with annotations that are referencing the experimental data. FGENESH (So-
too general to be of any use, and the presence of lovyev, Salamov, & Lawrence, 1995) (Salamov &
hypothetical genes without any functional assign- Solovyev, 2000), GeneID (Parra, Blanco, & Guigo,
ment (Frishman, 2007; Galperin & Koonin, 2004; 2000), GeneMark.hmm (Lukashin & Borodovsky,
Roberts, 2004; Warren et al., 2010). Specifically, 1998), GeneView (Milanesi L., 1993),Genie
there have been reports that prokaryotic gene (Reese, Kulp, Tammana, & Haussler, 2000),
finder programs have problems with small genes Grail (Xu, Mural, Shah, & Uberbacher, 1994),

157

Applications of Supercomputers in Sequence Analysis and Genome Annotation

GrailEXP_Perceval (Hyatt, 2000), HMMgene is considered to be a major challenge for scientists


(Krogh, 1998) (Krogh, 2000), and MZEF (M. Q. (Abecasis et al., 2012; Consortium, 2011). With the
Zhang, 1997) are some of the ab initio programs rapidly growing number of sequenced genomes,
extensively used in genome annotation (Chuang there is an exponential increase in the computing
et al., 2003). power needed to identify the genes and determine
The homology approach, on the other hand their functions and relationships. As such, scien-
(Chuang et al., 2003), identifies genes with the tists are trying to increase the efficiency of public
aid of experimental data exploiting sequence and proprietary comparative genomics tools by an
alignment between the genomic data and known order of magnitude and implement them on high
cDNA or protein databases (Chuang et al., 2003). performance secure operating system clusters
Examples of this method constitute GeneBuilder (www.hpcwire.com, 2006).
(Milanesi L., 1993), GenomeScan (Yeh, Lim, Over the years, a number of supercomputing
& Burge, 2001), GeneWise (Birney & Durbin, centers have emerged to support sequence analysis
2000), Procrustes (Gelfand, Mironov, & Pevzner, and genome annotation analyses. Below are some
1996) (Sze & Pevzner, 1997) (Mironov, Roytberg, examples of the supercomputing centers, which
Pevzner, & Gelfand, 1998), GrailEXP_Gawain support sequence and genome annotation analyses.
(Hyatt, 2000), GAIA (Bailey et al., 1998), AAT
(X. Huang, Adams, Zhou, & Kerlavage, 1997), San Diego Supercomputing
FGENESH+ and FGENESH++ (Salamov & So- Center (SDSC)
lovyev, 2000), and ICE (Pachter et al., 1999). Table
2 summarizes these genome annotation tools. The SDSC was founded in 1985 with a self-
The homology approaches demand high prescribed mission of developing and using
performance computing and large storage space technology to advance science. Located on the
(Chuang et al., 2003) while the ab initio methods campus of the University of California, San Diego,
tends to have higher false positive predictions in it houses advanced computing and networking
annotating long genomic sequences with multiple resources and conducts research in computing
genes (Dunham et al., 1999). Further, these meth- technologies and computational sciences such as
ods are also known to require extensive manual biology and chemistry. The center boasts with its
interventions to curate true gene prediction from SDSC Gordon Compute Cluster, a unique data-
large sets of matched data. A new method was intensive supercomputer sponsored by the NSF
proposed called Complexity Reduction Algo- XSEDE program, which went into production last
rithm for Sequence Analysis (CRASA), which January 1, 2012 (SDSC, 2014). The cluster does a
does annotation of the genomic sequence on top wide variety of research including de novo genome
of global alignment (Chuang et al., 2003). The assembly. The system is characterized by a 1,024
method features a progressive data structure in dual-socket Intel Sandy Bridge nodes, each with
hierarchical orders to facilitate a fast and efficient 64 GB DDR3–1333 memory. It has also over 300
search mechanism. The results from the benchmark TB of high performance Intel flash memory SSDs
tests showed that CRASA annotation excelled via 64 dual-socket Intel Westmere I/O nodes.
in both the sensitivity and specificity categories The large memory supernodes are capable of
(Chuang et al., 2003). presenting over 2 TB of cache coherent memory
With more than a thousand human individuals (SDSC, 2014). Recently, the Janssen Research
and several model organisms whose genome se- and Development, LLC (Janssen), in collaboration
quences have been completed, genome annotation with SDSC and the Scripps Translational Sci-

158

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 2. Softwares used for genome annotation

Name Description Authors


Analysis and Reduces the labor-intensive work of locating the exons of the query sequence and improves (X. Huang et al.,
annotation tools the process of defining the intron-exon boundaries by using the wealth of available protein 1997)
(AAT) and cDNA data
Annotate­Genomic­ A web application that accepts genomic regions as input and outputs a selection of (Zammataro,
Regions overlapping and/or neighboring genome annotations DeMolfetta, Bucci,
Ceol, & Muller,
2014)
Artemis Free genome browser and annotation tool that allows visualization of sequence features, next (Rutherford et al.,
generation data and the results of analyses within the context of the sequence 2000)
Complexity Has a large scale processing capability and a robust tool for genome annotation with high (Chuang et al.,
Reduction accuracy by matching expressed sequence tag (EST) sequences precisely to the genomic 2003)
Algorithm for sequences.
Sequence Analysis
(CRASA)
CruzDB A fast and intuitive programmatic interface to the University of California, Santa Cruz (Pedersen, Yang, &
(UCSC) genome browser that facilitates integrative analyses of diverse local and remotely De, 2013)
hosted datasets.
Database for Provides functional interpretation of large lists of genes derived from genomic studies (Huang da et al.,
Annotation, 2007)
Visualization
and Integrated
Discovery
(DAVID)
Encyclopedia of Builds a comprehensive parts list of functional elements in the human genome, including (“The ENCODE
DNA elements elements that act at the protein and RNA levels, and regulatory elements that controls cells (ENCyclopedia Of
(ENCODE) and circumstances in which a gene is active. DNA Elements)
Project,” 2004)
Ensembl Automatically annotate genome sequences, integrate these data with other biological (Flicek et al., 2011)
information and to make the data readily available to scientists.
FGENESH+ and Predicts protein-coding regions using linear discriminant functions. (Solovyev et al.,
FGENESHG++ 1995)
Gene Ontology De facto standard for functional annotation, and is routinely used as a basis for modeling and (Ashburner et al.,
(GO) hypothesis testing, large functional genomic sets. 2000)
GeneBuilder Based on the prediction of functional signals and coding regions by different approaches in (Milanesi L., 1993)
combination with similarity searches in proteins and EST databases.
GeneID Predicts genes in anonymous genomic sequences with a hierarchical structure. Parra, G., et al
(Parra et al., 2000)
GeneMark.hmm Based on heuristic methods producing fairly inhomogeneous Markov models of protein (Besemer &
coding regions; could be used to find genes in small fragments of prokaryotic genomes and Borodovsky, 1999)
genomes of organelles, viruses, phages and plasmids, as well as highly inhomogeneous
genomes where adjustment of models to local DNA composition is needed.
GeneView Based on prediction of splice signals by classification approach and coding regions by (Milanesi L., 1993)
dicodon statistic; constructs potential gene structure using dynamic programing approach.
GeneWise Used for combining gene prediction and homology searches providing reasonably accurate (Birney & Durbin,
gene predictions. 2000)
Genie Robust Markov model system allowing for generalized information integration from (Reese et al., 2000)
different sources such as signal sensors (splice sites, start codon, etc.), content sensors
(exons, introns, intergenic) and alignments of mRNA, EST, and peptide sequences; could be
effectively used in the genome annotation of higher organisms

continued on following page

159

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 2. Continued

Name Description Authors


Genome Uses a high-throughput, reliable annotation, called framework annotation, designed (Bailey et al.,
Annotation and to provide a foundation for initial biologic characterization of previously unexamined 1998)
Information sequence.
Analysis (GAIA)
GenomeScan Gene identification algorithm, which combines exon-intron and splice signal models with (Yeh et al., 2001)
similarity to known protein sequences in an integrated model.
GenomeTools A convenient and efficient software library and associated software tools for developing (Gremme,
bioinformatics software intended to create, process or convert annotation graphs. Steinbiss, & Kurtz,
2013)
GENSCAN Identifies complete intron/exon structures of genes in genomic DNA; has the capacity to (Burge & Karlin,
predict multiple genes in a sequence, to deal with partial or complete genes, and to predict 1997)
consistent sets of genes occurring on either or both DNA strands.
GRAIL Used for evaluating the protein-coding potential of anonymous DNA sequences; creates a (Uberbacher &
comprehensive analysis environment where a host of questions about genes and genome Mural, 1991)
structure can be answered as quickly and accurately as possible (Uberbacher, Xu,
& Mural, 1996)
GrailEXP Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repetitive (Hyatt, 2000)
elements within DNA sequence
HEXON Predicts internal exon sequences in human DNA based on a splice site algorithm that uses (Solovyev,
linear discriminant function to combine information about significant triplet frequencies of Salamov, &
various functional parts of splice site regions and preferences of oligonucleotides in protein Lawrence, 1994)
coding and intron regions.
HMMGene Predicts genes in anonymous DNA based on hidden Markov model; predicts whole genes so (Krogh, 1997)
the predicted exons always splice correctly; can predict several whole or partial genes in one
sequence, so it can be used on whole cosmids or even longer sequences; can also be used to
predict splice sites and start/stop codons.
ICE A fast and fully-automated dictionary-based approach to gene annotation and exon (Pachter et al.,
prediction. 1999)
Jannovar Stand-alone Java application as well as a Java library designed to be used in larger software (Jager et al., 2014)
frameworks for exome and genome analysis
Marine­genomics­ Provides open access to published data and a user-friendly environment for community- (Koyanagi et al.,
DB based manual gene annotation. 2013)
Mercator A fast and simple web server for genome scale functional annotation of plant sequence data (Lohse et al., 2014)
MZEF Predicts internal coding exons in genomic DNA sequences based on a prediction algorithm (M. Q. Zhang,
that uses the quadratic discriminant function for multivariate statistical pattern recognition. 1997)
OMIGA Predicts protein-coding genes from insect genomes. (Liu, Xiao, Huang,
& Li, 2014)
PANNOTATOR A web-based automated pipeline for the annotation of closely related and well-suited (Santos et al.,
genomes for pan-genome studies, aiming at reducing the manual work to generate reports 2013)
and corrections of various genome strains.
Parallel-META 2.0 A metagenomic analysis software package for efficient and fast analyses of taxonomical and (Su, Pan, Song, Xu,
functional structures for microbial communities. & Ning, 2014)
Procrustes Uses related proteins and cDNA for gene identifications and gene annotation-quality gene (Gelfand et al.,
predictions based on splice alignment algorithm which explores all possible exon assemblies 1996), (Sze &
and finds the multi-exon structure with the best fit to a related protein. Pevzner, 1997),
(Mironov et al.,
1998)
Prokka A command line software tool to fully annotate a draft bacterial genome in about 10 min on (Seemann, 2014)
a typical desktop computer.

continued on following page

160

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Table 1. Continued

Name Description Authors


ShortStack A program that was developed to comprehensively analyze reference-aligned small RNA- (Shahid & Axtell,
seq data, and output detailed and useful annotations of the causal small RNA-producing 2013)
genes.
Variobox A desktop tool to annotate, analyze, and compare human genes (Gaspar et al.,
2014)

ence Institute (STSI) utilized Gordon to launch a E5-2697v2 12C 2.700GHz, Infiniband FDR,
project of conducting whole-genome sequencing consisting of 51,392 cores, 758.9 TFlop/s Linpack
of 438 patients with rheumatoid arthritis to bet- performance (Rmax), 387.20 kW power, and a
ter understand the disease, as well as explore the 22,528 GB memory (Top500, 2013).
genetic factors of patient responses to a specific
biologic therapy currently marketed by Janssen Department of Energy (DOE)/
in the US (Zverina, 2014). National Nuclear Security
Administration (NNSA)/Los Alamos
National Energy Research National Laboratory (LANL)
Scientific Computing (NERSC)
The DOE/NNSA/LANL boast with the Road-
The NERSC is a facility operated by the Lawrence Runner, a BladeCenter QS22/LS21 Cluster with
Berkeley National Laboratory and the Department 122,400 cores, 1,026.0 TFlop/s Linpack Perfor-
of Energy. It recently accepted “Edison,” a new mance, and a 2,345.00 kW power. Roadrunner
flagship supercomputer designed for scientific was ranked 1st in the Top500 supercomputers in
productivity. Named in honor of the American the world in 2008 (Top500, 2008). It was used to
inventor Thomas Alva Edison, the Cray XC30 create the largest HIV evolutionary tree. The goal
has 332 terabytes memory, 2.39 petaflop/second was to identify common features of the transmit-
peak performance, 124,608 processing cores, and ted virus, and attempted to create a vaccine that
a 7.56 petabytes disk storage. Edison specializes enables recognition the original transmitted virus
in data analyses including genome sequencing and before the body’s immune response causes the
molecular screening programs, which involve high virus to react and mutate (DOE/LANL, 2009).
throughput computing (Rath, 2014).
Iowa State University
Intel Corporation Supercomputing Center

Intel houses many of the world’s fastest super- The ISU is home to the IBM Blue Gene/L super-
computers including the new world’s fastest computer with 1024 dual-core PPC 440 CPU, 5.7
supercomputer powered by Intel® Xeon PhiTM TF peak performance, and 11 TB data storage
coprocessors. Intel processors power more than (ISU, 2006). It has been used for wide variety
80% of all systems on Top500 list of world’s most of computational biology research including as-
powerful supercomputers including 98% of new sembling the corn genome and studying protein
listed systems (Intel, 2013). Intel is also house networks (Aluru, 2006).
for the Endeavor – Intel Cluster, an Intel Xeon

161

Applications of Supercomputers in Sequence Analysis and Genome Annotation

CHALLENGES AND SOLUTIONS they are packed per six on a FPGA running on
150 MHz resulting in a full system performance
Datasets of hundreds of genomes are becoming of 460 GCUPS (billion elementary operations
common and it is believed that their sizes will per second). The elementary processing element
only increase in the future. MSA of hundreds of can also deliver double the work per clock cycle
genomes are becoming an intractable problem than a naïve implementation, resulting in a better
due to the quadratic increases in computation throughput per area ratio (Vermij, 2011).
time and memory footprint. Majority of alignment In the context of sequence alignment, which
algorithms to date are designed for commodity adds to the computational burden, is the issue of
clusters without parallelism. Thus, it is necessary very-large pattern-matching search. Pattern match-
to come up with alignment algorithms to enable ing in the presence of noise and uncertainty is an
comparison of hundreds instead of few genome important computational problem in a variety of
sequences within reasonable time. Church and fields. It is widely used in the field of bioinfor-
colleagues (Church et al., 2011) implemented a matics, and in that context, DNA or amino acid
design of MSA algorithms on massively paral- sequences are typically compared with a genetic
lel, distributed memory supercomputers to en- database. In order to achieve efficient parallelism
able researchers do comparative genomics on in a single very large pattern-matching search using
large datasets. In their work, they followed the a supercomputer cluster of GPUs, reformulation
methodology of sequential progressive Mauve of the SW algorithm was performed, modifying
algorithm and designed data structures includ- it in order to reduce inter-GPU communication
ing sequences and sorted k-mer lists on the IBM (Khajeh-Saeed & Blair Perot, 2011).
Blue Gene/P supercomputer (BG/P). Their results
show that they can reduce the memory footprint
to potentially align over 250 bacterial genomes on FUTURE RESEARCH DIRECTIONS
a single BG/P compute mode. Thus, their results
matched those of the original algorithm but in a As the uses of supercomputers to address impor-
shorter ½ time and with ¼ the memory footprint tant problems in the society such as research in
for scaffold building (Church et al., 2011). sequence analysis and genome annotation con-
Vermij (Vermij, 2011), on the other hand, im- tinue to grow and the place of supercomputing
plemented the well-known Smith-Waterman (SW) within the overall computing industry continues
optimal local alignment algorithm on the HC-1 to change, the value of innovation in supercom-
hybrid supercomputer from Convey Computer. puting architecture, modeling systems software,
The platform features four FPGAs, which can be applications software, and algorithms will endure
used to accelerate the problem of dealing with large (Academies, 2003). Optimization of algorithm
volume of datasets in genetic sequence alignment. further such as in sequence alignment by combin-
The FPGAs, and the CPU that control them, live ing several steps into a single GPU call is one of
in the same virtual memory space and share one the approaches in reducing computational time in
large memory. The solution allows a sustainable sequence alignment. The key approach is being
peak performance, being able to align sequences able to synchronize the threads within the kernel,
of any length, FPGA area efficient computations because each step of the algorithm such as in SW
and the cancellation of unnecessary workload. must be entirely completed (for all threads) before
The resulting SW FPGA core can run at 100% the next step can be executed (Khajeh-Saeed &
utilization for many alignments long. Further, Blair Perot, 2011).

162

Applications of Supercomputers in Sequence Analysis and Genome Annotation

CONCLUSION Anderson, S. (1981). Shotgun DNA sequencing us-


ing cloned DNase I-generated fragments. Nucleic
Sequence alignment and genome annotation are Acids Research, 9(13), 3015–3027. doi:10.1093/
among the most commonly used techniques in nar/9.13.3015 PMID:6269069
the area of genetics and bioinformatics. With the
Ashburner, M., Ball, C. A., Blake, J. A., Botstein,
emergence of a large sequence of genomic data,
D., Butler, H., & Cherry, J. M. et al. (2000). Gene
scientists are confronted with computational
ontology: Tool for the unification of biology: The
burden in these two fields. Supercomputers facili-
gene ontology consortium. Nature Genetics, 25(1),
tate in the ease of analyses in these two areas by
25–29. doi:10.1038/75556 PMID:10802651
reducing computational time. Synchronizing the
threads in supercomputing processes and the use Ashkenazy, H., Erez, E., Martz, E., Pupko, T., &
of massively parallel, distributed memory super- Ben-Tal, N. (2010). ConSurf 2010: Calculating
computers enable researchers to do comparative evolutionary conservation in sequence and struc-
genomics on large datasets, eventually reducing ture of proteins and nucleic acids. Nucleic Acids
computational time. Research, 38(Web Server issue), W529-533. doi:
10.1093/nar/gkq399
Bailey, L. C. Jr, Fischer, S., Schug, J., Crabtree,
REFERENCES
J., Gibson, M., & Overton, G. C. (1998). GAIA:
Abecasis, G. R., Auton, A., Brooks, L. D., De- Framework annotation of genomic sequence.
Pristo, M. A., Durbin, R. M., & Handsaker, R. Genome Research, 8(3), 234–250. doi:10.1101/
E. et al. (2012). An integrated map of genetic gr.8.3.234 PMID:9521927
variation from 1,092 human genomes. Nature, Baron, H. A. (n.d.). BOXSHADE. Retrieved Janu-
491(7422), 56–65. doi:10.1038/nature11632 ary 3, 2013, from http://mobyle.pasteur.fr/cgi-bin/
PMID:23128226 portal.py?-forms:boxshade
Academies, T. N. (2003). The future of supercom- Benson, D. A., Cavanaugh, M., Clark, K., Karsch-
puting: An interim report (p. 4). Washington, DC: Mizrachi, I., Lipman, D. J., Ostell, J., & Sayers,
National Academy of Sciences. E. W. (2013). GenBank. Nucleic Acids Research,
Agrawal, A. (2008). A new heuristic for multiple 41(Database issue), D36–D42. doi:10.1093/nar/
sequence alignment. Paper presented at the In- gks1195 PMID:23193287
stitute of Electrical and Electronics Engineers Besemer, J., & Borodovsky, M. (1999). Heuristic
International Conference, Ames, IA. doi:10.1109/ approach to deriving models for gene finding.
EIT.2008.4554299 Nucleic Acids Research, 27(19), 3911–3920.
Altschul, S. F., Gish, W., Miller, W., Myers, E. doi:10.1093/nar/27.19.3911 PMID:10481031
W., & Lipman, D. J. (1990). Basic local alignment Birney, E., & Durbin, R. (2000). Using Gene-
search tool. Journal of Molecular Biology, 215(3), Wise in the Drosophila annotation experiment.
403–410. doi:10.1016/S0022-2836(05)80360-2 Genome Research, 10(4), 547–548. doi:10.1101/
PMID:2231712 gr.10.4.547 PMID:10779496
Aluru, S. (2006). A supercomputer for Iowa State
University. Retrieved April 17, 2014, from http://
www.public.iastate.edu/~nscentral/news/06/jan/
supercomputer.shtml

163

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Braberg, H., Webb, B. M., Tjioe, E., Pieper, U., Chang, J. M., Di Tommaso, P., Taly, J. F., &
Sali, A., & Madhusudhan, M. S. (2012). SALIGN: Notredame, C. (2012). Accurate multiple se-
A web server for alignment of multiple protein quence alignment of transmembrane proteins
sequences and structures. Bioinformatics (Oxford, with PSI-Coffee. BMC Bioinformatics, 13(Suppl
England), 28(15), 2072–2073. doi:10.1093/bio- 4), S1. doi:10.1186/1471-2105-13-S4-S1
informatics/bts302 PMID:22618536 PMID:22536955
Bright, L. A., Burgess, S. C., Chowdhary, B., Chen, E. Y. (1994). The efficiency of automated
Swiderski, C. E., & McCarthy, F. M. (2009). DNA sequencing. In Automated DNA sequencing
Structural and functional-annotation of an equine and analysis. London: Academic Press Limited.
whole genome oligoarray. BMC Bioinformatics,
Chuang, T. J., Lin, W. C., Lee, H. C., Wang, C.
10(Suppl 11), S8. doi:10.1186/1471-2105-10-
W., Hsiao, K. L., & Wang, Z. H. et al. (2003). A
S11-S8 PMID:19811692
complexity reduction algorithm for analysis and
Brodie, R., Roper, R. L., & Upton, C. (2004). JDot- annotation of large genomic sequences. Genome
ter: A Java interface to multiple dotplots generated Research, 13(2), 313–322. doi:10.1101/gr.313703
by dotter. Bioinformatics (Oxford, England), 20(2), PMID:12566410
279–281. doi:10.1093/bioinformatics/btg406
Church, P. C., Goscinski, A., Holt, K., Inouye,
PMID:14734323
M., Ghoting, A., Makarychev, K., & Reumann,
Brodsky, L. I., & Vasiliev, A. V., Ya, L. K., Osipov, M. (2011). Design of multiple sequence alignment
Y. S., Tatuzov, R. L., & Feranchuk, S. I. (1992). algorithms on parallel, distributed memory super-
GeneBee: The program package for biopolymer computers. In Proceedings of the Institute of Elec-
structure analysis. Dimacs, 8, 127–139. trical and Electronics Engineers Engineering in
Medicine and Biology Society (pp. 924-927). Aca-
Brown, N. (1996). Consensus. Retrieved January
demic Press. doi:10.1109/IEMBS.2011.6090208
1, 2013, from http://coot.embl.de/Alignment/
consensus.html Ciria, R., Abreu-Goodger, C., Morett, E., &
Merino, E. (2004). GeConT: Gene context analy-
Brown, S. M., & Joubert, F. (n.d.). Pairwise se-
sis. Bioinformatics (Oxford, England), 20(14),
quence alignment. Retrieved February 10, 2013,
2307–2308. doi:10.1093/bioinformatics/bth216
from http://www.med.nyu.edu/rcr/rcr/course/
PMID:15073003
PairAlign.ppt
Consortium, T. E. P. (2011). A user’s guide to the
Bu, J., Chi, X., & Jin, Z. (2013). HSA: A heuris-
encyclopedia of DNA elements (ENCODE). PLoS
tic splice alignment tool. BMC Systems Biology,
Biology, 9(4), e1001046. doi:10.1371/journal.
7(Suppl 2), S10. doi:10.1186/1752-0509-7-S2-
pbio.1001046 PMID:21526222
S10 PMID:24564867
Corpet, F. (1988). Multiple sequence align-
Burge, C., & Karlin, S. (1997). Prediction of
ment with hierarchical clustering. Nucleic Acids
complete gene structures in human genomic DNA.
Research, 16(22), 10881–10890. doi:10.1093/
Journal of Molecular Biology, 268(1), 78–94.
nar/16.22.10881 PMID:2849754
doi:10.1006/jmbi.1997.0951 PMID:9149143

164

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Darling, A. E., Carey, L., & Feng, W. (2003). Dutheil, J. Y., Gaillard, S., & Stukenbrock, E. H.
The design, implementation, and evaluation of (2014). MafFilter: A highly flexible and extensible
mpiBLAST. In Proceedings of 4th International multiple genome alignment files processor. BMC
Conference on Linux Clusters: The HPC Revolu- Genomics, 15(1), 53. doi:10.1186/1471-2164-15-
tion 2003. San Jose, CA: mpiBLAST. 53 PMID:24447531
Di Tommaso, P., Moretti, S., Xenarios, I., Orobitg, ENCODE. (2004). The ENCODE (encyclopedia
M., Montanyola, A., Chang, J. M., . . . Notredame, of DNA elements) project. Science, 306(5696),
C. (2011). T-coffee: A web server for the multiple 636–640. doi:10.1126/science.1105136
sequence alignment of protein and RNA sequences PMID:15499007
using structural information and homology ex-
Fan, L., Hui, J. H., Yu, Z. G., & Chu, K. H.
tension. Nucleic Acids Research, 39(Web Server
(2014). VIP barcoding: Composition vector-based
issue), W13-17. doi: 10.1093/nar/gkr245
software for rapid species identification based on
Do, C. B., Mahabhashyam, M. S., Brudno, M., DNA barcoding. Molecular Ecology Resources,
& Batzoglou, S. (2005). ProbCons: Probabilistic 14(4), 871–881. doi:10.1111/1755-0998.12235
consistency-based multiple sequence alignment. PMID:24479510
Genome Research, 15(2), 330–340. doi:10.1101/
Flicek, P., Amode, M. R., Barrell, D., Beal, K.,
gr.2821705 PMID:15687296
Brent, S., & Chen, Y. et al. (2011). Ensembl
DOE/LANL. (2009). Scientists use world’s 2011. Nucleic Acids Research, 39(Database is-
fastest supercomputer to create the largest HIV sue), D800–D806. doi:10.1093/nar/gkq1064
evolutionary tree. Retrieved April 17, 2014, PMID:21045057
from http://www.sciencedaily.com/releas-
Frazer, K. A., Pachter, L., Poliakov, A., Rubin, E.
es/2009/10/091027161536.htm
M., & Dubchak, I. (2004). VISTA: computational
Dunham, I., Shimizu, N., Roe, B. A., Chissoe, tools for comparative genomics. Nucleic Acids
S., Hunt, A. R., & Collins, J. E. et al. (1999). Research, 32(Web Server issue), W273-279. doi:
The DNA sequence of human chromosome 22. 10.1093/nar/gkh458
Nature, 402(6761), 489–495. doi:10.1038/990031
Frishman, D. (2007). Protein annotation at ge-
PMID:10591208
nomic scale: The current status. Chemical Reviews,
Dunn, J. J., Studier, F. W., & Gottesman, M. 107(8), 3448–3466. doi:10.1021/cr068303k
(1983). Complete nucleotide sequence of bacte- PMID:17658902
riophage T7 DNA and the locations of T7 genetic
Galperin, M. Y., & Koonin, E. V. (2004). ‘Con-
elements. Journal of Molecular Biology, 166(4),
served hypothetical’ proteins: Prioritization of
477–535. doi:10.1016/S0022-2836(83)80282-4
targets for experimental study. Nucleic Acids
PMID:6864790
Research, 32(18), 5452–5463. doi:10.1093/nar/
Duret, L., Gasteiger, E., & Perriere, G. (1996). gkh885 PMID:15479782
LALNVIEW: A graphical viewer for pairwise
Gaspar, P., Lopes, P., Oliveira, J., Santos, R.,
sequence alignments. Computer Applications in
Dalgleish, R., & Oliveira, J. L. (2014). Variobox:
the Biosciences, 12(6), 507–510. PMID:9021269
Automatic detection and annotation of human ge-
netic variants. Human Mutation, 35(2), 202–207.
doi:10.1002/humu.22474 PMID:24186831

165

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Gelfand, M. S., Mironov, A. A., & Pevzner, P. A. Huang, X., & Miller, W. (1991). A time-efficient
(1996). Gene recognition via spliced sequence linear-space local similarity algorithm. Ad-
alignment. Proceedings of the National Acad- vances in Applied Mathematics, 12(3), 337–357.
emy of Sciences of the United States of America, doi:10.1016/0196-8858(91)90017-D
93(17), 9061–9066. doi:10.1073/pnas.93.17.9061
Huang da, W., Sherman, B. T., Tan, Q., Kir, J.,
PMID:8799154
Liu, D., Bryant, D., . . . Lempicki, R. A. (2007).
Gille, C., Birgit, W., & Gille, A. (2014). Sequence Bioinformatics resources: Expanded annotation
alignment visualization in HTML5 without database and novel algorithms to better extract
Java. Bioinformatics (Oxford, England), 30(1), biology from large gene lists. Nucleic Acids Re-
121–122. doi:10.1093/bioinformatics/btt614 search, 35(Web Server issue), W169-175. doi:
PMID:24273246 10.1093/nar/gkm415
Godzik, A. (2012). FFAS fold and function align- Hyatt, D., Snoddy, J., Schmoyer, D., Chen, G.,
ment. Retrieved December 30, 2012, from http:// Fischer, K., Parang, M., et al. (2000). Improved
ffas.sanfordburnham.org/ffas-cgi/cgi/ffas.pl analysis and annotation tools for whole-genome
computational annotation and analysis: GRAIL-
Gouet, P., Courcelle, E., Stuart, D. I., & Metoz,
EXP genome analysis toolkit and related analysis
F. (1999). ESPript: Analysis of multiple sequence
tools. In Genome Sequencing & Biology Meeting.
alignments in PostScript. Bioinformatics (Oxford,
Information, N. C. f. B. Align sequences nucleo-
England), 15(4), 305–308. doi:10.1093/bioinfor-
tide BLAST. Retrieved December 30, 2012, from
matics/15.4.305 PMID:10320398
http://blast.ncbi.nlm.nih.gov/
Gremme, G., Steinbiss, S., & Kurtz, S. (2013).
Information, N. C. f. B. (2011). GenBank. Re-
GenomeTools: A comprehensive software library
trieved December 28, 2012
for efficient processing of structured genome an-
notations. Institute of Electrical and Electronics Intel. (2013). Intel powers the world’s fastest
Engineers/Association for Computing Machin- supercomputer, reveals new and future high per-
ery Transactions on Computatonal Biology and formance computing technologies. Retrieved April
Bioinformatics, 10(3), 645-656. doi: 10.1109/ 17, 2014, from http://www.intc.com/releasedetail.
TCBB.2013.68 cfm?ReleaseID=774058
HPC Service Will be Used for Genome Annota- ISU. (2006). CyBlue - Blue gene supercomputer.
tion System. (2006). Retrieved January 20, 2013, Retrieved April 17, 2014, from http://bluegene.
from http://www.hpcwire.com ece.iastate.edu
Huang, S., Zhang, J., Li, R., Zhang, W., He, Z., & Jager, M., Wang, K., Bauer, S., Smedley, D.,
Lam, T. W. et al. (2011). SOAPsplice: Genome- Krawitz, P., & Robinson, P. N. (2014). Jannovar:
wide ab initio detection of splice junctions from A Java library for exome annotation. Human Mu-
RNA-Seq data. Frontiers in Genetics, 2, 46. tation, 35(5), 548–555. doi:10.1002/humu.22531
doi:10.3389/fgene.2011.00046 PMID:22303342 PMID:24677618
Huang, X., Adams, M. D., Zhou, H., & Kerlavage, Jaroszewski, L., Li, Z., Cai, X. H., Weber, C., &
A. R. (1997). A tool for analyzing and annotating Godzik, A. (2011). FFAS server: Novel features
genomic sequences. Genomics, 46(1), 37–45. and applications. Nucleic Acids Research, 39(Web
doi:10.1006/geno.1997.4984 PMID:9403056 Server issue), W38-44. doi:10.1093/nar/gkr441

166

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Jeon, Y. S., Lee, K., Park, S. C., Kim, B. S., Cho, Krogh, A. (1997). Two methods for improving
Y. J., Ha, S. M., & Chun, J. (2014). EzEditor: A performance of an HMM and their application
versatile sequence alignment editor for both rRNA- for gene finding. Proceedings of the International
and protein-coding genes. International Journal Conference on Intelligent Systems for Molecular
of Systematic and Evolutionary Microbiology, Biology, 5, 179–186. PMID:9322033
64(Pt 2), 689–691. doi:10.1099/ijs.0.059360-0
Krogh, A. (1998). An introduction to hidden
PMID:24425826
Markov models for biological sequences. In
Junier, T., & Pagni, M. (2000). Dotlet: Diagonal Computational methods in molecular biology
plots in a web browser. Bioinformatics (Oxford, (pp. 45-63). Amsterdam: Elsevier. doi:10.1016/
England), 16(2), 178–179. doi:10.1093/bioinfor- S0167-7306(08)60461-5
matics/16.2.178 PMID:10842741
Krogh, A. (2000). Using database matches with
Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, for HMMGene for automated gene detection in
K. M., Pringle, T. H., Zahler, A. M., & Haussler, Drosophila. Genome Research, 10(4), 523–528.
D. (2002). The human genome browser at UCSC. doi:10.1101/gr.10.4.523 PMID:10779492
Genome Research, 12(6), 996-1006. doi: 10.1101/
Larkin, M. A., Blackshields, G., Brown, N. P.,
gr.229102
Chenna, R., McGettigan, P. A., & McWilliam,
Khajeh-Saeed, A., & Blair Perot, J. (2011). GPU- H. et al. (2007). Clustal W and Clustal X version
supercomputer acceleration of pattern matching. 2.0. Bioinformatics (Oxford, England), 23(21),
In W. W. Hwu (Ed.), GPU computing gems (Vol. 2947–2948. doi:10.1093/bioinformatics/btm404
2, pp. 185–198). Morgan Kaufmann. doi:10.1016/ PMID:17846036
B978-0-12-384988-5.00013-9
Lin, H., Ma, X., Chandramohan, P., Geist, A., &
Kim, D., Pertea, G., Trapnell, C., Pimentel, H., Samatova, N. (2005). Efficient data access for
Kelley, R., & Salzberg, S. L. (2013). TopHat2: Ac- parallel BLAST. Academic Press.
curate alignment of transcriptomes in the presence
Liu, J., Xiao, H., Huang, S., & Li, F. (2014).
of insertions, deletions and gene fusions. Genome
OMIGA: Optimized maker-based insect genome
Biology, 14(4), R36. doi:10.1186/gb-2013-14-
annotation. Molecular Genetics and Genomics,
4-r36 PMID:23618408
289(4), 567–573. doi:10.1007/s00438-014-0831-
Kleinjung, J., Douglas, N., & Heringa, J. (2002). 7 PMID:24609470
Parallelized multiple alignment. Bioinformatics
Lloyd, S. (2010). Parallel multiple sequence align-
(Oxford, England), 18(9), 1270–1271. doi:10.1093/
ment: An overview. Retrieved January 6, 2013,
bioinformatics/18.9.1270 PMID:12217922
from http://dna.cs.byu.edu/msa/overview.pdf
Koyanagi, R., Takeuchi, T., Hisata, K., Gyoja,
Lohse, M., Nagel, A., Herter, T., May, P., Schroda,
F., Shoguchi, E., Satoh, N., & Kawashima, T.
M., & Zrenner, R. et al. (2014). Mercator: A fast
(2013). MarinegenomicsDB: An integrated ge-
and simple web server for genome scale functional
nome viewer for community-based annotation of
annotation of plant sequence data. Plant, Cell &
genomes. Zoological Science, 30(10), 797–800.
Environment, 37(5), 1250–1258. doi:10.1111/
doi:10.2108/zsj.30.797 PMID:24125644
pce.12231 PMID:24237261

167

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Loytynoja, A. (2014). Phylogeny-aware alignment Maxam, A. M., & Gilbert, W. (1977). A new
with PRANK. Methods in Molecular Biology method for sequencing DNA. Proceedings of
(Clifton, N.J.), 1079, 155–170. doi:10.1007/978- the National Academy of Sciences of the United
1-62703-646-7_10 PMID:24170401 States of America, 74(2), 560–564. doi:10.1073/
pnas.74.2.560 PMID:265521
Loytynoja, A., & Goldman, N. (2010). web-
PRANK: A phylogeny-aware multiple sequence Mazumder, R., Kolaskar, A., & Seto, D. (2001).
aligner with interactive alignment browser. BMC GeneOrder: Comparing the order of genes in
Bioinformatics, 11(1), 579. doi:10.1186/1471- small genomes. Bioinformatics (Oxford, Eng-
2105-11-579 PMID:21110866 land), 17(2), 162–166. doi:10.1093/bioinformat-
ics/17.2.162 PMID:11238072
Lukashin, A. V., & Borodovsky, M. (1998).
GeneMark.hmm: New solutions for gene find- Milanesi, L. K. N. A., Rogozin, I. B., Ischenko,
ing. Nucleic Acids Research, 26(4), 1107–1115. I. V., Kel, A. E., Orlov Yu, L., Ponomarenko, M.
doi:10.1093/nar/26.4.1107 PMID:9461475 P., & Vezzoni, P. (1993). GenView: A comput-
ing tool for protein-coding regions prediction
Ma, J., Wang, S., Wang, Z., & Xu, J. (2014).
in nucleotide sequences. In Proceedings of the
MRFalign: Protein homology detection through
Second International Conference on Bioinfor-
alignment of markov random fields. PLoS Com-
matics, Supercomputing and Complex Genome
putational Biology, 10(3), e1003500. doi:10.1371/
Analysis. Singapore: World Scientific Publishing.
journal.pcbi.1003500 PMID:24675572
doi:10.1142/9789814503655_0048
Mahadevan, P., King, J. F., & Seto, D. (2009a).
Mironov, A. A., Roytberg, M. A., Pevzner, P. A.,
CGUG: In silico proteome and genome pars-
& Gelfand, M. S. (1998). Performance-guarantee
ing tool for the determination of “core” and
gene predictions via spliced alignment. Genom-
unique genes in the analysis of genomes up to
ics, 51(3), 332–339. doi:10.1006/geno.1998.5251
ca. 1.9 Mb. BMC Research Notes, 2(1), 168.
PMID:9721203
doi:10.1186/1756-0500-2-168 PMID:19706165
Morgenstern, B. (2014). Multiple sequence align-
Mahadevan, P., King, J. F., & Seto, D. (2009b).
ment with DIALIGN. Methods in Molecular Biolo-
Data mining pathogen genomes using GeneOrder
gy (Clifton, N.J.), 1079, 191–202. doi:10.1007/978-
and CoreGenes and CGUG: Gene order, synteny
1-62703-646-7_12 PMID:24170403
and in silico proteomes. International Journal of
Computational Biology and Drug Design, 2(1), Mount, D. M. (2004). Bioinformatics: sequence
100–114. doi:10.1504/IJCBDD.2009.027586 and genome analysis (2nd ed.). Cold Springs Har-
PMID:20054988 bor, NY: Cold Springs Harbor Laboratory Press.
Mahadevan, P., & Seto, D. (2010). Rapid pair- Ning, Z., Cox, A. J., & Mullikin, J. C. (2001).
wise synteny analysis of large bacterial genomes SSAHA: A fast search method for large DNA
using web-based GeneOrder4.0. BMC Research databases. Genome Research, 11(10), 1725–1729.
Notes, 3(1), 41. doi:10.1186/1756-0500-3-41 doi:10.1101/gr.194201 PMID:11591649
PMID:20178631
Noe, L., & Kucherov, G. (2005). YASS: Enhancing
the sensitivity of DNA similarity search. Nucleic
Acids Research, 33(Web Server issue), W540-543.
doi: 10.1093/nar/gki478

168

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Ovcharenko, I., Loots, G. G., Hardison, R. Pei, J., & Grishin, N. V. (2007). PROMALS: To-
C., Miller, W., & Stubbs, L. (2004). zPicture: wards accurate multiple sequence alignments of
Dynamic alignment and visualization tool for distantly related proteins. Bioinformatics (Oxford,
analyzing conservation profiles. Genome Re- England), 23(7), 802–808. doi:10.1093/bioinfor-
search, 14(3), 472–477. doi:10.1101/gr.2129504 matics/btm017 PMID:17267437
PMID:14993211
Penn, O., Privman, E., Ashkenazy, H., Landan,
Pachter, L., Batzoglou, S., Spitkovsky, V. I., Banks, G., Graur, D., & Pupko, T. (2010). GUIDANCE:
E., Lander, E. S., Kleitman, D. J., & Berger, B. A web server for assessing alignment confidence
(1999). A dictionary-based approach for gene scores. Nucleic Acids Research, 38(Web Server
annotation. Journal of Computational Biology, issue), W23-28. doi: 10.1093/nar/gkq443
6(3-4), 419–430. doi:10.1089/106652799318364
Pevsner, J. (2009). Bioinformatics and func-
PMID:10582576
tional genomics. Hoboken, NJ: Wiley-Blackwell.
Paquete, L., Matias, P., Abbasi, M., & Pinheiro, M. doi:10.1002/9780470451496
(2014). MOSAL: Software tools for multiobjective
Plewniak, F., Bianchetti, L., Brelivet, Y., Carles,
sequence alignment. Source Code for Biology and
A., Chalmel, F., & Lecompte, O. et al. (2003).
Medicine, 9(1), 2. doi:10.1186/1751-0473-9-2
PipeAlign: A new toolkit for protein family analy-
PMID:24401750
sis. Nucleic Acids Research, 31(13), 3829–3832.
Parra, G., Blanco, E., & Guigo, R. (2000). GeneID doi:10.1093/nar/gkg518 PMID:12824430
in drosophila. Genome Research, 10(4), 511–515.
Portal, E. B. R. (n.d.). SIM - Alignment tool for
doi:10.1101/gr.10.4.511 PMID:10779490
protein sequences. Retrieved December 30, 2012,
Pearson, W. (1991). LALIGN - Find mulitple from http://web.expasy.org/sim/
matching subsegments in two sequences. Retrieved
Puckelwartz, M. J., Pesce, L. L., Nelakuditi, V.,
December 29, 2012, from http://www.ch.embnet.
Dellefave-Castillo, L., Golbus, J. R., & Day, S. M.
org/software/LALIGN_form.html
et al. (2014). Supercomputing for the paralleliza-
Pearson, W. R. (2006a). FASTA sequence com- tion of whole genome analysis. Bioinformatics (Ox-
parison at the University of Virginia. Retrieved ford, England), 30(11), 1508–1513. doi:10.1093/
December 30, 2012, from http://fasta.bioch.vir- bioinformatics/btu071 PMID:24526712
ginia.edu/fasta_www2/fasta_www.cgi?rm=lalign
Rath, J. (2014). NERSC flips the switch on new
Pearson, W. R. (2006b). LALIGN/PLALIGN. Edison supercomputer. Retrieved April 17, 2014,
Retrieved December 30, 2012, from http:// from http://www.datacenterknowledge.com/
fasta.bioch.virginia.edu/fasta_www2/fasta_www. archives/2014/01/31/nersc-flips-switch-new-
cgi?rm=lalign edison-supercomputer/
Pedersen, B. S., Yang, I. V., & De, S. (2013). Reese, M. G., Kulp, D., Tammana, H., & Haussler,
CruzDB: Software for annotation of genomic D. (2000). Genie--gene finding in Drosophila
intervals with UCSC genome-browser data- melanogaster. Genome Research, 10(4), 529–538.
base. Bioinformatics (Oxford, England), 29(23), doi:10.1101/gr.10.4.529 PMID:10779493
3003–3006. doi:10.1093/bioinformatics/btt534
PMID:24037212

169

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Rice, P., Longden, I., & Bleasby, A. (2000). Schnattinger, T., Schoning, U., Marchfelder, A.,
EMBOSS: The European molecular biology & Kestler, H. A. (2013). RNA-Pareto: Interac-
open software suite. Trends in Genetics, 16(6), tive analysis of Pareto-optimal RNA sequence-
276–277. doi:10.1016/S0168-9525(00)02024-2 structure alignments. Bioinformatics (Oxford,
PMID:10827456 England), 29(23), 3102–3104. doi:10.1093/
bioinformatics/btt536 PMID:24045774
Roberts, R. J. (2004). Identifying protein function-
-A call for community action. PLoS Biology, Schuler, G. D. (1997). Sequence mapping by
2(3), E42. doi:10.1371/journal.pbio.0020042 electronic PCR. Genome Research, 7(5), 541–550.
PMID:15024411 PMID:9149949
Russell, D. J. (2014). GramAlign: Fast alignment Schwartz, S., Zhang, Z., Frazer, K. A., Smit, A.,
driven by grammar-based phylogeny. Methods Riemer, C., & Bouck, J. et al. (2000). PipMaker-
in Molecular Biology (Clifton, N.J.), 1079, -a web server for aligning two genomic DNA
171–189. doi:10.1007/978-1-62703-646-7_11 sequences. Genome Research, 10(4), 577–586.
PMID:24170402 doi:10.1101/gr.10.4.577 PMID:10779500
Rutherford, K., Parkhill, J., Crook, J., Horsnell, SDSC. (2014). San Diego supercompuer center.
T., Rice, P., Rajandream, M. A., & Barrell, B. Retrieved April 17, 2014, from http://www.sdsc.
(2000). Artemis: Sequence visualization and anno- edu/supercomputing/gordon/
tation. Bioinformatics (Oxford, England), 16(10),
Seemann, T. (2014). Prokka: Rapid prokaryotic
944–945. doi:10.1093/bioinformatics/16.10.944
genome annotation. Bioinformatics (Oxford,
PMID:11120685
England), 30(14), 2068–2069. doi:10.1093/bio-
Salamov, A. A., & Solovyev, V. V. (2000). Ab informatics/btu153 PMID:24642063
initio gene finding in Drosophila genomic DNA.
Shahid, S., & Axtell, M. J. (2013). Identification
Genome Research, 10(4), 516–522. doi:10.1101/
and annotation of small RNA genes using Short-
gr.10.4.516 PMID:10779491
Stack. Methods (San Diego, Calif.). doi:10.1016/j.
Sanger, F., Nicklen, S., & Coulson, A. R. (1977). ymeth.2013.10.004 PMID:24139974
DNA sequencing with chain-terminating in-
Sievers, F., Wilm, A., Dineen, D., Gibson, T. J.,
hibitors. Proceedings of the National Academy
Karplus, K., & Li, W. et al. (2011). Fast, scal-
of Sciences of the United States of America,
able generation of high-quality protein multiple
74(12), 5463–5467. doi:10.1073/pnas.74.12.5463
sequence alignments using Clustal Omega. Mo-
PMID:271968
lecular Systems Biology, 7(1), 539. doi:10.1038/
Santos, A. R., Barbosa, E., Fiaux, K., Zurita- msb.2011.75 PMID:21988835
Turk, M., Chaitankar, V., & Kamapantula, B.
Simossis, V. A., & Heringa, J. (2005). PRALINE:
et al. (2013). PANNOTATOR: An automated
A multiple sequence alignment toolbox that
tool for annotation of pan-genomes. Genetics
integrates homology-extended and secondary
and Molecular Research, 12(3), 2982–2989.
structure information. Nucleic Acids Research,
doi:10.4238/2013.August.16.2 PMID:24065654
33(Web Server issue), W289-294. doi: 10.1093/
Schmutz, J., Grimwood, J., & Myers, R. M. (2004). nar/gki390
Sequence finishing. Methods in Molecular Biol-
ogy (Clifton, N.J.), 255, 333–342. doi:10.1385/1-
59259-752-1:333 PMID:15020836

170

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Smith, C., Heyne, S., Richter, A. S., Will, S., & Subramanian, A. R., Kaufmann, M., & Morgen-
Backofen, R. (2010). Freiburg RNA Tools: A web stern, B. (2008). DIALIGN-TX: Greedy and pro-
server integrating INTARNA, EXPARNA and gressive approaches for segment-based multiple
LOCARNA. Nucleic Acids Research, 38(Web sequence alignment. Algorithms for Molecular
Server issue), W373-377. doi: 10.1093/nar/ Biology; AMB, 3(1), 6. doi:10.1186/1748-7188-
gkq316 3-6 PMID:18505568
Softberry, I. (2007). SCAN2. Mount Kisco, NY: Sze, S. H., & Pevzner, P. A. (1997). Las Vegas
Softberry, Inc. Retrieved April 17, 2014, from algorithms for gene recognition: Suboptimal and
http://linux1.softberry.com/ error-tolerant spliced alignment. Journal of Com-
putational Biology, 4(3), 297–309. doi:10.1089/
Solovyev, V. V., Salamov, A. A., & Lawrence,
cmb.1997.4.297 PMID:9278061
C. B. (1994). Predicting internal exons by oligo-
nucleotide composition and discriminant analysis Thompson, J. D., Plewniak, F., Thierry, J., & Poch,
of spliceable open reading frames. Nucleic Acids O. (2000). DbClustal: Rapid and reliable global
Research, 22(24), 5156–5163. doi:10.1093/ multiple alignments of protein sequences detected
nar/22.24.5156 PMID:7816600 by database searches. Nucleic Acids Research,
28(15), 2919–2926. doi:10.1093/nar/28.15.2919
Solovyev, V. V., Salamov, A. A., & Lawrence, C.
PMID:10908355
B. (1995). Identification of human gene structure
using linear discriminant functions and dynamic Thorsen, O., Smith, B., Sosa, C. P., Jiang, K.,
programming. Proceedings of the International Lin, H., Peters, A., & Feng, W. (2007). Parallel
Conference on Intelligent Systems for Molecular genomic sequence-search on a massively paral-
Biology, 3, 367–375. PMID:7584460 lel system. New York, NY: Academic Press.
doi:10.1145/1242531.1242542
Stein, L. (2001). Genome annotation: From
sequence to biology. Nature Reviews. Genet- Top500. (2008). Top500 June 2008: Roadrunner
ics, 2(7), 493–503. doi:10.1038/35080529 - BladeCenter QS22/LS21 cluster, PowerXCell 8i
PMID:11433356 3.2 Ghz / Opteron DC 1.8 GHz, Voltaire infiniband.
Retrieved April 17, 2014, from http://www.top500.
Sturrock, S., & Collins, J. (1993). MPsrch version
org/system/176026
1.3. Biocomputing Research Unit University of
Edinburgh. Retrieved April 17, 2014, from http:// Top500. (2013). Top500: Endeavor - Intel cluster.
www.ebi.ac.uk/Tools/MPsrch/ Retrieved April 17, 2014, from http://www.top500.
org/system/176908
Su, X., Pan, W., Song, B., Xu, J., & Ning, K. (2014).
Parallel-META 2.0: Enhanced metagenomic data Troshin, P. V., Procter, J. B., & Barton, G. J.
analysis with functional annotation, high perfor- (2011). Java bioinformatics analysis web services
mance computing and advanced visualization. for multiple sequence alignment--JABAWS:MSA.
PLoS ONE, 9(3), e89323. doi:10.1371/journal. Bioinformatics (Oxford, England), 27(14),
pone.0089323 PMID:24595159 2001–2002. doi:10.1093/bioinformatics/btr304
PMID:21593132

171

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Uberbacher, E. C., & Mural, R. J. (1991). Locating Ye, Y., Wei, B., Wen, L., & Rayner, S. (2013).
protein-coding regions in human DNA sequences BlastGraph: A comparative genomics tool based on
by a multiple sensor-neural network approach. BLAST and graph algorithms. Bioinformatics (Ox-
Proceedings of the National Academy of Sci- ford, England), 29(24), 3222–3224. doi:10.1093/
ences of the United States of America, 88(24), bioinformatics/btt553 PMID:24068035
11261–11265. doi:10.1073/pnas.88.24.11261
Yeh, R. F., Lim, L. P., & Burge, C. B. (2001).
PMID:1763041
Computational inference of homologous gene
Uberbacher, E. C., Xu, Y., & Mural, R. J. (1996). structures in the human genome. Genome Re-
Discovering and understanding genes in human search, 11(5), 803–816. doi:10.1101/gr.175701
DNA sequence using GRAIL. Methods in En- PMID:11337476
zymology, 266, 259–281. doi:10.1016/S0076-
Zafar, N., Mazumder, R., & Seto, D. (2001).
6879(96)66018-2 PMID:8743689
Comparisons of gene colinearity in genomes
Varadarajan, S. (2004). System X: Building the using GeneOrder2.0. Trends in Biochemical
Virginia Tech supercomputer. Paper presented at Sciences, 26(8), 514–516. doi:10.1016/S0968-
the 13th International Conference on Computer 0004(01)01881-3 PMID:11504629
Communications and Networks. New York, NY.
Zafar, N., Mazumder, R., & Seto, D. (2002). Core-
doi:10.1109/ICCCN.2004.1401571
Genes: A computational tool for identifying and
Vermij, E. P. (2011). Genetic sequence alignment cataloging “core” genes in a set of small genomes.
on a supercomputing platform. Netherlands: TU BMC Bioinformatics, 3(1), 12. doi:10.1186/1471-
Delft. 2105-3-12 PMID:11972896
Warren, A. S., Archuleta, J., Feng, W. C., & Se- Zammataro, L., DeMolfetta, R., Bucci, G., Ceol,
tubal, J. C. (2010). Missing genes in the annota- A., & Muller, H. (2014). AnnotateGenomicRe-
tion of prokaryotic genomes. BMC Bioinformat- gions: A web application. BMC Bioinformatics,
ics, 11(1), 131. doi:10.1186/1471-2105-11-131 15(Suppl 1), S8. doi:10.1186/1471-2105-15-S1-
PMID:20230630 S8 PMID:24564446
Webb-Roberts, B.-J. (2004). Protein & DNA Zhang, M. Q. (1997). Identification of protein
sequence analysis. Retrieved February 10, 2013, coding regions in the human genome by qua-
from http://www.sysbio.org/resources/tutorials/ dratic discriminant analysis. Proceedings of the
sequence_analysis_webb.pdf National Academy of Sciences of the United
States of America, 94(2), 565–568. doi:10.1073/
Wood, D. E., & Salzberg, S. L. (2014). Kraken:
pnas.94.2.565 PMID:9012824
Ultrafast metagenomic sequence classification us-
ing exact alignments. Genome Biology, 15(3), R46. Zhang, Y., & Waterman, M. S. (2003). DNA
doi:10.1186/gb-2014-15-3-r46 PMID:24580807 sequence assembly and multiple sequence align-
ment by an Eulerian path approach. Cold Spring
Xu, Y., Mural, R., Shah, M., & Uberbacher, E.
Harbor Symposia on Quantitative Biology,
(1994). Recognizing exons in genomic sequence
68(0), 205–212. doi:10.1101/sqb.2003.68.205
using GRAIL II. Genetic Engineering, 16,
PMID:15338619
241–253. PMID:7765200

172

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Zhao, K., & Chu, X. (2014). G-BLASTN: Accel- CpG Elements: Genomic regions consisting of
erating nucleotide alignment by graphics proces- a high frequency of CpG sites. A CpG site refers
sors. Bioinformatics (Oxford, England), 30(10), to a genomic region where a cytosine nucleotide
1384–1391. doi:10.1093/bioinformatics/btu047 exists next to a guanine nucleotide in the linear
PMID:24463183 sequence of bases along its length.
Dicodon Statistics: Statistics used for the
Zola, J., Yang, X., Rospondek, A., & Aluru, S.
prediction of splice signals and coding regions.
(2007). Parallel T-coffee: A parallel multiple
Discriminant Function: A function of a set of
sequence aligner. In Proceedings of the ISCA
variables that is evaluated for samples of events
20th International Conference on Parallel and
of objects and used as an aid in classifying them.
Distributed Computing Systems. Academic Press.
DNA: deoxyribonucleic acid is a molecule
Zuegge, J., Ebeling, M., & Schneider, G. (2001). that serves as the hereditary material in humans
H-BloX: Visualizing alignment block entropies. and almost all other organisms.
Journal of Molecular Graphics & Modelling, DNA Polymerases: An enzyme that catalyzes
19(3-4), 304–306, 379. doi:10.1016/S1093- the polymerization of DNAs into a DNA strand.
3263(00)00074-7 PMID:11449568 Evolution: Gradual unfolding of new varieties
of life from previous forms over long periods of
Zverina, J. (2014). SDSC assists in whole-genome
time; from the modern genetic perspective, it is
sequencing analysis under collaboration with
defined as a change in allele frequency from one
Janssen. Retrieved April 17, 2014, from http://
generation to the next.
ucsdnews.ucsd.edu/pressrelease/sdsc_assists_in_
Exon: DNA nucleotide sequence carrying out
whole_genome_sequencing_analysis_under_col-
the code for the final mRNA and, thus, determines
laboration_with_j
the amino acid sequence of an organism.
Expressed Sequence Tags (ESTs): Small
pieces of DNA sequence (usually 200 to 500
KEY TERMS AND DEFINITIONS nucleotides long) generated by sequencing either
one or both ends of an expressed gene.
Allele Frequency: In a population, this refers Field-Programmable Gate Arrays (FPGA):
to the percentage of all the alleles at a locus ac- An integrated circuit that can be programmed in
counted for by one specific allele. the field after manufacture.
Bacteriophage: Virus that infects and repli- Functional Annotation: The process of col-
cates within bacteria. lecting information and describing the gene’s
Comparative Genomics: Study that involves biological identity.
the comparison of the genomic sequences of dif- GenBank: The NIH genetic sequence da-
ferent species. tabase, an annotated collection of all publicly
Complementary DNA (cDNA): DNA derived available DNA sequences (Benson et al., 2013).
from messenger RNA (mRNA), which can be ob- Genetic Association: Statistical phenomenon,
tained from prokaryotes or eukaryotes and is often which associates a specific disease with a certain
utilized to clone eukaryotic genes in prokaryotes. gene(s).
Cosmid: A plasmid vector containing a bacte- Genomics: Field of study focusing on genes,
riophage lambda cos site, which directs insertion their functions, and related techniques.
of DNA into phage particles. Genotype Frequency: Sum of the number
of individuals possessing the genotype divided
by the total number of individuals in the sample.

173

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Genotype: An organism’s entire genetic Pattern Recognition: Method that deals with
makeup or to the alleles at a specific genetic locus. feature extraction and classification.
Graphics Processing Unit (GPU): A pro- Peptide Sequence: Unique amino acid se-
grammable logic chip that can perform animation, quence characterizing a given protein.
imaging, and videos for the computer screen. Population Genetics: The study concerned
Heuristic Methods: Methods that facilitate mainly with the genetic variation within species.
in learning, discover, or problem-solving by ex- Prokaryote: Organisms lacking a cell nucleus.
perimental or trial-and-error methods. Promoters: DNA segment usually occurring
Homology: Two molecules that share a com- from a gene coding region and acting as a control-
mon ancestor. ling element in gene expression
Intergenic: A region found between two genes. Reduced Instruction Set Computer (RISC):
Intron: A noncoding sequence between two A type of computer architecture that has a rela-
coding genomic sequence. tively small set of computer instructions that it
Markov Model System: Mathematical model can perform.
that allows the study of complex systems by estab- Regulatory Elements: DNA sequence that
lishing a state of the system and then consequently determines the regulation of gene expression.
effecting a transition to a new state, such a tran- Sequence Alignment: A method of arrang-
sition being dependent only on the values of the ing RNA, protein, or DNA sequences to identify
current state, and not dependent on the previous regions of similarity that maybe a consequence of
history of the system up to that point. functional, structural, or evolutionary relationships
Mendel’s Laws: Consist of three laws of between the sequences.
inheritance describing how genes are passed on Sequence Assembly: A method of determining
from parents to offsprings. the order of multiple sequenced DNA fragments.
Meta-Analysis: A method of combining Shotgun Sequencing: Laboratory technique
quantitative or qualitative datasets from differ- for determining the DNA sequence of an organ-
ent studies to determine a single conclusion with ism’s genome.
greater statistical power. Single Instruction, Multiple Data (SIMD)
Multiprocessor: A computer system with Processing: Processing technique in which an
more than one central processing unit (CPU) that operation is taken in one specified instruction and
share main memory. applies it to more than one set of data elements
Nucleotide: Building blocks from which DNA at the same time.
and RNA are built. Splicing: The process of inserting DNA or
Open Reading Frame (ORF): A DNA se- RNA fragments to form new genetic combinations
quence that does not contain a stop codon in a or alter a new genetic structure.
given reading frame. Start Codon: The first codon of an mRNA
Parallel Algorithm: An algorithm that allows transcript translated by a ribosome.
execution a piece at a time in different process- Stop Codon: The genetic codon in an mRNA
ing devices, and then eventually putting them that signals the termination of protein synthesis
back together again at the end to determine the during translation.
correct result. Structural Annotation: The process of local-
Parallel Programming: Computational izing the genes in both strands of a genome as well
method, which allows carrying out multiple cal- precisely determining the structural elements of
culations simultaneously. these genes.

174

Applications of Supercomputers in Sequence Analysis and Genome Annotation

Supernode: Any node that also serves as one Transposon: DNA segment consisting of an
of the network’s relayers and proxy servers, han- insertion sequence element at each end as a repeat
dling data flow and connections for other users. as well as genes specific to some other activity
Traditional GenBank: Divisions that contain such as resistance to antibiotics.
106 billion nucleotide bases from 108 million indi- Whole Genome Shotgun: A method of ge-
vidual sequences, with 11 million new sequences nome sequence determination based on assembly
added in 2009. of the whole genome from numerous sequence
reads at high coverage without requiring reference
to genetic or physical map locations for those reads.

175

View publication stats

You might also like