(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
(Methods in Molecular Biology 1525) Jonathan M. Keith (Eds.) - Bioinformatics - Volume I - Data, Sequence Analysis, and Evolution-Humana Press (2017)
Bioinformatics
Volume I:
Data, Sequence Analysis,
and Evolution
Second Edition
METHODS IN MOLECULAR BIOLOGY
Series Editor
John M. Walker
School of Life and Medical Sciences
University of Hertfordshire
Hatfield, Hertfordshire, AL10 9AB, UK
Edited by
Jonathan M. Keith
Monash University
Melbourne, VIC, Australia
Editor
Jonathan M. Keith
Monash University
Melbourne, VIC, Australia
Bioinformatics sits at the intersection of four major scientific disciplines: biology, mathe-
matics, statistics, and computer science. That’s a very busy intersection, and many volumes
would be required to provide a comprehensive review of the state-of-the-art methodologies
used in bioinformatics today. That is not what this concise two-volume work of contributed
chapters attempts to do; rather, it provides a broad sampling of some of the most useful and
interesting current methods in this rapidly developing and expanding field.
As with other volumes in Methods in Molecular Biology, the focus is on providing
practical guidance for implementing methods, using the kinds of tricks and tips that are
rarely documented in textbooks or journal articles, but are nevertheless widely known and
used by practitioners, and important for getting the most out of a method. The sharing of
such expertise within the community of bioinformatics users and developers is an important
part of the growth and maturation of the subject. These volumes are therefore aimed
principally at graduate students, early career researchers, and others who are in the process
of integrating new bioinformatics methods into their research.
Much has happened in bioinformatics since the first edition of this work appeared in
2008, yet much of the methodology and practical advice contained in that edition remains
useful and current. This second edition therefore aims to complement, rather than super-
sede, the first. Some of the chapters are revised and expanded versions of chapters from the
first edition, but most are entirely new, and all are intended to focus on more recent
developments.
Volume 1 is comprised of three parts: Data and Databases; Sequence Analysis; and
Phylogenetics and Evolution. The first part looks at bioinformatics methodologies of crucial
importance in the generation of sequence and structural data, and its organization into
conceptual categories and databases to facilitate further analyses. The Sequence Analysis part
describes some of the fundamental methodologies for processing the sequences of biological
molecules: techniques that are used in almost every pipeline of bioinformatics analysis,
particularly in the preliminary stages of such pipelines. Phylogenetics and Evolution deals
with methodologies that compare biological sequences for the purpose of understanding
how they evolved. This is a fundamental and interesting endeavor in its own right but is also
a crucial step towards understanding the functions of biological molecules and the nature of
their interactions, since those functions and interactions are essentially products of their
history.
Volume 2 is also comprised of three parts: Structure, Function, Pathways and Networks;
Applications; and Computational Methods. The first of these parts looks at methodologies
for understanding biological molecules as systems of interacting elements. This is a core task
of bioinformatics and is the aspect of the field that attempts to bridge the vast gap between
genotype and phenotype. The Applications part can only hope to cover a small number of
the numerous applications of bioinformatics. It includes chapters on the analysis of genome-
wide association data, computational diagnostics, and drug discovery. The final part
v
vi Preface
describes four broadly applicable computational methods, the scope of which far exceeds
that of bioinformatics, but which have nevertheless been crucial to this field. These are
modeling and inference, clustering, parameterized algorithmics, and visualization.
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . v
Contributors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
vii
viii Contents
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479
Contributors
ix
x Contributors
Genome Sequencing
Mansi Verma, Samarth Kulshrestha, and Ayush Puri
Abstract
Genome sequencing is an important step toward correlating genotypes with phenotypic characters.
Sequencing technologies are important in many fields in the life sciences, including functional genomics,
transcriptomics, oncology, evolutionary biology, forensic sciences, and many more. The era of sequencing
has been divided into three generations. First generation sequencing involved sequencing by synthesis
(Sanger sequencing) and sequencing by cleavage (Maxam-Gilbert sequencing). Sanger sequencing led to
the completion of various genome sequences (including human) and provided the foundation for develop-
ment of other sequencing technologies. Since then, various techniques have been developed which can
overcome some of the limitations of Sanger sequencing. These techniques are collectively known as “Next-
generation sequencing” (NGS), and are further classified into second and third generation technologies.
Although NGS methods have many advantages in terms of speed, cost, and parallelism, the accuracy and
read length of Sanger sequencing is still superior and has confined the use of NGS mainly to resequencing
genomes. Consequently, there is a continuing need to develop improved real time sequencing techniques.
This chapter reviews some of the options currently available and provides a generic workflow for sequencing
a genome.
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_1, © Springer Science+Business Media New York 2017
3
4 Mansi Verma et al.
1.1 General The first step of any genome sequencing project is generation of an
Procedure for Genome enormous amount of sequencing data (Fig. 1). The raw data gen-
Sequencing erated by sequencing is processed to convert the chromatograms
into the quality values (see Note 1). This procedure is known as
“base calling” or fragment readout. The next step is optional as
only those methods that involve library construction will utilize
trimming of vector sequences. It is an important step as some parts
of the vector are also sequenced by universal primers, along with
the insert region. Further screening of good quality and bad quality
regions is performed to ensure a more accurate sequence assembly.
The processed data is then assembled by searching for fragment
overlaps to form “contigs.” However, because of repeat regions,
the fragments may be wrongly aligned, leading to misassembly.
Hence, assembly validation is done manually as well as by softwares
to locate the correct position of reads. The contigs are then ori-
ented one after the other with the help of mate-paired information
to form ordered chains known as “scaffolds” [12]. This step is
followed by the final and most crucial step of genome sequencing:
FINISHING. It includes gap filling (also known as “minimum
tiling path”) and reassessing the assembly [13].
Genome Sequencing 5
Sequence generation
Base calling
Fragment assembly
Assembly Validation
Finishing
Fig. 1 Flow diagram to show the series of steps followed for sequencing any
genome
1.2 Choosing a Given the rapid growth of sequencing technologies, each with
Sequencing Strategy unique advantages and shortcomings with regard to performance
and cost, it is now becoming difficult to choose the best technology
for a particular application. Some parameters that should be
considered when selecting a sequencing method are the following
[14, 15]:
1. Read lengths and cost per read: Read length is the number of
contiguous bases sequenced in a given run. It remains one of the
most important parameters for the selection of sequencing plat-
forms. Increasing the read length facilitates assembly and
reduces assembly error, because it provides unique context for
repetitive sequences. Moreover, as the cost per read decreases,
sequencing longer reads will become more economical (see
Table 1).
2. Raw accuracy: This is an important quality control measure in
massive DNA sequencing projects. Lower error rates result in
better assembly, and better quality finished-sequence for the
same coverage.
3. Mate-paired reads: Mate-paired reads are those reads that are
separated by a known distance in the genome of origin [16].
Such reads are crucial for scaffolding de-novo genome assembly,
for resolving repeat regions, and for many other downstream
6
Table 1
Details of first, second, and third generation sequencing technologies with respect to their cost per megabase, instrument cost, read length, and
accuracy
2 Materials
(continued)
Table 2
(continued)
Clone fragments in
suitable vector
Fig. 2 Strategies for sequencing whole genome. Ordered clone approach is used for large insert size library for
constructing physical map, whereas whole genome shotgun strategy is used for sequencing small insert size
library for generating enormous amount of data
3 Methods
3.1 First Generation Even after the introduction of novel techniques, a major amount of
Technologies DNA sequence data is still credited to have been produced by first
generation technologies. Though slower and costlier (even after
parallelization and automation), these older technologies are still
used in studies where accuracy cannot be compromised even to a
slight degree, including synthetic oligonucleotides and gene targets
[18, 19]. The initial, first generation sequencing technologies were
the sequencing-by-cleavage method established by Maxam-Gilbert
[20] and sequencing-by-synthesis developed by Sanger [2].
3.1.1 Maxam-Gilbert This technique first appeared in 1977 and is also known as the
Method “chemical-degradation” method [20]. The chemical reagents act
on specific bases of existing DNA molecules and subsequent cleav-
age occurs (Fig. 3). In this technique, dsDNA is labeled with
radiolabeled phosphorus at the 50 end or 30 end. The next step is
to obtain ssDNA. This can be done by restriction digestion leading
to sticky ends or denaturation at 90 C in the presence of DMSO,
GCTACGGCAGCTA
GCTACGGCAGCTA GCTACGGCAGCTA
Hydrazine + Hydrazine +
Alkali Mild Acid
Piperidine Piperidine + 2M NaCl
G
GC GC
GCT GCT
GCTA GCTA
GCTAC
GCTACG
GCTACGG GCTACGG
GCTACGGC GCTACGGC
GCTACGGCA GCTACGGCAG
GCTACGGCAGC GCTACGGCAGC
GCTACGGCAGCT GCTACGGCAGCT
Fig. 3 Sequencing an oligonucleotide by Maxam-Gilbert method: This method largely depends upon treatment
of oligonucleotides by different chemicals that cleave at specific sites. 50 end of the oligonucleotide is
radiolabeled. When cleaved with specific chemicals, only the radiolabeled part generates signals on an
autoradiograph. Hence, by series of partial cleavages, nucleotide sequence can be determined
Genome Sequencing 13
Table 3
Reagents used in Maxam-Gilbert method for sequencing
3.1.2 Sanger Sequencing This technique is also known as chain termination sequencing or
“dideoxy sequencing.” Sanger sequencing has played a crucial role
in understanding the genetic landscape of the human genome.
It was developed by Frederick Sanger in 1975, but was commercia-
lized in 1977 [21].
The technique is based on the principal usage of dideoxy ribo-
nucleoside triphosphates that lack 30 hydroxyl group (Fig. 4a). The
technique comprises seven different components to perform the
sequencing. These include ssDNA template to be sequenced, pri-
mers, Taq polymerase to amplify the template strand, buffer, deox-
ynucleotides (dNTPs), fluorescently labeled dideoxynucleotides
(ddNTPs) and DMSO (to resolve the secondary structures if
formed in the template strand) (see Notes 3 and 4). Because the
incorporated ddNTP lacks a 30 OH group, the phosphodiester
bond between the C30 OH of the latter sugar moiety and C50 of
the next dNTPs will not form, resulting in the termination of the
chain at that point [2, 22].
When run on a polyacrylamide gel with the products of each
reaction in four parallel lanes, the fragments of different lengths get
separated (Fig. 4b and c). After its introduction, Sanger sequencing
underwent further modifications for the automated sequencing
protocol. One such advance was dye-terminator sequencing [23].
The main advantage of using this method lies in its greater accuracy
and speed. In 1987, Applied Biosystems, Inc. (ABI) launched the
first automated DNA sequencing machine, ABI model 370,
14 Mansi Verma et al.
a b
dNTPS ddATP, ddTTP,
Base ddGTP, ddCTP
Taq DNA
Polymerase
P O O
CH2
H 3’ A G A T T A T G G C T A G T 5’
H
H H A 5’
C A 5’
Base T C A 5’
O H
A T C A 5’
O O O
P CH2 G A T C A 5’
- C G A T C A 5’
O
H H C C G A T C A 5’
H H A C C G A T C A 5’
Base T A C C G A T C A 5’
O H A T A C C G A T C A 5’
O O O A A T A C C G A T C A 5’
P CH2
T A A T A C C G A T C A 5’
O- C T A A T A C C G A T C A 5’
H H
H H T C T A A T A C C G A T C A 5’
H H
× P O
Base
O CH2
H H
H H
O H
Fig. 4 Dideoxy chain termination method. (a) DNA synthesis by addition of dNTP (that has free 30 -OH) in the
synthesizing strand. But once a ddNTP is incorporated, the strand synthesis stops as no 30 -OH is available to
form a phosphodiester bond with the next dNTP. (b) formation of a series of chains by incorporation of ddNTPs
which can by separated by capillary electrophoresis. (c) Electropherogram of a DNA sequence
3.2 The Birth of Next- The advent of next-generation sequencing techniques has made the
Generation once herculean task of sequencing much simpler and faster [25].
Sequencing (NGS) Gradually, the rapid and cost-effective next-generation sequencing
Techniques technologies have emerged as the popular choice over their slow
and cumbersome first generation counterparts. They have enabled
Genome Sequencing 15
3.3.1 454 The 454 (Roche) was the first of the NGS platforms introduced in
Pyrosequencing 2005 [34]. The process of pyrosequencing, in short, involves gen-
eration of light from phosphates by employing an enzymatic cas-
cade. This light is released while a polymerase replicates the
template DNA and its detection is vital in accurately sequencing
the DNA [35].
In this technique, the duplex DNA is first sheared into smaller
fragments that are followed by ligation of adapters on both sides of
the fragment that are complementary to the primer sequences.
These adapters act as the site for primer binding and initiate the
sequencing process. Each DNA fragment is bound to the emulsion
microbeads such that the ratio of DNA:microbead is 1:1 (Fig. 5).
This is followed by amplification of each strand using emulsion
PCR and after a few cycles, many copies of these DNA molecules
are obtained per bead [35].
Immobilized enzymes including DNA polymerase, ATP sul-
phurylase, luciferase, and apyrase are added to the wells in a fiber
optic plate containing one amplified bead each. Fluidic assembly
supplies the four dNTPs one by one to the wells of the fiber optic
slide. These dNTPs are then incorporated, complementary to the
template, with the help of DNA polymerase. This process releases
PPi, on which ATP sulphurylase acts, converting it into ATP. In the
presence of the generated ATP, luciferase converts luciferin to
oxyluciferin and a light signal proportional to the amount of ATP
is produced. Intensity of the signal produced is captured by the
CCD camera. Once the signal is produced and captured, enzyme
apyrase degrades the existing nucleotides and ATP and the next
nucleotide is then added through fluidic assembly. The technique
has recently been supplemented with “paired-end” sequencing
where adapters are bound to both the ends of the fragmented
DNA therefore, leading to sequencing from both ends [34, 35].
A major advantage of this technique is its long read length.
Unlike other NGS technologies, 454 yields read length of 400
bases and can generate more than 1,000,000 individual reads per
run. This feature is particularly suited for de novo assembly and
metagenomics [36].
Genome Sequencing 17
Fig. 5 Pyrosequencing. In this method, the double stranded DNA is fragmented into small pieces and then
denatured to get single strands, which are ligated with adaptors at both the ends. Each DNA fragment gets
bound to microbead and is amplified by emulsion PCR, generating many copies of a single molecule
3.3.2 The Illumina The Solexa platform for sequencing DNA was first developed by
(Solexa) Genome Analyzer British chemists Shankar Balasubramanian and David Klenerman
and was later commercialized in 2006. It works using the principle
of “sequencing-by-synthesis.” Also known as cycle reversible ter-
mination (CRT) [37], this technique has been found to generate
50–60 million reads with an average read length of 40–50 bases in a
single run. The genomic DNA is fragmented into small pieces and
adapters are ligated at both the ends. The ligated DNA fragments
are then loaded on the glass slide (flow cell) where one end of the
ligated DNA fragment hybridizes to the complementary oligo
(oligo 1) that is covalently attached to the surface. The opposite
end of each of the bound ssDNAs is still free and hybridizes with a
nearby complementary oligo (oligo 2) present on the slide to form
a bridge. With the addition of DNA polymerase and dNTPs, the
ssDNA are bridge-amplified as strand synthesis starts in the 50 –30
direction from oligo 2. For the next PCR cycle, the template strand
and the newly synthesized complementary strand are denatured to
again start the amplification. As a result of repeated cycles of bridge
amplification, millions of dense clusters of duplex DNA are gener-
ated in each channel of the flow cell (Fig. 6). Once such an ultra-
high density sequencing flow cell is created, the slide is ready for
sequencing [33, 37–39].
18 Mansi Verma et al.
ssDNA
bridge bridge dense clusters of
formation amplification
duplex DNA
Oligo 2
Oligo 1
Fig. 6 Solexa Platform. The fragmented genomic DNA is ligated to adapters at both the ends. Each ligated DNA
fragment hybridizes to the complementary oligo (oligo 1) that is covalently attached to the glass slide surface.
The opposite end of each of the bound ssDNAs hybridizes with a nearby complementary oligo (oligo 2) to form
a bridge. The ssDNA are bridge-amplified as strand synthesis starts in the 50 –30 direction from oligo 2. As a
result of repeated cycles of bridge amplification, millions of dense clusters of duplex DNA are generated in
each channel of the flow cell
3.3.3 Applied Biosystems ABI SOLiD platform was described by J. Shendure and colleagues
SOLiD Sequencer in 2005. This technology involves a series of hybridization-ligation
steps. The di-base probes are actually octamers, containing two
bases of known positions, three degenerate bases, and three more
degenerate bases that are attached to the flourophore (and which
are excised along with flourophore cleavage). Sixteen combinations
of di-base probes are used, with each base represented by one of
four fluorescent dyes (Fig. 7a) [40, 41].
The sheared genomic DNA is ligated with adaptors on both
sides. The adaptor-ligated fragments get bound to microbeads and
each molecule is clonally amplified onto beads in emulsion PCR
(similar to pyrosequencing). After amplification, beads get cova-
lently attached to the glass slide and are provided with universal
primers, ligase, and di-base probes [42].
In the first round, universal primers (of length n) are added that
hybridize with adaptors. Following this, one of the di-base probes
(complementary to the sequence) binds to the template and is
Genome Sequencing 19
3.3.4 Polonator This ligation based sequencer was developed by Dover in collabo-
ration with Church Laboratory of Harvard Medical School.
In addition to high performance, it has become popular as a
cheap, affordable instrument for DNA sequencing. The system
uses open source software and protocols that can be downloaded
for free. Based on ligation detection sequencing (similar to
SOLiD), the Polonator G.007 identifies the nitrogenous base by
a single base probe, unlike SOLiD that uses a dual base probe.
Nucleotides having single base probes known as nonanucleotides
or nonamers are tagged with a fluorophore. The genomic DNA is
sheared into many fragments that are used for the construction of a
genomic paired tag shotgun library. Using emulsion PCR, clonal
amplification of templates results in tens of thousands of such
copies attached to the beads called polonies. This is followed by
enrichment of beads having such amplified templates bound to
them. After insertion of such beads in the flow cell, the Polonator
is used to generate millions of short reads in parallel by using the
cyclic sequencing array technology [3, 31].
20 Mansi Verma et al.
a
A C G T
A A A N N NN N N A C N N NN N N A G N N NN N N A T N N NN N N
C C C N N NN N N C C N N NN N N C GN N NN N N C T N N NN N N
G G A N N NN N N G CN N NN N N G GN N NN N N G T N N NN N N
T T A N N NN N N T C N N NN N N G T N N NN N N T T N N NN N N
b
Universal primer (n) AC G G T C T Universal primer (n-1) AC G G T C
TA CG T G C G T A CG T T G A T G TA CG T G C G T A CG T T G A T G
AC G G T CT AC G G T C
TG C C A G A TA CG T G C G T A CG T T G A T G TG C C A G A TA CG T G C G T A CG T T G A T G
A T N N NN N N T A N N NN N N
A C G G T C T A T N N NN N N A C G G T C T A N N NN N N
TG C C A G A TA C G T G C G T A C G TT G A T G TG C C A G A TA C G T G C G T A C G T T G A T G
C A N N NN N N A T N N NN N N
AC G G T CT AT N N N AC G G T C T AN N N
TG C C A G A TA CG T G TG T A C G T T G A T G TG C C A G A TA C G T G TG T A C G T T G A T G
A C G G T C T A T N N N C A N N NN N N A C G G T C T A N N N A T N N NN N N
TG C C A G A TA CG T G TG T A C G T T G A T G TG C C A G A TA C G T G TG T A C G T T G A T G
Fig. 7 (a) Diagrammatic representation of the 16 di-base probes used in ABI-SOLiD: The di-base probes
consist of two known nitrogeneous bases, six degenerate bases, three of which are tagged with a fluorophore.
Depending on the combination of known bases used, four different colors are emitted when these probes bind
to the template DNA, signaling the correct sequence of the DNA strand. (b) First cycle is continued with a
universal primer (n) which leads to the detection of nucleotide 1–2, 6–7, 11–12, 16–17, 21–22 etc. In the
second cycle, the same universal primer with one base less (n1) helps in detecting nucleotide position 0–1,
0–11, 15–16 etc.
Genome Sequencing 21
3.4.1 The Ion Torrent Ion torrent, a transitional technique between the second and third
Sequencing Technology generations, is a high-throughput DNA sequencing technology
introduced by Life technologies in 2010. The sequencing chemis-
try of Ion Torrent is based on the principle of covalent bond
formation in a growing DNA strand, catalyzed by DNA polymer-
ase. This results in release of pyrophosphate and a proton. The
release of a proton decreases the pH of the environment which in
turn is detected to determine the sequence [3, 31, 45].
The sequencer contains microwells on a semiconductor chip
containing a clonally amplified single-stranded template DNA mol-
ecule that needs to be sequenced. The wells are then sequentially
flooded with DNA polymerase and unmodified deoxynucleoside
triphosphates (dNTPs). When flooded with a single species of
dNTP, a biochemical reaction will occur if the nucleotide is
22 Mansi Verma et al.
Fig. 8 Ion torrent sequencing technology. With the incorporation of a complementary base, a proton is released
which leads to the change in pH of the environment. Thus, the sequence of incorporated base can be
deciphered
DNA nanoball
Amplification of DNA fragment
to form DNA nanoball
Adaptor 2
Ligation
Each DNA ball is placed on a spot at the chip
Detection probe
Sequencing using Probe-anchor ligation
Anchor probe
Likewise 2 more adaptors
are added for a given
genomic fragment
Ligase
C
T
G
Ligation of adaptors to DNA fragments A
Adaptor
dNTP
Fig. 9 DNA nanoball sequencing. The fragmented genomic DNA is ligated with four adaptors and the DNA
nanoball is formed by rolling-circle replication. Each DNA nanoball is placed on a spot at the chip and
sequencing is performed by probe-anchor ligation followed by the binding of detection probe
3.4.3 Pacific Bioscience SMRT developed by Pacific Biosciences relies on the principle of
RS (SMRT Sequencing) single molecule real time sequencing [46]. This technique differs
from other techniques in two ways:
1. Instead of labeling the nucleic acid bases themselves, the phos-
phate end of a nucleotide is labeled differently for all four bases.
2. The reaction occurs in a nano-photon visualization chamber
called ZMW (zero mode waveguides).
The sequencing reaction for a DNA fragment begins with
DNA polymerase that resides in the detection zone at the bottom
of each ZMW. The chamber is designed to accommodate a poly-
merase, which is fixed at the bottom, and a ssDNA strand as a
template. When a nucleotide is incorporated by the DNA polymer-
ase, the fluorescent tag is cleaved off with the formation of a
phosphodiester bond. The machine records light pulses emitted
as a result of nucleotide incorporation into the target template
(Fig. 10). The fluorophore released with the formation of a phos-
phodiester bond diffuses out of the ZMW [50–52].
P A
P T
P G ssDNA
P C
Fluorescent-labeled
phosphate instead
of base
Nano-photonic
visualization
chamber (ZMW)
Fig. 10 SMRT sequencing. DNA polymerase is immobilized at the base of each nano-photonic visualization
chamber (ZMW). Phosphate of each nucleotide species is tagged with a unique flourophore. As a new base is
incorporated complementary to a ssDNA, the pyrophosphate is cleaved with the liberation of fluorophore and
the light intensity is captured at each liberation
Genome Sequencing 25
3.4.4 Nanopore The concept of nanopores and its use in sequencing technique
Sequencing emerged in the mid-1990s. With years of advancement and devel-
opment, Oxford Nanopore technologies licensed the technology in
2008 [53].
Nanopores are channels of nanometer width, which can be of
three types: (1) biological: pores formed by a pore-forming protein
in a membrane, for example alpha-hemolysin; (2) solid-state: pores
formed by synthetic material or derived chemically, for example
silicon and grapheme; or (3) hybrid: pores formed by a biological
agent such as pore-forming protein are encapsulated in a synthetic
material [54].
Unlike all the above-mentioned sequencers, a Nanopore DNA
sequencer does not require the labeling or detection of nucleotides.
This technique is based on the principle of modulation of the ionic
current generated when a DNA molecule passes through the Nano-
pore. This helps decipher various characteristics of the molecule
such as its length, diameter, and conformation. Initially, an ionic
current of known magnitude is allowed to flow through the nano-
pore. Since different nucleotides have different resistance, and
therefore block current for a specific time period, measuring this
time period one can determine the sequence of the molecule in
question (Fig. 11). Further improvements in the technique may
lead to the development of a rapid nanopore-based DNA sequenc-
ing technique [55–58].
Enzyme
GTA G CA
Nanopore
Membrane
Potential to determine
the deflection caused
by each nucleotide
while passing through
the channel
Fig. 11 Nanopore sequencing. A nanopore formed in the biological membrane. Each nucleotide of a single
stranded DNA when passing through the nanopore leads to a characteristic change in the membrane potential,
which can be readily recorded to decipher the sequence
26 Mansi Verma et al.
4 Applications of Sequencing
4.1 Whole-Genome After the successful completion of the first genome, Haemophilus
Sequencing influenzae, in 1995 [4], all sequencing techniques have been
applied to whole genome sequencing (WGS) of various organisms
[18]. More than 20,000 complete genomes have been sequenced
so far (including 11427 bacterial, 617 archeal, 5874 viral, and 2052
eukaryotic genomes). The rapid rise in the quantity of WGS data
stored in databases is an obvious outcome of NGS technologies.
[59].
4.2 Comparative The availability of whole genomes has led to extensive studies on
Genomics related organisms [33]. The impact of comparative genomics is
widespread; for example, it has helped distinguish the pathogenic
and nonpathogenic regions in closely related strains. With the avail-
ability of sequencing data, we are able to locate novel genes and other
previously unknown functional elements [60]. The raw data available
helps elucidate the structure and functional relationships of biomo-
lecules and their regulatory processes. It has also provided a wealth of
information regarding DNA-protein interactions [61].
4.3 Evolutionary With the availability of such a large number of draft and complete
Biology genomes, it has become possible to explore their detailed evolu-
tionary history. Whole genome sequences advanced studies in evo-
lutionary biology from single gene based studies (16S rRNA and
other house-keeping genes) to genome based studies [62]. Various
genomic features have been explored to shed light on genome
evolution, including gene concatenation, gene order, genome
based alignment-free phylogeny, and many others. Sequencing
provides valuable information for studying molecular phylogenies
Genome Sequencing 27
4.4 Forensic Forensic studies were previously conducted using RFLP, PCR, and
Genomics other techniques. But reduction in the cost of sequencing has
opened the way for more accurate methods to be used in forensics,
for example in studies related to short tandem repeat (STR) typing
[63], mitochondrial DNA analysis, and single nucleotide poly-
morphisms (SNPs) [64]. These studies are proving to be of
immense value and providing additional evidences in criminal
cases [65].
4.5 Drug The availability of genome sequences has advanced drug discovery
Development in two ways. First, computer aided drug discovery (CADD) has
become easier with the availability of reference sequences, as these
help in homology modeling [60]. Second, it aids in analyzing the
interaction of a drug with the host genome (molecular simula-
tions). Nowadays, we can not only identify pathogenic strains and
virulence factors, we can also counter them using appropriate
genetic and protein engineering [66, 67].
4.6 Personal The completion of the human genome project in 2001 with
Genomes 99.99 % accuracy was a landmark that took more than a decade.
But with the availability of NGS platforms, the reduced cost per
read has paved the way for sequencing personal genomes [34].
Although the reference human genome still contains 0.01 %
error, yet it provides the basis for resequencing individual genomes.
Sequencing personal genomes facilitates: detection of single-
nucleotide polymorphisms (SNPs); detection of insertions and
deletions (InDels); large structural variations (SVs); new variations
at the individual level; and variations in genotype/haplotype [68].
Personal genomics is used to study diseases, genetic adaptations,
epigenetic inheritance, and quantitative genetics [69].
4.7 Cancer Research The high-throughput upcoming sequencing technologies are lead-
ing us toward the discovery of novel diagnostic and therapeutic
approaches in the treatment of cancer [67]. In cancer genomics,
current technology empowers speedy identification of patient-
specific rearrangements, tumor specific biomarkers, mutations
responsible for cancer initiation, and micro-RNAs responsible for
inhibition of translation in tumors cells [70]. Further, technology is
able to provide “personalized biomarkers”—a set of distinct mar-
kers for an individual, used clinically to select optimal treatments for
diseases and oncogenic factors [71].
28 Mansi Verma et al.
4.8 Microbial Low cost sequencing technologies have paved the way for simulta-
Population Analysis neous sequencing of viral, bacterial, and other small genomes in
environmental samples, owing to its high throughput, depth of
sequencing, and the small size of most microbial genomes [18]. It
has given birth to a new field of study called, “metagenomics” [25]. It
is now possible to detect unexpected disease-associated viruses and
budding human viruses, including cancer-related ones [72]. Sequenc-
ing data significantly strengthen our perception of host pathogen
communications: including via the discovery of new splice variants,
mutations, regulatory elements, and epigenetic controls [73].
5 Future Aspects
6 Notes
Table 4
Probability score of phred value. The acceptable error rate for any genome
project is 1/10,000 nucleotides (¼ phred quality score 40)
Components Volume
DNA 50–150 ng/μl
Primer (specific) 1 μl (10 μM)
BigDye Terminator mix (v3.1) 0.5 μl
Buffer (5) 2 μl
DMSO 0–0.3 μl
Sterile deionized water To make up the final volume of 10 μl
The PCR can be set up for 25–30 cycles in which the denatur-
ation, annealing, and extension conditions are as follows:
Denaturation: 95 C—10 s
Annealing: 50 C—5 s
Extension: 60 C—4 min
References
1. Mardis EM (2011) A decade’s perspective on 6. Koboldt DC, Steinberg KM, Larson DE, Wil-
DNA sequencing technology. Nature son RK, Mardis EM (2013) The next genera-
470:198–203F tion sequencing revolution and its impact on
2. Sanger F, Nicklen S, Coulson AR (1977) DNA genomics. Cell 155:27–38
sequencing with chain-terminating inhibitors. 7. Bormann Chung CA, Boyd VL, McKernan KJ,
Proc Natl Acad Sci U S A 74:5463–5467 Fu YT, Monighetti C, Peckham HE, Barker M
3. Pareek CS, Smoczynski R, Tretyn A (2011) (2010) Whole methylome analysis by ultra-
Sequencing technologies and genome deep sequencing using two-base encoding.
sequencing. J Appl Genet 52:413–435 PLoS One 5:1–8
4. Fleischmann RD, Adams MD, White O, Clay- 8. Nowrousian M (2010) Next-generation
ton RA, Kirkness EF, Kerlavage AR, Bult CJ, sequencing techniques for eukaryotic microor-
Tomb JF, Dougherty BA, Merrick JM et al ganisms: sequencing-based solutions to
(1995) Whole-genome random sequencing biological problems. Eukaryot Cell
and assembly of Haemophilus influenzae Rd. 9:1300–1310
Science 269:496–512 9. Koboldt DC, Larson DE, Chen K, Ding L,
5. Venter JC, Adams MD, Myers EW et al (2001) Wilson RK (2012) Massively parallel sequenc-
The sequence of the human genome. Science ing approaches for characterization of
291:1304–1351
Genome Sequencing 31
structural variation. Methods Mol Biol 24. Franca LTC, Carrilho E, Kist TBL (2002) A
838:369–384 review of DNA sequencing techniques. Q Rev
10. Brautigam A, Gowik U (2010) What can next Biophys 35:169–200
generation sequencing do for you? Next- 25. Bubnoff AV (2008) Next-generation sequenc-
generation sequencing as a valuable tool in ing: the race is on. Cell 132:721–723
plant research. Plant Biol 12:831–841 26. Metzker ML (2005) Emerging technologies in
11. Thudi M, Li Y, Jackson SA, May GD, Varshney DNA sequencing. Genome Res 15:1767–1776
RK (2012) Current state-of-art sequencing 27. Shendure J, Ji H (2008) Next-generation DNA
technologies for plant genomics research. sequencing. Nat Biotechnol 26:1135–1145
Brief Funct Genomics 2:3–11 28. Mardis EA (2013) Next-generation sequenc-
12. Pop M, Kosack D, Salzberg SL (2002) A hier- ing platforms. Annu Rev Anal Chem
archical approach to building contig scaffolds. 6:287–303
In: Second annual RECOMB satellite meeting 29. Blazej RG, Kumaresan P, Mathies RA (2006)
on DNA sequencing and characterization. Micro fabricated bioprocessor for integrated
Stanford University nanoliter-scale Sanger DNA sequencing. Proc
13. Shultz JL, Yesudas C, Yaegashi S, Afzal AJ, Kazi Natl Acad Sci U S A 103:7240–7245
S, Lightfoot DA (2006) Three minimum tile 30. Augustin MA, Ankenbauer W, Angerer B (2001)
paths from bacterial artificial chromosome Progress towards single-molecule sequencing:
libraries of soyabean (Glycine max cv Forrest): enzymatic synthesis of nucleotide-specifically
tools for structural and functional genomics. labeled DNA. J Biotechnol 86:289–301
Plant Methods 2:9
31. Hui P (2014) Next- generation sequencing:
14. Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, chemistry, technology and application. Top
Lu L, Law M (2012) Comparison of next- Curr Chem 336:1–18
generation sequencing systems. J Biomed Bio-
technol 2012:1–11 32. Hert DG, Fredlake CP, Annelise E (2008)
Advantages and limitations of next-generation
15. Edwards A, Caskey T (1991) Closure strategies sequencing technologies: a comparison of elec-
for random DNA sequencing. Methods trophoresis and non-electrophoresis methods.
3:41–47 Electrophoresis 29:4618–4626
16. Chaisson MJ, Brinza D, Pevzner PA (2010) De 33. Mardis EA (2008) Next-generation DNA
novo fragment assembly with short mate- sequencing methods. Annu Rev Genomics
paired reads: does the read length matter? Hum Genet 9:387–402
Genome Res 19:336–346
34. Ansorge WJ (2009) Next-generation sequenc-
17. Green P (1997) Against a whole-genome shot- ing techniques. New Biotechnol 25:195–203
gun. Genome Res 7:410–417
35. Ronaghi M, Uhlen M, Nyren P (1998) A
18. Stranneheim H, Lundeberg J (2012) Stepping sequencing method based on real-time pyro-
stones in DNA sequencing. Biotechnol J phosphate. Science 281:363–365
7:1063–1073
36. Keijser BJ, Zaura E, Huse SM, van der Vossen
19. Hutchison CA (2007) DNA sequencing: JM, Schuren FH, Montijn RC, ten Cate JM,
bench to bedside and beyond. Nucleic Acids Crielaard W (2008) Pyrosequencing analysis of
Res 35:6227–6237 the oral microflora of healthy adults. J Dent
20. Maxam MA, Gilbert W (1977) A new method Res 87:1016–1020
for sequencing DNA. Proc Natl Acad Sci U S A 37. Bentley DR, Balasubramanian S, Swerdlow
74(2):560–564 HP, Smith GP, Milton J, Brown CG, Hall KP,
21. Zimmermann J, Voss H, Schwager C, Stege- Evers DJ, Barnes CL, Bignell HR et al (2008)
mann J, Ansorge W (1989) Automated Sanger Accurate whole human genome sequencing
dideoxy sequencing reaction protocol. FEBS using reversible terminator chemistry. Nature
Lett 223:432–436 456:53–59
22. Ansorge W, Voss H, Wirkner U, Schwager C, 38. Turcatti G, Romieu A, Fedurco M, Tairi AP
Stegemann J, Pepperkok R, Zimmermann J, (2008) A new class of cleavable fluorescent
Erfle H (1989) Automated Sanger DNA nucleotides: synthesis and optimization as
sequencing with one label in less than four reversible terminators for DNA sequencing by
lanes on gel. J Biochem Biophys Methods synthesis. Nucleic Acids Res 36:1–13
20:47–52 39. Pettersson E, Lundeberg J, Ahmadian A
23. Rosenthal A, Charnock-Jones DS (1992) New (2009) Generations of sequencing technolo-
protocols for DNA sequencing with dye termi- gies. Genomics 93:105–111
nators. DNA Seq 3:61–64
32 Mansi Verma et al.
40. Niedringhaus TP, Milanova D, Kerby MB, Sny- 54. Clarke J, Wu HC, Jayasinghe L, Patel A, Reid
der MP, Barron AE (2011) Landscape of next- S, Bayley H (2009) Continuous base identifi-
generation sequencing technologies. Anal cation for single-molecule nanopore DNA
Chem 83:4327–4341 sequencing. Nat Nanotechnol 4:265–270
41. Voelkerding KV, Dames SA, Durtschi JD 55. Stoddart D, Heron AJ, Mikhailova E, Maglia
(2009) Next-generation sequencing: from G, Bayley H (2009) Single-nucleotide discrim-
basic research to diagnostic. Clin Chem 55 ination in immobilized DNA oligonucleotides
(4):641–658 with a biological nanopore. Proc Natl Acad Sci
42. Metzker ML (2010) Sequencing U S A 106:7702–7707
technologies—next generation. Nat Rev 56. Astier Y, Braha O, Bayley H (2006) Toward
Genet 11:31–46 single molecule DNA sequencing: direct iden-
43. Glenn TC (2011) Field guide to next- tification of ribonucleoside and deoxyribonu-
generation DNA sequencers. Mol Ecol Resour cleoside 50 -monophosphates by using an
11:759–769 engineered protein nanopore equipped with a
44. Delsenya M, Han B, Hsing YI (2010) High molecular adapter. J Am Chem Soc
throughput DNA sequencing: the new 128:1705–1710
sequencing revolution. Plant Sci 179:407–422 57. Maitra RD, Kim J, Dunbar WB (2012) Recent
45. Rothberg JM, Hinz W, Rearick TM, Schultz J, advances in nanopore sequencing. Electropho-
Mileski W, Davey M, Leamon JH, Johnson K, resis 33:3418–3428
Milgrew MJ, Edwards M et al (2011) An 58. Haque F, Li J, Wu HC, Liang XJ, Guo P
integrated semiconductor device enabling (2013) Solid state and biological nanopore for
non-optical genome sequencing. Nature real time sensing of single chemical and
475:348–352 sequencing of DNA. Nano Today 8:56–74
46. Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, 59. Lim JS, Choi BS, Lee JS, Shin C, Yang TJ,
Peluso P, Rank D, Baybayan P, Bettman B et al Rhee JS, Lee JS, Choi IY (2012) Survey of
(2010) Real-time DNA sequencing from single the applications of NGS to whole genome
polymerase molecules. Science 323:133–138 sequencing and expression profiling. Genomics
47. Kaji N, Okamoto Y, Tokeshi M, Baba Y (2010) Inform 10:1–8
Nanopillar, nanoball, and nanofibers for highly 60. Thompson JF, Milos PM (2011) The proper-
efficient analysis of biomolecules. Chem Soc ties and applications of single-molecule DNA
Rev 39:948–956 sequencing. Genome Biol 12:217
48. Drmanac R, Sparks AB, Callow MJ, Halpern 61. Zhou X, Ren L, Meng Q, Li Y, Yu Y, Yu J
AL, Burns NL, Kermani BG, Carnevali P, (2010) The next generation sequencing tech-
Nazarenko I, Nilsen GB, Yeung G et al nology and application. Protein Cell
(2010) Human genome sequencing using 1:520–536
unchained base reads on self-assembling DNA 62. Buermans HPJ, Dunnen JTD (2014) Next
nanoarrays. Science 327:78–81 generation sequencing technology: advances
49. Porreca GJ (2010) Genome sequencing on and applications. Biochim Biophys Acta
nanoballs. Nat Biotechnol 28:43–44 1842:1932–1941
50. Korlach J, Marks PJ, Cicero RL, Gray JJ, Mur- 63. Warshauer DH, Lin D, Hari K, Jain R, Davis C,
phy DL, Roitman DB, Pham TT, Otto GA, Larue B, King JL, Budowle B (2013) STRait
Foquet M, Turner SW (2008) Selective alumi- Razor: a length-based forensic STR allele-
num passivation for targeted immobilization of calling tool for use with second generation
single DNA polymerase molecules in zero- sequencing data. Forensic Sci Int Genet 7
mode waveguide nanostructures. Proc Natl (4):409–417
Acad Sci U S A 105:1176–1181 64. Kumar S, Banks TW, Cloutier S (2012) SNP
51. Korlach J, Bjornson KP, Chaudhuri BP, Cicero discovery through next-generation sequencing
RL, Flusberg BA, Gray JJ, Holden D, Saxena and its applications. Int J Plant Genomics
R, Wegener J, Turner SW (2010) Real-time 2012:1–15
DNA sequencing from single polymerase 65. Berglund EC, Anna Kiialainen A, Syv€anen AN
molecules. Methods Enzymol 472:431–455 (2011) Next generation sequencing
52. Schadt E, Turner S, Kasarskis A (2010) A win- technologies and applications for human
dow into third-generation sequencing. Hum genetic history and forensics. Investigative
Mol Genet 19(2):227–240 Genet 2:1–15
53. Venkatesan BM, Bashir R (2011) Nanopore 66. Ozsolak F (2012) Third generation sequencing
sensors for nucleic acid analysis. Nat Nanotech- techniques and applications to drug discovery.
nol 6:615–624 Expert Opin Drug Discov 7:231–243
Genome Sequencing 33
67. Yadav NK, Shukla P, Omer A, Pareek S, Singh 72. Xuan J, Yu Y, Qing T, Guo L, Shi L (2013)
RK (2014) Next-generation sequencing: Next-generation sequencing in the clinic: pro-
potential and application to drug discovery. mises and challenges. Cancer Lett 340:284–295
Scientific World J 2014:1–7 73. Hall N (2007) Advanced sequencing technol-
68. Snyder M, Du J, Gerstein M (2010) Personal ogies and their wider impact in microbiology. J
genome sequencing: current approaches and Exp Biol 209:1518–1525
challenges. Genes Dev 23:423–431 74. Dijk ELV, Auger H, Jaszczyszyn Y, Thermes
69. Yngvadottir B, MacArthur DG, Jin H, Tyler- C (2014) Ten years of next-generation
Smith C (2009) The promise and reality of per- sequencing technology. Trends Genet 30
sonal genomics. Genome Biol 10:237.1–237.4 (9):418–426
70. Grumbt B, Eck SH, Hinrichsen T, Hirv K 75. Morey M, Fernández-Marmiesse A, Castiñeiras
(2013) Diagnostic applications of next genera- D, Fraga JM, Couce ML, Cocho JA (2013) A
tion sequencing in immunogenetics and glimpse into past, present, and future DNA
molecular oncology. Transfus Med Hemother sequencing. Mol Genet Metab 110:3–24
40:196–206 76. Ewing B, Green P (1998) Base-calling of auto-
71. Haimovich AD (2011) Methods, challenges mated sequencer traces using phred: II. Error
and promise of next generation sequencing in probabilities. Genome Res 8(3):186–194
cancer biology. Yale J Biol Med 84:439–446
Chapter 2
Sequence Assembly
Xiaoqiu Huang
Abstract
We describe an efficient method for assembling short reads into long sequences. In this method, a hashing
technique is used to compute overlaps between short reads, allowing base mismatches in the overlaps. Then
an overlap graph is constructed, with each vertex representing a read and each edge representing an overlap.
The overlap graph is explored by graph algorithms to find unique paths of reads representing contigs.
The consensus sequence of each contig is constructed by computing alignments of multiple reads without
gaps. This strategy has been implemented as a short read assembly program called PCAP.Solexa. We also
describe how to use PCAP. Solexa in assembly of short reads.
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_2, © Springer Science+Business Media New York 2017
35
36 Xiaoqiu Huang
2 Materials
2.3 Files PCAP.Solexa takes as input any number of files of short reads in
fastq format with each paired-end dataset provided as two separate
fastq files of the same size, a file named fofn of all short read file
names (one per line), a file named fofn.con of all pairs of paired-end
file names (two names and insert size range and library names per
line), and a file named fofn.lib of library names and mean and
standard deviation of insert sizes (one name, one mean, and one
standard deviation per line). The files for the example used in this
unit are included in the PCAP.Solexa package.
3 Methods
3.1 Algorithm First we describe a method for computing overlaps between reads.
The method is based on a data structure called a superword array
[22], which is named in a similar way to another data structure
called suffix array [23]. A word of length w is a string of w char-
acters, and a superword with v words is a string of v ∗ w characters,
obtained by concatenating the v words in order. The v word posi-
tions of the superword from left to right are referred to as word
positions 1 through v. For example, the word of the superword at
word position v refers to the rightmost word of the superword. The
positions in the word (with one-based numbering) are divided into
two types called checked and unchecked. Two words of length
w form a match if they have identical bases at each checked word
position. For example, consider two words of length 15: ACCA-
TACCATAGCAC and ACTATTCCATAACAC. Assume that only
word positions 3, 6, and 12 are unchecked. Then the two words
form a match, because they have identical bases at each of the other
positions. The word length w is often set to 12 or a larger value such
that the number of checked positions in the word is 12, which
ensures that a lookup table for all strings of length 12 can fit into
the main memory. Here the word is turned into a string of length
12 by removing bases at each unchecked position.
The value for the parameter v is selected such that the super-
word length v ∗ w (also called the minimum overlap length) is
smaller than the length r of each read. For example, for reads of
length 150, we can set w to 15, and v to 6, resulting in a superword
length of 90. A read of length r has r v ∗ w + 1 superwords,
starting at positions 1, 2, . . ., r v ∗ w + 1. Two superwords form
a match or are identical if they have identical regular bases at each
checked word position. One superword is less (greater) than
another superword if the string of bases at each checked position
of the first superword, in lexicographic order, comes before (after)
the string of bases at each checked position of the second super-
word. Each read is given a unique nonnegative integer, called a read
index. Then each superword in each read can be given a unique
nonnegative integer, called a superword index, computed from the
38 Xiaoqiu Huang
start position of the superword in the read and the read index.
There is also an inverse function that is efficiently used to produce,
from a superword index, the start position of the superword in the
read and the read index. The superword array for a set of reads is an
array of all superword indexes that are sorted in the lexicographic
order of the superwords.
Below is an example of eight superwords in a sorted order.
Each superword is made up of four words of length 15, where
each checked position of the word is indicated by the bit 1, and each
unchecked position by the bit 0. The last position of each word in
the superword is marked by a pound sign. The top three super-
words are considered identical because they have identical bases at
each checked position. The middle two superwords are also identi-
cal, and so are the bottom three superwords. Superwords in each
block may have different bases or the undetermined base N at an
unchecked position. The top block comes before the middle block
because the top block has the base G at a checked position marked
by an asterisk and the middle block has the base T at the same
position. Likewise, the middle block comes before the bottom
block, as determined by the bases at another checked position
marked by an asterisk.
110110111110111110110111110111110110111110111110110111110111
# # # #
ATNAGCCCAGTTATCCTAGTCAGACTCAGGTTNCATCATTCNTCCGANCAGACTGACCAG
ATGAGGCCAGTCATCCTTGTGAGACTTAGGTTACATCATTCATCCGAACAAACTGANCAG
ATTAGNCCAGTAATCCTCGTAAGACTAAGGTTACANCATTCTTCCGACCANACTGATCAG
*
ATCAGTCCAGTNATCCTTTTTAGACTTAGGTTGCAACATTCGTCCGAGCATACTGACCAG
ATGAGACCAGTNATCCTATTGAGACTGAGGTTACATCATTCTTCCGACCAAACTGAACAG
*
ATGAGACCAGTNATCCTNTTNAGACTCAGGTTCCAACATTGCTCCGANCATACTGAGCAG
ATNAGTCCAGTAATCCTTTTAAGACTTAGGTTGCAGCATTGGTCCGAGCANACTGACCAG
ATCAGNCCAGTNATCCTATTTAGACTNAGGTTGCATCATTGATCCGATCACACTGANCAG
that end at the vertex but have no overlap between their long
prefix paths. This is done in the framework of Dijkstra’s algorithm.
Each long overlap path of non-repetitive vertexes is reported as a
contig of reads. The generation of the layout of each contig is
performed on a single processor with a large amount of main
memory. The output of this step is a number of files of contig
layouts.
The files of contig layouts are processed in parallel with each file
handled by a different processor. The reads in each contig are
arranged to form a multiple alignment of reads. Then the multiple
alignment is used to generate a consensus sequence for the contig.
For each file of contig layouts, a file of contig consensus sequences
in fasta format is generated along with a file of contig consensus
base quality scores. In addition, an .ace file of contigs is produced
for viewing and editing in Consed [24]. All the files of contig
consensus sequences are merged into a single file of contig consen-
sus sequences, and a single file of contig consensus base quality
scores is generated similarly.
3.2 Using PCAP. Download the PCAP.Solexa package for your computer system at
Solexa http://seq.cs.iastate.edu.
Unpack the tar file for Linux and move to the pcap.solexa
directory:
tar -xvzf pcap.solexa.linux.tgz
cd pcap.solexa
Enter the full path name in double quotes as the definition for
the variable $CodeDirPath in the pcap.solexa.perl file. For
example, if the path name is
/home/xqhuang/551/pcap.solexa,
Acknowledgements
References
1. Dear S, Staden R (1991) A sequence assembly 6. Sutton GG, White O, Adams MD et al (1995)
and editing program for efficient management TIGR assembler: a new tool for assembling
of large projects. Nucleic Acids Res large shotgun sequencing projects. Genome
19:3907–3911 Sci Tech 1:9–19
2. Huang X (1992) A contig assembly program 7. Myers EW, Sutton GG, Delcher AL et al
based on sensitive detection of fragment over- (2000) A whole-genome assembly of Drosoph-
laps. Genomics 14:18–25 ila. Science 287:2196–2204
3. Kececioglu JD, Myers EW (1995) Combinato- 8. Aparicio S, Chapman J, Stupka E et al (2002)
rial algorithms for DNA sequence assembly. Whole-genome shotgun assembly and analysis
Algorithmica 13:7–51 of the genome of Fugu rubripes. Science
4. Green P (1995) http://www.phrap.org 297:1301–1310
5. Huang X, Madan A (1999) CAP3: a DNA 9. Mullikin JC, Ning Z (2003) The Phusion
sequence assembly program. Genome Res assembler. Genome Res 13:81–90
9:868–877
Sequence Assembly 45
10. Jaffe DB, Butler J, Gnerre S et al (2003) parallel short read sequencing. Genome Res
Whole-genome sequence assembly for mam- 20:265–272
malian genomes: ARACHNE 2. Genome Res 18. Boisvert S, Laviolette F, Corbeil J (2010) Ray:
13:91–96 simultaneous assembly of reads from a mix of
11. Huang X, Wang J, Aluru S et al (2003) PCAP: high-throughput sequencing technologies. J
a whole-genome assembly program. Genome Comput Biol 17:1519–1533
Res 13:2164–2170 19. Liu Y, Schmidt B, Maskell DL (2011) Paralle-
12. Pevzner PA, Tang H, Waterman MS (2001) An lized short read assembly of large genomes using
Eulerian path approach to DNA fragment de Bruijn graphs. BMC Bioinform 12:354
assembly. Proc Natl Acad Sci USA 20. Bankevich A, Nurk S, Antipov D et al (2012)
98:9748–9753 SPAdes: a new genome assembly algorithm and
13. Chaisson M, Pevzner PA (2008) Short read its applications to single-cell sequencing. J
fragment assembly of bacterial genomes. Comput Biol 19:455–477
Genome Res 18:324–330 21. Compeau PEC, Pevzner PA, Tesler G (2011)
14. Butler J, MacCallum I, Kleber M et al (2008) How to apply de Bruijn graphs to genome
ALLPATHS: De novo assembly of whole- assembly. Nat Biotechnol 29:987–991
genome shotgun microreads. Genome Res 22. Huang X, Yang S-P, Chinwalla A et al (2006)
18:810–820 Application of a superword array in genome
15. Zerbino DR, Birney E (2008) Velvet: Algo- assembly. Nucleic Acids Res 34:201–205
rithms for de novo short read assembly using 23. Gusfield D (1997) Algorithms on strings, trees,
de Bruijn graphs. Genome Res 18:821–829 and sequences: computer science and compu-
16. Simpson JT, Wong K, Jackman SD et al (2009) tational biology. Cambridge University Press,
ABySS: a parallel assembler for short read New York
sequence data. Genome Res 19:1117–1123 24. Gordon D, Abajian C, Green P (1998)
17. Li R, Zhu H, Ruan J et al (2010) De novo Consed: a graphical tool for sequence finishing.
assembly of human genomes with massively Genome Res 8:195–202
Chapter 3
Abstract
Macromolecular crystallography is a powerful tool for structural biology. The resolution of a protein crystal
structure is becoming much easier than in the past, thanks to developments in computing, automation of
crystallization techniques and high-flux synchrotron sources to collect diffraction datasets. The aim of this
chapter is to provide practical procedures to determine a protein crystal structure, illustrating the new
techniques, experimental methods, and software that have made protein crystallography a tool accessible to
a larger scientific community.
It is impossible to give more than a taste of what the X-ray crystallographic technique entails in one brief
chapter and there are different ways to solve a protein structure. Since the number of structures available in
the Protein Data Bank (PDB) is becoming ever larger (the protein data bank now contains more than
100,000 entries) and therefore the probability to find a good model to solve the structure is ever increasing,
we focus our attention on the Molecular Replacement method. Indeed, whenever applicable, this method
allows the resolution of macromolecular structures starting from a single data set and a search model
downloaded from the PDB, with the aid only of computer work.
Key words X-ray crystallography, Protein crystallization, Molecular replacement, Coordinates refine-
ment, Model building
1 Introduction
1.1 Protein The first requirement for protein structure determination by X-ray
Crystallization crystallography is to obtain protein crystals diffracting at high reso-
lution. Protein crystallization is mainly a “trial and error” procedure
in which the protein is slowly precipitated from its solution. As a
general rule, the purer the protein, the better the chances to grow
crystals. Growth of protein crystals starts from a super-saturated
solution of the macromolecule, and evolves toward a thermodynam-
ically stable state in which the protein is partitioned between a solid
phase and the solution. The time required before the equilibrium is
reached has a great influence on the final result, which can go from an
amorphous or microcrystalline precipitate to large single crystals.
The super-saturation conditions can be obtained by the addition of
precipitating agents (salts, organic solvents, and polyethylene glycol
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_3, © Springer Science+Business Media New York 2017
47
48 Andrea Ilari and Carmelinda Savino
1.2.3 X-ray Sources X-rays are produced in the laboratory by accelerating a beam of
electrons emitted by a cathode into an anode, the metal of which
dictates what the wavelength of the resulting X-ray will be. Mono-
chromatization is carried out either by using a thin metal foil, which
absorbs much of the unwanted radiation or by using the intense
low-order diffraction from a graphite crystal. To obtain a brighter
source, the anode can be made to revolve (rotating anode genera-
tor) and is water-cooled to prevent it from melting. An alternative
source of X-rays is obtained when a magnet bends a beam of
electrons. This is the principle behind the synchrotron radiation
sources that are capable of producing X-ray beams about a thou-
sand times more intense than a rotating anode generator. A conse-
quence of this high-intensity radiation source is that data collection
time has been drastically reduced. A further advantage is that the X-
ray spectrum is continuous from around 0.05–0.3 nm (see Note 2).
read-out time (5 ms), good dynamic range (see Note 7), high
detective quantum efficiency, and the possibility of suppressing
fluorescence by an energy threshold. The short readout together
with the fast framing time allows taking diffraction data in continu-
ous mode without opening and closing the shutter for each frame.
1.2.5 Data Measurement Successful data integration depends on the choice of the experi-
and Data Processing mental parameters during data collection. It is therefore crucial that
the diffraction experiment is correctly designed and executed. The
essence of the data collection strategy is to collect every unique
reflection at least once. The most important issues that have to be
considered are:
1. The crystal must be single.
2. In order to have a good signal-to-noise ratio, it is recommended
to measure crystal diffraction at the detector edge.
3. The exposure time has to be chosen carefully: it has to be long
enough to allow collection of high resolution data (see Note 8),
but not so long as to cause overload reflections at low resolution
and radiation damage.
4. The rotation angle per image should be optimized: too large an
angle will result in spatial overlap of spots, too small an angle will
give too many partial spots (see Note 9).
5. High data multiplicity will improve the overall quality of the data
by reducing random errors and facilitating outlier identification.
Data analysis, performed with modern data reduction pro-
grams, is normally performed in three stages:
1. (Auto)indexing of one or more images. The program deduces
the lattice type, the crystal unit cell parameters, and crystal
orientation parameters.
2. Indexing of all images. The program compares the diffraction
measurements to the spots predicted on the basis of the auto-
indexing parameters, assigns the hkl indices to each spot.
3. Integration of the peaks. The program calculates the diffraction
intensities for each spot in all the collected images.
4. Scaling. The program scales and merges together the reflections
of all the collected images (identified by the indices hkl).
1.3 Structure The goal of X-ray crystallography is to obtain the distribution of the
Determination electron density which is related to the atomic positions in the unit
cell, starting from the diffraction data. The electronic density func-
tion has the following expression:
1
ρðx; y; z Þ ¼ Σ hkl F hkl e 2πiðhxþkyþlz Þ ð1Þ
V
52 Andrea Ilari and Carmelinda Savino
where Fhkl are the structure factors, V is the cell volume, and h, k, l
are the Miller indices. F is a complex number and can be repre-
sented as a vector with a module and a phase. It is possible to easily
calculate the amplitude of F directly from the X-ray scattering
measurements but the information on the phase value would be
lost. Different experimental techniques can be used to solve the
“phase problem,” allowing the building of the protein three-
dimensional structure: Multiple Isomorphous Replacement
(MIR), Multiple Anomalous Diffraction (MAD), and Molecular
Replacement (MR). The last one can be performed by computa-
tional calculations using only the native data set.
1.3.2 Rotation Function As mentioned above, the rotation function is based on the obser-
vation that the self-vectors depend only on the orientation of the
molecule and not on its position in the unit cell. Thus, the rotation
matrix can be found by rotating and superimposing the model
Patterson (calculated as the self-convolution function of the elec-
tron density, see Note 10) on the observed Patterson (calculated
from the experimental intensity). Mathematically, the rotation
function can be expressed as a sum of the product of the two
Patterson functions at each point:
ð
F ðRÞ ¼ P cryst ðuÞP self ðRuÞdu ð3Þ
r
where Pcryst and Pself are the experimental and the calculated Pat-
terson functions respectively, R is the rotation matrix, and r is the
integration radius. In the integration, the volume around the origin
where the Patterson map has a large peak is omitted. The radius of
integration has a value of the same order of magnitude as the
molecule dimensions because the self-vectors are more concen-
trated near the origin. The programs most frequently used to
solve X-ray structures by Molecular Replacement implement the
fast rotation function developed by Tony Crowther, who realized
that the rotation function can be computed more quickly using the
Fast Fourier Transform, expressing the Patterson maps as spherical
harmonics [9].
54 Andrea Ilari and Carmelinda Savino
1.3.3 Translation Once the orientation matrix of the molecule in the experimental
Function cell is found, the next step is the determination of the translation
vector. This operation is equivalent to finding the absolute position
of the molecule. When the molecule (assuming it is correctly
rotated in the cell) is translated, all the intermolecular vectors
change. Therefore, the Patterson functions’ cross-vectors, calcu-
lated using the observed data and the model, superimpose with
good agreement only when the molecules in the crystal are in the
correct position. The translation function can be described as:
ð
T ðt Þ ¼ P cryst ðuÞP cross ðut Þdu ð4Þ
v
where Pcryst is the experimental Patterson function, whereas Pcross is
the Patterson function calculated from the probe oriented in the
experimental crystal, t is the translation vector, and u is the inter-
molecular vector between two symmetry-related molecules.
1.4 Structure Once the phase has been determined (for example with the molec-
Refinement ular replacement method) an electron density map can be calcu-
lated and interpreted in terms of the polypeptide chain. If the major
part of the model backbone can be fitted successfully in the elec-
tronic density map, the structure refinement phase can begin.
Refinement is performed by adjusting the model in order to find
a closer agreement between the calculated and the observed struc-
ture factors. The adjustment of the model consists in changing the
three positional parameters (x, y, z) and the isotropic temperature
factors B (see Note 11) for all the atoms in the structure except the
hydrogen atoms. Refinement techniques in protein X-ray crystal-
lography are based on the least squares minimization and depend
greatly on the ratio of the number of independent observations to
variable parameters. Since the protein crystals diffract very weakly,
the errors in the data are often very high and more than five
intensity measurements for each parameter are necessary to refine
protein structures. Generally, the problem is poorly overdeter-
mined (the ratio is around 2) or sometimes under-determined
(the ratio below 1). Different methods are available to solve this
problem. One of the most commonly used is the Stereochemically
Restrained Least Squares Refinement, which increases the number
of the observations by adding stereo-chemical restraints [10].
The function to minimize consists in a crystallographic term and
several stereochemical terms:
X 2 X
Q ¼ whkl jF obs j F cal þ w D ðd ideal d model Þ2
X X
þ wT ðX ideal X model Þ2 þ wP ðP ideal P model Þ2
X X
þ wNB ðE min E model Þ2 þ wC ðV ideal V model Þ2 ð5Þ
Protein X-ray Structure Determination 55
2 Materials
2.3 Molecular 1. One of the following programs: AMoRe, MolRep, CNS, Phaser.
Replacement
Protein X-ray Structure Determination 57
3 Methods
3.1 Crystallization Precise rules to obtain suitable single-protein crystals have not been
and Crystal defined yet. For this reason, protein crystallization is mostly a trial
Preparation and error procedure. However, there are some rules that should be
followed in order to increase the success probability in protein
3.1.1 Crystallization crystallization:
Procedure
1. Check protein sample purity, which has to be around 90–95 %;
2. Slowly increase the precipitating agent concentration (PEGs,
salts, or organic solvents) in order to favor protein aggregation;
3. Change pH and/or temperature.
It is usually necessary to carry out a large number of experi-
ments to determine the best crystallization conditions while using a
minimum amount of protein per experiment (crystallization trials).
In these trials the aliquots of purified protein are mixed with an
equal amount of mother solution containing precipitating agents,
buffers, and other additives. The individual chemical conditions in
which a particular protein aggregates to form crystals are used as a
starting point for further crystallization experiments. The goal is
optimizing the formation of single protein crystals of sufficient size
and quality suitable for diffraction data collection. The protein
concentration should be about 10 mg/ml; therefore, 1 mg of
purified protein is sufficient to perform about 100 crystallization
experiments. Crystallization can be carried out using different
techniques, the most common of which are: liquid-liquid diffusion
methods, crystallization under dialysis, and vapor diffusion tech-
nique. The latter is described in detail since it is easy to set up and
allows the biocrystallographer to utilize a minimum protein
amount. The vapor diffusion technique can be performed in two
ways: the “hanging drop” and the “sitting drop” methods.
1. In the “hanging drop” method, drops are prepared on a silicon-
ized microscope glass cover slip by mixing 1–5 μl of protein
solution with the same volume of precipitant solution. The slip
is placed upside-down over a depression in a tray; the depression
is partly filled (about 1 ml) with the required precipitant solution
(reservoir solution). The chamber is sealed by applying grease to
the circumference of depression before the cover slip is put into
place (Fig. 2a).
58 Andrea Ilari and Carmelinda Savino
Fig. 2 (a) “Hanging drop” crystallization method. A drop of protein solution is suspended from a glass cover
slip above a reservoir solution, containing the precipitant agent. The glass slip is siliconized to prevent
spreading of the drop. (b) “Sitting drop” crystallization method. A drop of protein solution is placed in a plastic
support above the reservoir solution
3.1.3 Crystal Cryo- The most widely used cryo-mounting method consists of the sus-
Protection pension of the crystal in a film of an “antifreeze” solution, held by
surface tension across a small diameter loop of fiber, and followed by
rapid insertion into a gaseous nitrogen stream. The cryo-protected
solution is obtained by adding cryo protectant agents such as glyc-
erol, ethylene glycol, MPD (2-Methyl-2,4-pentandiol), or low
molecular weight PEG (polyethylene glycol) to the precipitant solu-
tion. The crystal is immersed in this solution for a few seconds prior
to being flash-frozen. This method places little mechanical stress on
the crystal, so it is excellent for fragile samples. Loops are made from
very fine (~10 μm diameter) fibers of nylon. As some crystals degrade
in growth and harvest solutions, liquid nitrogen storage is an excel-
lent way to stabilize crystals for long periods [14]. This system is
particularly useful when preparing samples for data collection at
synchrotron radiation sources, in that, by minimizing the time
required by sample preparation, it allows using the limited time
available at these facilities to collect data.
3.2 Data Once the crystal is placed in the fiber loop, the latter must be
Measurement attached to a goniometer head. This device has two perpendicular
arcs that allow rotation of the crystal along two perpendicular axes.
Additionally, its upper part can be moved along two perpendicular
sledges for further adjustment and centering of the crystal. The
goniometer head must be screwed onto a detector, making sure
that the crystal is in the X-ray beam.
The data collection parameters used in diffraction experiments
have a strong impact on the data quality and for this reason should
be carefully chosen.
In agreement with Bragg’s law, the crystal-to-detector distance
should be as low as possible to obtain the maximum resolution
together with a good separation between diffraction spots. Gener-
ally, a distance of 150 mm allows collection of high quality data sets
with a good resolution (i.e., lower than 2.0 Å) for protein crystals
with unit cell dimensions around 60–80 Å. Long unit cell dimen-
sions (a, b, and/or c longer than 150 Å), large mosaicity (more than
1.0 ) (see Note 13), and large oscillation range (more than 1.0 ) are
all factors affecting spot separations and causing potential reflection
overlaps.
The availability in many beamlines of hybrid pixel detectors
operating in single-photon counting mode (PILATUS), displaying
different characteristics compared with CCD detectors, imposes on
users different data collection strategies. In particular, recent stud-
ies have shown that, if the single photon counting pixel detectors
60 Andrea Ilari and Carmelinda Savino
3.3 Data Processing The basic principles involved in integrating diffraction data from
macromolecules are common to many data integration programs
currently in use. Here, we describe the data processing performed
by the HKL2000 suite [16] and XDS [17]. The currently used data
processing methods exploit automated subroutines for indexing
the X-ray crystal data collection, which means assigning the correct
hkl index to each spot on a diffraction image and for integrating
images, which means measuring spot intensities (Fig. 3).
3.3.1 HKL2000 1. Peak search. The first automatic step is the peak search, which
chooses the most intense spots to be used by the autoindexing
subroutine. Peaks are measured in a single oscillation image,
which for protein crystals requires 0.1–1.0 oscillation degrees.
2. Autoindexing of one image. If autoindexing succeeds a good
match between the observed diffraction pattern and predictions
is obtained. The autoindexing permits the identification of the
space group and the determination of cell parameters (see Note
14 and Table 1). Other parameters also have to be refined. The
most important are the crystal and detector orientation para-
meters, the center of the direct beam, and the crystal-to-detector
distance.
3. Autoindexing of all the images. The autoindexing procedure,
together with refinement, is repeated for all diffraction images.
4. Integration of the peaks. In the integration step all spot inten-
sities are measured.
Data are processed using a component program of the
HKL2000 suite called Denzo. The scaling and merging of indexed
data, as well as the global refinement of crystal parameters, is
performed with the program Scalepack that is another HKL2000
suite component. The values of unit-cell parameters refined from a
single image may be quite imprecise. Therefore, a postrefinement
procedure is implemented in the program to allow for separate
refinements of the orientation of each image while using the same
unit cell for the whole data set. The quality of X-ray data is firstly
assessed by statistical parameters reported in the scale.log file. The
first important parameter is the I/σ value (I: intensity of the signal,
Protein X-ray Structure Determination 61
Fig. 3 Diffraction oscillation image visualized with the program Xdisp (HKL2000 suite) of the whole human
sorcin collected at the ESRF synchrotron radiation source (Grenoble, FR). The spot distances from the image
center are proportional to the resolution, so the spots at the image edge are the highest resolution spots
Table 1
Output of the Denzo autoindexing routine. The lattice and unit cell distortion table, and the crystal orientation parameters are shown. The present
results are obtained for a human sorcin (soluble Resistance related calcium binding protein) crystal
3.3.2 XDS XDS is one of the few program packages which allows us to process
datasets collected using the single-photon-counting detectors
(PILATUS 2M and 6M).
Data processing with XDS requires an input file XDS.INP
provided with the XDS PACKAGE. It consists of eight program
steps, which in the input file are indicated as JOBS:
1. XYCORR that calculates the spatial corrections at each detector
pixel;
2. INIT calculates the detector gain;
3. COLSPOT identifies the strong reflections to be used for
indexing;
4. IDXREF identifies the space group of the crystal and perform
the indexing;
5. DEFPIX identifies the detector surface area used to measure
intensities;
6. XPLAN is a routine that supports the planning of data
collection;
7. INTEGRATE integrates the reflection of the whole dataset;
8. CORRECT scales and merges the symmetry-related reflections.
After completing each individual step, the program writes a .LP
file containing the results obtained running the program steps.
As for all the data processing programs, the indexing of the
reflections should be carefully managed.
First, the standard deviation of spot position and of spindle
position should be checked in the IDXREF.LP. The first parameter
should be in the order of 1 pixel whereas the second one, which
depends on both the crystal mosaicity and the data collection Δφ, is
usually between 0.1 and 0.5 . If the spindle position is greater than
1 , it means that the indexing procedure has not worked properly.
Then the space group and crystal cell parameters should be
checked. The correct space group is that one with the lowest
QUALITY OF FIT value and the highest symmetry. If the correct
lattice is determined, the reflections of the whole data set can be
integrated and after that the reflections with the same symmetry can
be scaled and merged. The CORRECT step, the last one, produces
a file CORRECT.LP containing all the statistics and a file XDS_AS-
CII.HKL containing the integrated and scaled reflections.
The final table of the CORRECT.LP file reports the number of
measured reflections and of the unique reflections, the
Protein X-ray Structure Determination 65
3.4 Molecular 1. Search model. The first operation is searching the databases for
Replacement probe structure similar to the structure to be solved. Since we do
not know the structural identity of our protein with homolo-
gous proteins, we use sequence identity as a guide. Proteins
showing a high degree of sequence similarity with our “query”
protein can be identified in protein sequence databases using
sequence comparison methods such as BLAST [20]. The pro-
tein of known three-dimensional structure showing the highest
sequence identity with our query protein is generally used as the
search model.
2. Files preparation. The Pdb file of the search probe has to be
downloaded from the Protein Data Bank (http://www.rcsb.
org). The file has to be manipulated before Molecular Replace-
ment is performed. The water molecules as well as the ligand
molecules have to be removed from the file. The structure can be
transformed into a polyalanine search probe to avoid model bias
during the Molecular Replacement procedure (see Note 18).
The other file needed to perform the molecular replacement is
the file with extension .mtz resulting from earlier data processing
(see Subheading 3.3), containing information about crystal space
group, cell dimensions, molecules per unit cell, and a list of the
collected experimental reflections.
3. Molecular Replacement. The Molecular Replacement procedure
consists in Rotation and Translation searches to put the probe
structure in the correct position in the experimental cell. This
operation can be done using different programs, belonging to
the CCP4 suite [18] : AMoRe [21], Phaser [22] and Molrep
[23]. In this chapter, we describe briefly the use of MolRep,
which is automated and user-friendly.
4. The program performs rotation searches followed by translation
searches. The only input files to upload are the .mtz and the .pdb
files. The values of two parameters have to be chosen: the
integration radius and the resolution range to be used for
66
Table 2
Example of an XSCALE output table. On the table, different parameters, allowing the evaluation of the dataset quality, are reported, namely
completeness of the dataset, R factor (¼Rmerge), I/σ(I ) and CC1/2. As shown in the table, the parameters are calculated separately for the twenty
different resolution shells. The data shells with CC1/2 marked by an asterisk display a correlation significant at the 0.1 %, and can be used for
structure determination and model refinement
Andrea Ilari and Carmelinda Savino
Number of reflections
Resolution Completeness R-Factor
limit Observed Unique Possible of data Observed Expected Compared I/Sigma R-meas CC(1/2)
7.38 1718 457 469 97.4 % 1.4 % 1.7 % 1716 74.14 1.7 % 100.0*
5.22 3219 851 852 99.9 % 1.7 % 1.9 % 3214 65.40 2.0 % 99.9*
4.26 3629 1091 1100 99.2 % 1.8 % 2.1 % 3604 56.06 2.2 % 99.9*
3.69 4124 1283 1295 99.1 % 1.9 % 2.2 % 4063 49.51 2.3 % 99.9*
3.30 4931 1442 1450 99.4 % 2.2 % 2.3 % 4920 47.24 2.7 % 99.9*
3.01 5748 1627 1632 99.7 % 2.5 % 2.6 % 5735 41.15 3.0 % 99.9*
2.79 6292 1744 1750 99.7 % 2.6 % 2.7 % 6289 40.77 3.0 % 99.9*
2.61 6740 1904 1907 99.8 % 3.0 % 3.0 % 6724 35.62 3.5 % 99.9*
2.46 6559 1998 2007 99.6 % 3.4 % 3.5 % 6506 29.33 4.1 % 99.8*
2.33 6775 2104 2127 98.9 % 3.9 % 3.9 % 6692 26.29 4.6 % 99.8*
2.22 7334 2220 2237 99.2 % 4.6 % 4.6 % 7280 23.84 5.5 % 99.7*
2.13 8112 2359 2372 99.5 % 5.6 % 5.5 % 8075 21.17 6.6 % 99.6*
2.05 8492 2426 2433 99.7 % 7.5 % 7.4 % 8457 16.79 8.8 % 99.5*
1.97 8976 2542 2549 99.7 % 9.3 % 9.3 % 8953 14.38 10.9 % 99.2*
1.91 9292 2621 2627 99.8 % 12.8 % 12.8 % 9274 10.97 15.1 % 98.7*
1.84 9714 2731 2735 99.9 % 18.5 % 18.6 % 9701 8.19 21.8 % 97.7*
1.79 9011 2782 2799 99.4 % 23.9 % 23.8 % 8923 6.25 28.7 % 95.5*
1.74 9586 2862 2876 99.5 % 33.4 % 33.0 % 9513 4.82 39.9 % 93.2*
1.69 9272 2945 2964 99.4 % 39.0 % 39.0 % 9133 3.84 46.9 % 91.1*
1.65 10,092 3051 3063 99.6 % 55.9 % 56.4 % 9994 3.00 66.6 % 83.3*
total 139,616 41,040 41,244 99.5 % 3.3 % 3.4 % 138,766 21.57 3.9 % 99.9*
Protein X-ray Structure Determination
67
68 Andrea Ilari and Carmelinda Savino
Table 3
Output of the MolRep program after rotation and translation searches. The present results (data not
published) have been obtained for the protein Dps (Dna binding proteins) from Listeria
monocitogenes [32] using as search model the Dps from Listeria innocua (Pdb code 1QHG)
3.5 Structure Several programs can be used to perform structure refinement. The
Refinement most common are: CNS written by Br€ unger [24], which uses
conventional least square refinement as well as simulated annealing
to refine the structure; REFMAC5 (CCP4i suite) written by Mur-
shudov [25] that uses maximum likelihood refinement; and PHE-
NIX that allows multistep complex refinement protocols in which
most of the available refinement strategies can be combined with
each other and applied to any selected part of the model [26].
Although CNS and PHENIX have been used with success, in this
chapter we illustrate the use of REFMAC5 implemented in CCP4i
because it provides a graphic interface to compile the input files;
this feature is particularly helpful for beginners.
1. Rigid body refinement. Firstly, the initial positions of the mole-
cules in the unit cell and the crystal cell provided by MR proce-
dures have to be refined. For this purpose, Rigid Body
refinement should be performed. This method assigns a rigid
geometry to parts of the structure and the parameters of these
constrained parts are refined rather than individual atomic para-
meters. The input files to be uploaded are the MR solution and
the .mtz file containing the experimental reflections. The reso-
lution at which to run rigid body refinement has to be specified
(in general the rigid body refinement should start at the lowest
resolution range) and the rigid entity should be defined (this can
be an entire protein, a protein subunit, or a protein domain). To
define the rigid entities in REFMAC5, simply select the chain
and protein regions that are to be fixed.
2. Output files. The output files are: (a) the .log file that contains a
list of all the operations performed, statistics (Table 4) about the
geometrical parameters after each refinement cycle, crystallo-
graphic R factor and R free factor values, and finally the figure
of merit (see Note 21); (b) the .pdb file containing the refined
coordinates of the model; (c) the .mtz file containing the
observed structure factors (Fobs), the structure factor amplitudes
calculated from the model (Fcalc), and the phase angles calcu-
lated from the model.
3. Coordinates and B factors refinement. The program REFMAC
5 refines the x, y, z, and B parameters using the maximum
likelihood method. As for the Rigid Body refinement, the
input files are the .mtz file containing the Fobs and the .pdb file
containing the coordinates of the model. It is also necessary to
restrain the stereochemical parameters using the maximum like-
lihood method. It is possible to choose a numerical value for the
relative weighting terms or, more easily, to choose a single value
for the so-called weight matrix that allows the program to
70 Andrea Ilari and Carmelinda Savino
Table 4
Summary of ten cycles of DpsTe (see Note 25) coordinate refinement using REFMAC5. The Rfact, Rfree,
figures of merits (FOM) and root mean square deviation values of some stereo-chemical parameters
are shown
3.6 Model Building After the Molecular Replacement and the first cycles of coordinates
refinement, only a partial model has been obtained. In this model,
the side chains are absent, and often parts of the model do not
match the electronic density map. Therefore, the building of the
first structural elements is followed by refinement cycles that should
lead to an improvement on the statistics (that is, the R factor has to
decrease and the figure of merit has to increase). The most common
program used for model building is COOT [27], which permits the
direct calculation of electronic density maps. Two maps are neces-
sary to build a model: the 2Fo Fc map contoured at 1σ which is
used to trace the model and the Fo Fc map contoured at 3σ,
which is necessary to observe the differences between the model
and the experimental data.
1. Starting point. Firstly, a match between protein sequence and
the 2Fo Fc density map should be found. If the phases are
good, this operation should not be too difficult. The electron
density map should be clear (especially if it has been calculated
Protein X-ray Structure Determination 71
Fig. 4 Initial Electronic density map of Dps from Thermosynechococcus elongatus (see Note 25) calculated
after Molecular Replacement. Cα trace of the model is superimposed on the map. The electronic density of a
Trp residue and a Tyr residue are easily recognizable in the map
4 Notes
Fig. 5 Electronic density map contoured at 1.0 σ of Dps from Thermosynechococcus elongatus (see Note 25)
calculated after many REFMAC5 refinement cycles. The final structure (thick lines) solved at 1.8 Å resolution is
superimposed on the map
covðX ; Y Þ
ρX , Y ¼
σX σY
18. To avoid model bias often the model is transformed into a
poly-Ala search probe. Only the coordinates of the polypeptide
backbone and of Cβ atoms are conserved, whereas the side
chains atoms are deleted.
19. The MolRep solutions represent the highest superposition
peaks between the experimental Patterson function and the
Patterson function calculated from the search probe, rotated
and translated in the real cell.
20. Correlation coefficient value (CCf) lies between 0 and 1 and
measures the agreement between the structure factors calcu-
lated from the rotated and translated model and the observed
structure factors. The correlation coefficient is calculated by
REFMAC5 using the following formula:
" #
X
ðjF obs jjF calc jÞ ðhjF obs jihjF calc jiÞ
hkl
CCf ¼ " !#
X 2
X
2
2
F obs hF obs i
2
F calc hF calc i
2
hkl hkl
21. Figure of merit. The “figure of merit” m is:
jF ðhkl Þbestj
m¼
jF ðhkl Þj
where:
X
P ðαÞF hkl ðαÞ
α
F ðhkl Þbest ¼ X ;
P ðαÞ
α
P(α) is the probability distribution for the phase angle α and
Fhkl(best) represents the best value for the structure factors.
The m value is between 0 and 1 and is a measure of the agree-
ment between the structure factors calculated on the basis of
the model and the observed structure factors. If the model is
correct the figure of merit approaches 1.
22. Noncrystallographic symmetry (NCS) occurs when the asym-
metric unit is formed by two or more identical subunits. The
presence of this additional symmetry could help to improve the
initial phases and obtain interpretable maps for model building
using the so-called density modification techniques [30].
23. Usually, the sequence region that contains the largest number
of aromatic residues is chosen to start the model building.
The aromatic residues (especially tryptophan) contain a high
Protein X-ray Structure Determination 77
References
1. Ducruix A, Giegè R (1992) In: Rickwood D, 13. Hui R, Edwards A (2003) High-throughput
Hames BD (eds) Crystallization of nucleic protein crystallization. J Struct Biol
acids and proteins. Oxford University Press, 142:154–161
New York, pp 7–10 14. Rodgers DW, Rodgers DW (1994) Cryocrys-
2. Hahn T (ed) (2002) International table of tallography. Structure 2:1135–1140
crystallography. Kluwer Academic Publishers, 15. Mueller M, Wang M, Schulze Briese C (2012)
Dodrecht Optimal fine φ-slicing for single-photon-
3. McPherson AJ (1990) Current approach to counting pixel detector. Acta Crystallogr D
macromolecular crystallization. Eur J Biochem Biol Crystallogr D68:42–56
189:1–23 16. Otwinoski Z, Minor W (1997) Processing of
4. Matthews BW (1968) Solvent content of pro- X-ray diffraction data collected in oscillation
tein crystals. J Mol Biol 33:491–497 mode. Methods Enzymol 276:307–326
5. Friedrich W, Knipping P, Laue M (1981) In: 17. Kabsch W (2010) XDS. Acta Crystallogr D
Glusker JP (ed) Structural crystallography in Biol Crystallogr D66:125–132
chemistry and biology. Hutchinson & Ross, 18. CCP4 (Collaborative Computational Project,
Stroudsburg, PA, pp 23–39 number 4) (1994) The CCP4 suite: programs
6. Bragg WL, Bragg WH (1913) The structure of for protein crystallography. Acta Crystallogr
crystals as indicated by their diffraction of X- D50:760–763
ray. Proc R Soc Lond 89:248–277 19. French GS, Wilson KS (1978) On the treat-
7. Chothia C, Lesk AM (1986) The relation ment of negative intensity observations. Acta
between the divergence of sequence and struc- Crystallogr A34:517–525
ture in proteins. EMBO J 5:823–826 20. Altshul SF, Koonin EV (1998) Iterated profile
8. Berman HM, Westbrook J, Feng Z, Gilliland searches with PSI-BLAST—a tool for discovery
G, Bhat TN, Weissig H, Shindyalov IN, in protein databases. TIBS 23:444–447
Bourne PE (2000) The Protein Data Bank. 21. Navaza G (1994) AMORE: an automated
Nucleic Acids Res 28:235–242 package for molecular replacement. Acta Crys-
9. Crowther RA (1972) In: Rossmann MG (ed) tallogr A50:157–163
The molecular replacement method. Gordon 22. McCoy AJ, Grosse-Kunstleve RW, Adams PD,
& Breach, New York, pp 173–178 Winn MD, Storoni LC, Read RJ (2007) Phaser
10. Hendrickson WA (1985) Stereochemically crystallographic software. J Appl Crystallogr
restrained refinement of macromolecular struc- 40:658–674
tures. Methods Enzymol 115:252–270 23. Vagin A, Teplyakov A (1997) MOLREP: an
11. Brunger AT, Adams PD, Rice LM (1999) automated program for molecular replace-
Annealing in crystallography: a powerful opti- ment. J Appl Crystallogr 30:1022–1025
mization tool. Prog Biophys Mol Biol 24. Br€unger AT, Adams PD, Clore GM, DeLano
72:135–155 WL, Gros P, Grosse-Kunstleve RW, Jiang JS,
12. Skarina T, Xu X, Evdokimova E, Savchenko A Kuszewski J, Nilges M, Pannu NS, Read RJ,
(2014) High-throughput crystallization Rice LM, Simonson T, Warren GL (1998)
screening. Methods Mol Biol 1140:159–168 Crystallography & NMR system: a new
78 Andrea Ilari and Carmelinda Savino
Abstract
Nucleotide and protein sequences are the foundation for all bioinformatics tools and resources. Researchers
can analyze these sequences to discover genes or predict the function of their products. The INSDC
(International Nucleotide Sequence Database—DDBJ/ENA/GenBank + SRA) is an international, cen-
tralized primary sequence resource that is freely available on the Internet. This database contains all publicly
available nucleotide and derived protein sequences. This chapter discusses the structure and history of the
nucleotide sequence database resources built at NCBI, provides information on how to submit sequences
to the databases, and explains how to access the sequence data.
Key words Sequence database, GenBank, SRA, INSDC, RefSeq, Next generation sequencing
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_4, © Springer Science+Business Media New York 2017
79
80 Christopher O’Sullivan et al.
1.2 SRA/GEO The NCBI SRA stores raw sequencing data and alignment infor-
mation from high-throughput sequencing platforms, including
Illumina, Applied Biosystems SOLiD, Complete Genomics, Pacific
Biosciences SMRT, Nanopore, Ion Torrent, and Roche 454. The
Table 1
Links for data retrieval from DDBJ, EMBL-EBI, and NCBI
data stored in the NCBI SRA are suitable for the reanalysis of data
that supports publications as well as data that supports assembly,
annotation, variation, and expression data submissions to other
NCBI archives. As of October 2016, SRA contains more than 9
PetaBases (9 1015 bases) from nearly 49,000 different source
organisms and metagenomes. Approximately half of the total is
controlled-access human clinical sequence data supporting dbGaP
studies. SRA growth is explosive, doubling approximately every 12
months. Statistics are updated daily and available from the SRA
home page (https://trace.ncbi.nlm.nih.gov/Traces/sra/). SRA
stores descriptive metadata and sequence data separately using
distinct accession series: the SRA Experiment and Run. The SRA
Experiment record is used for search and display. It contains details
describing sequence library preparation, molecular and bioinfor-
matics workflows and sequencing instruments. The SRA Run con-
tains sequence, quality, alignment, and statistics from a specific
library preparation for a single biological sample. Every SRA Exper-
iment references a BioSample and a BioProject.
GEO [6], The Gene Expression Omnibus, is a public func-
tional genomics data repository supporting MIAME-compliant
data submissions. An integral part of NCBI’s primary data archives,
GEO stores profiles of gene expression, coding and noncoding
RNA, Chromatin Immunoprecipitation, genome methylation,
genome variation, and SNP arrays derived from microarrays and
next-generation sequencing. Raw sequence data submitted in sup-
port of GEO profiles is stored in SRA. The profiles and underlying
sequence data are linked via BioSample and BioProject references.
Fig. 1 BioProject report from Entrez. This report provides information about an initiative for sequencing
Salmonella enterica, a common foodborne pathogen. The report includes links to the GenBank assemblies, the
SRA reads, the BioSamples, and the publications. There are also links to other related projects
1.6 Genomes The first complete microbial genome was submitted to GenBank in
1995 [7]. Since then, more than 151,000 cellular genomes have
been released into the public archive. These genomes are derived
from organisms from all branches of the tree of life. Microbial
genomes, those from bacteria, archaea and fungi, are relatively
small in size compared with their eukaryotic counterparts, ranging
from hundreds of thousands to millions of bases. Nonetheless,
these genomes contain thousands of genes, coding regions, and
structural RNAs. Many of the genes in microbial genomes have
been identified only by similarity to other genomes and their gene
products are often classified as hypothetical proteins. Each gene
within a genome is assigned a locus_tag: a unique identifier for a
particular gene in a particular genome. Since the function of many
of the genes is unknown, locus_tags have become surrogate gene
identifiers. Submitters of prokaryote genomes are encouraged to
use the NCBI Prokaryotic Genome Annotation Pipeline [8] to
annotate genome submissions. This service provides standardized
annotation for genomes, which can easily be compared across
genomes. This pipeline is also used to annotate prokaryotic gen-
omes for RefSeq.
The first genome sequences were built by sequencing over-
lapping clones and then assembling the genome by overlap. Many
bacterial genomes and the first human genome were sequenced in
this manner. In 2001, a new approach for sequencing complete
genomes was introduced: Whole Genome Shotgun (WGS) data are
generated by breaking the genome into random fragments for
sequencing and then computationally assembling them to form
contigs which can be assembled into larger structures called scaf-
folds. All of the contig sequences from a single genome assembly,
along with the instructions for building scaffolds, are submitted
together as a single WGS project. WGS has become such a domi-
nant sequencing technique that more nucleotides of WGS sequence
have been added to INSDC than from all of the other divisions
since the inception of the database.
For each project, a master record is created that contains infor-
mation that is common among all the records of the sequencing
projects, such as the biological source, submitter information, and
publication information. Each master record includes links to the
range of accession numbers for the individual contigs in the assem-
bly and links to the range of accessions for the scaffolds from the
84 Christopher O’Sullivan et al.
Fig. 2 The identical protein report that can be accessed from Entrez Protein (https://www.ncbi.nlm.nih.gov/
protein) record. It is a tabular display of all of the protein sequences in GenBank with the identical sequence. It
includes links to coding regions, the genome records, the protein product name as it is cited in the genome
record and the source organisms that have this protein
Managing Sequence Data 85
1.8 The GenBank A GenBank sequence record is most familiarly viewed as a flat file
Sequence Record where the data and associated metadata is structured for human
readability. The format specifications are as follows.
1.8.2 Reference Section The next section of the GenBank flat file contains the bibliographic
and submitter information (Fig. 4):
The REFERENCE section contains published and unpub-
lished references. Many published references include a link to a
PubMed ID number that allows users to view the abstract of the
cited paper in PubMed. The last REFERENCE cited in a record
reports the names of submitters of the sequence data and the
location where the work was done.
The COMMENT field may have submitter provided com-
ments about the sequence or a table that contains structured meta-
data for the record. This example has sequencing and assembly
methodology but there are other structured comments with isola-
tion source or other phenotypic information. In addition, if the
sequence has been updated, then the COMMENT will have a link
to the previous version.
Managing Sequence Data 87
1.8.3 Features and The FEATURES section contains a source feature, which has addi-
Sequence tional information about the source of the sequence and the organ-
ism from which the DNA was isolated. There are approximately 50
different standard qualifiers that can be used to describe the source.
Some examples are /strain, /chromosome, and /host. Following
the source feature are annotations that describe the sequence, such
as gene, CDS (coding region), mRNA, rRNA, variation, and
others. Like the source feature, other features can be further
described with feature-specific qualifiers. For example, a mandatory
qualifier for the CDS feature is a /translation which contains
the protein sequence. Following the Feature section is the nucleo-
tide sequence itself. The specification for the Feature Table can
be found at http://www.insdc.org/documents/feature-table
(see Note 3). An example is included as Fig. 5.
1.9 Updates and An INSDC record can be updated by the submitter any time new
Maintenance of the information is acquired. Updates can include: adding new
Database sequence, correcting existing sequence, adding a publication, or
adding new annotation. Information about acceptable update for-
mats can be found at https://www.ncbi.nlm.nih.gov/genbank/
update. The new, updated record replaces the older one in the
database and in the retrieval and analysis tools. However, since
GenBank is archival, a copy of the older record is maintained in
the database. The sequence in the GenBank record is versioned. For
example, KM527068.1 is the first version of the sequence in Gen-
Bank record KM527068. When the sequence is modified or
updated, the accession version gets incremented. So the accession
in the sample record will become KM527068.2 after a sequence
update. The base accession number does not change as this is a
stable identifier for this record. A COMMENT is added to the
updated GenBank flat file that indicates when the sequence is
88 Christopher O’Sullivan et al.
Fig. 6 Sequence Revision History page allows users to retrieve older versions of a sequence record prior to it
being updated. Sequence changes are indicated by incrementing the version number. One can view the
modifications that were made during an update for two versions of a sequence record by choosing a version in
columns I and II and then clicking the Show button
1.10.2 Sequence The GenBank staff actively removes vector and linker contamina-
Contamination tion from sequence submissions when it is discovered. GenBank
submissions are screened using a specialized BLAST database—
UniVec (https://www.ncbi.nlm.nih.gov/tools/vecscreen/)—to
detect vector contamination. Assembled genomes are also screened
for contamination by sequence from other organisms, for instance,
detection of stretches of human DNA in a bacterial assembly.
2.1 SRA SRA processes multiple TeraBytes of sequence data every day. Like
Submissions GenBank, SRA submissions come from a variety of sources includ-
ing small labs, core facilities, and genome sequencing centers. Low
to mid volume submissions are initiated via NCBI sra submission
portal (https://submit.ncbi.nlm.nih.gov/subs/sra/) where sub-
mitters enter descriptions of the samples and libraries that they
intend to upload. Data files may be uploaded from the browser
but are typically delivered separately using ftp or Aspera FASP
protocol (https://downloads.asperasoft.com/connect2/). High
volume automated submission pipelines use dedicated upload
accounts to deliver data files and bulk metadata submission via
programmatically generated xml files.
NCBI works to help labs that do significant amounts of
sequencing, covered under the Genomic Data Submission policy,
comply with that policy. Submission to SRA, GEO, or dbGaP
(https://www.ncbi.nlm.nih.gov/gap) or GenBank qualify as
acceptable submissions. Resource specific help desks assist
individuals with technical issues regarding those submissions
(see Note 4).
Managing Sequence Data 91
3.2 Simple Entrez To perform a direct Entrez query using SRA metadata search terms,
Query go to the top of the NCBI home page (https://www.ncbi.nlm.nih.
gov/) and select “SRA” from the drop-down list of available data-
bases and enter a query (for example, “salmonella”) in the search
box. The SRA display page of results for your query will include
records for each matching study, with links to read/run data,
project descriptions, etc.
3.3 Advanced Entrez Go to the top of the NCBI home page (https://www.ncbi.nlm.nih.
Query gov/) and select “SRA” from the drop-down list of available data-
bases, then click the “Advanced” link located just below the search
bar at the top of the page. The SRA Advanced Search Builder page
(https://www.ncbi.nlm.nih.gov/sra/advanced) will appear and on
this page you can construct a complex SRA query by selecting
multiple search terms from a large number of fields and qualifiers
such as accession number, author, organism, text word, publication
date, and properties (paired-end, RNA, DNA, etc.). See the
Advanced Search Builder video tutorial (https://www.youtube.
94 Christopher O’Sullivan et al.
3.4 SRA Home Page You can also search SRA through the coordinated use of the
Query “Browse” and “Search” tabs on the SRA home page (https://
www.ncbi.nlm.nih.gov/Traces/sra/). The SRA Web interface
allows the user to:
(a) Access any data type stored in SRA independently of any other
data type (e.g., accessing read and quality data without the
intensity data).
(b) Access reads and quality scores in parallel.
(c) Access related data from other NCBI resources that are
integrated with SRA.
(d) Retrieve data based on ancillary information and/or sequence
comparisons.
(e) Retrieve alignments in “vertical slices” (showing underlying
layered data) by reference sequence location.
(f) Review the descriptions of studies and experiments (metadata)
independently of experimental data.
3.5.1 Putting It All Bioproject can aggregate several data types (e.g., a genome assem-
Together with Bioproject bly and a transcriptome) by study (e.g., NIH grant). You can access
BioProject records by browsing, querying, or downloading in
Entrez, or by following a link from another NCBI database.
Managing Sequence Data 95
3.7 BioProject Query You can perform a search in BioProject like you would in any other
Entrez database, namely by searching for an organism name, text
word, or BioProject accession (PRJNA31257), or by using the
Advanced Search page to build a query restricted by multiple fields.
Search results can be filtered by Project Type, Project Attributes,
Organism, or Metagenome Groups, or by the presence or absence
of associated data in one of the data archives.
Table 2 contains some representative searches:
3.8 BioProject You can also find BioProject records by following links from archi-
Linking val databases when the data cites a BioProject accession. You can
find links to BioProject in several databases including SRA, Assem-
bly, BioSample, dbVar, Gene, Genome, GEO, and Nucleotide
(which includes GenBank and RefSeq nucleotide sequences).
Large consortia also LinkOut (https://www.ncbi.nlm.nih.gov/pro
jects/linkout/) from BioProject to their resources.
Table 2
Some example BioProject searches
3.9 GenBank GenBank and Refseq nucleotide sequences records can be retrieved
from Entrez Nucleotide (https://www.ncbi.nlm.nih.gov/
nuccore/). EST and GSS records are searched and retrieved from
https://www.ncbi.nlm.nih.gov/est/ and https://www.ncbi.nlm.
nih.gov/gss/ or by choosing the appropriate database form the
pull-down menu at the top of most NCBI Web pages. Entrez
Protein (https://www.ncbi.nlm.nih.gov/protein/) is a collection
of protein sequences from a variety of sources, including transla-
tions from annotated coding regions in INSDC and RefSeq, Uni-
Prot and PDB. The Entrez retrieval system has a network of links
that join entries from each of the databases. For example, a Gen-
Bank record in Nucleotide can have links to the Taxonomy,
PubMed, PubMed Central, Protein, PopSet, BioProject, Gene,
and Genome databases. Within a GenBank flat file hyperlinks to
Taxonomy, BioProject, BioSample, and PubMed databases are dis-
played if the links are present. Links to external databases can be
made by LinkOut or by cross-references (db_xrefs) within the
entry. By taking advantage of these links, users can make important
scientific discoveries. These links are critical to discovering the
relationship between a single piece of data and the information
available in other databases.
In Entrez, sequence data can be viewed in a number of different
formats. The default and most readable format is the GenBank flat
file view. The graphical view, which eliminates most of the text and
displays just sequence and biological features, is another display
option. Other displays of the data, for instance XML, ASN.1, or
FASTA formats, are intended to be more computer-readable.
4.1 The SRA Toolkit The SRA Toolkit is a collection of tools and libraries for using the
SRA archive file format. SRA utilities have the ability to locate and
download data on-demand from NCBI servers, removing the need
for a separate download step, and most importantly, downloading
Managing Sequence Data 97
Fig. 7 The NCBI Assembly resource can be searched for taxonomic and sequence information pertaining to
whole genomes and scaffolds submitted to NCBI. Sequences can be downloaded from the GenBank and
RefSeq FTP sites that are accessible from the NCBI Assembly pages
only required data. This feature can reduce the bandwidth, storage,
and the time taken to perform tasks that use less than 100 % of the
data contained in a run. Utilities are provided for delivering data in
commonly used text formats such as fastq and sam. Additional
information on using, configuring, and building the toolkit is
maintained on the NCBI github repository (see Note 9).
4.2 Programmatic We have developed a new, domain-specific API for accessing reads,
Interaction with the alignments and pileups produced from Next Generation Sequenc-
SRA Toolkit ing called NGS (see Note 9). The API itself is independent from any
particular back-end implementation, and supports use of multiple
back-ends simultaneously. It also provides a library for building new
back-end “engines.” The engine for accessing SRA data is
contained within the sister repository ncbi-vdb.
The API is currently expressed in C++, Java, and Python lan-
guages. The design makes it possible to maintain a high degree of
similarity between the code in one language and code in another—
especially between C++ and Java.
4.3 BioProject In addition to the Entrez Web interface and the BioProject browse
Download page, you can download the entire BioProject database and the
database .xsd schema from the FTP site: ftp://ftp.ncbi.nlm.nih.
gov/bioproject/, or use Entrez Programming Utilities (E-utilities)
to programmatically access public BioProject records.
98 Christopher O’Sullivan et al.
4.4 GenBank FTP There is a bimonthly release of GenBank, which is available from
the NCBI FTP site (ftp://ftp.ncbi.nih.gov/genbank/). Between
releases, there is a daily dump of the sequences loaded into the
database to the FTP site (ftp://ftp.ncbi.nih.gov/genbank/daily-
nc/). Genome assemblies are retrievable by organism or by taxo-
nomic group in a variety of formats on the FTP site (ftp://ftp.ncbi.
nih.gov/genomes/).
The assembly database has FTP URLs in the upper right hand
corner of each page (see Fig. 7).
5 Conclusion
6 Notes
Table 3
Traditional taxonomic GenBank divisions
Code Description
BCT Bacterial sequences
PRI Primate sequences
MAM Other mammalian sequences
VRT Other vertebrate sequences
INV Invertebrate sequences
PLN Plant, fungal, and algal sequences
VRL Viral sequences
PHG Bacteriophage sequences
SYN Synthetic and chimeric sequences
UNA Unannotated sequences, including some WGS sequences obtained via environmental sampling
methods
Nontraditional GenBank divisions
PAT Patent sequences
EST EST division sequences, or expressed sequence tags, are short single pass reads of transcribed
sequence
STS STS division sequences include anonymous STSs based on genomic sequence as well as gene-
based STSs derived from the 30 ends of genes. STS records usually include primer sequences,
annotations and PCR reaction conditions
GSS GSS records are predominantly single reads from bacterial artificial chromosomes (“BAC-ends”)
used in a variety of clone-based genome sequencing projects
ENV The ENV division of GenBank, for non-WGS sequences obtained via environmental sampling
methods in which the source organism is unknown
HTG The HTG division of GenBank contains unfinished large-scale genomic records that are in
transition to a finished state. These records are designated as Phase 0–3 depending on the
quality of the data. Upon reaching Phase 3, the finished state, HTG records are moved into the
appropriate taxonomic division of GenBank
HTC The HTC division of GenBank accommodates high-throughput cDNA sequences. HTCs are of
draft quality but may contain 50 UTRs and 30 UTRs, partial coding regions, and introns
CON Large records that are assembled from smaller records, such as eukaryotic chromosomal
sequences or WGS scaffolds, are represented in the GenBank “CON” division. CON records
contain sets of assembly instructions to allow the transparent display and download of the full
record using tools such as NCBI’s Entrez
TSA Transcriptome shotgun data are transcript sequences assembled from sequences deposited in the
NCBI Trace Archive, the Sequence Read Archive (SRA), and the EST division of GenBank
Managing Sequence Data 101
>Feature Sc_16
1 7000 REFERENCE
PubMed 8849441
<1 1050 gene
gene ATH1
<1 1009 CDS
product acid trehalase
product Ath1p
codon_start 2
<1 1050 mRNA
Table 4
SRA experimental enumeration values and definitions
(continued)
104 Christopher O’Sullivan et al.
Table 4
(continued)
Acknowledgement
References
1. Karsch-Mizrachi I, Nakamura Y, Cochrane G Nucleotide Archive. Nucleic Acids Res 43(Data-
(2012) The International Nucleotide Sequence base issue):D23–D29
Database Collaboration. Nucleic Acids Res 40 4. Kodama Y, Mashima J, Kosuge T, Katayama T,
(Database issue):D33–D37 Fujisawa T, Kaminuma E et al (2015) The DDBJ
2. Benson DA, Clark K, Karsch-Mizrachi I, Lipman Japanese Genotype-phenotype Archive for
DJ, Ostell J, Sayers EW (2015) GenBank. genetic and phenotypic human data. Nucleic
Nucleic Acids Res 43(Database issue):D30–D35 Acids Res 43(Database issue):D18–D22
3. Silvester N, Alako B, Amid C, Cerdeno-Tarraga 5. Mellmann A, Harmsen D, Cummings CA,
A, Cleland I, Gibson R et al (2015) Content Zentz EB, Leopold SR, Rico A et al (2011)
discovery and retrieval services at the European Prospective genomic characterization of the
106 Christopher O’Sullivan et al.
Genome Annotation
Imad Abugessaisa, Takeya Kasukawa, and Hideya Kawaji
Abstract
The dynamic structure and functions of genomes are being revealed simultaneously with the progress of
technologies and researches in genomics. Evidence indicating genome regional characteristics (genome
annotations in a broad sense) provide the basis for further analyses. Target listing and screening can be
effectively performed in silico using such data. Here, we describe steps to obtain publicly available genome
annotations or to construct new annotations based on your own analyses, as well as an overview of the types
of available genome annotations and corresponding resources.
Key words Genome annotation, Gene functions, RNA-Seq, Epigenetic marks, Genome browser
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_5, © Springer Science+Business Media New York 2017
107
108 Imad Abugessaisa et al.
2 Materials
2.1.3 Evolutionary 1. Genome sequences reflect the history of life, and are shaped by
Conservation and Variation negative selection against disadvantageous genomic variations
(alleles), positive selection of favored ones, and genetic drift on
neutral ones. Conserved genomic segments among species are
likely important and actually many of the protein coding
sequences (CDSs) are conserved. Interestingly, promoter
regions of noncoding transcripts (ncRNA) are conserved as
well as promoters of protein-coding transcripts, while ncRNA
exons tend to be less conserved [27]. Ultra-conserved regions,
where an almost complete identity is observed between ortho-
logous regions of human, rat, and mouse [28], are remarkable
occurrences in terms of evolution. DNA sequence alignments,
conserved genomic segments, and scores indicating the levels of
conservation at a single base-pair resolution [29, 30] are relevant
components of genome annotation.
2. Genetic variations among cells, individuals, and populations may
provide explanations for variations in phenotypes. A large num-
ber of studies have published statistical associations between
specific allele and diseases [31] and a variety of somatic muta-
tions in cancers [32]. ClinVar [33] aims to collect genomic
variation of clinical importance, not only for cancers but many
other diseases. Besides coordinates and allelic frequencies of
genetic variations [34–37], their associations to specific pheno-
types may provide clues to understand the contribution of each
genomic segment to individual phenotypes at the system level.
Genome Annotation 111
2.2 Technical One of the common ways to store genome annotations is to use tab-
Frameworks to Use delimited text format, since it allows us to inspect the contents
Genome Annotations manually and to handle the data files in any computers. Use of
predefined formats permits the end-user to use his/her own annota-
2.2.1 Data File Formats tions with existing tools, hence be able to explore, query, or visualize
the data. Several formats based on tab-delimited file have been
proposed, including General Feature Format (GFF, see Note 1)
[42], Gene Transfer Format (GTF), Browser Extensible Data
(BED, see Note 2), Wiggle Track Format (Wig) [43], Variant Call
Format (VCF [44]), and Sequence Alignment/Map format (SAM).
Most of the file format specifications are easy to understand, parse,
and to process using lightweight scripting languages such as Perl,
Python, or Ruby, and several tools dedicated to handling genomic
intervals are freely available, such as BedTools [45] and SAMtools
[46]. Here we introduce BED, as an example of a genome annota-
tion file format. Individual values are simply tab-delimited as in
Table 1, which describes seven regions as genome annotations:
At minimum, a BED file has three mandatory columns and nine
optional ones. The mandatory columns are:
1. The name of the chromosome in the format of chr#, e.g., chr1,
chr3, chr5, and chrM.
2. The starting position in the chromosome, i.e., the starting posi-
tion of the genomic region.
3. The ending position in the chromosome, i.e., the ending posi-
tion of the genomic region.
To describe genome annotations, the following optional col-
umns can be used:
112 Imad Abugessaisa et al.
Table 1
CAGE peaks stored in BED file format
4. The name of the track is a label to identify the name of the row in
the BED file, usually displayed at the left side of the genome
browser.
5. The score value: the range of this field is 0–1000. The content of
this score determine the way the genomic feature will be dis-
played in the genome browser.
6. Strand value, containing either ‘+’ or ‘’.
2.2.2 Genome Browsers Visualization of genomics data is an essential step in data analysis to
explore genomic information, inspect quality of user-defined data,
elucidate the meaning of data and generate hypotheses. A genome
browser is a tool to visualize genomic sequences with annotations
in a schematic view, where users are typically able to perform
multiple operations across the genomic coordinates (e.g., zoom
in, zoom out, search, and download) via an interactive graphical
user interface. A variety of genome browsers have been developed
so far and they can be classified into two groups:
1. Web-based genome browsers, such as the UCSC genome
browser [43], Ensembl genome browser [47], Generic Genome
Browser (Gbrowse [48]), and ZENBU [49] (see Figs. 1, 2, and
3), in which services are hosted in a remote computer and end-
users access them with a web browser over the Internet.
Typically a variety of genome annotations are preloaded and
configured. End-users can display and browse annotations and
upload their own data to compare with them.
2. Desktop genome browsers, such as IGV [50] and the NCBI
genome workbench (http://www.ncbi.nlm.nih.gov/tools/
gbench/) in which the software has to be installed in a local
computer. Typically only a minimum set of genome annotations
are pre-configured at the beginning, and end-users can browse
the data without real time connection to the Internet.
Genome Annotation 113
Fig. 1 Ensembl genome browser, displaying BRCA1 locus. BRCA1 locus is displayed in three panes: the top
pane indicates the genetic band of chromosome 17, the middle pane indicates neighboring genes, and the
bottom one displays more details, such as mRNA structure, at the locus. A keyword search box is located
under the middle pane as well as in the top menu
2.2.3 Further Use of Genome browsers are the most commonly used tools to inspect
Genome Annotations genome annotations, and tab-delimited format is the simplest way
to store them as described above. We introduce several comple-
mentary alternative tools and formats that are used for specific
purposes.
1. End-users may wish to obtain a fraction of the entire genome
annotations for their own analysis. Several systems support batch
queries with graphical user interfaces, including BioMart [51]
and the UCSC table browser [43].
2. Genomic analyses often require handling large sizes of genome
annotations, in particular when using NGS-based approaches.
Several compressed and binary formats have been proposed for
efficient retrieval of individual records by using genomic coordi-
nates, such as BAM—a binary version of SAM [46], BCF—a
binary version of VCF [44], and Big Binary Indexed (BBI) files
114 Imad Abugessaisa et al.
Fig. 2 ZENBU genome browser, displaying BRCA1 locus. BRCA1 locus is displayed in five tracks: the first one
indicates the genetic band of chromosome 17 and the genomic coordinates of the displayed region, the
second and third show gene and mRNA structure. The fourth and the fifth indicate transcription initiation
activities and their peaks. The bar graph below the tracks indicates the levels of transcription initiation
activities per sample. A keyword search box is located above the tracks
3 Methods
3.1 Scenario I: Assume that you find a tumor suppressor gene BRCA1 in an article
Search and Obtain or a list of your analysis results, and you are interested in its RNA
Published Genome structure (exon/intron), genetic variations, genome conservation
Annotations and epigenetic content around the gene. Genome browsers, such as
Ensembl Genome Browser including ZENBU, can be used for this
purpose and we here use the UCSC Genome Browser as an exam-
ple. Conceptually, the following steps are common in these genome
browsers. For further reading see manuals for their details.
1. Access a genome browser with your favorite web browser, and
open a window corresponding to the human genome assembly.
In the case of the UCSC genome browser, open the URL
http://genome.ucsc.edu/, followed by selection of “Genome
Browser” in the left menu. Note that multiple genome assem-
blies have been published for human as a consequence of suc-
cessive efforts to improve the reference genome sequences. You
have to select an appropriate assembly version (usually the latest
one) from those available.
2. Type the gene name “BRCA1” and press the submit button (the
browser support text and sequence search). The resulting page
shows a list of annotations including “BRCA1”.
3. Click any of the genes in the above list. This will take you back to
the genome browser displaying the BRCA1 gene. Figure 3
shows some of the annotations that may be displayed, including
several isoforms of BRCA1, SNPs reported in the region, var-
iants reported in cancer, genome conservation across the species,
and epigenetic context such as transcription factor binding
regions and transcription initiation sites. Several interfaces are
provided for exploring neighboring regions, such as zoom in/
out, move to right/left, and display/hide for each of the
genome annotations.
4. Click “tools” in the top menu and select “Table Browser” to
access the UCSC table browser. It enables the user to obtain
genome annotations, whether in the displayed region or the
whole genome, in a tabular form (see Note 3).
3.2 Scenario II: Make Assume that you have produced your own ChIP-seq data by using
and Browse Your Own anti-BRCA1 antibody, and you hope to generate your own genome
Annotations annotations and inspect them visually. As an example we use a set of
data generated by the ENCODE project [38], consisting of raw
reads, their alignments with the reference genome, and identified
binding sites of the protein in HeLa cells (see Table 2 for URLs of
the example data). BRCA1 has been shown to regulate itself (auto-
regulation) negatively in several cell lines [53], thus we expect the
presence of ChIP-seq signals at the locus.
1. Install a local genome browser, IGV, by following the instruc-
tions in http://www.broadinstitute.org/igv/ (this can be
skipped if you have already installed it).
116 Imad Abugessaisa et al.
Fig. 3 The UCSC genome browser, displaying BRCA1 locus. BRCA1 locus is displayed with BRCA1 ChIP-seq
results, somatic mutations in cancers, and other annotations. A keyword search box is placed above the
genetic band. The view of the track is adjustable by right-clicking (click while pressing the control key on Mac)
of the tracks or using the detailed interface below the tracks. The immediate view from the search results is
very dense and gives an overview of the whole chromosome. Users can zoom in and click on one element to
get detailed information about a particular gene, mRNA, etc.
2. Align the raw reads with the reference genome using an appro-
priate alignment program (the algorithms are reviewed in Ref.
[54]), and produce alignment results in BAM format with index
[46]. Note that the aligned results are available for this case as
indicated in Table 2.
3. Start the local genome browser IGV, followed by loading the
reference genome (hg19) and specifying the prepared BAM file.
Note that the BAM file is located at a remote server in this case,
but it can be at a local disk. Numerous reads are aligned at the
promoter of BRCA1 (Fig. 4).
Genome Annotation 117
Table 2
Resulting files of ChIP-seq on BRCA1 protein
Fig. 4 Screenshots of the IGV genome browser, displaying BRCA1 locus. BRCA1 locus is displayed with BRCA1
ChIP-seq results. The top pane indicates genetic bands, the second indicates an identified binding site by the
ChIP-seq, the third indicates individual reads and their frequency, and the fourth indicates the gene location
118 Imad Abugessaisa et al.
4 Notes
Acknowledgments
References
1. Genomes Project Consortium, Abecasis GR, 12. Lockhart DJ, Winzeler EA (2000) Genomics,
Auton A, Brooks LD, DePristo MA, Durbin gene expression and DNA arrays. Nature 405
RM et al (2012) An integrated map of genetic (6788):827–836
variation from 1,092 human genomes. Nature 13. Cawley S, Bekiranov S, Ng HH, Kapranov P,
491(7422):56–65 Sekinger EA, Kampa D et al (2004) Unbiased
2. Li W, Manktelow E, von Kirchbach JC, Gog mapping of transcription factor binding sites
JR, Desselberger U, Lever AM (2010) Geno- along human chromosomes 21 and 22 points
mic analysis of codon, sequence and structural to widespread regulation of noncoding RNAs.
conservation with selective biochemical- Cell 116(4):499–509
structure mapping reveals highly conserved 14. Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieber-
and dynamic structures in rotavirus RNAs man E, Giannoukos G et al (2007) Genome-
with potential cis-acting functions. Nucleic wide maps of chromatin state in pluripotent
Acids Res 38(21):7718–7735 and lineage-committed cells. Nature 448
3. Kageyama Y, Kondo T, Hashimoto Y (2011) (7153):553–560
Coding vs non-coding: translatability of short 15. Landt SG, Marinov GK, Kundaje A, Kherad-
ORFs found in putative non-coding tran- pour P, Pauli F, Batzoglou S et al (2012) ChIP-
scripts. Biochimie 93(11):1981–1986 seq guidelines and practices of the ENCODE
4. Abugessaisa I, Saevarsdottir S, Tsipras G, Lind- and modENCODE consortia. Genome Res 22
blad S, Sandin C, Nikamo P et al (2014) Accel- (9):1813–1831
erating translational research by clinically 16. Rhee HS, Pugh BF (2011) Comprehensive
driven development of an informatics genome-wide protein-DNA interactions
platform—a case study. PLoS One 9(9): detected at single-nucleotide resolution. Cell
e104382 147(6):1408–1419
5. Harbers M, Carninci P (2005) Tag-based 17. Ndlovu MN, Denis H, Fuks F (2011) Expos-
approaches for transcriptome research and ing the DNA methylome iceberg. Trends Bio-
genome annotation. Nat Methods 2 chem Sci 36(7):381–387
(7):495–502 18. Bannister AJ, Kouzarides T (2011) Regulation
6. Cock PJ, Fields CJ, Goto N, Heuer ML, Rice of chromatin by histone modifications. Cell
PM (2010) The Sanger FASTQ file format for Res 21(3):381–395
sequences with quality scores, and the Solexa/ 19. Huebert DJ, Bernstein BE (2005) Genomic
Illumina FASTQ variants. Nucleic Acids Res 38 views of chromatin. Curr Opin Genet Dev 15
(6):1767–1771 (5):476–481
7. Kodzius R, Kojima M, Nishiyori H, Nakamura 20. Lan X, Adams C, Landers M, Dudas M, Kris-
M, Fukuda S, Tagami M et al (2006) CAGE: singer D, Marnellos G et al (2011) High reso-
cap analysis of gene expression. Nat Methods 3 lution detection and analysis of CpG
(3):211–222 dinucleotides methylation using MBD-Seq
8. Shiraki T, Kondo S, Katayama S, Waki K, Kasu- technology. PLoS One 6(7):e22226
kawa T, Kawaji H et al (2003) Cap analysis 21. Aberg KA, McClay JL, Nerella S, Xie LY, Clark
gene expression for high-throughput analysis SL, Hudson AD et al (2012) MBD-seq as a
of transcriptional starting point and identifica- cost-effective approach for methylome-wide
tion of promoter usage. Proc Natl Acad Sci U S association studies: demonstration in 1500
A 100(26):15776–15781 case–control samples. Epigenomics 4
9. Wang Z, Gerstein M, Snyder M (2009) RNA- (6):605–621
Seq: a revolutionary tool for transcriptomics. 22. Hoffman MM, Ernst J, Wilder SP, Kundaje A,
Nat Rev Genet 10(1):57–63 Harris RS, Libbrecht M et al (2013) Integra-
10. Forrest AR, Kawaji H, Rehli M et al (2014) A tive annotation of chromatin elements from
promoter-level mammalian expression atlas. ENCODE data. Nucleic Acids Res 41
Nature 507(7493):462–470 (2):827–841
11. Andersson R, Gebhard C, Miguel-Escalada I, 23. Li Y, Tollefsbol TO (2011) DNA methylation
Hoof I, Bornholdt J, Boyd M et al (2014) An detection: bisulfite genomic sequencing analy-
atlas of active enhancers across human cell types sis. Methods Mol Biol 791:11–21
and tissues. Nature 507(7493):455–461 24. Portela A, Liz J, Nogales V, Setien F, Villa-
nueva A, Esteller M (2013) DNA methylation
120 Imad Abugessaisa et al.
determines nucleosome occupancy in the 50 - 37. Sherry ST, Ward MH, Kholodov M, Baker J,
CpG islands of tumor suppressor genes. Onco- Phan L, Smigielski EM et al (2001) dbSNP: the
gene 32(47):5421–5428 NCBI database of genetic variation. Nucleic
25. Lieberman-Aiden E, van Berkum NL, Williams Acids Res 29(1):308–311
L, Imakaev M, Ragoczy T, Telling A et al 38. ENCODE Project Consortium (2012) An
(2009) Comprehensive mapping of integrated encyclopedia of DNA elements in
long-range interactions reveals folding princi- the human genome. Nature 489(7414):57–74
ples of the human genome. Science 326 39. Bernstein BE, Stamatoyannopoulos JA, Cost-
(5950):289–293 ello JF, Ren B, Milosavljevic A, Meissner A et al
26. Paulsen J, Rodland EA, Holden L, Holden M, (2010) The NIH Roadmap Epigenomics
Hovig E (2014) A statistical model of ChIA- Mapping Consortium. Nat Biotechnol 28
PET data for accurate detection of chromatin (10):1045–1048
3D interactions. Nucleic Acids Res 42(18): 40. Zhang J, Baran J, Cros A, Guberman JM, Hai-
e143 der S, Hsu J et al (2011) International Cancer
27. Carninci P, Kasukawa T, Katayama S, Gough J, Genome Consortium Data Portal—a one-stop
Frith MC, Maeda N et al (2005) The transcrip- shop for cancer genomics data. Database 2011:
tional landscape of the mammalian genome. bar026
Science 309(5740):1559–1563 41. Cancer Genome Atlas Research Network,
28. Bejerano G, Pheasant M, Makunin I, Stephen Weinstein JN, Collisson EA, Mills GB, Shaw
S, Kent WJ, Mattick JS et al (2004) Ultracon- KR, Ozenberger BA et al (2013) The Cancer
served elements in the human genome. Science Genome Atlas Pan-Cancer analysis project. Nat
304(5675):1321–1325 Genet 45(10):1113–1120
29. Kent WJ, Baertsch R, Hinrichs A, Miller W, 42. Rastogi A, Gupta D (2014) GFF-Ex: a genome
Haussler D (2003) Evolution’s cauldron: feature extraction package. BMC Res Notes
duplication, deletion, and rearrangement in 7:315
the mouse and human genomes. Proc Natl 43. Kuhn RM, Haussler D, Kent WJ (2013) The
Acad Sci U S A 100(20):11484–11489 UCSC genome browser and associated tools.
30. Pollard KS, Hubisz MJ, Rosenbloom KR, Sie- Brief Bioinform 14(2):144–161
pel A (2010) Detection of nonneutral substitu- 44. Danecek P, Auton A, Abecasis G, Albers CA,
tion rates on mammalian phylogenies. Genome Banks E, DePristo MA et al (2011) The variant
Res 20(1):110–121 call format and VCFtools. Bioinformatics 27
31. Marigorta UM, Gibson G (2014) A simulation (15):2156–2158
study of gene-by-environment interactions in 45. Quinlan AR, Hall IM (2010) BEDTools: a
GWAS implies ample hidden effects. Front flexible suite of utilities for comparing genomic
Genet 5:225 features. Bioinformatics 26(6):841–842
32. Forbes SA, Bindal N, Bamford S, Cole C, Kok 46. Li H, Handsaker B, Wysoker A, Fennell T,
CY, Beare D et al (2011) COSMIC: mining Ruan J, Homer N et al (2009) The Sequence
complete cancer genomes in the Catalogue of Alignment/Map format and SAMtools. Bioin-
Somatic Mutations in Cancer. Nucleic Acids formatics 25(16):2078–2079
Res 39(Database issue):D945–D950 47. Stalker J, Gibbins B, Meidl P, Smith J, Spooner
33. Landrum MJ, Lee JM, Riley GR, Jang W, W, Hotz HR et al (2004) The Ensembl Web
Rubinstein WS, Church DM et al (2014) Clin- site: mechanics of a genome browser. Genome
Var: public archive of relationships among Res 14(5):951–955
sequence variation and human phenotype. 48. Donlin MJ (2009) Using the Generic Genome
Nucleic Acids Res 42(Database issue): Browser (GBrowse). Current protocols in bio-
D980–D985 informatics/editoral board, Andreas D. Baxe-
34. Kuehn BM (2008) 1000 Genomes Project pro- vanis [et al.] Chapter 9:Unit 9
mises closer look at variation in human 49. Severin J, Lizio M, Harshbarger J, Kawaji H,
genome. JAMA 300(23):2715 Daub CO, Hayashizaki Y et al (2014) Interac-
35. International HapMap Consortium (2005) A tive visualization and analysis of large-scale
haplotype map of the human genome. Nature sequencing datasets using ZENBU. Nat Bio-
437(7063):1299–1320 technol 32(3):217–219
36. International HapMap Consortium, Altshuler 50. Thorvaldsdottir H, Robinson JT, Mesirov JP
DM, Gibbs RA, Peltonen L, Altshuler DM, (2013) Integrative Genomics Viewer (IGV):
Gibbs RA et al (2010) Integrating common high-performance genomics data visualization
and rare genetic variation in diverse human and exploration. Brief Bioinform 14
populations. Nature 467(7311):52–58 (2):178–192
Genome Annotation 121
51. Kasprzyk A (2011) BioMart: driving a para- autoregulation by BRCA1. Cancer Res 70
digm change in biological data management. (2):532–542
Database 2011:bar049 54. Li H, Homer N (2010) A survey of
52. Raney BJ, Dreszer TR, Barber GP, Clawson H, sequence alignment algorithms for next-
Fujita PA, Wang T et al (2014) Track data hubs generation sequencing. Brief Bioinform 11
enable visualization of user-defined genome- (5):473–483
wide annotations on the UCSC Genome 55. Bailey T, Krajewski P, Ladunga I, Lefebvre C,
Browser. Bioinformatics 30(7):1003–1005 Li Q, Liu T et al (2013) Practical guidelines for
53. De Siervi A, De Luca P, Byun JS, Di LJ, Fufa T, the comprehensive analysis of ChIP-seq data.
Haggerty CM et al (2010) Transcriptional PLoS Comput Biol 9(11):e1003326
Chapter 6
Abstract
Ontologies are powerful and popular tools to encode data in a structured format and manage knowledge. A
large variety of existing ontologies offer users access to biomedical knowledge. This chapter contains a short
theoretical background of ontologies and introduces two notable examples: The Gene Ontology and the
ontology for Biological Pathways Exchange. For both ontologies a short overview and working bioinfor-
matic applications, i.e., Gene Ontology enrichment analyses and pathway data visualization, are provided.
Key words Data management, Knowledge management, Ontologies, BioPAX, rBiopaxParser, Gene
ontology, topGO, GOstat
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_6, © Springer Science+Business Media New York 2017
123
124 Frank Kramer and Tim Beißbarth
1.1 Theoretical The term “Ontology” originates from philosophy, where it denotes
Background the studies of existence and reality, known as a branch of metaphys-
ics, founded on the work of the philosopher Aristotle [5]. In
computer science an ontology can be defined as follows:
“A specification of a representational vocabulary for a shared domain of
discourse – definitions of classes, relations, functions and other objects – is
called an ontology [6].”
1.2 Applications First and foremost, ontologies are used for knowledge and data
management by modeling the knowledge of a specific domain.
Ontologies offer a generic and comprehensive framework for creat-
ing and documenting a specific data model and additionally ease
knowledge exchange between users. A common application in
bioinformatics is the use of statistical tests for associating experi-
mental results to existing ontologies, i.e., differentially expressed
genes, with functional annotations for genes [10, 11].
Working with Ontologies 125
2 Materials
2.1 Finding and Users interested in integrating knowledge from ontologies into
Accessing Ontologies their work need access to this data. Websites provide lists and search
engines that can help in finding and browsing relevant ontologies
(see Note 1). The ChEBI and GO ontologies are part of the Open
Biomedical Ontologies Foundry (OBO, www.obofoundry.org)
[21], a collaboration to standardize the way biomedical ontologies
are developed and to allow cross-ontology referencing between
members of the OBO Foundry. The OBO Foundry currently con-
tains ten ontologies and lists dozens of candidate ontologies. Sev-
eral web sites are available which list and categorize biomedical
ontologies, e.g., BioPortal (bioportal.bioontology.org) and the
EMBL-EBI Ontology Lookup Service (http://www.ebi.ac.uk/
ontology-lookup/) listing currently almost 400 and 100 ontolo-
gies, respectively [22–24].
2.2 Encoding Several tools and software solutions exist to help users define new
Ontologies and edit existing ontologies, and to encode and modify data
according to an existing ontology-definition.
Commonly the encoding of ontologies is based on three spe-
cifications: The definition of classes and properties that make up an
ontology can be defined via the Web Ontology Language (OWL), a
World Wide Web Consortium (W3C) standard [25]. These OWL
definitions can be encoded in an XML/RDF file format [26] based
126 Frank Kramer and Tim Beißbarth
2.3 Editing Protégé is a very popular general software tool that allows defini-
Ontologies tion of and data entry for ontologies and is widely used in industrial
and academic settings alike [29] (see Note 2). The software is open-
source and available for all platforms. Furthermore, a browser
version is available for facilitating collaboration, online editing
and data entry of ontologies [30]. The web application can be
downloaded and hosted at a local web server and is also available
directly at the University of Stanford (webprotege.stanford.edu).
Various ontology-specific tools are available for some ontologies,
enabling manipulation [31, 32] and visualization [33, 34] of spe-
cific ontologies.
3 Methods
3.1 Gene Ontology The Gene Ontology (GO) emerged from a cooperation of three
model organism databases: FlyBase, Mouse Genome Informatics
(MGI), and the Saccharomyces Genome Database (SGD). A major
goal of GO arose from the discovery that there are large amounts of
DNA sequences that are identical between species, as well as func-
tional conservation within these genes [3]. The desire for a com-
mon site of annotation for genes is a consequence of this finding.
The idea of GO is to model the knowledge about genes and gene
products across species and to provide access to this information.
GO consists of three independent ontologies, each modeling a
different domain: biological process, molecular function, and cel-
lular component [3]. Aiming for a generalizing model, the cellular
component ontology models the parts and pieces of eukaryotic cells
Working with Ontologies 127
Fig. 1 Tree-view of the annotation of the apoptosis GO term (GO:0006915). “I” denotes “is-a” relations while
“P” denotes “part-of” relations
3.1.1 Accessing the The GO data can be accessed via a multitude of different tools. The
Gene Ontology official website (geneontology.org) offers functionality to explore,
search and visualize GO terms. Additionally, the whole GO ontol-
ogy can be downloaded as bulk (ftp://ftp.geneontology.org/pub/
go/godatabase/archive/). GO is further available within Cytos-
cape [33, 36] and for the R Project for Statistical Computing [37]
via the GO.db and RamiGO packages [38]. The following code will
install the GO data within R and access the one GO term:
source("http://bioconductor.org/biocLite.R")
biocLite("GO.db")
library(GO.db)
GOID(xx[[1]])
> [1] "GO:0000001"
Term(xx[[1]])
> [1] "mitochondrion inheritance"
Defintion(xx[[1]])
> [1] "The distribution of mitochondria, [. . .]"
128 Frank Kramer and Tim Beißbarth
3.1.2 Working with Being widely used and hierarchical in structure, GO has sparked
the Gene Ontology numerous new approaches in bioinformatics (see Notes 3 and 4).
For example, semantic similarity measures have been proposed to
assess functional similarity of genes [13, 39] and pathways [12].
Based on these measures a large number of methods have been
proposed, ranging from disease gene identification [40] to drug re-
purposing [41]. Furthermore, methods have been developed
which aim at extending or reconstructing GO [42–44].
However, by far the most prominent application of the gene
ontology is enrichment analysis for a list of differentially expressed
genes [11]. The general idea behind this is to interpret the lists of
genes resulting from high-throughput experiments by using statis-
tical methods to find significantly over or underrepresented GO
terms within the list of genes. Gene ontology annotations can be
cut at different levels, allowing the definition of gene sets for each
GO term, leading to a hierarchy of gene sets. The subsequent
statistical test for whether a certain biological function (or gene
set) is associated with the experimental results is referred to as
gene set enrichment analysis [11]. This analysis allows testing for
gene sets (or functional groups) that are represented significantly
more frequently in a list of differentially expressed genes than in a
comparable list of genes that would be selected randomly from all
genes. A wide array of different testing strategies and approaches
has been proposed [45–48], most notably those available via web-
sites GOstat [10] (gostat.wehi.edu.au) and DAVID [49] (david.
abcc.ncifcrf.gov), within Cytoscape using the plugins clueGO [33]
or BiNGO [50] and within R using the topGO package.
The following code will install the topGO package and perform
an enrichment analysis of example data within R:
source("http://bioconductor.org/biocLite.R")
biocLite("topGO")
library(topGO)
data(GOdata) # load example data set
Godata # display example data set
res ¼ runTest(GOdata, algorithm ¼ "classic", statis-
tic ¼ "fisher")
head(sort(score(res))) # display top scoring GO terms
> GO:0006091 GO:0022900 GO:0009267 [. . .]
0.0002510612 0.0002510612 0.0004261651 [. . .]
Entity
Physical Interaction
Dna
C o m p l ex
Rna
S m a ll M o l e c u l e
P rotein
Conversion Control
M o d u l atio n
B i o ch e m i c a l R e a ct i o n
C o m p l ex A ss e m b l y
C a t aly sis
Tr a n sp o r t
Is a
Fig. 2 This diagram shows the central classes and their inheritance relationships [4]. Reproduced according to
the BioPAX specification (see www.biopax.org)
3.2.1 Pathway Many notable pathway databases have been developed and are
Databases actively curated. Pathguide.org, a website listing all types of path-
way databases, currently contains links to over 500 different path-
way data resources [56] (see Notes 5 and 6). The different types of
databases include protein-protein interactions, metabolic pathways,
signaling pathways and transcription factor networks. Well known
examples of pathway databases are the Kyoto Encyclopedia of
Genes and Genomes (KEGG) database [57], Reactome [58], and
WikiPathways [59, 60]. Reactome is an open-source, manually
curated, and peer-reviewed pathway database including an interac-
tive website for querying and visualizing data [61]. Reactome is a
joint effort of the European Bioinformatics Institute, the New York
University Medical Center and the Ontario Institute for Cancer
Research. The database is focused on pathways in Homo sapiens;
however, equivalent processes in 22 other species are inferred from
human data [61]. Reactome includes signaling pathways, informa-
tion on regulatory interactions as well as metabolic pathways. The
Pathway Interaction Database was launched as a collaborative proj-
ect between the NCI and the Nature Publishing Group in 2006
[62]. It uses Homo sapiens as a model system and offers well
annotated and curated signaling pathways. Its data includes mole-
cules annotated using UniProt identifiers and posttranslational
modifications [63]. WikiPathways is a community approach to
pathway editing [59, 60]. It allows everyone to join and share
new pathways or curate existing ones. Pathway Commons is a
meta-database aiming at providing a single point of access to pub-
licly available pathway knowledge [64]. It is a collection of pathway
databases covering many aspects and common model organisms
trying to ease access to a large number of different sources.
These four databases are all freely available for download as
BioPAX-export.
3.2.2 Accessing Pathway A number of software tools can be used to access BioPAX-encoded
Knowledge via date, e.g., general tools like Protégé [29] or software written spe-
rBiopaxParser cifically for BioPAX, for example Paxtools [31] or rBiopaxParser
[34]. The R package rBiopaxParser [34] is specifically implemented
to make pathway data that is encoded using the BioPAX ontology
accessible within the R Project for Statistical Computing [37]. The
software package has been published as open source and has been
released as part of Bioconductor (www.bioconductor.org) [65].
The readBiopax function reads in a BioPAX .owl file and gen-
erates the internal data format used within this package. As this
function has to traverse the whole XML-tree of a database export, it
Working with Ontologies 131
source("http://bioconductor.org/biocLite.R")
biocLite("rBiopaxParser")
library(rBiopaxParser)
file ¼ downloadBiopaxData("NCI","biocarta")
biopax ¼ readBiopax(file)
print(biopax)
WIF1
Fig. 3 Generated R plot of the WNT signaling pathway. Green edges denote activations and red edges denote
inhibitions
4 Notes
development (e.g., Noy et al. [7]) and (b) to research via search
engines whether an already existing ontology can be used or
extended for your own work.
3. The Gene Ontology can be browsed and used for enrichment
analysis online at geneontology.org. This is a good place to get
started; however, it is usually not feasible to use in (semi-)
automated pipelines found in bioinformatic service facilities,
where often R-scripts and/or Cytoscape visualizations are com-
monly run.
4. As shown in “Working with the Gene Ontology”, GO allows a
wide range of analyses. When in doubt, consult articles pub-
lished in your field of research in order to understand the con-
clusions and results that can stem from GO analyses.
5. The advice given in Note 4 is also pertinent to pathway enrich-
ment analyses: research which methods others are using and how
they are applied. Pathway enrichment analyses have gained in
popularity compared to GO analyses. Three commonly used
databases are KEGG [57], Reactome [58], and PID [62].
6. The focus and granularity of knowledge in pathway databases
differs extremely. For example Reactome pathways are focused
on cellular mechanisms and processes while KEGG pathways are
also disease-specific. Furthermore, Reactome has a highly hier-
archical structure; its thousands of pathways are nested into a
total of 23 top-tier pathways. Each of these pathways includes
dozens of sub-pathways, including sub-sub-pathways and so on.
7. Visualizing BioPAX-encoded pathway data is usually easiest
with Cytoscape, while automated analyses are often developed
within R [3].
Acknowledgements
References
1. Gruber TR (1995) Toward principles for the 5. Burkhardt H, Smith B (1991) Handbook of
design of ontologies used for knowledge shar- metaphysics and ontology. Philosophia Verlag,
ing? Int J Hum Comput Stud 43:907–928 Muenchen
2. Berners-Lee T, Hendler J, Lassila O et al 6. Gruber TR (1993) A translation approach to
(2001) The semantic web. Sci Am 284:28–37 portable ontology specifications. Knowl Acquis
3. Ashburner M, Ball CA, Blake JA et al (2000) 5(2):199–220
Gene ontology: tool for the unification of biol- 7. Noy NF, McGuinness DL et al (2001) Ontol-
ogy. Nat Genet 25:25–29 ogy development 101: a guide to creating your
4. Demir E, Cary MP, Paley S et al (2010) The first ontology. Stanford knowledge systems lab-
BioPAX community standard for pathway data oratory technical report KSL-01-05 and Stan-
sharing. Nat Biotechnol 28:935–942 ford medical informatics technical report SMI-
2001-0880
134 Frank Kramer and Tim Beißbarth
8. Hitzler P, Krotzsch M, Rudolph S (2011) 22. Noy NF, Shah NH, Whetzel PL et al (2009)
Foundations of semantic web technologies. BioPortal: ontologies and integrated data
CRC Press, Boca Raton, FL resources at the click of a mouse. Nucleic
9. du Plessis L, Škunca N, Dessimoz C (2011) Acids Res 37:W170–W173
The what, where, how and why of gene 23. Rubin DL, Shah NH, Noy NF (2008) Biomed-
ontology—a primer for bioinformaticians. ical ontologies: a functional perspective. Brief
Brief Bioinform 12:723–735 Bioinform 9:75–90
10. Beißbarth T, Speed TP (2004) GOstat: find 24. Côté R, Reisinger F, Martens L et al (2010)
statistically overrepresented Gene Ontologies The Ontology Lookup Service: bigger and bet-
within a group of genes. Bioinformatics ter. Nucleic Acids Res 38:W155–W160
20:1464–1465 25. McGuinness DL, Van Harmelen F et al (2004)
11. Beißbarth T (2006) Interpreting experimental OWL web ontology language overview. W3C
results using gene ontologies. In: Kimmel A, Recomm 10
Oliver B (eds) Methods Enzymol. Academic, 26. Beckett D, McBride B (2004) RDF/XML syn-
Waltham, pp 340–352 tax specification (revised). W3C Recomm 10
12. Guo X, Liu R, Shriver CD et al (2006) Asses- 27. Bray T, Paoli J, Sperberg-McQueen CM et al
sing semantic similarity measures for the char- (1997) Extensible markup language (XML).
acterization of human regulatory pathways. World Wide Web J 2:27–66
Bioinformatics 22:967–973 28. Klyne G, Carroll JJ, McBride B (2004)
13. Fröhlich H, Speer N, Poustka A, Beißbarth T Resource description framework (RDF): con-
(2007) GOSim—an R-package for computa- cepts and abstract syntax. W3C Recomm 10
tion of information theoretic GO similarities 29. Gennari JH, Musen MA, Fergerson RW et al
between terms and gene products. BMC Bio- (2003) The evolution of Protégé: an environ-
informatics 8:166 ment for knowledge-based systems develop-
14. Cheng L, Li J, Ju P et al (2014) SemFunSim: a ment. Int J Hum Comput Stud 58:89–123
new method for measuring disease similarity by 30. Horridge M, Tudorache T, Nuylas C et al
integrating semantic and gene functional asso- (2014) WebProtégé: a collaborative Web-
ciation. PLoS One 9:e99415 based platform for editing biomedical ontolo-
15. Hoehndorf R, Hancock JM, Hardy NW et al gies. Bioinformatics 30:2384–2385
(2014) Analyzing gene expression data in mice 31. Demir E, Babur Ö, Rodchenkov I et al (2013)
with the Neuro Behavior Ontology. Mamm Using biological pathway data with Paxtools.
Genome Off J Int Mamm Genome Soc PLoS Comput Biol 9:e1003194
25:32–40
32. The Geno Ontology Consortium (2014) Gene
16. Xu Q, Shi Y, Lu Q et al (2008) GORouter: an Ontology Consortium: going forward. Nucleic
RDF model for providing semantic query and Acids Res 43(Database issue):D1049–D1056
inference services for Gene Ontology and its
associations. BMC Bioinformatics 9(Suppl 1): 33. Bindea G, Mlecnik B, Hackl H et al (2009)
S6 ClueGO: a Cytoscape plug-in to decipher func-
tionally grouped gene ontology and pathway
17. Chi Y-L, Chen T-Y, Tsai W-T (2015) A chronic annotation networks. Bioinformatics
disease dietary consultation system using 25:1091–1093
OWL-based ontologies and semantic rules. J
Biomed Inform 53:208–219 34. Kramer F, Bayerlová M, Klemm F et al (2013)
rBiopaxParser—an R package to parse, modify
18. Nadkarni PM, Marenco LA (2010) Imple- and visualize BioPAX data. Bioinformatics
menting description-logic rules for 29:520–522
SNOMED-CT attributes through a table-
driven approach. J Am Med Inform Assoc 35. The Geno Ontology Consortium (2008) The
17:182–184 Gene Ontology project in 2008. Nucleic Acids
Res 36:D440–D444
19. Rector AL, Brandt S (2008) Why do it the hard
way? The case for an expressive description 36. Shannon P, Markiel A, Ozier O et al (2003)
logic for SNOMED. J Am Med Inform Assoc Cytoscape: a software environment for
15:744–751 integrated models of biomolecular interaction
networks. Genome Res 13:2498–2504
20. Degtyarenko K, de Matos P, Ennis M et al
(2008) ChEBI: a database and ontology for 37. R Core Team (2013) R: a language and envi-
chemical entities of biological interest. Nucleic ronment for statistical computing, Vienna,
Acids Res 36:D344–D350 Austria
21. Smith B, Ashburner M, Rosse C et al (2007) 38. Schröder MS, Gusenleitner D, Quackenbush J
The OBO Foundry: coordinated evolution of et al (2013) RamiGO: an R/Bioconductor
ontologies to support biomedical data integra- package providing an AmiGO Visualize inter-
tion. Nat Biotechnol 25:1251–1255 face. Bioinformatics 29:666–668
Working with Ontologies 135
39. Pesquita C, Faria D, Bastos H et al (2008) 53. Strömb€ack L, Lambrix P (2005) Representa-
Metrics for GO based protein semantic similar- tions of molecular pathways: an evaluation of
ity: a systematic evaluation. BMC Bioinformat- SBML, PSI MI and BioPAX. Bioinformatics
ics 9:S4 21:4401–4407
40. Jiang R, Gan M, He P (2011) Constructing a 54. Cary MP, Bader GD, Sander C (2005) Pathway
gene semantic similarity network for the infer- information for systems biology. FEBS Lett
ence of disease genes. BMC Syst Biol 5:S2 579:1815–1820
41. Andronis C, Sharma A, Virvilis V et al (2011) 55. Kramer F (2014) Integration of pathway data
Literature mining, ontologies and information as prior knowledge into methods for network
visualization for drug repurposing. Brief Bioin- reconstruction. Georg-August-Universitat
form 12:357–368 Göttingen, Göttingen
42. Kramer M, Dutkowski J, Yu M et al (2014) 56. Bader GD, Cary MP, Sander C (2006) Path-
Inferring gene ontologies from pairwise simi- guide: a pathway resource list. Nucleic Acids
larity data. Bioinformatics 30:i34–i42 Res 34:D504–D506
43. Dutkowski J, Ono K, Kramer M et al (2014) 57. Ogata H, Goto S, Sato K et al (1999) KEGG:
NeXO Web: the NeXO ontology database and Kyoto encyclopedia of genes and genomes.
visualization platform. Nucleic Acids Res 42: Nucleic Acids Res 27:29–34
D1269–D1274 58. Joshi-Tope G, Gillespie M, Vastrik I et al (2005)
44. Dutkowski J, Kramer M, Surma MA et al Reactome: a knowledgebase of biological path-
(2013) A gene ontology inferred from molec- ways. Nucleic Acids Res 33:D428–D432
ular networks. Nat Biotechnol 31:38–45 59. Kelder T, van Iersel MP, Hanspers K et al
45. Zheng Q, Wang X-J (2008) GOEAST: a web- (2011) WikiPathways: building research com-
based software toolkit for Gene Ontology munities on biological pathways. Nucleic Acids
enrichment analysis. Nucleic Acids Res 36: Res 40:D1301–D1307
W358–W363 60. Pico AR, Kelder T, van Iersel MP et al (2008)
46. Huang DW, Sherman BT, Lempicki RA (2009) WikiPathways: pathway editing for the people.
Bioinformatics enrichment tools: paths toward PLoS Biol 6:e184
the comprehensive functional analysis of large 61. Vastrik I, D’Eustachio P, Schmidt E et al
gene lists. Nucleic Acids Res 37:1–13 (2007) Reactome: a knowledge base of bio-
47. Bauer S, Grossmann S, Vingron M, Robinson logic pathways and processes. Genome Biol 8:
PN (2008) Ontologizer 2.0—a multifunc- R39
tional tool for GO term enrichment analysis 62. Schaefer CF, Anthony K, Krupa S et al (2009)
and data exploration. Bioinformatics PID: the pathway interaction database. Nucleic
24:1650–1651 Acids Res 37:D674–D679
48. Eden E, Navon R, Steinfeld I et al (2009) 63. Bauer-Mehren A, Furlong LI, Sanz F (2009)
GOrilla: a tool for discovery and visualization Pathway databases and tools for their exploita-
of enriched GO terms in ranked gene lists. tion: benefits, current limitations and chal-
BMC Bioinformatics 10:48 lenges. Mol Syst Biol 5:290
49. Huang DW, Sherman BT, Tan Q et al (2007) 64. Cerami EG, Gross BE, Demir E et al (2011)
DAVID Bioinformatics Resources: expanded Pathway Commons, a web resource for
annotation database and novel algorithms to biological pathway data. Nucleic Acids Res
better extract biology from large gene lists. 39:D685–D690
Nucleic Acids Res 35:W169–W175 65. Gentleman RC, Carey VJ, Bates DM et al
50. Maere S, Heymans K, Kuiper M (2005) (2004) Bioconductor: open software develop-
BiNGO: a Cytoscape plugin to assess overrep- ment for computational biology and bioinfor-
resentation of Gene Ontology categories in matics. Genome Biol 5:R80
Biological Networks. Bioinformatics 66. Kramer F, Bayerlová M, Beißbarth T (2014) R-
21:3448–3449 based software for the integration of pathway
51. Hucka M, Finney A, Sauro HM et al (2003) data into bioinformatic algorithms. Biology
The systems biology markup language 3:85–100
(SBML): a medium for representation and 67. Shannon PT, Grimes M, Kutlu B et al (2013)
exchange of biochemical network models. Bio- RCytoscape: tools for exploratory network
informatics 19:524–531 analysis. BMC Bioinformatics 14:217
52. Hermjakob H, Montecchi-Palazzi L, Bader G 68. Csardi G, Nepusz T (2006) The igraph soft-
et al (2004) The HUPO PSI’s Molecular Inter- ware package for complex network research.
action format—a community standard for the Int J Complex Syst 1695
representation of protein interaction data. Nat
Biotechnol 22:177–183
Chapter 7
Abstract
The significant expansion in protein sequence and structure data that we are now witnessing brings with it a
pressing need to bring order to the protein world. Such order enables us to gain insights into the evolution
of proteins, their function and the extent to which the functional repertoire can vary across the three
kingdoms of life. This has lead to the creation of a wide range of protein family classifications that aim to
group proteins based upon their evolutionary relationships.
In this chapter we discuss the approaches and methods that are frequently used in the classification of
proteins, with a specific emphasis on the classification of protein domains. The construction of both domain
sequence and domain structure databases is considered and we show how the use of domain family
annotations to assign structural and functional information is enhancing our understanding of genomes.
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_7, © Springer Science+Business Media New York 2017
137
138 Natalie Dawson et al.
Fig. 1 The number of complete (black) and incomplete (grey) genome sequencing projects released each year
from 2007 to 2016 as reported by the GOLD (Genomes Online Database) resource [2]. As of October 2016,
26402 complete and 14938 genome sequencing projects have been reported
Fig. 2 The correlation between structure similarity and sequence identity for all pairs of homologous domain
structures in the CATH domain database. Structural similarity was measured by the SSAP structure compari-
son algorithm, which returns a score in the range of 0–100 for identical protein structures. Dark grey circles
represent pairs of domains with the same function, and light grey circles represent those with different
functions
the same protein fold and often share similarities in their function
depending on the degree of relatedness between them (Fig. 2). The
diversity of such relationships can be seen within families of proteins
The Classification of Protein Domains 139
3.1 Automatic Automated domain clustering algorithms are used in the construc-
Domain Sequence tion of many domain sequence classifications to generate an initial
Clustering clustering of domain-like sequences. This method is often followed
by varying levels of expert-driven validation of the proposed
domain families.
To implicitly predict domain boundaries from sequence data
alone is an immensely challenging problem, though a number of
methods have been devised using predicted secondary structure,
protein folding or domain-linker patterns recognized through neu-
ral networks. Despite extensive research, none of these methods are
reliable enough or well enough established to use in large-scale
sequence analysis. Instead, most automatic domain clustering
The Classification of Protein Domains 141
Fig. 3 The domain problem in protein sequence comparison. (a) The three sequences (i, ii, and iii) are
incorrectly clustered due to sequence similarity with shared domains (B and D) in sequence ii. (b) The
identification and excision of domains A to E enables subsequent clustering of the domains into five domain
families
3.3 Families The earliest domain family classifications were only able to capture a
Represented by limited number of evolutionary relationships because the reliability
Multiple Sequence of pairwise sequence comparison quickly descends in the so-called
Alignments Twilight Zone [23] of sequence similarity (<30 % sequence iden-
tity). Comparison of protein structures has shown that
The Classification of Protein Domains 143
3.3.2 Profiles Like patterns, profiles can also be automatically built from a multi-
ple sequence alignment, but unlike patterns, they tend to be used to
form a consensus view across a domain family. A profile is built to
describe the probability of finding a given amino acid at a given
location in the sequence relatives of a domain family using position-
specific amino acid weightings and gap penalties. These values are
stored in a table or matrix and are used to calculate similarity scores
between a profile and query sequence. Profiles can be used to
represent larger regions of sequence and allow a greater residue
divergence in matched sequences in order to identify more diver-
gent family members. A threshold can be calculated for a given set
of family sequences to enable the reliable identification and inclu-
sion of new family members into the original set of sequences.
Profile methods such as HMMs have been shown to be highly
discriminatory in identifying distant homologues when searching
the sequence databases [24]. It is also clear that the expansion of
diversity within the sequence and domain databases will bring an
144 Natalie Dawson et al.
3.4 Domain A variety of domain sequence databases are now available, most of
Sequence which, as has been discussed, use patterns or profiles (or in some
Classifications cases both) to build a library of domain families. Such libraries can
often be browsed via the internet or used to annotate sequences
over the internet using comparison servers. Some classifications,
such as Pfam, provide libraries of HMMs and comparison software
to enable the user to generate automated up-to-date genome
annotations on their local computer systems. A list of the most
popular databases is shown in Table 1.
The consolidation of domain classifications is a logical progres-
sion towards a comprehensive classification of all protein domains.
However, with so many domain sequence and structure domain
classifications, each with their own unique formats and outputs, it
can be difficult to choose which one to use or how to meaningfully
combine the results from separate sources.
One solution is the manually curated InterPro database [27]
(Integration Resource of Protein Families) at the EBI in the UK.
This resource integrates 11 of the major protein family classifica-
tions and provides regular mappings from these family resources
onto primary sequences in Swiss-Prot and TrEMBL. Contributing
databases include: CATH-Gene3D [28], HAMAP [29], PAN-
THER [30], PIRSF [31], Pfam [32], PRINTS [33], ProDom
[9], PROSITE [34], SMART [35], SUPERFAMILY [36], and
TIGRFAMs [37].
This integration of domain family resources into homologous
groups not only builds upon the individual strengths of the com-
ponent databases, but also provides a measure of objectivity for the
The Classification of Protein Domains 145
Table 1
Protein domain sequence classifications
3.4.1 PROSITE PROSITE [34] began as a database of sequence patterns, its under-
lying principle being that domain families could be characterized by
the single most conserved motif observed in a multiple sequence
alignment. More recently it has also provided an increasing number
of sequence profiles. PROSITE multiple sequence alignments are
derived from a number of sources, such as a well-characterized
protein family, the literature, sequence searching against SWISS-
PROT and TrEMBL and from sequence clustering. PROSITE
motifs or patterns are subsequently built to represent highly con-
served stretches of contiguous sequence in these alignments. These
typically correspond to specific biological regions such as enzyme
active sites or substrate binding sites. Accordingly, PROSITE pat-
terns tend to embody short conserved biologically active regions
within domain families, rather than representing the domain family
over its entire length. Each pattern is manually tested and refined by
compiling statistics that reflect how often a certain motif matches
sequences in SWISS-PROT. The use of patterns provides a rapid
method for database searching, and in cases where global sequence
comparison becomes unreliable, these recurring “fingerprints” are
often able to identify very remote sequence homologues.
PROSITE also provides a number of profiles to enable the
detection of more divergent domain families. Such profiles include
a comparison table of position-specific weights representing the
frequency distribution of residues across the initial PROSITE mul-
tiple sequence alignment. The table of weights is then used to
generate a similarity score that describes the alignment between
the whole or partial PROSITE profile and a whole or partial
sequence. Each pair of amino acids in the alignment is scored
based on the probability of a particular type of residue substitution
at each position, with scores above a given threshold constituting a
true match.
build a seed alignment. The output from this step was then verified
to generate the seed alignment for a new Pfam-A family.
The high quality seed alignments are used to build HMMs to
which sequences are automatically aligned to generate the final full
alignments. If the initial alignments are deemed to be diagnostically
unsound the seed is manually checked, and the process repeated
until a sound alignment is generated. The parameters that produce
the best alignment are saved for each family so that the result can be
reproduced. Pfam-B families are created using the ADDA algo-
rithm [38].
3.5 Protein Sequence There are also a number of databases that classify protein families
Classification using whole-protein sequences. The widely used resources:
HAMAP, PANTHER, PIRSF, and TIGRFAMs are described fur-
ther below.
3.5.2 PANTHER The PANTHER resource [30] annotates protein sequence families
with gene and protein functional information to reflect events in
gene evolution. Phylogenetic trees are used to infer experimental
annotations from a few model organisms with fully sequenced
genomes, onto homologues. To create families for a new release,
all new protein sequences are scanned against the HMMs from the
previous release using InterProScan [41]. Each sequence is
assigned to the family with the largest significant score. The CluSTr
algorithm is also used to define new families [42]. A phylogenetic
tree is built for each family alignment with the GIGA program [43].
Finally, the tree nodes are annotated using three different attri-
butes: protein class membership, GO terms, and subfamily
information.
3.5.4 TIGRFAMs TIGRFAM [37, 45] protein families are composed of protein
family alignments and HMMs. The main purpose of this resource
is to provide models for functional annotation. Models are
provided for superfamilies, subfamilies, and “equivalogs,” which
are homologous proteins that have performed the same function
since their last common ancestor.
4.1 Identification of Over 40 % of known structures in the Protein Data Bank [46] are
Domain Boundaries at multidomain proteins, a percentage that is likely to increase as
the Structural Level structure determination methods, such as X-ray crystallography,
become better able to characterize large proteins. It is very difficult
to reliably assign domain boundaries to distant homologues by
sequence-based methods; however, in cases where the three-
dimensional structure has been characterized, putative domain
boundaries can be delineated by manual inspection through the
use of graphical representations of protein structure. Nonetheless,
the delineation of domain boundaries by eye is a time consuming
process and is not always straightforward, especially for large pro-
teins containing many domains or discontinuous domains in which
one domain is interrupted by the insertion of another domain.
The concept of a structural domain was first introduced by
Richardson, defining it as a semi-independent globular folding
150 Natalie Dawson et al.
Table 2
Protein domain structure classifications
unit [47]. The following criteria, based upon this premise, are often
used to characterize domains:
(a) A compact globular core;
(b) More intra-domain residue contacts than inter-domain
contacts;
(c) Secondary structure elements are not shared between
domains, most significantly Beta-strands; and
(d) Evidence of domain as an evolutionary unit, such as recur-
rence in different structural contexts.
The growth in structure data has lead to the development of a
variety of computer algorithms that automatically recognize
domain boundaries from structural data, each with varying levels
of success. Such methods often use a measure of geometric com-
pactness, exploiting the fact that there are more contacts between
residues within a domain than between neighboring domains, or
searching for hydrophobic clusters that may represent the core of a
structural domain. Many of these algorithms perform well on sim-
ple multidomain structures in which few residue contacts are found
between neighboring domains, though the performance levels tend
The Classification of Protein Domains 151
4.2 Methods for The use of automated methods for structural comparison of protein
Structural Comparison domains is essential in the construction of domain structure classi-
fications. Structural comparison and alignment algorithms were
first introduced in the early 1970s and methods such as rigid
body superposition are still used today for superimposing structures
and calculating a similarity measure (root mean square deviation).
This is achieved by translation and rotation of structures in space
relative to one another in order to minimize the number of non-
equivalent residues. Such approaches use dynamic programming,
secondary structure alignment and fragment comparison to enable
comparison of more distantly related structures in which extensive
residue insertions and deletions or shifts in secondary structure
orientations have occurred. More recently, some domain structure
classifications have employed rapid comparison methods, based on
secondary structure, to approximate these approaches (e.g., SEA
[51], VAST [52], and CATHEDRAL [53]). This enables a large
number of comparisons to be performed that are used to assess the
significance of any match via a rigorous statistical analysis. Where
necessary, potential relatives can then be subjected to more reliable,
albeit more computationally intensive, residue-based comparisons
(e.g., SSAP [54] and Dali [55]).
FATCAT (Flexible structural AlignmenT by Chaining Aligned
fragment pairs allowing Twists) [56] produces a flexible structural
alignment and also accounts for structural rearrangements, which
152 Natalie Dawson et al.
4.3 Domain The largest domain structure classification databases are organized
Structure on a hierarchical basis corresponding to differing levels of sequence
Classification and structure similarity. The terms used in these hierarchies are
Hierarchies summarized in Table 3.
At the top level of the hierarchy is domain class, a term that
refers to the proportion of residues in a given domain adopting an
alpha-helical or beta-strand conformation. This level is usually
divided into four classes: mainly alpha, mainly beta, alternating
alpha-beta (in which the different secondary structures alternate
along the polypeptide chain), and alpha plus beta (in which mainly
alpha and mainly beta regions appear more segregated). In the
CATH database, these last two classes are merged into a single
alpha-beta class as a consequence of the automated assignment of
class. CATH also uses a level beneath class classifying the architec-
ture of a given domain according to the arrangement of secondary
structures regardless of their connectivity (e.g., barrel-like or lay-
ered sandwich). Such a description is also used in the SCOP classi-
fication, but it is less formalized, often appearing for a given
structural family rather than as a completely separate level in the
hierarchy.
Within each class, structures can then be further clustered at
the fold (also known as topology) level according to equivalences in
the orientation and connectivity of their secondary structures.
Cases in which domains adopt highly similar folds are often indica-
tive of an evolutionary relationship. However, care must be taken:
The Classification of Protein Domains 153
Table 3
Overview of hierarchical construction of domain structure classifications
Level of
hierarchy Description
Class The class of a protein domain reflects the proportion of residues adopting an alpha-
helical or beta-strand conformation within the three-dimensional structure. The
major classes are mainly alpha, mainly beta, alternating alpha/beta and alpha + beta.
In CATH the alpha/beta and alpha + beta classes are merged
Architecture This is the description of the gross arrangement of secondary structures in three-
dimensional space independent of their connectivity
Fold/topology The gross arrangement of secondary structures in three-dimensional space and the
orientation and connectivity between them
Superfamily A group of proteins whose similarity in structure and function suggests a common
evolutionary origin
Family Proteins clustered into families have clear evolutionary relationships. This generally
means that pairwise residue identities between the proteins are 30 % and greater.
However, in some cases, similar functions and structure provide sufficient evidence of
common descent in the absence of high sequence identity
Fig. 4 The 100 largest CATH-Gene3D superfamilies have a biased domain population that contains over half of
all known sequence data in the resource
4.4 Structural In this section, the most comprehensive structural domain classifi-
Domain Classifications cations are briefly discussed: SCOP, SCOP2, CATH, the Dali
Database, SCOPe, and ECOD. These represent manual (SCOP
and SCOP2), semi-automated (CATH) and automated approaches
(the rest of the list) approaches to classification. A more compre-
hensive list, together with internet links, is also shown in Table 2.
The SCOP database uses an almost entirely manual approach
for the assignment of domain boundaries and recognition of struc-
tural and functional similarities between proteins to generate super-
families. This has resulted in an extremely high quality resource
even though it requires a significant level of input from the
curators.
The SCOP2 prototype was announced in 2014 [62], a new
structural classification that succeeds SCOP. SCOP2 similarly aims
to classify protein domains based on their structural and evolution-
ary relationships; however, it uses a directed acyclic graph (DAG) to
represent relationships, rather than a hierarchy. With the growth of
structural data, the SCOP team found evolutionary relationships to
be more complex than first thought, leading to a complete redesign
of how the relationships are captured. Structural and evolutionary
relationships are now split into two different categories, and are
joined by the “protein types” and “evolutionary events” categories
[62].
Unlike SCOP, the CATH approach to domain classification
aims to automate as many steps as possible, alongside the use of
expert manual intervention to differentiate between homologous
proteins and those merely sharing a common fold. Domain bound-
aries in CATH are automatically assigned through the identification
of recurrent domains using CATH domain family HMMs, and
structure comparison using the CATHEDRAL algorithm. In addi-
tion, a consensus method (DBS) is used to predict novel domains
(i.e., those that have not been previously observed) directly from
structure, with manual validation being applied for particularly
difficult domain assignments. Sequence comparison and the
CATHEDRAL and SSAP structural comparison algorithms are
then used to identify structural and functional relatives within the
existing library of CATH domains. Again, manual validation is
often used at this stage in order to verify fold and superfamily
assignments.
In contrast the Dali Database established by Holm and cow-
orkers uses a completely automated protocol that attempts to
156 Natalie Dawson et al.
Fig. 5 The growth of structural data in the Protein Data Bank (PDB) and CATH-Gene3D. While the number of
PDB structures deposited and domains identified is still increasing annually, the number of new folds
classified has leveled off in the last 10 years, showing that CATH-Gene3D has an extensive coverage of
structural space
6 Conclusions
provide the best approach that we have for annotating the majority
of functionally uncharacterized genome sequences. The compari-
son of entire genomes is becoming an important mechanism for
progressing beyond the simple cataloging of homologous genes
and domains towards an understanding of the biology that under-
lies the variability within families of protein domains.
7 Notes
References
7. Altschul SF, Gish W, Miller W, Myers EW, Lip- 22. Hauser M, Mayer CE, Söding J (2013) kClust:
man DJ (1990) Basic local alignment search fast and sensitive clustering of large protein
tool. J Mol Biol 215:403–410 sequence databases. BMC Bioinformatics
8. Ponting CP (2001) Issues in predicting protein 14:248
function from sequence. Brief Bioinform 23. Feng DF, Doolittle RF (1996) Progressive
2:19–29 alignment of amino acid sequences and con-
9. Bru C et al (2005) The ProDom database of struction of phylogenetic trees from them.
protein domain families: more emphasis on Methods Enzymol 266:368–382
3D. Nucleic Acids Res 33:D212–D215 24. Eddy SR (1996) Hidden Markov models. Curr
10. Portugaly E, Linial N, Linial M (2007) EVER- Opin Struct Biol 6:361–365
EST: a collection of evolutionary conserved 25. Finn RD et al (2015) HMMER web server:
protein domains. Nucleic Acids Res 35: 2015 update. Nucleic Acids Res 43:W30–W38
D241–D246 26. Remmert M, Biegert A, Hauser A, Söding J
11. Heger A (2004) ADDA: a domain database (2012) HHblits: lightning-fast iterative protein
with global coverage of the protein universe. sequence searching by HMM-HMM align-
Nucleic Acids Res 33:D188–D191 ment. Nat Methods 9:173–175
12. The UniProt Consortium (2014) UniProt: a 27. Mitchell A et al (2015) The InterPro protein
hub for protein information. Nucleic Acids Res families database: the classification resource
43:D204–D212 after 15 years. Nucleic Acids Res 43:
13. Altschul SF et al (1997) Gapped BLAST and D213–D221
PSI-BLAST: a new generation of protein data- 28. Sillitoe I et al (2015) CATH: comprehensive
base search programs. Nucleic Acids Res structural and functional annotations for
25:3389–3402 genome sequences. Nucleic Acids Res 43:
14. Kelil A, Wang S, Brzezinski R, Fleury A (2007) D376–D381
CLUSS: clustering of protein sequences based 29. Pedruzzi I et al (2014) HAMAP in 2015:
on a new similarity measure. BMC Bioinfor- updates to the protein family classification and
matics 8:286 annotation system. Nucleic Acids Res 43:
15. Gnanavel M et al (2014) CLAP: a web-server D1064–D1070
for automatic classification of proteins with 30. Mi H, Muruganujan A, Thomas PD (2013)
special reference to multi-domain proteins. PANTHER in 2013: modeling the evolution
BMC Bioinformatics 15:343 of gene function, and other gene attributes, in
16. Krishnamurthy N, Brown DP, Kirshner D, Sjö- the context of phylogenetic trees. Nucleic
lander K (2006) PhyloFacts: an online struc- Acids Res 41:D377–D386
tural phylogenomic encyclopedia for protein 31. Nikolskayaw QN, Arighi CN, Huang H,
functional and structural classification. Barker WC, Wu CH (2006) PIRSF family clas-
Genome Biol 7:R83 sification system for protein functional and
17. Loewenstein Y, Portugaly E, Fromer M, Linial evolutionary analysis. Evol Bioinforma
M (2008) Efficient algorithms for accurate 2:197–209
hierarchical clustering of huge datasets: tack- 32. Finn RD et al (2014) Pfam: the protein families
ling the entire protein space. Bioinformatics database. Nucleic Acids Res 42:D222–D230
24:i41–i49 33. Attwood TK et al (2012) The PRINTS data-
18. Enright AJ, Kunin V, Ouzounis CA (2003) base: a fine-grained protein sequence annota-
Protein families and TRIBES in genome tion and analysis resource—its status in 2012.
sequence space. Nucleic Acids Res Database (Oxford) 2012:bas019
31:4632–4638 34. Sigrist CJA et al (2013) New and continuing
19. Edgar RC (2010) Search and clustering orders developments at PROSITE. Nucleic Acids Res
of magnitude faster than BLAST. Bioinformat- 41:D344–D347
ics 26:2460–2461 35. Letunic I, Doerks T, Bork P (2015) SMART:
20. Li W, Godzik A (2006) Cd-hit: a fast program recent updates, new developments and status in
for clustering and comparing large sets of pro- 2015. Nucleic Acids Res 43:D257–D260
tein or nucleotide sequences. Bioinformatics 36. Oates ME et al (2015) The SUPERFAMILY
22:1658–1659 1.75 database in 2014: a doubling of data.
21. Fu L, Niu B, Zhu Z, Wu S, Li W (2012) CD- Nucleic Acids Res 43:D227–D233
HIT: accelerated for clustering the next- 37. Haft DH et al (2013) TIGRFAMs and genome
generation sequencing data. Bioinformatics properties in 2013. Nucleic Acids Res 41:
28:3150–3152 D387–D395
The Classification of Protein Domains 163
38. Heger A, Holm L (2003) Exhaustive enumer- 54. Taylor W, Orengo CA (1989) Protein structure
ation of protein domain families. J Mol Biol alignment. J Mol Biol 208:1–22
328:749–767 55. Holm L, Sander C (1993) Protein structure
39. Penel S et al (2009) Databases of homologous comparison by alignment of distance matrices.
gene families for comparative genomics. BMC J Mol Biol 233:123–138
Bioinformatics 10(Suppl 6):S3 56. Ye Y, Godzik A (2003) Flexible structure align-
40. Kriventseva EV et al (2015) OrthoDB v8: ment by chaining aligned fragment pairs allow-
update of the hierarchical catalog of orthologs ing twists. Bioinformatics 19:ii246–ii255
and the underlying free software. Nucleic Acids 57. Subbiah S, Laurents DV, Levitt M (1993)
Res 43:D250–D256 Structural similarity of DNA-binding domains
41. Jones P et al (2014) InterProScan 5: genome- of bacteriophage repressors and the globin
scale protein function classification. Bioinfor- core. Curr Biol 3:141–148
matics 30:1236–1240 58. Gerstein M, Levitt M (1998) Comprehensive
42. Petryszak R, Kretschmann E, Wieser D, Apwei- assessment of automatic structural alignment
ler R (2005) The predictive power of the against a manual standard, the scop classifica-
CluSTr database. Bioinformatics tion of proteins. Protein Sci 7:445–456
21:3604–3609 59. Kolodny R, Koehl P, Levitt M (2005) Compre-
43. Thomas PD (2010) GIGA: a simple, efficient hensive evaluation of protein structure align-
algorithm for gene tree inference in the geno- ment methods: scoring by geometric measures.
mic age. BMC Bioinformatics 11:312 J Mol Biol 346:1173–1188
44. Wu CH et al (2004) PIRSF: family classifica- 60. Dayhoff MO (2005) Atlas of protein sequence
tion system at the Protein Information and structure. Natl. Biomed. Res. Foundation
Resource. Nucleic Acids Res 32:D112–D114 61. Orengo CA, Jones DT, Thornton JM (1994)
45. Haft DH, Selengut JD, White O (2003) The Protein superfamilles and domain superfolds.
TIGRFAMs database of protein families. Nature 372:631–634
Nucleic Acids Res 31:371–373 62. Andreeva A, Howorth D, Chothia C, Kulesha
46. Berman H, Henrick K, Nakamura H (2003) E, Murzin AG (2014) SCOP2 prototype: a
Announcing the worldwide Protein Data new approach to protein structure mining.
Bank. Nat Struct Biol 10:980 Nucleic Acids Res 42:D310–D314
47. Richardson JS (1981) The anatomy and taxon- 63. Das S et al (2015) Functional classification of
omy of protein structure. Adv Protein Chem CATH superfamilies: a domain-based
34:167–339 approach for protein function annotation. Bio-
48. Murzin A, Brenner S, Hubbard T, Chothia C informatics 31:3460–3467
(1995) SCOP: a structural classification of pro- 64. Lee DA, Rentzsch R, Orengo C (2010)
teins database for the investigation of GeMMA: functional subfamily classification
sequences and structures. J Mol Biol within superfamilies of predicted protein struc-
247:536–540 tural domains. Nucleic Acids Res 38:720–737
49. Orengo CA et al (1997) CATH—a hierarchic 65. Holm L, Sander C (1994) Parser for protein
classification of protein domain structures. folding units. Proteins 19:256–268
Structure 5:1093–1108 66. Marchler-Bauer A et al (2014) CDD: NCBI’s
50. Holm L, Sander C (1998) Dictionary of recur- conserved domain database. Nucleic Acids Res
rent domains in protein structures. Proteins 43:D222–D226
33:88–96 67. Shindyalov IN, Bourne PE (1998) Protein
51. Sowdhamini R, Rufino SD, Blundell TL structure alignment by incremental combina-
(1996) A database of globular protein struc- torial extension (CE) of the optimal path. Pro-
tural domains: clustering of representative fam- tein Eng 11:739–747
ily members into similar folds. Fold Des 68. Krissinel E, Henrick K (2004) Secondary-
1:209–220 structure matching (SSM), a new tool for fast
52. Gibrat JF, Madej T, Bryant SH (1996) protein structure alignment in three dimen-
Surprising similarities in structure comparison. sions. Acta Crystallogr D Biol Crystallogr
Curr Opin Struct Biol 6:377–385 60:2256–2268
53. Redfern OC, Harrison A, Dallman T, Pearl 69. Fox NK, Brenner SE, Chandonia J-MM
FMG, Orengo CA (2007) CATHEDRAL: a (2014) SCOPe: Structural Classification of
fast and effective algorithm to predict folds Proteins—extended, integrating SCOP and
and domain boundaries from multidomain ASTRAL data and classification of new struc-
protein structures. PLoS Comput Biol 3:e232 tures. Nucleic Acids Res 42:D304–D309
164 Natalie Dawson et al.
70. Andreeva A et al (2007) Data growth and its Committee of the International Union of Bio-
impact on the SCOP database: new develop- chemistry and Molecular Biology. Academic,
ments. Nucleic Acids Res 36:D419–D425 San Diego, CA
71. Cheng H et al (2014) ECOD: an evolutionary 79. Hadley C, Jones DT (1999) A systematic com-
classification of protein domains. PLoS Com- parison of protein structure classifications:
put Biol 10:e1003926 SCOP, CATH and FSSP. Structure
72. Sowdhamini R et al (1998) Protein three- 7:1099–1112
dimensional structural databases: domains, 80. Lupas AN, Ponting CP, Russell RB (2001) On
structurally aligned homologues and superfa- the evolution of protein folds: are similar
milies. Acta Crystallogr D Biol Crystallogr motifs in different protein folds the result of
54:1168–1177 convergence, insertion, or relics of an ancient
73. Orengo CA (1999) CORA—topological fin- peptide world? J Struct Biol 134:191–203
gerprints for protein structural families. Protein 81. Park J et al (1998) Sequence comparisons using
Sci 8:699–715 multiple sequences detect three times as many
74. Orengo CA, Taylor WR (1996) In: Computer remote homologues as pairwise methods. J
methods for macromolecular sequence analy- Mol Biol 284:1201–1210
sis, vol 266. Elsevier, Amsterdam, pp 617–635 82. Gough J, Chothia C (2002) SUPERFAMILY:
75. Cuff A, Redfern O, Dessailly B, Orengo C HMMs representing all proteins of known
(2011) In Protein function prediction for structure. SCOP sequence searches, align-
omics era. Springer, Netherlands ments and genome assignments. Nucleic
76. Furnham N et al (2012) FunTree: a resource Acids Res 30:268–272
for exploring the functional evolution of struc- 83. Yeats C et al (2006) Gene3D: modelling pro-
turally defined enzyme superfamilies. Nucleic tein structure, function and evolution. Nucleic
Acids Res 40:D776–D782 Acids Res 34:D281–D284
77. Furnham N et al (2012) Exploring the evolu- 84. Todd AE, Marsden RL, Thornton JM, Orengo
tion of novel enzyme functions within structur- CA (2005) Progress of structural genomics
ally defined protein superfamilies. PLoS initiatives: an analysis of solved target struc-
Comput Biol 8:e1002403 tures. J Mol Biol 348:1235–1260
78. Barrett AJ (1992) Enzyme nomenclature:
Recommendations of the Nomenclature
Part II
Sequence Analysis
Chapter 8
Abstract
The increasing importance of Next Generation Sequencing (NGS) techniques has highlighted the key role
of multiple sequence alignment (MSA) in comparative structure and function analysis of biological
sequences. MSA often leads to fundamental biological insight into sequence–structure–function relation-
ships of nucleotide or protein sequence families. Significant advances have been achieved in this field, and
many useful tools have been developed for constructing alignments, although many biological and meth-
odological issues are still open. This chapter first provides some background information and considerations
associated with MSA techniques, concentrating on the alignment of protein sequences. Then, a practical
overview of currently available methods and a description of their specific advantages and limitations are
given, to serve as a helpful guide or starting point for researchers who aim to construct a reliable MSA.
Key words Multiple sequence alignment, Progressive alignment, Dynamic programming, Phyloge-
netic tree, Amino acid exchange matrix, Sequence profile, Gap penalty
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_8, © Springer Science+Business Media New York 2017
167
168 Punto Bawono et al.
1.5 Alignment Triggered by the main pitfall of the progressive alignment scenario,
Iteration some methods try to alleviate the greediness of this strategy by
implementing an iterative alignment procedure. Pioneered by
Hogeweg and Hesper [10], iterative techniques try to enhance
the alignment quality by gleaning increased information from
repeated alignment procedures, such that earlier alignments are
“corrected” [10, 11]. The idea is to compile an MSA, learn from
it, and do it better next time. In this scenario, a previously gener-
ated MSA is used for improvement of parameter settings, so that
the initial guide tree and consequently the alignment can be opti-
mized. Apart from the guide tree, the alignment procedure itself
can also be adapted based on observed features of a preceding MSA.
The iterative procedure is terminated whenever a preset maximum
number of iterations or convergence is reached. However, depend-
ing on the target function of an iterative procedure, it does not
always reach convergence, so that a final MSA often depends on the
number of iterations set by the user. The alignment scoring func-
tion used during progressive alignment can be different from the
target function of the iteration process. This means that a decision
has to be made whether the last alignment (with the maximal
iterative target function value) or the highest scoring alignment
that may be encountered earlier on during iteration should be taken
as the final result upon reaching convergence or termination of the
iterations by the user.
Currently, a number of different progressive alignment meth-
ods are able to produce high-quality alignments. These are dis-
cussed in Subheading 3, as well as the options and solutions they
offer, also with respect to the considerations outlined in the pre-
ceding sections.
2 Materials
2.1 Selection Since sequence alignment techniques are based upon a model of
of Sequences divergent evolution, the input of a multiple alignment algorithm
should be a set of homologous sequences. Sequences can be
retrieved directly from protein sequence databases, but usually a
set is created by employing a homology searching technique for
Multiple Sequence Alignment 171
2.2 Unequal Query sequence sets comprise sequences that typically will be of
Sequence Lengths: unequal length. The extent of such length differences requires a
Global and Local decision whether a global or local alignment should be performed.
Alignment A global alignment strategy [6] aligns sequences over their entire
length. However, many biological sequences are modular and con-
tain shuffled domains [14], which can render a global alignment of
two complete sequences meaningless (see Note 2). Moreover,
global alignment can also lead to incorrect alignment when large
insertions of gaps are needed, for example, to match two domains A
and B in a two-domain protein against the corresponding domains
in a three-domain structure ACB. In general, the global alignment
strategy is appropriate for sequences of high to medium sequence
similarity. At lower sequence identities, the global alignment tech-
nique can still be useful provided there is confidence that the
sequence set is largely colinear without shuffled sequence motifs
or insertions of domains. Whenever such confidence is not present,
the local alignment technique [15] should be attempted. This
technique selects and aligns the most conserved region in either
of the sequences and discards the remaining sequence fragments. In
cases of medium to low sequence similarity, local alignment is
generally the most appropriate approach with which to start the
analysis. Techniques have also been developed that use the local
alignment technique to iteratively align sequence fragments that
remain after previous local alignment (e.g., [16]).
172 Punto Bawono et al.
3 Methods
Table 1
Web sites of multiple sequence alignment programs mentioned in this chapter
Fig. 2 The PRALINE standard web interface. Protein sequences can be pasted in the upper box in FASTA
format or directly uploaded from a file. In addition to using default settings, various alignment strategies can
be selected (see Subheading 3.1) as well as the desired number of iterations or preprocessing cut-off scores
3.2 MUSCLE MUSCLE [26, 27] is multiple alignment software for both nucle-
otide and protein sequences. It includes an online server, but the
user can also choose to download the program and run it locally.
The web server performs calculations using predefined default
parameters, albeit the program provides a large number of options.
MUSCLE is a very fast algorithm, which should be particularly
considered when aligning large datasets. The progressive alignment
protocol is sped up using a clever pairwise sequence comparison
that avoids the slow DP technique for the construction of the so-
called guide tree. Because of the computational efficiency gained,
MUSCLE by default employs iterative refinement procedures that
have been shown to produce high-quality multiple alignments.
1. Iteration. The full iteration procedure used by MUSCLE con-
sists of three steps, although only the last can be considered truly
iterative.
(a) In the first step, sequences are clustered according to the
number of k-mers (contiguous segments of length k) that
they share using a compressed amino acid alphabet [28].
From this the guide tree is calculated using UPGMA, after
which the sequences are progressively aligned following the
tree order.
(b) During the next step the obtained MSA is used to construct
a new tree by applying the Kimura distance correction. This
step is executed at least twice and can be repeated a number
of times until a new tree does not achieve any
176 Punto Bawono et al.
3.3 T-Coffee The T-Coffee program [29] can also handle both DNA and protein
sequences. It includes a web server (following the default settings)
as well as an option to download the program. The algorithm
derives its sensitivity from combining both local and global align-
ment techniques. Additionally, transitivity is exploited using
triplet alignment information including each possible third
sequence. A pairwise alignment is created using a protocol named
matrix extension that includes the following steps:
1. Combining local and global alignment. For each pairwise align-
ment, the match scores obtained from local and global
Multiple Sequence Alignment 177
3.4 MAFFT The multiple sequence alignment package MAFFT [36, 37] is
suited for DNA and protein sequences. MAFFT includes a script
and a web server that both incorporate several alignment strategies.
An alternative solution is proposed for the construction of the
guide tree, which usually requires most computing time in a pro-
gressive alignment routine. Instead of performing all-against-all
pairwise alignments, Fast Fourier Transformation (FFT) is used to
rapidly detect homologous segments. The individual amino acids
are characterized using their volume and polarity values, yielding
high FFT peaks in a pairwise comparison whenever homologous
segments are identified. The segments thus identified are then
merged into a final alignment by dynamic programming. Addi-
tional iterative refinement processes, in which the scoring system
is quickly optimized at each cycle, yield a high overall accuracy of
the alignments.
1. Fast alignment strategies. Two options are provided for large
sequence sets: FFT-NS-1 and FFT-NS-2, both of which follow a
strictly progressive protocol. FFT-NS-1 generates a quick and
dirty guide tree and compiles a corresponding MSA. If FFT-NS-
2 is invoked, it takes the alignment obtained by FFT-NS-1 but
now calculates a more reliable guide tree, which is used to
compile another MSA.
2. Iterative strategies. The user can choose from several iterative
approaches. The FFT-NS-i method attempts to further refine
the alignment obtained by FFT-NS-2 by realigning subgroups
until the maximum weighted sum of pairs (WSP) score [38] is
reached. Two more recently included iterative refinement
options (MAFFT version 5.66) incorporate local pairwise align-
ment information into the objective function (sum of the WSP
scores). These are L-INS-i and E-INS-i, which use standard
affine and generalized affine gap costs [39, 40] for scoring the
pairwise comparisons, respectively.
3. Alignment extension. Another tool included in the MAFFT
alignment package is mafftE. This option enhances the original
dimension of the input set by including other homologous
sequences, retrieved from the SwissProt database with BLAST
[12]. Preferences for the exact number of additional sequences
and the e-value can be specified by the user.
Multiple Sequence Alignment 179
3.5 ProbCons ProbCons [41] is an accurate but slow progressive alignment algo-
rithm for protein sequences. The software can be downloaded but
sequences can also be submitted to the ProbCons web server. The
method follows the T-Coffee approach in spirit, but implements
some of the steps differently. For example, the method uses an
alternative scoring system for pairs of aligned sequences. The
method starts by using a pair-HMM and expectation maximization
(EM) to calculate a posterior probability for each possible residue
match within a pairwise comparison. Next, for each pairwise
sequence comparison, the alignment that maximizes the “expected
accuracy” is determined [42]. In a similar way to the T-Coffee
algorithm, information from pairwise alignments is then extended
by considering consistency with all possible third “intermediate”
sequences. For each pairwise sequence comparison, this leads to a
so-called “probabilistic consistency” that is calculated for each
aligned residue pair using matrix multiplication. These changed
probabilities for matching residue pairs are then used to determine
the final pairwise alignment by dynamic programming. Upon con-
struction of a guide tree, a progressive protocol is followed to build
the final alignment.
ProbCons allows a few variations of the protocol that the user
can decide to adopt:
1. Consistency replication. The program allows the user to repeat
the probabilistic consistency transformation step, by recalculat-
ing all posterior probability matrices. The default setting
includes two replications, which can be increased to a maximum
of 5.
2. Iterative refinement. The program also includes an additional
iterative refinement procedure for further improving alignment
accuracy. This is based on repeated random subdivision of the
alignment in two blocks of sequences and realignment of the
associated profiles. The default number of replications is set to
100, but can be changed from 0 to 1000 iterations (for the web
server one can select 0, 100, or 500).
3. Pre-training. Parameters for the pair-HMM are estimated using
unsupervised expectation maximization (EM). Emission prob-
abilities, which reflect substitution scores from the
BLOSUM-62 matrix [5], are fixed, whereas gap penalties (tran-
sition probabilities) can be trained on the whole set of
sequences. The user can specify the number of rounds of EM
to be applied on the set of sequences being aligned. The default
number of iterations should be followed, unless there is a clear
need to optimize gap penalties when considering a particular
dataset.
180 Punto Bawono et al.
3.6 Kalign The Kalign algorithm [43] follows the standard progressive align-
ment strategy for sequence alignment. To gain speed, while not
losing too much accuracy, the Kalign method incorporates the
Wu–Manber approximate string-matching algorithm [44], which
is used to determine local matches and subsequently to calculate the
sequence distances. A drawback of using pattern matching techni-
ques is that many spurious local matches may be detected. To
address this issue, Kalign incorporates a number of heuristic strate-
gies to filter the matches. This is done to maintain alignment
quality and to preserve fast execution times.
Known obstacles for correct alignment are cases with discon-
tinuous alignments requiring large gap regions, for example as a
result of a domain being inserted or deleted. To facilitate the
insertion of such extensive gap regions, Kalign provides the option
to use Wu–Manber matches as anchor points during alignment. It
incorporates two extra steps to enable the dynamic programming
routine to use the anchor information efficiently:
1. Consistent match finding: The largest set of matches is searched
that can be included in a single colinear alignment, after which
the dynamic programming search matrix is filled with sums of
selected matches. Then, dynamic programming is performed
over the search matrix without applying gap penalties, so to
allow the algorithm to match local segments, even if this requires
the insertion of extensive gap regions. As an additional filter,
matches occurring on short diagonals (cutoff length is 22) are
deleted.
2. Updating profile match positions: Early during progressive align-
ment, the matches found with the Wu–Manber technique are
indexed using the sequence positions. Later during progressive
alignment, however, these positions must be updated to the
appropriate positions in the profiles, because of gap insertion.
The updates indexes facilitate rapid computation in a next pair-
wise alignment step.
The authors compared the speed and accuracy of Kalign to
other popular methods, and it turned out to have comparable
accuracy to the best other methods on small alignments, but was
significantly more accurate when aligning large and evolutionary
divergent sequence sets. Overall, the alignment quality is just under
the best performers described in this chapter (e.g., MSAProbs and
Clustal Omega). The speed of Kalign is about ten times faster than
ClustalW.
3.8 Clustal Omega The Clustal Omega alignment engine [46] is the latest version of
the widely used Clustal alignment suite. Unlike ClustalW, which
employs the standard Dynamic Programming algorithm to align
the sequences [31], ClustalOmega incorporates the HHsearch
method, which is a HMM profile-profile alignment method, to
align the sequences [47]. Similar to MAFFT, Clustal Omega is
able to align a large number of sequences (>10,000 sequences)
within a reasonable time. The method achieves this by exploiting an
algorithm called mBed [48] for rapid generation of a guide tree for
the alignment. The mBed technique is able to produce guide trees
that are just as accurate as those from conventional methods. The
182 Punto Bawono et al.
4 Notes
References
1. Gribskov M, McLachlan AD, Eisenberg D 6. Needleman SB, Wunsch CD (1970) A general
(1987) Profile analysis: detection of distantly method applicable to the search for similarities
related proteins. Proc Natl Acad Sci U S A in the amino acid sequence of two proteins. J
84:4355–4358 Mol Biol 48:443–453
2. Haussler D, Krogh A, Mian IS et al (1993) 7. Carillo H, Lipman DJ (1988) The multiple
Protein modeling using hidden Markov mod- sequence alignment problem in biology.
els: analysis of globins. In: Proceedings of the SIAM J Appl Math 48:1073–1082
Hawaii international conference on system 8. Stoye J, Moulton V, Dress AW (1997) DCA:
sciences. IEEE Computer Society Press, Los an efficient implementation of the divide-and-
Alamitos, CA conquer approach to simultaneous multiple
3. Bucher P, Karplus K, Moeri N et al (1996) A sequence alignment. Comput Appl Biosci
flexible motif search technique based on 13:625–626
generalized profiles. Comput Chem 20:3–23 9. Feng DF, Doolittle RF (1987) Progressive
4. Dayhoff MO, Schwart RM, Orcutt BC (1978) sequence alignment as a prerequisite to correct
A model of evolutionary change in proteins. In: phylogenetic trees. J Mol Evol 25:351–360
Dayhoff M (ed) Atlas of protein sequence and 10. Hogeweg P, Hesper B (1984) The alignment
structure. National Biomedical Research Foun- of sets of sequences and the construction of
dation, Washington, DC phyletic trees: an integrated method. J Mol
5. Henikoff S, Henikoff JG (1992) Amino acid Evol 20:175–186
substitution matrices from protein blocks. Proc
Natl Acad Sci U S A 89:10915–10919
188 Punto Bawono et al.
11. Gotoh O (1996) Significant improvement in time and space complexity. BMC Bioinformat-
accuracy of multiple protein sequence align- ics 5:113
ments by iterative refinement as assessed by 27. Edgar RC (2004) MUSCLE: multiple
reference to structural alignments. J Mol Biol sequence alignment with high accuracy and
264:823–838 high throughput. Nucleic Acids Res
12. Altschul SF, Gish W, Miller W et al (1990) 32:1792–1797
Basic local alignment search tool. J Mol Biol 28. Edgar RC (2004) Local homology recognition
215:403–410 and distance measures in linear time using
13. Pearson WR (1990) Rapid and sensitive compressed amino acid alphabets. Nucleic
sequence comparison with FASTP and Acids Res 32:380–385
FASTA. Methods Enzymol 183:63–98 29. Notredame C, Higgins DG, Heringa J (2000)
14. Heringa J, Taylor WR (1997) Three- T-Coffee: A novel method for fast and accurate
dimensional domain duplication, swapping multiple sequence alignment. J Mol Biol
and stealing. Curr Opin Struct Biol 7:416–421 302:205–217
15. Smith TF, Waterman MS (1981) Identification 30. Huang X, Miller W (1991) A time-efficient,
of common molecular subsequences. J Mol linear-space local similarity algorithm. Adv
Biol 147:195–197 Appl Math 12:337–357
16. Waterman MS, Eggert M (1987) A new algo- 31. Thompson JD, Higgins DG, Gibson TJ
rithm for best subsequence alignments with (1994) CLUSTAL W: improving the sensitivity
application to tRNA-rRNA comparisons. J of progressive multiple sequence alignment
Mol Biol 197:723–728 through sequence weighting, position-specific
17. Thompson JD, Plewniak F, Poch O (1999) gap penalties and weight matrix choice.
BAliBASE: a benchmark alignment database Nucleic Acids Res 22:4673–4680
for the evaluation of multiple alignment pro- 32. O’Sullivan O, Suhre K, Abergel C et al (2004)
grams. Bioinformatics 15:87–88 3DCoffee: combining protein sequences and
18. Heringa J (1999) Two strategies for sequence structures within multiple sequence align-
comparison: profile-preprocessed and second- ments. J Mol Biol 340:385–395
ary structure-induced multiple alignment. 33. Taylor WR, Orengo CA (1989) Protein struc-
Comput Chem 23:341–364 ture alignment. J Mol Biol 208:1–22
19. Heringa J (2002) Local weighting schemes for 34. Shi J, Blundell TL, Mizuguchi K (2001)
protein multiple sequence alignment. Comput FUGUE: sequence-structure homology recog-
Chem 26:459–477 nition using environment-specific substitution
20. Simossis VA, Heringa J (2005) PRALINE: a tables and structure-dependent gap penalties. J
multiple sequence alignment toolbox that inte- Mol Biol 310:243–257
grates homology-extended and secondary 35. Wallace IM, O’Sullivan O, Higgins DG et al
structure information. Nucleic Acids Res 33: (2006) M-Coffee: combining multiple
W289–W294 sequence alignment methods with T-Coffee.
21. Altschul SF, Madden TL, Schaffer AA et al Nucleic Acids Res 34:1692–1699
(1997) Gapped BLAST and PSIBLAST: a 36. Katoh K, Misawa K, Kuma K et al (2002)
new generation of protein database search pro- MAFFT: a novel method for rapid multiple
grams. Nucleic Acids Res 25:3389–3402 sequence alignment based on fast Fourier
22. Kabsch W, Sander C (1983) Dictionary of pro- transform. Nucleic Acids Res 30:3059–3066
tein secondary structure: pattern recognition 37. Katoh K, Kuma K, Toh H et al (2005) MAFFT
of hydrogen-bonded and geometrical features. version 5: improvement in accuracy of multiple
Biopolymers 22:2577–2637 sequence alignment. Nucleic Acids Res
23. Jones DT (1999) Protein secondary structure 33:511–518
prediction based on position-specific scoring 38. Gotoh O (1995) A weighting system and algo-
matrices. J Mol Biol 292:195–202 rithm for aligning many phylogenetically
24. Rost B, Sander C (1993) Prediction of protein related sequences. Comput Appl Biosci
secondary structure at better than 70% accu- 11:543–551
racy. J Mol Biol 232:584–599 39. Altschul SF (1998) Generalized affine gap costs
25. Lin K, Simossis VA, Taylor WR et al (2005) A for protein sequence alignment. Proteins
simple and fast secondary structure prediction 32:88–96
method using hidden neural networks. Bioin- 40. Zachariah MA, Crooks GE, Holbrook SR et al
formatics 21:152–159 (2005) A generalized affine gap model signifi-
26. Edgar RC (2004) MUSCLE: a multiple cantly improves protein sequence alignment
sequence alignment method with reduced accuracy. Proteins 58:329–338
Multiple Sequence Alignment 189
41. Do CB, Mahabhashyam MS, Brudno M et al with a hidden Markov model: application to
(2005) ProbCons: probabilistic consistency- complete genomes. J Mol Biol 305:567–580
based multiple sequence alignment. Genome 55. Kall L, Krogh A, Sonnhammer EL (2004) A
Res 15:330–340 combined transmembrane topology and signal
42. Holmes I, Durbin R (1998) Dynamic pro- peptide prediction method. J Mol Biol
gramming alignment accuracy. J Comput Biol 338:1027–1036
5:493–504 56. Clamp M, Cuff J, Searle SM et al (2004) The
43. Lassmann T, Sonnhammer ELL (2005) Kalign: Jalview Java alignment editor. Bioinformatics
an accurate and fast multiple sequence align- 20:426–427
ment algorithm. BMC Bioinformatics 6 57. Saitou N, Nei M (1987) The neighbor-joining
(1):298 method: a new method for reconstructing phy-
44. Wu S, Manber U (1992) Fast text searching logenetic trees. Mol Biol Evol 4:406–425
allowing errors. Commun ACM 35:83–91 58. Galtier N, Gouy M, Gautier C (1996) SEA-
45. Liu Y, Schmidt B, Maskell DL (2010) MSA- VIEW and PHYLO_WIN: two graphic tools
Probs: multiple sequence alignment based on for sequence alignment and molecular phylog-
pair hidden Markov models and partition func- eny. Comput Appl Biosci 12:543–548
tion posterior probabilities. Bioinformatics 26 59. Li W-H, Graur D (1991) Fundamentals of
(16):1958–1964 molecular evolution. Sinauer, Sunderland, MA
46. Sievers F, Wilm A, Dineen D, Li W, Lopez R, 60. Gille C, Frommel C (2001) STRAP: editor for
McWilliam H, Remmert M, Söding J, Thomp- STRuctural Alignments of Proteins. Bioinfor-
son JD, Higgins DG (2011) Fast, scalable gen- matics 17:377–378
eration of high quality protein multiple 61. Parry-Smith DJ, Payne AW, Michie AD et al
sequence alignments using Clustal Omega. (1998) CINEMA—a novel colour INteractive
Mol Syst Biol 7(1):539 editor for multiple alignments. Gene 221:
47. Söding J (2005) Protein homology detection GC57–GC63
by HMM–HMM comparison. Bioinformatics 62. Attwood TK, Beck ME, Bleasby AJ et al (1997)
21(7):951–960 Novel developments with the PRINTS protein
48. Blackshields G, Sievers F, Shi W, Wilm A, Hig- fingerprint database. Nucleic Acids Res
gins DG (2010) Sequence embedding for fast 25:212–217
construction of guide trees for multiple 63. Golubchik T, Wise MJ, Easteal S, Jermiin LS
sequence alignment. Algorithms Mol Biol 5:21 (2007) Mind the gaps: evidence of bias in esti-
49. Rost B (1999) Twilight zone of protein mates of multiple sequence alignments. Mol
sequence alignments. Protein Eng 12:85–94 Biol Evol 24(11):2433–2442
50. Morgenstern B, Dress A, Werner T (1996) 64. Raghava GPS, Searle SMJ, Audley PC, Barber
Multiple DNA and protein sequence alignment JD, Barton GJ (2003) OXBench: a benchmark
based on segment-to-segment comparison. for evaluation of protein multiple sequence
Proc Natl Acad Sci U S A 93:12098–12103 alignment accuracy. BMC Bioinformatics 4
51. Morgenstern B (2004) DIALIGN: multiple (1):47
DNA and protein sequence alignment at BiBi- 65. Van Walle I, Lasters I, Wyns L (2005)
Serv. Nucleic Acids Res 32:W33–W36 SABmark—a benchmark for sequence align-
52. Sammeth M, Heringa J (2006) Global ment that covers the entire known fold space.
multiple-sequence alignment with repeats. Bioinformatics 21(7):1267–1268
Prot Struct Funct Bioinf 64:263–274 66. Cline M, Hughey R, Karplus K (2002) Predict-
53. Phuong TM, Choung BD, Edgar RC, Batzo- ing reliable regions in protein sequence align-
glou S (2006) Multiple alignment of protein ments. Bioinformatics 18(2):306–314
sequences with repeats and rearrangements. 67. Bawono P, van der Velde A, Abeln S, Heringa J
Nucleic Acids Res 34:5932–5942 (2015) Quantifying the displacement of mis-
54. Krogh A, Larsson B, von Heijne G et al (2001) matches in multiple sequence alignment
Predicting transmembrane protein topology benchmarks. PLoS ONE 10(5):e0127431
Chapter 9
Abstract
There are millions of sequences deposited in genomic databases, and it is an important task to categorize
them according to their structural and functional roles. Sequence comparison is a prerequisite for proper
categorization of both DNA and protein sequences, and helps in assigning a putative or hypothetical
structure and function to a given sequence. There are various methods available for comparing sequences,
alignment being first and foremost for sequences with a small number of base pairs as well as for large-scale
genome comparison. Various tools are available for performing pairwise large sequence comparison. The
best known tools either perform global alignment or generate local alignments between the two sequences.
In this chapter we first provide basic information regarding sequence comparison. This is followed by the
description of the PAM and BLOSUM matrices that form the basis of sequence comparison. We also give a
practical overview of currently available methods such as BLAST and FASTA, followed by a description and
overview of tools available for genome comparison including LAGAN, MumMER, BLASTZ, and AVID.
Key words Homology, Orthologs, Paralogs, Substitutions, Indels, Gap penalty, Conservative sub-
stitutions, Dynamic programming algorithm, Heuristic approach, Scoring matrix, Accepted point
mutation, BLOcks SUbstitution Matrix, Global and local alignment, E value
1 Introduction
1.1 Homology, When we talk about two sequences we typically want to know how
Similarity, and Identity these sequences are related to each other. The terms homology and
similarity are often used interchangeably, but these two terms rep-
resent different relationships. Homology is a general term that is
used to describe a shared common evolutionary ancestry [1].
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_9, © Springer Science+Business Media New York 2017
191
192 Devi Lal and Mansi Verma
1.2 Substitutions In order to assess the similarity or identity any two sequences share,
and Indels these sequences must first be aligned (see also Chapter 8). Align-
ments of two sequences, referred to as pairwise alignments, can be
used to compute a similarity score based on the number of matches,
mismatches, and gaps. The mutations that accumulate in a sequence
during evolution are substitutions, insertions, and deletions. Substi-
tutions result from a change in the nucleotide or amino acid
sequence. Insertions and deletions (together known as indels)
denote either addition or removal of residues and are typically
represented by a dash. A run of one or more contiguous indels is
commonly referred to as a gap in an alignment. Gaps are usually
heavily penalized in forming an optimal alignment. The most
widely used method for calculation of gap penalties is known as
an affine gap penalty. In this method two parameters are taken into
account: (1) the introduction of a gap and (2) the extension of a
gap. The gap score is the sum of the gap opening penalty (G) and the
gap extension penalty (L). For a gap of length n, the gap score will be
G + Ln. The values for gap opening penalties typically lie in the
range 10–15 (a common default is 11) and gap extension penalties
are typically 1–2 (default is 1).
The second method for scoring gap penalties is known as non-
affine or linear gap penalty. This method has a fixed penalty to score
gaps in an alignment and there is no heavy cost for gap opening as is
seen in the affine gap penalty.
In pairwise alignments of proteins (Fig. 1), some aligned resi-
dues may be similar (that is, structurally, functionally or biochemi-
cally related) but not identical. These similar residues are denoted
by a “+” sign in the alignment and are referred to as conservative
substitutions.
Large-Scale Sequence Comparison 193
Seq1 MSDLDRLASRAAIQDLYSDQLIGVDKRQEGRLASIWWDDAEWTIEGIGTYKGPEGALDLA
MSDLDRLASRAAIQDLYSD+LI VDKRQEGRLASIWWDDAEWTIEGIGTYKGPEGALDLA
Seq2 MSDLDRLASRAAIQDLYSDKLIAVDKRQEGRLASIWWDDAEWTIEGIGTYKGPEGALDLA
Seq1 NNVLWPMFHETIHYGTNLRLEFVSADKVNGIGDVLCLGNLVEGNQSILIAAVYTNEYERR
NNVLWPMFHE IHYGTNLRLEFVSADKVNGIGDVL LGNLVEGNQSILIAAV+T+EYERR
Seq2 NNVLWPMFHECIHYGTNLRLEFVSADKVNGIGDVLLLGNLVEGNQSILIAAVFTDEYERR
Seq1 DGVWKLSKLNGCMNYFTPLAGIHFAPPGALLQKS
DGVW SK N C NYFTPLAGIHFAPPG S
Seq2 DGVWKFSKRNACTNYFTPLAGIHFAPPGIHFAPS
2 Materials
2.1 Sequence The comparison of two sequences can be done either at the level of
Selection: DNA or nucleotide or protein. Though it is true that mutations take place in
Protein DNA, still comparing two sequences at the level of protein can help
in revealing important biological information [2]. Many mutations
in DNA are synonymous, that is, they are not reflected at the level
of protein and do not lead to any change in the corresponding
amino acid. Consequently, for distantly related organisms with low
sequence identity at the DNA level, comparison of proteins is
preferred.
2.2 Pairwise There are various methods available for pairwise alignment. These
Alignment and Scoring methods include (i) dot-matrix analysis, (ii) use of a dynamic pro-
Matrices gramming algorithm (which depends on the use of scoring matrices
for scoring alignment), and (iii) heuristic approaches (word or k-
tuple methods).
1. Dot-matrix analysis: This is one of the most popular graphical
methods of aligning two sequences, first described by Gibbs and
McIntyre [3]. The sequences are placed on the X- and Y-axes of
the matrix and a dot is placed wherever a match is found between
the two sequences. Diagonal runs of dots are joined to form the
alignment, whereas isolated dots are regarded as random
matches (Fig. 2). This readily reveals the presence of indels
(insertions or deletions) and repeats (directed or inverted).
There are various available tools for dot matrix generation (see
Note 1). The dot matrices give only a graphical representation and
do not reveal the similarity score. Therefore other methods for
sequence comparison have been developed that depend on scoring
any pairwise sequence comparison. The system of scoring imple-
ments the dynamic programming algorithm (DPA) to yield an
optimal alignment between two sequences by breaking an align-
ment into smaller parts and then joining these subalignments in a
sequential manner. Dynamic programming identifies the best
194 Devi Lal and Mansi Verma
T G C G G C T A G
Sequence 2
A T C G G C T A G
Sequence 1
Fig. 2 Diagrammatic representation of dot matrix between sequence 1 “GATCGGCGT” and sequence
2 “TGCGGCTAG”. A dot in the box represents a match; a diagonal line connecting the dots represents a
sequence alignment whereas dots outside the diagonal represents spurious matches
Sequence 1 V L D S K Y N V L D Sequence 1 V L D S K Y N V L D
Sequence 2 Y N V L E S K Y N A Sequence 2 Y N V L E S K Y N A
4+4+2+4+5+7+6+0 = 32 7+6+4+4+2 = 23
Fig. 3 Alignment of two sequences can be done in many ways, but the one with the highest score is selected
by DPA
2.2.1 PAM Matrices Margaret Dayhoff [4] proposed the first matrices that were used for
quantitatively scoring pairwise protein alignments. The Dayhoff
Large-Scale Sequence Comparison 195
2.2.2 BLOSUM Matrices Using a slightly different approach, S. Henikoff and J.G. Henikoff
[7–9] devised other types of scoring matrices that address the
drawbacks in PAM matrices. Their approach was based on the use
196 Devi Lal and Mansi Verma
A 2
R -2 6
N 0 0 2
D 0 -1 2 4
C -2 -4 -4 -5 12
Q 0 1 1 2 -5 4
E 0 -1 1 3 -5 2 4
G 1 -3 0 1 -3 -1 0 5
H -1 2 2 1 -3 3 1 -2 6
I -1 -2 -2 -2 -2 -2 -2 -3 -2 5
L -2 -3 -3 -4 -6 -2 -3 -4 -2 -2 6
K -1 3 1 0 -5 1 0 -2 0 -2 -3 5
M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6
F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9
P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6
S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2
T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 9 1 3
W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17
Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10
V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4
A R N D C Q E G H I L K M F P S T W Y V
Fig. 4 PAM250 matrix, derived by multiplying PAM1 by itself 250 times. This matrix is useful for highly
divergent sequences
Table 1
Relationship between PAM matrices and relative sequence identity
A R N D C Q E G H I L K M F P S T W Y V
A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0
R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3
N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3
D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3
C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1
Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2
E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2
G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3
H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3
I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3
L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1
K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2
M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1
F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1
P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2
S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2
T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0
W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3
Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1
V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4
Fig. 5 BLOSUM62 matrix derived by merging all the proteins that share 62 % or more sequence similarity. It is
the most commonly used matrix in BLAST search
2.3 Global and Local We now shift our focus to the type of alignments and the algo-
Alignment rithms that govern these alignments. There are two main types of
alignments: global—typified by the Needleman and Wunsch algo-
rithm [13] and local—typified by the Smith and Waterman algo-
rithm [14]. Alignment of two sequences over their entire length is
known as global alignment. In this type of alignment, both the
termini of each sequence participate in alignment, irrespective of
matches, mismatches, or gaps. Therefore, such alignments almost
always introduce gaps and are preferred for sequences of roughly
the same length and high similarity. However, some sequences can
show similarity only in some regions (for example, limited to a
motif or domain only) and global alignment may misalign these
regions of high similarity. In order to align such sequences local
alignment is preferred. Local alignment searches for a region of
high similarity only, irrespective of the length of the sequence
(Fig. 7). For example, if two sequences are of 100 bp length but
198 Devi Lal and Mansi Verma
Sequence 1 ATCGGCTAGGAACACGACGAGCAG
Sequence 2 GTGCCGCTGGATGAGTGGTCAGTTG
ATCG-GCTAGGAACACGACG-AGCAG -----GCT------------------
| ||| | | | || | |||
GTGCCGCTGG-ATGAGTGGTCAGTTG -----GCT------------------
Global Alignment Local Alignment
Fig. 7 Global and local alignment: in global alignment, two sequences are aligned over their entire length
whereas in local alignment, regions of high similarity are aligned irrespective of the length of the sequence
are highly dissimilar, then local alignment will search for a region of
high similarity (even 4–5 bp) generating a small alignment.
2.3.1 Needleman and Saul Needleman and Christian Wunsch (1970) described one of the
Wunsch Algorithm: Global important algorithms to align two protein sequences. This algo-
Sequence Alignment rithm was subsequently modified by Seller [15] and Gotoh [16].
The algorithm tends to produce optimal alignment of two
sequences with the introduction of gaps. The global alignment
using the Needleman and Wunsch algorithm can be obtained via
a three-step process: (1) setting a matrix, (2) scoring the matrix,
and (3) identification of optimal alignment.
1. Setting a matrix: In order to set a matrix, two sequences are
written in two dimensions. The first sequence is written verti-
cally along the y-axis while the second is written horizontally
along the x-axis. If the two sequences are identical, then a perfect
alignment can be constructed, represented by a diagonal line
that extends from top left to bottom right (Fig. 8). Gaps are
represented using vertical or horizontal edges (see Fig. 8 c and d
respectively).
2. Scoring the matrix: Let us consider two sequences of lengths a
and b. A scoring matrix of dimensions a + 1 and b + 1 is created
and the upper left cell is assigned the value “0.” Subsequent cells
across the first row and down the first column will be assigned
the terminal gap penalties (Fig. 9a). The scoring system will be
+1 for a match, 2 for a mismatch, and 2 for a gap. In each cell
the score is derived from the cell diagonally above and to the left
(“+/” score for match/mismatch), the cell to the left
(“”score for gap penalty), and the cell directly above that cell
(“” score for gap penalty) (Fig. 9b). The highest score of these
three is then assigned to the cell. Now let us start from the first
cell in the example given in Fig. 9. The score derived from the
cell diagonally above and to the left will be 0 + 1 ¼ 1 (+1 for
match), the score from the cell to the left will be 2 2 ¼ 4
(2 for gap penalty) and the score from the cell directly above
Large-Scale Sequence Comparison 199
(a) M D V P E (b) M D V K E
M M
D D
V V
P P
E E
1. MDVPE 1. MDVPE
2. MDVPE 2. MDVKE
(c) M D V E (d) M D V P E
M M
D D
V P
P E
E
1. MDVPE 1. MD--PE
2. MDV--E 2. MDVPE
-4
D -gap
penalty
V -6
-8 (c) M P R
P
0 +1 -2 -4 -6 -6
-4
E -10 M -4
-2 +1 -1 -1
-4
R -12
D -4
K -14 V -6
(d) M P R T S K
0 -2 -4 -6 -8 -10 -12
M -2 +1 -1 -3 -5 -7 -9
-4 -1 -1 -5 -9
D -3 -7
V -6 -3 -3 -3 -5 -7 -9
P -8 -5 -2 -4 -5 -7 -9
E -10 -7 -4 -4 -6 -7 -9
R -12 -9 -6 -3 -5 -7 -9
K -14 -11 -8 -5 -5 -7 -6
(e) M P R T S K
0 -2 -4 -6 -8 -10 -12
M -2 +1 -1 -3 -5 -7 -9
-4 -1 -1 -5 -9
D -3 -7
V -6 -3 -3 -3 -5 -7 -9
P -8 -5 -2 -4 -5 -7 -9
E -10 -7 -4 -4 -6 -7 -9
R -12 -9 -6 -3 -5 -7 -9
K -14 -11 -8 -5 -5 -7 -6
Sequence 1: M D V P E R - - K
Sequence 2: M - - P - R T S K
Fig. 9 Scoring a matrix in Global alignment: (a) a matrix of two sequences (a, b) with a + 1 columns and b + 1
rows is constructed and gap penalties are added in the first row and column. (b) The score in a cell is
Large-Scale Sequence Comparison 201
2.3.2 Smith and The Smith and Waterman algorithm is used to find regions of local
Waterman Algorithm: Local similarities and tends to align only part of two sequences without
Sequence Alignment introducing gaps. The matrix construction for the Smith and
Waterman Algorithm is similar to the Needleman and Wunsch
Algorithm. For two sequences of lengths a and b, a scoring matrix
of dimensions a + 1 and b + 1 is created. However, the scoring in
this system is slightly different from the Needleman–Wunsch Algo-
rithm. Here the scores can never be negative.
1. Setting a matrix: This step is the same as that of global align-
ment. The two sequences are written along the x- and y-axes.
2. Scoring the matrix: A scoring matrix of dimensions a + 1 and
b + 1 is created and the upper row and first column is assigned
the value “0” in each cell. The scoring system will be +1 for a
match, 1/3 for a mismatch, and 1.3 for a gap. Like global
alignment, the score in each cell is derived from the cell diago-
nally above and to the left (“+/” score for match/mismatch),
the cell to the left (“” score for gap penalty), and the cell
directly above that cell (“” score for gap penalty), and the
highest score out of the three values is then assigned to the cell.
However, if the score is negative it is replaced with “0.” The score
for each cell is obtained as above and the matrix is filled.
3. Identification of optimum alignment: This method also uses the
tracking back process. But the trace-back procedure starts from
the highest score instead of the bottom right cell (Fig. 10); (in this
matrix, the highest value is 3 and therefore the trace-back will start
in that cell). From this cell, trace-back will proceed in the same way
as for global alignment, but will end as soon as a cell containing
“0” value is reached. This cell marks the start of alignment.
The Smith–Waterman algorithm and its descendants find the
optimal alignment or alignments between two sequences, but like
Needleman–Wunsch are rather slow. For the two algorithms that
we have just described, the time required is proportional to a b.
These algorithms are too inefficient when a query sequence is to be
compared with an entire database. Therefore heuristic approaches
have been used to increase the speed of alignment. These heuristic
approaches include FASTA and BLAST programs that are based on
local alignment but produce alignments more rapidly than the
Smith–Waterman algorithm. These programs are described in the
next section.
Fig. 9 (Continued) calculated from the upper left cell (“+/” score for match/mismatch), cell to the left (“”
score for gap penalty), and cell directly above that cell (“” score for gap penalty). (c) The score which is
maximum out of these three is put in the cell. (d) A complete matrix with overall scores derived as explained.
(e) Optimal pairwise alignment using the track back process. The cells highlighted are the source of optimum
pairwise alignment
202 Devi Lal and Mansi Verma
A D V P S K A D V P S K
0 0 0 0 0 0 0 0 0 0 0 0 0 0
M 0 0 0 0 0 0 0 M 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 1 0 0 0 0
D D
V 0 0 0 2 0.7 0 0 0 0 0 2 0.7 0 0
V
P 0 0 0 0.7 3 1.7 O.3 P 0 0 0 0.7 3 1.7 O.3
K 0 0 0 0 0 0 1 K 0 0 0 0 0 0 1
Sequence 1: D V P
Sequence 2: D V P
(a) (b)
Fig. 10 Scoring a matrix in local alignment. (a) The matrix construction and derivation of scores for a cell is
similar to global alignment. The scoring system here is +1 for a match, 1/3 for a mismatch, and 1.3 for a
gap. But the score cannot be negative and therefore, the lowest score in this matrix is “0.” (b) The alignment
here does not start from the last cell but from the cell with the highest score and is tracked back until a score
of “0” is encountered
3 Methods
3.1 BLAST BLAST stands for Basic Local Alignment Search Tool and was
designed by Eugene Myers, Stephen Altschul, Warren Gish,
David J. Lipman, and Webb Miler at NIH in 1990 [17]. It is a
heuristic method, which is more advance than the Smith–Water-
man Algorithm and faster than its counterpart FASTA. This utility
helps in searching and comparing biological sequences (nucleotide
or protein) to sequence databases. BLAST calculates the statistical
significance of matches and generates an overall score based on
matches, mismatches, and gaps. The BLAST algorithm relies on
k-tuples, where k is the matching “word length” for searching the
query against each database sequence. The default word length for
BLAST is 3 for proteins and 11 for nucleic acids. As the name
suggests, BLAST explores local alignments, that is, subsequences
within the database are compared to the query sequence. A number
of databases can be searched using BLAST (Fig. 11). The user
needs to enter the query fasta formatted sequence (see Note 2) in
the box provided. Alternatively, the user can also write the accession
number of the query (see Note 3). BLAST can be broadly categor-
ized into basic and specialized BLAST. Basic BLAST includes the
Large-Scale Sequence Comparison 203
Database Description
Nucleotide database
nr All GenBank, EMBL, DDBJ, PDB sequences
refseq_rna Reference RNA sequences
refseq_genomic Genomic sequences from NCBI Reference Sequence Project
chromosome Complete genomes
est Expressed sequence tags
gss Genome Survey Sequence
htgs Unfinished High Throughput Genomic Sequences
pat Patent sequences
alu_repeats Human alu repeat elements
dbsts Database of Sequence Tag Sites
wgs Whole Genome Shotgun contigs
TSA Transcriptome Shotgun Assembly
env_nt Sequences from environmental samples.
16S rRNA sequences (from bacteria and archaea)
Protein Sequence database
nr non redundant protein sequences
Refseq_proteins NCBI Protein Reference Sequences
swissprot SWISS-PROT protein sequence database
pat Patented protein sequences
pdb Bank.
env_nr metagenomics proteins
tsa_nr Transcriptome Shotgun Assembly proteins
Basic BLAST
3.1.1 Understanding the As already discussed a typical BLAST consists of various parameters
BLAST Parameters that the user has to define. These parameters include Expected
threshold (E value), word size, Matrix and gap penalties.
(a) Expected threshold (E-value): An E-value represents the
expected number of occurrences of a hit in database search
purely by chance. Karlin and Altschul [18] defined E-values
using a formula also known as the Karlin–Altschul equation:
E ¼ kmN e λs
where k is the minor constant, m is the query size, N is the
total size of the database, S is the score, and λ is a constant used
to normalize the score of a high scoring pair. As can be noted
from the equation, the value of E decreases exponentially with
increase in the score S.
The default threshold E-value in BLAST search is “10.” To
have a better understanding of E-values, let us consider a case
where an alignment has an E value 1e9 with an alignment
score of X bits. This indicates that a score of X bits or better is
expected to occur by chance in the database with a probability
of 1 in a billion. While defining E-values it is also important to
define the scores. Typically there are two types of scores: raw
score (S) and bit score (S0 ). Raw score is calculated using a
particular substitution matrix and gap penalty parameters,
while bit score is calculated from the raw score after normal-
izing the variables that define a particular scoring system.
A bit Score (S0 ) is given by:
0 λS ln K
S ¼
ln2
Large-Scale Sequence Comparison 205
where S is the raw score and λ and K are the statistical para-
meters of a particular scoring system.
Raw scores are without any units and thus raw scores from
different database searches cannot be compared. On the other
hand, the bit scores are produced from the raw scores by
normalizing the variables and therefore they have standard
units, which allows the bit scores from different database
searches to be compared.
(b) Word size: The BLAST algorithm works by dividing the query
into short runs of letters of a particular length, known as query
words or k-tuples (in case of FASTA). The default word size is 3
for proteins (can be reduced to 2 if the query is a short
peptide) and 11 for nucleotides (can be set to 7–15). Lower-
ing the word size results in more accuracy but slower search,
while higher word sizes generally match infrequently resulting
in a faster search. Let us consider the example given in Fig. 13.
The BLAST algorithm will first divide the query into smaller
words (of three letters each) and then not only find instances
of the first query word (VLD) but also related words where
conservative substitutions have taken place (in this case a
conservative substitution has taken place in the target
Sequence 1 V L D S K Y N V L D
Sequence 2 Y N V L E S K Y N A
5+7+6
Extend to right and left
Sequence 1 V L D S K Y N V L D
Sequence 2 Y N V L E S K Y N A
4+4+2+4+5+7+6+0 = 32
Fig. 13 The typical BLAST search begins with query words. (See text for details.)
206 Devi Lal and Mansi Verma
Fig. 15 RPS-BLAST output. The query used in this example is HCH dehydrochlorinase from Sphingobium
indicum B90A. RPS-BLAST works by searching a protein query against a database of predefined PSSMs. RPS-
BLAST can be accessed at CDD at NCBI
3.2 FASTA FASTA was the first program developed to address rapid database
search for protein or DNA sequences [27–31]. Like BLAST,
FASTA also aims to compare a query sequence against the data-
bases. FASTA search begins by looking for matching words called
k-tuples or ktup which can be considered equivalent to the word size
of BLAST. The ktup length is usually 1–2 for proteins and 4–6 for
nucleic acids. Larger values of ktup result in a faster search but there
may be the possibility of missing similar regions. Once the ktup is
determined the FASTA program looks for word matches that are
close to each other and connects these without introducing any
gaps. Once the initial connections are made, an initial score is
calculated. In the next step, the FASTA program considers the
Large-Scale Sequence Comparison 209
3.3 Methods and MegaBLAST is a variation of the BLAST program at NCBI that is
Tools for Large Scale optimized for use in aligning large or highly similar DNA queries
Sequence Comparison [33]. MegaBLAST is faster than traditional BLASTN. The speed of
MegaBLAST is due to two important changes that have been
3.3.1 MegaBLAST and introduced into the traditional BLASTN program.: (1) The default
Discontiguous MegaBLAST word size for MegaBLAST is 28 and can be set as high as 64 while
the word size for BLASTN is 11. (2) MegaBLAST makes use of a
non-affine gap penalty, which means that there is no penalty for gap
opening. Figure 16 shows a typical megaBLAST output using
human myoglobin coding region.
A variant of MegaBLAST known as Discontiguous Mega-
BLAST can also be accessed at NCBI. This version has been
designed to compare divergent sequences with low sequence iden-
tity from different organisms. It uses a discontiguous word
approach [34] in which nonconsecutive positions are scanned
over longer sequence segments.
3.3.2 BLAT BLAT or BLAST like Alignment Tool [35] was developed to align
large genomic DNA sequences of 95 % and greater similarity. It is
quite similar to the MegaBLAST program but different from
BLAST due to its high speed. BLAT search is used to find the
position of a sequence of interest in a genome. The BLAT query
page is shown in Fig. 17. The sequence of interest, which can be
DNA, protein, translated RNA or translated DNA, is pasted in the
box provided. The user can choose the genomes from the drop
down menu and can use them for cross-species analysis. In the
following example, we have used myoglobin coding region from
210 Devi Lal and Mansi Verma
Fig. 16 A typical Megablast output using the human myoglobin coding region. Megablast is used to search
large DNA query sequences rapidly, in part due to the larger word size implemented in Megablast
human. The result from a typical BLAT search is shown in Fig. 18.
The results are presented with the identity of each hit along with
the position of that particular hit. Details regarding a particular hit
can be found by clicking the hyperlink to its left. The matched bases
are shown in blue and are capitalized while the unaligned regions
are shown in black lower case. The user can also find the distribu-
tion of the BLAT hits across the human chromosomes (Fig. 18c).
3.3.3 BLASTZ BLASTZ is a variation of gapped BLAST [36] that was initially used
to align the human and mouse genomes. The method is slightly
different from BLASTN, which looks for series of exact matches
defined by word size. To begin with, BLASTZ looks for repeat
regions in the first genome that are also found in the second
genome followed by their removal from both genomes. In the
Large-Scale Sequence Comparison 211
Fig. 17 BLAT homepage. The user can choose the genome against which the query is to be searched using the
drop-down menu
Fig. 18 BLAT output. (a) This part of the results shows the number of hits and their details. (b) Clicking on
details gives additional information about the hit including the alignment. The matched bases are shown in
blue and are capitalized while the unaligned regions are shown in black lower case. (c) Distribution of BLAT
hits across the human chromosomes. The portion of chromosome 22 that is boxed has the maximum identity
with the query
Large-Scale Sequence Comparison 213
Fig. 19 A typical output of LAGAN showing an alignment between Mycobacterium tuberculosis (AL123456)
and Mycobacterium africanum (FR878060) genomes. The key to the alignment is also given in the left panel
3.3.5 MUMMER Mummer is a local alignment based tool that is primarily used for
aligning whole genomes [39]. Unlike other methods, mummer
constructs a suffix tree, a method that is known to find all distinct
sub-sequences with high efficiency. Among these sub-sequences,
Maximal Unique Matches (MUMs) are selected. These are unique
in both the genomes. Any extension of an MUM results in a
mismatch. Using these MUMs, the algorithm identifies the longest
increasing subsequence (LIS) of MUMs that occur in the same
direction in both the genomes. All aligned LIS are then connected
by closing gaps between them using Smith–Waterman algorithm.
This helps in detecting SNPs, indels, repeats, and polymorphic
regions in genomes. The MUMMER package can also align draft
genomes with more than 100–1000 contigs using the NUCmer
program efficiently. The package can also translate DNA sequences
into six frames and generate alignments using the PROmer pro-
gram. The alignments can be viewed using MUMmer plot
(Fig. 20).
3.3.6 AVID AVID is a global alignment program [40] used to align two gen-
omes. Like BLASTZ, repeat regions are masked but unlike it both
214 Devi Lal and Mansi Verma
Fig. 20 Whole genome alignments of (a) M. bovis AF2122/97 (vertical axis) and M. bovis BCG Pasteur
(horizontal axis) and (b) M. bovis AF2122/97 (vertical axis) and M. avium paratuberculosis (horizontal axis)
using NUCmer showing all MUMs in a dot plot. A straight line indicates aligned regions. Blue dots represent
large-scale chromosomal reversals
masked and unmasked regions are used for alignment. Like MUM-
mer it also relies on the construction of suffix trees to find the initial
matches. This is followed by anchoring and aligning the sequences.
Here again the anchors are selected using the Smith–Waterman
algorithm. After searching for the sub-sequences or matches,
anchoring of the sub-sequences (nonoverlapping, non-crossing
matches) is done so that “noisy” matches are eliminated. Anchors
are then joined to each other using a recursive approach, finally
leading to a global alignment. The AVID program provides fast and
reliable detection of even weak homologies [40].
3.3.7 Mugsy Mugsy is a multiple alignment tool that is used to align whole
genomes [41]. Mugsy can align multiple genomes without requir-
ing any reference and can identify genetic variations like duplica-
tions and rearrangements. Mugsy uses the pairwise aligner
NUCmer and an algorithm for identifying locally collinear blocks
(LCBs are aligned regions from the genomes under consideration).
In the final steps a multiple alignment of LCBs is generated.
3.3.8 WABA Wobble Aware Bulk Aligner or WABA program is used for large-
scale genome alignment [42]. Like BLASTZ, WABA is a variant of
gapped BLAST and determines the initial matches by looking for
strings of six nucleotides in the pattern 11011011 where 1 should
always be a match. WABA is efficient in managing insertions and
deletions.
Large-Scale Sequence Comparison 215
Table 2
Web addresses of the most commonly used tools for large scale genome sequence comparison
4 Notes
Fig. 21 Dotplot generated using YASS genomic similarity search tool. The figure shows alignment of complete
genomes of (a) Mycobacterium tuberculosis F11 and Mycobacterium tuberculosis H37Rv (b) Mycobacterium
tuberculosis F11 and Mycobacterium paratuberculosis K10. The diagonal shown in (a) represents synteny
between the two Mycobacterium genomes while in (b) the two genomes are not in synteny
Large-Scale Sequence Comparison 217
FASTA
>seq1
MSDLDRLASRAAIQDLYSDQLIGVDKRQEGRLASIWWDDAEWTIEGIGTY
>seq2
KGPEGALDLANNVLWPMFHETIHYGTNLRLEFVSADKVNGIGDVLCLGNL
>seq3
VEGNQSILIAAVYTNEYERRDGVWKLSKLNGCMNYFTPLAGIHFAPPGAL
PIR/NBRF
>P1;seq1
seq1
MSDLDRLASRAAIQDLYSDQLIGVDKRQEGRLASIWWDDAEWTIEGIGTY*
>P1;seq2
seq2
KGPEGALDLANNVLWPMFHETIHYGTNLRLEFVSADKVNGIGDVLCLGNL*
>P1;seq3
seq3
VEGNQSILIAAVYTNEYERRDGVWKLSKLNGCMNYFTPLAGIHFAPPGAL*
#NEXUS
begin taxa;
dimensions ntax=4;
taxlabels
seq1
seq2
seq3
seq4
;
end;
begin characters;
dimensions nchar=50;
format datatype=protein gap=-;
matrix
seq1 MSDLDRLASRAAIQDLYSDQLIGVDKRQEGRLASIWWDDAEWTIEGIGTY
seq2 KGPEGALDLANNVLWPMFHETIHYGTNLRLEFVSADKVNGIGDVLCLGNL
seq3 VEGNQSILIAAVYTNEYERRDGVWKLSKLNGCMNYFTPLAGIHFAPPGAL
seq4 LQKS
;
end;
GENBANK
LOCUS AAR05959 154 aa linear BCT 02-APR-2004
DEFINITION LinA [Sphingobium indicum].
ACCESSION AAR05959
ORGANISM Sphingobium indicum
ORIGIN
1 msdldrlasr aaiqdlysdq ligvdkrqeg rlasiwwdda ewtiegigty kgpegaldla
61 nnvlwpmyhe tihygtnlrl efvsadkvng igdvlclgnl vegnqsilia avytneyerr
121 dgvwklskln gcmnyftpla gihfappgal lqks
Fig. 22 Different file formats that are used in sequence similarity search and
retrieval
Fig. 23 BLASTp search. In the query box of BLASTp, accession number P68871 for the beta-globin gene of
human is typed and the query is searched against the database with default parameters
Fig. 24 Typical BLAST output. (a) The result window of query P68871 showing the hits with a putative
conserved domain with 100 target sequences in the database. Each line (red) represents a pairwise alignment
of the query sequence with 100 individual sequences. The color of this line is dependent upon the alignment
score of the individual 100 hits. (b) Pairwise alignment result of query and individual subjects presented with
their accession numbers, % identity, query coverage, E-value, and alignment score. (c) Detailed pairwise
alignment of query with respective subjects
Fig. 25 FASTA homepage for protein query. In step 1, the user can choose the database against which the user
is interested in aligning the protein. In step 2, the query protein sequence is pasted in the fasta format. In step
3, the user can select and set various parameters
Fig. 26 Typical FASTA search output. Just like the BLAST program, there are various outputs of FASTA: (a)
visual output giving the color code for the E-value of the hits (b) summary table that lists all the hits across the
database (c) functional prediction suggesting the presence of conserved domains in the query protein sequence
Large-Scale Sequence Comparison 223
References
1. Tautz D (1998) Evolutionary biology. Debat- 17. Altschul SF, Gish W, Miller W, Myers EW, Lip-
able homologies. Nature 395:17–19 man DJ (1990) Basic local alignment search
2. Pearson WR (1996) Effective protein sequence tool. J Mol Biol 215:403–410
comparison. Methods Enzymol 266:227–258 18. Karlin S, Altschul SF (1990) Methods for asses-
3. Gibbs AJ, McIntyre GA (1970) The diagram, a sing the statistical significance of molecular
method for comparing sequences. Its use with sequence features by using general scoring
amino acid and nucleotide sequences. Eur J schemes. Proc Natl Acad Sci U S A
Biochem 16:1–11 87:2264–2268
4. Dayhoff MO, Schwartz RM, Orcutt BC 19. Altschul SF, Madden TL, Sch€affer AA, Zhang
(1978) A model of evolutionary changes in J, Zhang Z, Miller W, Lipman DJ (1997)
proteins. In: Dayhoff MO (ed) Atlas of protein Gapped BLAST and PSI-BLAST: a new gener-
sequence and structure, vol 5. National Bio- ation of protein database search programs.
medical Research Foundation, Washington, Nucleic Acids Res 25:3389–3402
DC, pp 345–352 20. Altschul SF, Koonin EV (1998) Iterated profile
5. Gonnet GH, Cohen MA, Brenner SA (1992) searches with PSI-BLAST: a tool for discovery
Exhaustive matching of the entire protein in protein databases. Trends Biochem Sci
sequence database. Science 256:1443–1445 23:444–447
6. Jones DT, Taylor WR, Thornton JM (1992) 21. Schaffer AA, Aravind L, Madden TL, Shavirin
The rapid generation of protein mutation data S, Spouge JL, Wolf YI, Koonin EV, Altschul SF
matrices from protein sequences. Cumput Appl (2001) Improving the accuracy of PSI-BLAST
Biosci 8:275–282 protein database searches with composition
7. Henikoff S, Henikoff JG (1992) Amino acid based statistics and other refinements. Nucleic
substitution matrices from protein blocks. Proc Acids Res 29:2994–3005
Natl Acad Sci U S A 89:10915–10919 22. Bucher P, Karplus K, Moeri N, Hofmann K
8. Henikoff S, Henikoff JG (1996) Blocks data- (1996) A flexible motif search technique
base and its application. Methods Enzymol based on generalized profiles. Comput Chem
266:88–105 20:3–23
9. Henikoff S, Henikoff JG (2000) Amino acid 23. Staden R (1988) Methods to define and locate
substitution matrices. Adv Protein Chem patterns of motifs in sequences. Comput Appl
54:73–97 Biosci 4:53–60
10. Henikoff S, Henikoff JG (1991) Automated 24. Tatusov RL, Altschul SF, Koonin EV (1994)
assembly of protein blocks for database search- Detection of conserved segments in proteins:
ing. Nucleic Acids Res 19:6565–6572 iterative scanning of sequence databases with
alignment blocks. Proc Natl Acad Sci U S A
11. Henikoff S, Henikoff JG (1993) Performance 91:12091–12095
evaluation of amino acid substitution matrices.
Proteins Struct Funct Genet 17:49–61 25. Marchler-Bauer A, Anderson JB, Chitsaz F,
Derbyshire MK, DeWeese-Scott C, Fong JH,
12. Wheeler DG (2003) Selecting the right protein Geer LY et al (2009) CDD: specific functional
scoring matrix. Curr Protoc Bioinformatics annotation with the Conserved Domain Data-
3.5.1–3.5.6 base. Nucleic Acids Res 37:D205–D210
13. Needleman SB, Wunsch CD (1970) A general 26. Zhang Z, Sch€affer AA, Miller W, Madden TL,
method applicable to the search for similarities Lipman DJ, Koonin EV, Altschul SF (1998)
in amino acid sequence of two proteins. J Mol Protein similarity searches using patterns as
Biol 48:443–453 seeds. Nucleic Acids Res 26:3986–3990
14. Smith TF, Waterman MS (1981) Identification 27. Wilbur WJ, Lipman DJ (1983) Rapid similarity
of common molecular subsequences. J Mol searches of nucleic acid and protein data banks.
Biol 147:195–197 Proc Natl Acad Sci U S A 80:726–730
15. Sellers PH (1974) On the theory and compu- 28. Lipman DJ, Pearson WR (1985) Rapid and
tation of evolutionary distances. SIAM J Appl sensitive protein similarity searches. Science
Math 26:787–793 227:1435–1441
16. Gotoh O (1982) An improved algorithm for 29. Pearson WR, Lipman DJ (1988) Improved
matching biological sequences. J Mol Biol tools for biological sequence comparison.
162:705–708 Proc Natl Acad Sci U S A 85:2444–2448
224 Devi Lal and Mansi Verma
30. Pearson WR (1990) Rapid and sensitive 43. Darling AC, Mau B, Blattner FR, Perna NT
sequence comparison with FASTP and (2004) Mauve: multiple alignment of con-
FASTA. Methods Enzymol 183:63–98 served genomic sequence with rearrangements.
31. Pearson WR (2003) Finding protein and Genome Res 14:1394–1403
nucleotide similarities with FASTA. Curr Pro- 44. Nakato R, Gotoh O (2008) A novel method
toc Bioinformatics 3.9.1–3.9.23 for reducing computational complexity of
32. Pearson WR (2000) Flexible sequence similar- whole genome sequence alignment. In Pro-
ity searching with the FASTA3 program pack- ceedings of the sixth Asia-Pacific bioinformat-
age. Methods Mol Biol 132:185–219 ics conference (APBC2008), pp 101–110
33. Zhang Z, Schwartz S, Wagner L, Miller WA 45. Nakato R, Gotoh O (2010) Cgaln: fast and
(2000) A greedy algorithm for aligning DNA space-efficient whole-genome alignment.
sequences. J Comput Biol 7:203–214 BMC Bioinformatics 11:24
34. Ma B, Tromp J, Li M (2002) Patternhunter: 46. Kiełbasa SM, Wan R, Sato K, Horton P, Frith
faster and more sensitive homology search. MC (2011) Adaptive seeds tame genomic
Bioinformatics 18:440–445 sequence comparison. Genome Res
35. Kent WJ (2002) BLAT-the BLAST like align- 21:487–493
ment tool. Genome Res 12:656–664 47. Dalca AV, Brudno M (2008) Fresco: flexible
36. Schwartz S, Kent WJ, Smit A, Zhang Z, alignment with rectangle scoring schemes. Pac
Baertsch R, Hardison RC, Haussler D, Miller Symp Biocomput 13:3–14
W (2003) Human–mouse alignments with 48. Treangen T, Messeguer X (2006) M-GCAT:
BLASTZ. Genome Res 13:103–107 interactively and efficiently constructing large-
37. Brudno M, Do CB, Cooper GM, Kim MF, scale multiple genome comparison frameworks
Davydov E, NISC Comparative Sequencing in closely related species. BMC Bioinformatics
Program, Green ED, Sidow A, Batzoglou S 7:433
(2003) LAGAN and multi-LAGAN: efficient 49. Sonnhammer EL, Durbin R (1995) A dot-
tools for large-scale multiple alignment of matrix program with dynamic threshold con-
genomic DNA. Genome Res 13:721–731 trol suited for genomic DNA and protein
38. Brudno M, Morgenstern B (2002) Fast and sequence analysis. Gene 167:GC1–GC10
sensitive alignment of large genomic 50. Brodie R, Roper RL, Upton C (2004) JDotter:
sequences. In: Proceedings IEEE computer a Java interface to multiple dotplots generated
society bioinformatics conference, Stanford by dotter. Bioinformatics 20:279–281
University, pp 138–147 51. Noe L, Kucherov G (2005) YASS: enhancing
39. Delcher AL, Kasif S, Fleischmann RD, Peter- the sensitivity of DNA similarity search.
son J, White O, Salzberg SL (1999) Alignment Nucleic Acids Res 33:W540–W543
of whole genomes. Nucleic Acids Res 52. Junier T, Pagni M (2000) Dotlet: diagonal
27:2369–2376 plots in a web browser. Bioinformatics
40. Bray N, Dubchak I, Pachter L (2003) AVID: a 16:178–179
global alignment program. Genome Res 53. Grant JR, Arantes AS, Stothard P (2012) Com-
13:97–102 paring thousands of circular genomes using the
41. Angiuoli SV, Salzberg SL (2011) Mugsy: fast CGView Comparison Tool. BMC Genomics
multiple alignment of closely related whole 13:202
genome. Bioinformatics 27:334–342 54. Alikhan NF, Petty NK, Ben Zakour NL, Beat-
42. Kent WJ, Zahler AM (2000) Conservation, son SA (2011) BLAST Ring Image Generator
regulation, synteny, and introns in a large- (BRIG): simple prokaryote genome compari-
scale C. briggsae–C. elegans genomic align- sons. BMC Genomics 12:402
ment. Genome Res 10:1115–1125
Chapter 10
Abstract
The availability of reference genome sequences for virtually all species under active research has revolutio-
nized biology. Analyses of genomic variations in many organisms have provided insights into phenotypic
traits, evolution and disease, and are transforming medicine. All genomic data from publicly funded projects
are freely available in Internet-based databases, for download or searching via genome browsers such as
Ensembl, Vega, NCBI’s Map Viewer, and the UCSC Genome Browser. These online tools generate
interactive graphical outputs of relevant chromosomal regions, showing genes, transcripts, and other
genomic landmarks, and epigenetic features mapped by projects such as ENCODE.
This chapter provides a broad overview of the major genomic databases and browsers, and describes
various approaches and the latest resources for searching them. Methods are provided for identifying
genomic locus and sequence information using gene names or codes, identifiers for DNA and RNA
molecules and proteins; also from karyotype bands, chromosomal coordinates, sequences, motifs, and
matrix-based patterns. Approaches are also described for batch retrieval of genomic information,
performing more complex queries, and analyzing larger sets of experimental data, for example from next-
generation sequencing projects.
Abbreviations
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_10, © Springer ScienceþBusiness Media New York 2017
225
226 James R.A. Hutchins
1 Introduction
[9, 10], the emergence of hominids [11], and even the course of
European history [12].
For any species, individuals within a population will exhibit
genomic variations, and practically none will exactly match the
reference genome sequence. Recent advances in next-generation
sequencing (NGS) technology have allowed the genomes of many
individuals within a population to be sequenced for comparative
analysis. The worldwide 1000 Genomes Project is creating a
detailed catalog of human genetic variation [13, 14], whereas the
UK-based 100,000 Genomes Project [15] focuses on genomic
differences linked to medical conditions; variation data from both
projects are to be made publicly available. Yet even within one
individual, or one tissue sample, different cells may exhibit varia-
tions in genomic sequence. Technological advances have now
enabled genomic sequencing of single cells, revealing much about
inter-cell diversity [16, 17].
A major driving force behind the human genome project was
the quest to identify genes underlying human diseases, for the
development of novel therapies. During the period of the project
the field of medical genetics evolved, from using reverse-genetic
approaches to identify individual disease-linked genes, to prospects
for personalized whole-of-life health care [18–20]. However, early
hopes and expectations for genome-based therapies were damp-
ened slightly by the realization that gene regulation, genome orga-
nization and the genetic basis of disease are much more
complicated than initially thought [21]. Understanding the genetic
basis of cancer was the original motivation behind the human
genome project [22], and cancer genomes projects on both sides
of the Atlantic are investigating the relationship between genomic
variation and cancers [23], to identify sequence signatures linked to
different cancer types [24].
Complementing genomic sequence information are data gen-
erated from consortia that attribute evidence-based or computa-
tionally predicted functions to segments of the genome, as well as
mapping epigenetic modifications in certain cell types. The best
known such project, the Encyclopedia of DNA Elements
(ENCODE), aims to identify all functional elements in the
human genome [25]. The parallel modENCODE project has
similar goals, for the model organisms C. elegans and D. melano-
gaster [26, 27], whereas the GENCODE project aims “to identify
all gene features in the human genome using a combination of
computational analysis, manual annotation, and experimental vali-
dation” [28]. Data from these projects are also publicly available,
and will be complemented by further epigenetic features as collab-
orative projects currently underway come to fruition [29].
This chapter provides an overview of current methods for
searching genomic databases using names and unique identifier
codes (IDs) for genes, DNA, or RNA molecules, locus codes,
228 James R.A. Hutchins
9. 11. 12.
BioMart (Ensembl) Specialist Custom scripts - access via APIs
Table Browser applications
(UCSC Genome Browser)
Fig. 1 Overview of approaches for genomic database searching. Genomic data in public databases can be
searched using many types of query, including gene names and codes, IDs for DNA, RNA, and protein
molecules, karyotype band codes, chromosomal coordinates, sequences, and motifs. A variety of software
approaches are possible, including the use of genome browsers, tools such as BioMart and Table Browser,
specialist applications such as Galaxy and Taverna, and custom scripts employing APIs, in a variety of
languages. Numbers refer to sections within this chapter
2.1 Reference For most experimental organisms, reference genome sequences are
Genome Sequences the product of collaborative endeavors, with the resulting assem-
blies being deposited in public databases. But for the human
genome project there were two parallel, competing efforts: one
led by the publicly funded International Human Genome Sequenc-
ing Consortium (IHGSC), and one undertaken by the private
company Celera Genomics, with “working draft” genomes from
each team reported simultaneously in 2001 [5, 6]. Whereas
IHGSC data were immediately made publicly available, Celera
data were initially available only via paid subscription, but eventu-
ally released to the public, incorporated into GenBank [33]. The
IHGSC draft genome underwent refinement, for example by filling
Genomic Database Searching 229
2.2 Reference Gene One issue central to genomic annotation, but which is far from
Sets straightforward, is what a gene is, and how a set of genes can be
identified within a genome. The definition of a gene has evolved
considerably over time, and must now encompass loci
corresponding to a range of protein-coding as well as non
protein-coding transcripts [35]. Several automated routines have
been developed to predict sets of reference genes within genomic
sequences, these include Genescan [36], AceView [37], the Mam-
malian Gene Collection (MGC) [38], Consensus Coding Sequence
(CCDS) [39], Ensembl [40], and RefSeq [41]. Complementing
automated gene-prediction methods are expert-curated manual
annotations, notably the products of the Human and Vertebrate
Analysis and Annotation (HAVANA) procedure, available from the
Vega database [42].
For the human and mouse genomes, the GENCODE consor-
tium merges Ensembl (automated) and HAVANA (manual) gene
annotations to generate a reference gene set that aims to capture
the full extent of transcriptional complexity, including pseudo-
genes, small RNAs and long noncoding RNAs. There are two
versions of this gene set: GENCODE Basic and GENCODE Com-
prehensive. For protein-coding genes the former contains only full-
length, protein-coding transcripts, whereas the latter includes a
fuller set of variant-length transcripts. The GENCODE Compre-
hensive gene set in particular gives very good genome coverage
[43], and is increasingly being adopted as the standard in the
community.
All major genome browsers can be customized to allow the
visualization of multiple gene sets aligned in parallel to the refer-
ence genome, allowing their outputs to be inspected and
compared.
The approach and type of query depends on the starting point, and
the question being asked. Genomic database searching can be
approached in three main ways:
Firstly, complete genome sequences can be downloaded from
Internet-based databases. This gives the user the full freedom and
flexibility to search in the manner of their choosing, which may
range from using a simple text editor to writing a sophisticated
custom program. This approach is also necessary when using soft-
ware not directly connected to Internet-based databases, but
requiring a sequence file to be inputted or uploaded (such as
EMBOSS; Subheading 8.2). Genomic sequence files are typically
downloadable from the file transfer protocol (FTP) servers of each
genomic database, as described in Table 1.
Secondly, genomes can be searched using Web-based genome
browsers and other Internet-connected software that employ
graphical user interfaces (GUIs). This is the most user-friendly
approach to genomic database searching, as genome browsers
automatically recognize many types of query, have in-built search
routines, and allow searches to be stored and combined. Major
Web-based browsers also allow genome searching by incorporating
query terms into a Web address (URL). For users with a spread-
sheet containing many IDs, links to custom genome searches can
thus be automatically created in spreadsheet software, allowing
one-click launching of searches in genome browsers
(Subheading 9.1).
Thirdly, genomic databases can be searched using custom
scripts that execute queries to Internet-based databases by means
of their application programming interfaces (APIs). This is ulti-
mately the most flexible and powerful approach, but for those
without programming experience this involves a considerable
learning curve. Options for this approach are described in
Subheading 12.
Table 1
Genomic databases and genome browsers
4.2 Ensembl Ensembl [40], a joint initiative between the European Bioinfor-
matics Institute and the Wellcome Trust Sanger Institute, is a
resource initially created to allow access to data from the public
Human Genome Project. Ensembl now comprises a family of data-
bases providing compiled genomic data for over 80 genome-
sequenced vertebrate species, together with gene predictions and
corresponding transcript and polypeptide sequences, and various
analytical tools.
The main Ensembl resource is complemented by Ensembl Gen-
omes [44], a “superfamily” of databases that includes Ensembl
Bacteria (comprising over 10,000 genomes of bacteria and archaea),
Ensembl Fungi (comprising genomes of over 40 species, including
budding and fission yeasts, and Aspergillus), Ensembl Metazoa
(over 50 invertebrate species, including Drosophila, C. elegans, and
silkworm), Ensembl Plants (over 30 species, including Arabidopsis,
rice, and wheat), and Ensembl Protists (over 20 species, including
Dictyostelium, Plasmodium, and Tetrahymena).
4.2.1 Searching Ensembl Starting with a gene name (for human genes, this is referred to as a
gene symbol), the genomic locus can be rapidly identified within
Ensembl using a simple search, where the relevant organism is also
specified. At the Ensembl home page, after “Search:” choose the
relevant species from the pull-down menu, enter the gene name
into the search box and click “Go”. At the search results page, the
relevant gene usually appears at the top of the list, and the “Best
gene match” is shown at the top-right. Clicking on either of these
opens an Ensembl gene page showing information about the gene
and a summary of its genomic context.
More straightforward still is searching Ensembl using unique
IDs. In addition to its own identifiers, which have the prefix ENS,
Ensembl recognizes a wide range of, but not all, popular ID types
for genes, nucleic acids and proteins from external databases. For
searching using gene IDs, Ensembl recognizes NCBI’s GeneID
and UniGene [45] codes, as well as several from species-specific
databases such as HGNC [46] and FlyBase [47]. For nucleic acids
Genomic Database Searching 233
4.2.2 Navigating and Following an Ensembl gene search, and the user’s selection of the
Customizing the Genome gene of interest, a page appears that is rich in information relevant
Browser to that gene. This page is organized into tabs, the three main ones
being Species, Location, and Gene. The Gene tab is shown as
default, displaying summary information about the gene, a table
of transcripts, and then a condensed genome viewer zoomed to the
full length of the gene. Genes are shown with a color code (red,
protein coding; orange, merged Ensembl/Havana; blue, processed
transcript; gray, pseudogene). Within a gene, exons are shown as
solid blocks, and introns as chevron-shaped connecting lines. To
open the full interactive genome browser from the Gene tab, click
the “Location” tab at the top.
The Location tab shows the locus in question in its genomic
context, in an interactive and customizable view. Within the Loca-
tion tab there are three main panels: “Chromosome”, and under
“Region in detail”, the 1 MB Region and the Main Panel. The
Chromosome panel gives the chromosomal coordinates of the
ROI, and a diagram of the chromosome with its banding pattern,
with the ROI bounded by a red box. Below this, the 1 MB Region
allows the user to see the genes in the vicinity of the current ROI,
also bounded by a red box. Genes are color coded by type (e.g.,
protein coding, or RNA gene), and their orientations indicated
next to their names by the > and < symbols. The ROI can be
adjusted by either scrolling left and right, or by selecting a specific
region by dragging a box, then clicking “Jump to region”.
234 James R.A. Hutchins
4.2.3 Exporting Outputs Once the search has been performed and relevant information
from Ensembl visualized in the browser, the user may wish to export the results,
either as an image of the browser output, or as a data file containing
the sequence and other features.
To export the current browser output as a publication-quality
graphic file, click on the “Export this image” icon within the set of
white icons along the left of the blue header bar of the browser
panel. A pop-up box allows the graphical output to be saved in
high-quality raster (PNG) or vector (PDF, SVG) formats.
To export the sequence and other features in the current view,
click the “Export data” link to the left of the browser. In the dialog
Genomic Database Searching 235
4.3 Vega The Vertebrate Genome Annotation (Vega) database [42] comple-
ments Ensembl by providing expert manual annotation of genes
(protein-coding, noncoding RNA, processed transcripts, and pseu-
dogenes), currently for five organisms: human, mouse, rat, zebra-
fish, and pig, with partial annotation for five other vertebrates. The
genome annotations in the Vega database are updated by a dedi-
cated team via the HAVANA procedure.
The Vega search and browser interface works almost exactly like
Ensembl, and so a separate description of its functionality is not
necessary. As with Ensembl, searching the Vega genome database
using a molecule’s unique ID can be initiated from the search box
at the top-right of the home page. Searches can also be launched by
incorporating query terms into URLs (see Note 3). Such searches
identify the corresponding Vega gene IDs, which have the prefix
OTT, and the corresponding Gene and Location (genome
browser) pages.
Manually curated gene information from Vega/HAVANA
played an enormously valuable role in enhancing the quality and
credibility of the GENCODE gene set; as this information is avail-
able from the main Ensembl website, genome searching from the
Vega site would only be necessary if a user were interested exclu-
sively in manually annotated genes.
4.4 University of The UCSC Genome Browser [51] is one of the most popular and
California Santa Cruz flexible Web browser-based tools for searching, visualizing, and
(UCSC) Genome accessing genomic data. The browser allows access to the genomes
Browser of currently over 90 species, the vast majority being metazoans.
Access is provided to locus-specific information of many types,
including genes, genomic landmarks, gene expression and regula-
tion (including ENCODE project data), epigenetic modifications,
comparative genomics, and genomic variations.
A unique feature of the UCSC Genome Browser is the “UCSC
genes” gene set [52]. This moderately conservative set of gene
predictions is based on RefSeq, GenBank, CCDS, Rfam [53], and
tRNAs [54], and comprises both protein-coding and noncoding
RNA genes. However in a move towards standardizing on a com-
mon gene set within the bioinformatics community, as of July 2015
the UCSC Genome Browser has adopted GENCODE
236 James R.A. Hutchins
4.4.1 Genome Searching Genomes can be searched within the UCSC Genome Browser
Using the UCSC Genome using gene names, plus many kinds of ID, including those from
Browser RefSeq, GenBank, UniProt, and Ensembl—although not all of
these ID types are recognized for all organisms. To search for
genomic information for a gene or molecule, the starting point is
the “Genomes” page. From the pull-down menus “group” and
“genome”, choose the appropriate entries for the species of inter-
est. Under “assembly”, choose the latest version, unless an older
one is specifically required. Then under “search term” enter the
gene name or molecule ID, and click “submit”.
The results page that follows lists entries from nucleic acid
databases that correspond to genes showing a match to the search
term; each of these is a link to the corresponding genome browser
page. Each results entry listed has chromosomal coordinates and an
ID, and often a brief title, which may include the gene name.
Matches are categorized according to the type of gene set, such as
UCSC, RefSeq, or GENCODE, as appropriate. Matches to gen-
omes of other organisms also sometimes appear, listed separately.
Whereas searches using unique IDs usually produce one hit per
gene set, searches using gene names may also generate several “off-
target” hits due to non-exact text matching (e.g., “CDK1” also
matches “CDK10”), thus the user must exercise caution when
choosing a match from the results list. One useful approach is
that when searching using gene names, these may be automatically
recognized before “submit” is clicked, in the form of a drop-down
menu of gene matches; clicking on the appropriate gene match
takes the user to the correct genome browser page.
The UCSC Genome Browser can also be searched by incorpor-
ating query terms into a URL (see Note 4). A more sophisticated
method of searching the UCSC Genome Browser for multiple IDs
is provided by the site’s Table Browser tool; an example of this is
provided in Subheading 9.3.
4.4.2 Genome Browser The main genome browser page features a menu bar along the top,
Navigation and then navigation controls allowing the user to move the ROI left or
Customization right; zoom in by a factor of 1.5, 3, 10, or right in to the base-pair
level; and zoom out by a factor or 1.5, 3, 10 and 100. Below this
appear the coordinates of the current ROI and its length, plus an
additional search box. Below this, a chromosomal ideogram with
banding patterns is shown, with a red line or box indicating the
location of the current ROI.
Below this is the main browser panel, showing the tracks. The
uppermost track, “Base Position”, shows by default a genome ruler
and a scale bar. Navigation left and right can be performed by
dragging any part of the panel except the Base Position track.
To zoom in to a specific desired region, drag a box over the Base
Genomic Database Searching 237
4.4.3 Exporting Outputs To export the current graphical output of the genome browser as a
from the UCSC Genome publication-quality vector file, go to the top menu, select “View”
Browser then “PDF/PS”. This opens a page called “PDF Output”, where
the current browser graphic or chromosome ideogram can be
downloaded in PDF or EPS formats.
To obtain genomic sequence corresponding to the current
browser view, in the top menu, click “View” then “DNA”. The
“Get DNA in Window” page opens, with the relevant genomic
coordinates already filled in under “Position”. Under “Sequence
Retrieval Region Options:”, select: “Add 1 extra bases upstream
(50 ) and 0 extra downstream (30 ).” Under “Sequence Formatting
Options:” choose upper or lower case as desired. Click “get DNA”.
238 James R.A. Hutchins
4.5.1 Searching Map Performing a genomic database search from the Map Viewer
Viewer home page proceeds by first selecting an organism (after
“Search:”), entering a query term (after “for:”), then clicking
“Go”. The range of ID types recognized is somewhat narrower
than those of Ensembl and the UCSC Genome Browser. Gene
names and NCBI IDs from RefSeq, GenBank, and UniGene are
recognized, but not NCBI GI numbers nor external IDs from
Ensembl or UniProt.
Following a database search, a results page is produced that
shows a karyotype of the organism in question. For a successful
search that matches the query to a single genomic locus, its location
on the corresponding chromosome is indicated by a red line. When
a search matches several loci, the positions of each are indicated on
their respective chromosomes. Where the search term contains
degeneracy (for example “CDK” matches all members of the
cyclin-dependent kinase family), this provides a useful overview of
the distribution of matches across the whole genome. Below the
karyotype is a table listing all the matching nucleotide entries, listed
under the headings Chromosome, Assembly, Match, Map element,
Type and Maps. For many genes there may be matches to dozens of
transcript entries in the NCBI nucleotide database, and so finding
whole gene hits among these may require some effort. Fortunately,
a Quick Filter option to the right of the table allows the user to
narrow down the hits to only Gene or RefSeq entries. A more
extensive range of filters and search options are available by clicking
the “Advanced Search” button. Here, the search can be narrowed
down to the chromosome, assembly, type of mapped object, and
map name. The query is entered in the “Search for” box, and the
search initiated by clicking “Find”. Genomic database searching
using Map Viewer can also be performed by incorporating query
terms into a URL (see Note 5).
Genomic Database Searching 239
4.5.2 Genome Browser The NCBI Map Viewer browser differs from browsers described
Navigation and previously in that tracks appear vertically rather than horizontally,
Customization with a vertical ideogram of the relevant chromosome to the far left.
Each track has a short title above it; hovering the mouse pointer
over this opens a pop-up box providing further information. In
Map Viewer one track is designated as the “Master Map”; this
appears as the rightmost track, with its title in red. Next to each
track title are two small buttons: a right-facing arrow and an “X”.
Clicking the former makes that track the Master Map; clicking the
latter removes the track from the display. Each element within the
vertical track is connected by a gray line to a horizontal textual
annotation. For the Master Map track these annotations appear
larger, and where appropriate further information is provided in a
table to the right of the graphical panel.
Navigating the Map Viewer by zooming and scrolling is possi-
ble, but not with the same instantaneous interactivity as Ensembl,
Vega or the UCSC Genome Browser. Zooming in and out can be
achieved by clicking on a small panel to the left of the main graphic.
Clicking on a track opens a pop-up window with options for
zooming and re-scaling to 10 Mb, 1 Mb, 100 kb, and 10 kb.
Scrolling can be performed by clicking on the blue up or down
arrowheads at either end of the Master Map track.
Where genes are shown in a track, the exons are shown as solid
blocks, and introns as thin lines. The track called “Genes_seq”
presents a flattened view of all exons that can be spliced together
in various ways. It is advantageous to have Genes_seq as the Master
Map, as the resulting table to the right of the graphical output
contains a wealth of useful information, including gene descrip-
tions, plus links to numerous external databases of gene and protein
function.
Clicking the “Maps & Options” button opens a window that
gives the user control over the tracks displayed, and their order. The
window has two sections: the Available Tracks section on the left
shows a variety of additional data types, including genes from
different sources, transcripts, clones, STSs, CpG islands and con-
tigs. Clicking the “þ” next to a feature adds it to the Tracks
Displayed panel on the right, which lists currently displayed tracks
and allows their order to be changed. Here, highlighting an “R”
icon next to a track adds a genomic ruler to its left in the genome
browser. Clicking “OK” closes the Maps & Options window and
updates the graphical output.
240 James R.A. Hutchins
4.5.3 Export from Map Options for exporting Map Viewer’s graphical output appear fairly
Viewer limited. One could take the approach of clicking the right mouse
button over the browser graphic, choosing Save Image As. . ., and
saving in the PNG raster format, however this does not capture
annotations to the Master Map track, nor any of the table to its
right. So the leading option at present is to perform a screenshot or
selective screen grab. In either case, smaller textual elements and
finer lines are of low resolution, appearing pixilated.
In contrast, Map Viewer’s options for exporting genomic and
associated data shown in the browser are very comprehensive.
Below the graphical panel is a section called Summary of Maps,
which lists each of the tracks (here called maps), and summarizes
their contents. Next to each map summary are two links. Clicking
the Table View link opens a page that tabulates all of the features for
each track, with genomic coordinates and other relevant data; this
can be downloaded as a tabulated text file. Clicking the Download/
Sequence/Evidence link opens a window that allows the
corresponding sequence to be saved in GenBank or FASTA
formats.
4.5.4 NCBI Gene Whereas the NCBI Gene database [58] focuses on genes rather
than genomic regions, its search and genome browsing functional-
ity complements and in some cases supersedes that of Map Viewer.
NCBI Gene can be searched using all standard NCBI IDs, includ-
ing nucleotide and protein GI numbers, which should be prefaced
“GI:”. Some external IDs are also recognized, such as Ensembl
genes and UniProt proteins. URL-based searching is also possible
(see Note 6).
The NCBI Gene results page lists possible gene hits, tabulated
as Name/Gene ID, Description, Location, and Aliases. Clicking a
gene name link opens the relevant gene page, which (in “Full
Report” format) hosts a wealth of information about the gene, in
12 sections. The “Genomic context” section summarizes informa-
tion about the locus, with a graphical overview of the gene’s
relationship to neighboring genes. The “Genomic regions, tran-
scripts, and products” section contains a horizontally oriented
interactive genome browser that in some ways surpasses the usabil-
ity of the Map Viewer (to which there is a link to the right); for
example this browser supports drag-based scrolling and ROI selec-
tion, and export of the graphic in high-quality PDF format. Thus,
in cases where a certain ID is not recognized by Map Viewer, then a
search at NCBI Gene, and the use of its horizontal browser or the
link to Map Viewer is one solution to access genomic locus
information.
4.6 Viral Genomes The NCBI Genomes database contains genomic information from
over 4500 viruses, bacteriophages, viroids, archaeal phages, and
virophages, with a dedicated search portal, NCBI Viral Genomes
Genomic Database Searching 241
[59]. The home page allows for a NCBI Genomes search, which
can be initiated using a virus name or RefSeq ID as a query.
Searches can also be incorporated into a URL, thus:
http://www.ncbi.nlm.nih.gov/genome/?term¼PhiX174
http://www.ncbi.nlm.nih.gov/genome/?term¼NC_001422
The results page shows a list of virus hits; clicking on one shows
a summary of genomic information for that virus, including an
overview of the whole genome. Clicking on a link under either
the RefSeq or INSDC headings opens a page showing collected
information about the genome, including translated open reading
frames, and the whole genome sequence. Alternatively, clicking on
the “Gene” link under the “Related Information” heading to the
right of the page lists the genes, from the NCBI Gene database, that
are encoded by the viral genome. Clicking on a gene name link
opens the NCBI Gene page; under the “Genomic regions, tran-
scripts, and products” section, the whole viral genome can be
inspected in the interactive genome browser. From here, the
sequence of the ROI can be exported in FASTA or GenBank
formats, or the graphical output exported as a PDF file.
4.7 Stand-Alone The genome browsers described above are Web-based, and allow
Genome Browsers both retrieval and display of genomic data. Stand-alone genome
browser applications, not being restricted to Web-browser func-
tionality and speed, are generally more flexible and responsive.
However a time-consuming step is the import to local storage of
a whole reference genome’s worth of data, which can take more
than a gigabyte of memory. The strong point of these tools is,
rather than genomic database searching, their advanced abilities
to display and integrate multiple datasets, including the user’s
own quantitative experimental data, relative to a reference genome.
Several such browsers are available (and in some cases functionally
overlap with software described in Subheading 11); three popular
ones are described here.
The Integrated Genome Browser (IGB, http://bioviz.org/
igb/) [60] is a desktop application that allows the visualization of
genes, genomic features and experimental data aligned to reference
genomes. Genomic sequences can be loaded in to the program
from a remote server, and a certain amount of genomic querying
is also possible. IGB provides an interactive environment for the
exploration of genomic data, with a powerful and user-friendly
GUI. It is very flexible regarding options for data upload, and can
handle multiple sets of quantitative data, which can be visualized in
parallel and customized in terms of their visual aspect, to generate
high-quality figures.
The Integrative Genomics Viewer (IGV, https://www.bro
adinstitute.org/igv/) [61] is a desktop application whose primary
242 James R.A. Hutchins
5.1 Messenger RNAs For several decades, sequences of protein-coding mRNAs have
(mRNAs) been gathered in public nucleotide databases. The repositories
GenBank, ENA, and DDBJ collaborate and exchange sequence
data within the International Nucleotide Sequence Database Col-
laboration (INSDC), forming a collective resource commonly
referred to as DDBJ/EMBL/GenBank [66]. In addition, mRNA
entries are present in the RefSeq, Ensembl and Vega databases.
Database entries may comprise mRNAs whose existence is con-
firmed experimentally, as well as transcripts predicted by computa-
tional analysis of genome sequences. For most genes several mRNA
species (database entries) correspond to a gene or genomic locus,
due to the presence of alternative transcriptional start and stop sites,
and the effects of alternative splicing.
Fortunately, due to data sharing and integration efforts, mRNA
IDs from any of these databases mentioned above are recognized by
the four main genome browsers (although Map Viewer seems not
to recognize Vega IDs). Therefore genomic database searching
from standard mRNA IDs can be performed using the user’s favor-
ite genome browser, as described in the previous sections.
5.2 MicroRNAs miRNAs are 19–25 nucleotide single-stranded ncRNAs that bind
(miRNAs) to 30 -untranslated regions (30 -UTRs) of target mRNAs, destabiliz-
ing or inhibiting their translation [67]. miRNA genes encode
primary transcripts known as pri-miRNAs, which are processed
into short stem-loop structures called pre-miRNAs, then modified
into mature single-stranded miRNAs.
The nomenclature for miRNAs is complicated by the presence
of colloquial and official names and IDs for genes and miRNAs at
different stages of processing, as well as database accession codes.
As many of these terms are unrecognized or misinterpreted by
genome browser databases, the most effective means of accessing
reliable information is from the expert database miRBase [68],
which recognizes all miRNA IDs, and documents the relationships
between them.
To search miRBase from the home page, paste an ID into the
search box at the top right, and click “Search”. The Search Results
page lists the number of hits for each class of miRNA. Clicking on
an entry under the Accession or ID columns opens a page of
information. If the query is a mature miRNA, then the user should
click the entry following the “Stem-loop” heading to open the
corresponding page. In the Stem Loop Sequence page, under the
“Genome context” heading, a link is provided showing the geno-
mic coordinates; clicking this opens an Ensembl genome browser
window at the locus corresponding to the miRNA of interest.
5.3 Piwi-Interacting piRNAs are a class of small ncRNAs of length range 26–32 bases
RNAs (piRNAs) that bind to Piwi and other proteins of the Argonaute family. Roles
for piRNAs have been found in transposon silencing, and
244 James R.A. Hutchins
5.4 Long Noncoding lncRNAs are a class of ncRNA ranging in length from 200 bases to
RNAs (lncRNAs) more than 15 kb. lncRNAs may be transcribed antisense to protein
coding genes, within introns, overlapping genes, or outside of
genes, and are believed to have roles in the regulation of gene
expression, especially during development [75, 76]. In terms of
standardized gene nomenclature, lncRNA gene symbols consist of
either function-based abbreviations, existing protein-coding gene
symbols with the suffices -AS, -IT, or -OT, or symbols starting
LINC for Long Intergenic Non-protein Coding RNAs [77].
The efforts of the GENCODE consortium in including
lncRNAs in their comprehensive gene set means that these entries
are present and searchable from Ensembl and the UCSC Genome
Genomic Database Searching 245
8.1 Sequence-Based Several algorithms exist to search a database for entries that exhibit
Searches homology to a given sequence of interest (the “query sequence”);
the best-known is probably BLAST (Basic Local Alignment Search
Tool) [81]. This can be applied to whole genomic database search-
ing, however the alternative algorithm BLAT (BLAST-Like Align-
ment Tool) [82] is faster, and has been adopted as the default
Genomic Database Searching 247
Table 2
Software utilities for searching genomes using sequences, motifs and matrices
results page that tabulates the hits: their genomic locations, over-
lapping genes, and alignment and percentage identity scores, with
links to view the alignment. Each Genomic Location entry is a link
that opens the genome browser, showing the query matched to the
relevant genome. An alternative means of accessing these hits is
provided by a karyotype diagram below, which shows matching
regions highlighted by red boxes and arrows; clicking an arrow
opens a pop-up box containing information and links about the
matches.
Within the UCSC Genome Browser, searching genomic data-
bases with sequences using BLAT is accessible via the menu bar at
the top of the page, under >Tools >Blat. In the BLAT Search
Genome page, choose the species and genome assembly, and
paste in the query sequence (DNA or protein). Multiple sequence
queries can be entered in a multi-sequence FASTA format. The
search is initiated by clicking “submit”. When the query is a single
sequence that the user is confident will map to a unique genomic
locus, one option is the “I’m feeling lucky” button, which bypasses
the results page, going straight to a browser showing the best
match from the BLAT search. Following a standard submission,
the BLAT Search Results page lists the set of matches found for
each query, showing the genomic coordinates of the match and the
percentage identity. In addition there are two hyperlinks: “details”
opens a page showing the alignment of bases for that match. The
“browser” link opens the genome browser, where the query
sequence appears as a blue track alongside the genome, annotated
as “Your Sequence from Blat Search” and labeled on the left with
the FASTA heading or “YourSeq”.
Searching the NCBI’s collection of genomic databases using
DNA or protein sequence-based queries is performed from the
BLAST page (http://blast.ncbi.nlm.nih.gov/Blast.cgi). More
search parameters are available here than on the Ensembl or
UCSC sites, but BLAT is not offered. Under the “BLAST Assem-
bled Genomes” heading, first choose the organism by entering a
species name or taxonomic ID, or by clicking a link to the right. In
the relevant species page, choose the type of BLAST search by
selecting a tab across the top (blastn, blastp, blastx, tblastn, or
tblastx; for a standard nucleotide-based search, choose blastn).
Under “Enter Query Sequence”, paste in the sequence(s) to be
searched. Multiple sequence queries can be entered in a multi-
sequence FASTA format. The database selection can be left in its
default state (“Genome. . .”). Advanced search options are available
by clicking “Algorithm parameters”, but for unique-sequence
searches this is usually not necessary. Clicking “BLAST” runs the
search, and upon completion a results page shows details of query-
genome matches identified. Where more than one query sequence
Genomic Database Searching 249
8.2 Motif-Based The specificity of many factors that interact with and modulate the
Searches genome is owed to the recognition of short stretches of DNA
sequence, termed motifs. The specific recognition of DNA at
motifs drives and regulates many fundamental nuclear processes,
notably gene expression [83], and has been studied mechanistically
in much detail [84, 85]. Sequence motifs may also be used to
predict the formation of unusual DNA secondary structures [86].
A motif may comprise a single exact sequence, for example the
restriction endonuclease EcoRI cuts at the palendromic motif
G#AATTC, where # indicates the cleavage site [87]. Alternatively
a motif may contain some degeneracy (variability in the bases
acceptable at certain positions); such is the case for the transcription
factor p53 [88]. Some motifs are even more flexible, involving
repeated segments and variable-length gaps, as is the case for
those predicting G-quadruplex structures, found in telomeric
regions and 50 to the majority of metazoan origins of replication
[89–91].
So how can motifs, including those containing degeneracy and
variable gaps, be expressed and used for genomic database search-
ing? One means is the Regular Expression (RegEx), a standar-
dized format for representing text patterns, popular in the
computing world (for more information, see http://www.regular-
expressions.info). When applied to DNA sequences, the four bases
appear simply as the letters A, C, G and T, a set of letters within
square brackets represents the possible bases at a certain position,
and numbers in curly brackets refer to the number of times that part
of the pattern appears. Taking G-quadruplexes as an example, a
simplified motif for sequences likely to form these structures can be
represented using standard biochemical nomenclature like this
(where N represents any of the four standard DNA bases A, C, G
or T):
G3-N1-7-G3-N1-7-G3-N1-7-G3
Using RegEx nomenclature, this same pattern can be repre-
sented thus:
G{3}[ACGT]{1,7}G{3}[ACGT]{1,7}G{3}[ACGT]{1,7}G{3}
where [ACGT] represents “any standard DNA base”, {3} means
“exactly three times”, and {1,7} means “between one and seven
times”.
250 James R.A. Hutchins
8.3 Matrix-Based Genome searching using RegEx-based motifs allows for degeneracy
Searches at certain positions, but does not allow information about prefer-
ences for certain bases over others at each position to be taken into
account. This information can be expressed by a Position Weight
Matrix (PWM), also known as a Position-Specific Scoring Matrix
(PSSM). Here, typically experimental data are used to generate a
table (matrix) that gives a score for the likelihood of occurrence of
each base at each position within the motif. A genomic sequence is
then searched using the matrix, the quality of each motif match
being assessed and given a score [96].
Software routines that can perform PWM-based genome
searches include the MATCH algorithm [97], which works
together with the TRANSFAC database of transcription factors,
their binding sites, and nucleotide distributions [98]. Multiple
MATCH searches can be launched from and managed by the
stand-alone program ModuleMaster [99]. The MEME suite con-
tains programs that can run matrix-based sequence searches,
including FIMO, plus GLAM2Scan, which accepts gapped motifs.
The RSAT suite contains a program called Matrix-Scan [100],
which performs a similar functionality.
Genomic Database Searching 251
9.1 Creating a Where a researcher has a list of gene names or molecule IDs in a
Hyperlinked ID Table table, they may wish to perform genomic database searches with
these IDs using one or more genome browsers. A useful approach
to this is to create a series of Web links beside each ID, allowing
one-click direct searching within the browsers. All major spread-
sheet software packages incorporate a HYPERLINK function; the
following is a method that employs this to generate links referring
to IDs within the spreadsheet, creating a hyperlinked ID table
(Fig. 2).
1. In spreadsheet software—open the file containing the list of IDs.
For RefSeq IDs (e.g., NM_000546.5), the decimal points and
version-number suffixes should be removed, as these can cause
recognition problems, notably in Ensembl. This can be per-
formed in the spreadsheet software by selecting the relevant
IDs, then using Find and Replace to convert “.*” (without the
quotes) to “” (nothing). For the purposes of this exercise, the
top-most ID is in cell A2. If the column to the right of the IDs
is not blank, insert an additional column there.
2. Obtain the URL—identify the direct-search URL of the
genome browser you wish to use (e.g., from Notes 1–7).
Copy the URL, and paste it into a text editor.
3. Re-format the URL to create a HYPERLINK function—
replacing the part of the URL specific to the ID by a cell
reference (here, A2), along the lines of this example:
This URL: http://www.ensembl.org/human/Location/
View?g¼NM_000546
rearranges to:
¼HYPERLINK("http://www.ensembl.org/human/Loca
tion/View?g¼"&A2,"Ensembl")
Note:,"Ensembl" (with a comma) in Excel, but;"Ensembl"
(with a semicolon) in Calc.
252 James R.A. Hutchins
9.2 Creating an Also starting with a table of IDs, a researcher may wish to supple-
Annotated ID Table ment the list with relevant data such as gene names, genomic
coordinates, and even brief functional descriptions, to generate an
annotated ID table (Fig. 2). One method for performing this, using
the BioMart facility of Ensembl [101], is described here.
1. In spreadsheet software—open the file containing the list of IDs.
For RefSeq IDs, decimal points and version-number suffixes
must be removed, as described in Subheading 9.1. Select the
IDs, and copy them to the clipboard.
2. Go to Ensembl BioMart—in a Web browser, go to http://www.
ensembl.org/biomart/.
3. Inputting the IDs—from the “CHOOSE DATABASE” pull-
down menu choose “Ensembl Genes”. From the “CHOOSE
DATASET” menu choose the relevant organism and genome,
e.g., “Homo sapiens genes (GRCh38.p3)”. Click “Filters” on
the left, then “[þ]” next to “GENE:” to expand the options.
Check the box next to “Input external references ID list limit
[Max 500 advised]”. Choose the relevant type of ID from the
pull-down menu; an example is shown for each ID type, for
example “RefSeq mRNA ID(s) [e.g., NM_001195597]”. In
the box under the ID type, paste in the list of IDs.
4. Selecting output options—click “Attributes” on the left, then
“[þ]” next to “GENE:” to expand the options. Ensembl Gene
ID and Ensembl Transcript ID are usually selected by default;
uncheck “Ensembl Transcript ID” to restrict the output to
genes. Select additional output parameters by checking the
relevant boxes, for example the following (in this order): Asso-
ciated Gene Name, Chromosome Name, Gene Start (bp),
Gene End (bp), Strand, Band, Description.
5. Generating and exporting the results table—click the “Results”
button on the top left of the page. An HTML table appears
listing the Ensembl genes corresponding to the input IDs, and
further attributes. At “Export all results to” choose “File”, and
“XLS” from the menus. Check the box for “Unique results
only”, then click “Go”. BioMart will export an Excel spread-
sheet file for the computer to download that contains the
results table with attributes as hyperlinks.
Genomic Database Searching 253
Annotated ID Table
Additional columns provide further information about each molecule:
gene codes and names, genomic coordinates, and brief descriptions.
Fig. 2 Two approaches to genomic data retrieval from a list of identifiers. Starting with a spreadsheet table
containing DNA, RNA, or protein identifiers (ID Table; shown here are RefSeq transcript IDs), two methods are
described for obtaining relevant genomic information. The Hyperlinked ID Table (method in Subheading 9.1)
contains multiple Web links, providing one-click access to relevant entries within multiple genome browsers.
The Annotated ID Table (method in Subheading 9.2) contains gene names and genomic locus information for
each entry, in this case obtained from Ensembl
9.3 Batch Retrieval A researcher may have a set of genomic coordinates, for example as
of Sequences from an output from a bioinformatic program, and may wish to obtain
Multiple Genomic the corresponding DNA sequences. The following is a method for
Coordinates performing this, using the UCSC Genome Browser’s
Table Browser tool.
254 James R.A. Hutchins
11.1 UCSC As shown in the example in Subheading 9.3, the UCSC Genome
Table Browser Browser contains a Table Browser tool that provides a powerful and
flexible means of performing genomic database searches. Multiple
genomic queries can be performed, for example starting with a set
of gene IDs the corresponding genomic coordinates, transcription
start and end sites can rapidly be obtained, as indeed can any of the
data stored in the UCSC Genome Browser. Flexible options are
offered for formatting the results, for example exons and introns
can be listed separately, or appear in upper or lower case as desired.
The output can be as FASTA lists, or table formats suitable for
other applications and spreadsheets.
Options exist to perform more complex queries. Filters can be
applied to searches, for example to restrict the output to matches
from a certain chromosome or set of genomic regions. Tables can
be stored, and pairs of tables combined through a union or inter-
section (for example, to show which genes are present in both
tables, or in one table but not the other), generating a single
output. For further information on this functionality see the tuto-
rial publication [123], and the Table Browser User’s Guide: http://
genome.ucsc.edu/goldenPath/help/hgTablesHelp.html.
Table 3
Application Programming Interface (API) information
12.1 Bio* toolkits For each of the major programming languages used in bioinfor-
matics, over the past two decades large sets of open-source routines
for performing operations including genomic database searches
have been developed by the scientific community and released for
public use. Starting with BioPerl, a family of routine sets now
includes Biopython, BioRuby, BioJava, and others. These are col-
lectively known as Bio* toolkits [128], and are supported by the
Open Bioinformatics Foundation. Each of these toolkits provides
routines for searching Internet-based genomic databases, plus
methods for interconversion, comparison, and analysis of the
outputs.
12.2 Ensembl The main API for Ensembl is based on the Perl language [129], and
depends in part on BioPerl. The API allows searching of the
Ensembl database, and retrieval of genomic segments, genes,
Genomic Database Searching 259
12.3 The UCSC The UCSC Genome Browser offers a set of utilities known as the
Genome Browser Kent Source Tree, based on the C programming language. This
comprises nearly 300 command-line applications for Linux and
UNIX platforms, providing a wide range of functionalities inter-
acting with the UCSC Genome Browser database, including data
retrieval, genomic searching using BLAT and pattern-finding rou-
tines. Additionally, a MySQL database of genomic data is main-
tained, allowing queries to be performed via MySQL commands.
Complementing this, but created independently, is an API to the
UCSC Genome Browser written in and for the Ruby language
[131]. Taking the form of a BioRuby plugin, this API allows access
to the main genomic databases, using a dynamic framework to cope
with the complex tables in which the UCSC Genome Browser
holds its data.
12.4 NCBI Genome The NCBI offers access to their collection of databases via the
Resources Entrez Programming Utilities (E-Utilities), a suite of nine pro-
grams that support a uniform set of parameters used to search, link
and download data [45, 132]. The functionalities of E-Utilities can
be accessed from any programming language capable of handling
the HTTP protocol; query results are typically returned in the
extensible markup language (XML) format. E-Utility commands
can be linked together within a script to generate an analysis pipe-
line or application. To help with this, the NCBI’s Ebot tool guides
the user step by step in generating a Perl script implementing an E-
Utility pipeline. An additional recent facility is Entrez Direct
(EDirect) [45, 133], an advanced method for accessing the
NCBI’s set of interconnected databases using UNIX terminal
command-line arguments, which can be combined to build multi-
step queries.
means to search them, and for genomic databases and browsers this
crucial aspect has been greatly enhanced by extensive documenta-
tion, open-access publications, online tutorials, webinars, public
events, and the like that accompany them. The usability of
resources is continuing to improve thanks to the work of curators
and developers, and their willingness to incorporate features
through user feedback and suggestions.
Among the wider community of molecular biologists and
bioinformaticians there is a notable spirit of collegiality and mutual
support, for example through the Biostars question and answer
website (http://www.biostars.org/) [135]. For genomic database
searching, this spirit of openness, information sharing and assis-
tance has proven to be crucial for clarity of navigation in an other-
wise bewilderingly complex terrain, and hopefully will continue in
the future.
14 Notes
Acknowledgements
References
1. Sanger F, Air GM, Barrell BG et al (1977) 5. Lander ES, Linton LM, Birren B et al (2001)
Nucleotide sequence of bacteriophage phi Initial sequencing and analysis of the human
X174 DNA. Nature 265:687–695 genome. Nature 409:860–921
2. Fleischmann RD, Adams MD, White O et al 6. Venter JC, Adams MD, Myers EW et al
(1995) Whole-genome random sequencing (2001) The sequence of the human genome.
and assembly of Haemophilus influenzae Rd. Science 291:1304–1351
Science 269:496–512 7. IHGSC (2004) Finishing the euchromatic
3. Johnston M (1996) The complete code for a sequence of the human genome. Nature
eukaryotic cell. Genome sequencing. Curr 431:931–945
Biol 6:500–503 8. Reddy TB, Thomas AD, Stamatis D et al
4. C. elegans Sequencing Consortium (1998) (2015) The Genomes OnLine Database
Genome sequence of the nematode C. (GOLD) v. 5: a metadata management system
elegans: a platform for investigating biology. based on a four level (meta)genome project
Science 282:2012–2018 classification. Nucleic Acids Res 43:
D1099–D1106
Genomic Database Searching 265
9. Warren WC, Hillier LW, Marshall Graves JA 25. Hoffman MM, Ernst J, Wilder SP et al (2013)
et al (2008) Genome analysis of the platypus Integrative annotation of chromatin elements
reveals unique signatures of evolution. Nature from ENCODE data. Nucleic Acids Res
453:175–183 41:827–841
10. Amemiya CT, Alfoldi J, Lee AP et al (2013) 26. modEncode Consortium, Roy S, Ernst J et al
The African coelacanth genome provides (2010) Identification of functional elements
insights into tetrapod evolution. Nature and regulatory circuits by Drosophila mod-
496:311–316 ENCODE. Science 330:1787–1797
11. Pr€ufer K, Racimo F, Patterson N et al (2014) 27. Gerstein MB, Lu ZJ, Van Nostrand EL et al
The complete genome sequence of a Nean- (2010) Integrative analysis of the Caenorhab-
derthal from the Altai Mountains. Nature ditis elegans genome by the modENCODE
505:43–49 project. Science 330:1775–1787
12. King TE, Fortes GG, Balaresque P et al 28. Harrow J, Frankish A, Gonzalez JM et al
(2014) Identification of the remains of King (2012) GENCODE: the reference human
Richard III. Nat Commun 5:5631 genome annotation for The ENCODE Proj-
13. Abecasis GR, Altshuler D, Auton A et al ect. Genome Res 22:1760–1774
(2010) A map of human genome variation 29. Almouzni G, Altucci L, Amati B et al (2014)
from population-scale sequencing. Nature Relationship between genome and
467:1061–1073 epigenome—challenges and requirements for
14. Abecasis GR, Auton A, Brooks LD et al future research. BMC Genomics 15:487
(2012) An integrated map of genetic variation 30. Hériché JK (2014) Systematic cell phenotyp-
from 1,092 human genomes. Nature ing. In: Hancock JM (ed) Phenomics. CRC
491:56–65 Press, Boca Raton, FL, pp 86–110
15. Torjesen I (2013) Genomes of 100,000 peo- 31. Hutchins JRA (2014) What’s that gene (or
ple will be sequenced to create an open access protein)? Online resources for exploring func-
research resource. BMJ 347:f6690 tions of genes, transcripts, and proteins. Mol
16. Baslan T, Hicks J (2014) Single cell sequenc- Biol Cell 25:1187–1201
ing approaches for complex biological sys- 32. Schmidt A, Forne I, Imhof A (2014) Bioin-
tems. Curr Opin Genet Dev 26C:59–65 formatic analysis of proteomics data. BMC
17. Liang J, Cai W, Sun Z (2014) Single-cell Syst Biol 8(Suppl 2):S3
sequencing technologies: current and future. 33. Kaiser J (2005) Genomics. Celera to end sub-
J Genet Genomics ¼ Yi Chuan Xue Bao scriptions and give data to public GenBank.
41:513–528 Science 308:775
18. Dykes CW (1996) Genes, disease and medi- 34. Church DM, Schneider VA, Graves T et al
cine. Br J Clin Pharmacol 42:683–695 (2011) Modernizing reference genome
19. Chan IS, Ginsburg GS (2011) Personalized assemblies. PLoS Biol 9:e1001091
medicine: progress and promise. Annu Rev 35. Gerstein MB, Bruce C, Rozowsky JS et al
Genomics Hum Genet 12:217–244 (2007) What is a gene, post-ENCODE? His-
20. Bauer DC, Gaff C, Dinger ME et al (2014) tory and updated definition. Genome Res
Genomics and personalised whole-of-life 17:669–681
healthcare. Trends Mol Med 20(9):479–486 36. Burge C, Karlin S (1997) Prediction of com-
21. Check Hayden E (2010) Human genome at plete gene structures in human genomic
ten: life is complicated. Nature 464:664–667 DNA. J Mol Biol 268:78–94
22. Dulbecco R (1986) A turning point in cancer 37. Thierry-Mieg D, Thierry-Mieg J (2006) Ace-
research: sequencing the human genome. Sci- View: a comprehensive cDNA-supported
ence 231:1055–1056 gene and transcripts annotation. Genome
23. International Cancer Genome Consortium, Biol 7(Suppl 1):S12.1–S12.14
Hudson TJ, Anderson W et al (2010) Inter- 38. MGC Project Team, Temple G, Gerhard DS
national network of cancer genome projects. et al (2009) The completion of the Mamma-
Nature 464, 993–998 lian Gene Collection (MGC). Genome Res
24. Alexandrov LB, Stratton MR (2014) Muta- 19:2324–2333
tional signatures: the patterns of somatic 39. Farrell CM, O’Leary NA, Harte RA et al
mutations hidden in cancer genomes. Curr (2014) Current status and new features of
Opin Genet Dev 24C:52–60
266 James R.A. Hutchins
the Consensus Coding Sequence database. 54. Chan PP, Lowe TM (2009) GtRNAdb: a
Nucleic Acids Res 42:D865–D872 database of transfer RNA genes detected in
40. Cunningham F, Amode MR, Barrell D et al genomic sequence. Nucleic Acids Res 37:
(2015) Ensembl 2015. Nucleic Acids Res 43: D93–D97
D662–D669 55. Punta M, Coggill PC, Eberhardt RY et al
41. Pruitt KD, Brown GR, Hiatt SM et al (2014) (2012) The Pfam protein families database.
RefSeq: an update on mammalian reference Nucleic Acids Res 40:D290–D301
sequences. Nucleic Acids Res 42:D756–D763 56. Tatusova T (2010) Genomic databases and
42. Harrow JL, Steward CA, Frankish A et al resources at the National Center for Biotech-
(2014) The Vertebrate Genome Annotation nology Information. Methods Mol Biol
browser 10 years on. Nucleic Acids Res 42: 609:17–44
D771–D779 57. Wolfsberg TG (2011) Using the NCBI Map
43. Frankish A, Uszczynska B, Ritchie GR et al Viewer to browse genomic sequence data.
(2015) Comparison of GENCODE and Curr Protoc Hum Genet. Chapter 18. Unit
RefSeq gene annotation and the impact of 18.15
reference geneset on variant effect prediction. 58. Brown GR, Hem V, Katz KS et al (2015)
BMC Genomics 16(Suppl 8):S2 Gene: a gene-centered information resource
44. Kersey PJ, Allen JE, Christensen M et al at NCBI. Nucleic Acids Res 43:D36–D42
(2014) Ensembl Genomes 2013: scaling up 59. Brister JR, Ako-Adjei D, Bao Y et al (2015)
access to genome-wide data. Nucleic Acids NCBI viral genomes resource. Nucleic Acids
Res 42:D546–D552 Res 43:D571–D577
45. NCBI Resource Coordinators (2015) Data- 60. Nicol JW, Helt GA, Blanchard SG Jr et al
base resources of the National Center for Bio- (2009) The Integrated Genome Browser:
technology Information. Nucleic Acids Res free software for distribution and exploration
43:D6–D17 of genome-scale datasets. Bioinformatics
46. Gray KA, Yates B, Seal RL et al (2015) Gene- 25:2730–2731
names.org: the HGNC resources in 2015. 61. Thorvaldsdottir H, Robinson JT, Mesirov JP
Nucleic Acids Res 43:D1079–D1085 (2013) Integrative Genomics Viewer (IGV):
47. dos Santos G, Schroeder AJ, Goodman JL high-performance genomics data visualiza-
et al (2015) FlyBase: introduction of the Dro- tion and exploration. Brief Bioinform
sophila melanogaster Release 6 reference 14:178–192
genome assembly and large-scale migration 62. Fiume M, Smith EJ, Brook A et al (2012)
of genome annotations. Nucleic Acids Res Savant Genome Browser 2: visualization and
43:D690–D697 analysis for population-scale genomics.
48. Silvester N, Alako B, Amid C et al (2015) Nucleic Acids Res 40:W615–W621
Content discovery and retrieval services at 63. Wright MW, Bruford EA (2011) Naming
the European Nucleotide Archive. Nucleic ‘junk’: human non-protein coding RNA
Acids Res 43:D23–D29 (ncRNA) gene nomenclature. Hum Geno-
49. Kodama Y, Mashima J, Kosuge T et al (2015) mics 5:90–98
The DDBJ Japanese Genotype-phenotype 64. Agirre E, Eyras E (2011) Databases and
Archive for genetic and phenotypic human resources for human small non-coding
data. Nucleic Acids Res 43:D18–D22 RNAs. Hum Genomics 5:192–199
50. UniProt Consortium (2015) UniProt: a hub 65. The RNAcentral Consortium (2015) RNA-
for protein information. Nucleic Acids Res central: an international database of ncRNA
43:D204–D212 sequences. Nucleic Acids Res 43:D123–D129
51. Rosenbloom KR, Armstrong J, Barber GP 66. Nakamura Y, Cochrane G, Karsch-Mizrachi I
et al (2015) The UCSC Genome Browser (2013) The International Nucleotide
database: 2015 update. Nucleic Acids Res Sequence Database Collaboration. Nucleic
43:D670–D681 Acids Res 41:D21–D24
52. Hsu F, Kent WJ, Clawson H et al (2006) The 67. Ameres SL, Zamore PD (2013) Diversifying
UCSC known genes. Bioinformatics microRNA sequence and function. Nat Rev
22:1036–1046 Mol Cell Biol 14:475–488
53. Nawrocki EP, Burge SW, Bateman A et al 68. Kozomara A, Griffiths-Jones S (2014) miR-
(2015) Rfam 12.0: updates to the RNA Base: annotating high confidence microRNAs
families database. Nucleic Acids Res 43: using deep sequencing data. Nucleic Acids
D130–D137 Res 42:D68–D73
Genomic Database Searching 267
69. Mani SR, Juliano CE (2013) Untangling the bind DNA and RNA. Nat Rev Mol Cell Biol
web: the diverse functions of the PIWI/ 15:749–760
piRNA pathway. Mol Reprod Dev 86. Wells RD (1988) Unusual DNA structures. J
80:632–664 Biol Chem 263:1095–1098
70. Peng JC, Lin H (2013) Beyond transposons: 87. Hedgpeth J, Goodman HM, Boyer HW
the epigenetic and somatic functions of the (1972) DNA nucleotide sequence restricted
Piwi-piRNA mechanism. Curr Opin Cell by the RI endonuclease. Proc Natl Acad Sci U
Biol 25:190–194 S A 69:3448–3452
71. Sai Lakshmi S, Agrawal S (2008) piRNABank: 88. Wei CL, Wu Q, Vega VB et al (2006) A global
a web resource on classified and clustered map of p53 transcription-factor binding sites
Piwi-interacting RNAs. Nucleic Acids Res in the human genome. Cell 124:207–219
36:D173–D177 89. Mergny JL (2012) Alternative DNA struc-
72. Zhang P, Si X, Skogerbo G et al (2014) piR- tures: G4 DNA in cells: itae missa est? Nat
Base: a web resource assisting piRNA func- Chem Biol 8:225–226
tional study. Database (Oxford) 2014, 90. Giraldo R, Suzuki M, Chapman L et al (1994)
bau110 Promotion of parallel DNA quadruplexes by a
73. Sarkar A, Maji RK, Saha S et al (2014) piR- yeast telomere binding protein: a circular
NAQuest: searching the piRNAome for silen- dichroism study. Proc Natl Acad Sci U S A
cers. BMC Genomics 15:555 91:7658–7662
74. Skinner ME, Uzilov AV, Stein LD et al (2009) 91. Cayrou C, Coulombe P, Puy A et al (2012)
JBrowse: a next-generation genome browser. New insights into replication origin character-
Genome Res 19:1630–1638 istics in metazoans. Cell Cycle 11:658–667
75. Kung JT, Colognori D, Lee JT (2013) Long 92. Brown P, Baxter L, Hickman R et al (2013)
noncoding RNAs: past, present, and future. MEME-LaB: motif analysis in clusters. Bioin-
Genetics 193:651–669 formatics 29:1696–1697
76. Bonasio R, Shiekhattar R (2014) Regulation 93. Grant CE, Bailey TL, Noble WS (2011)
of transcription by long noncoding RNAs. FIMO: scanning for occurrences of a given
Annu Rev Genet 48:433–455 motif. Bioinformatics 27:1017–1018
77. Wright MW (2014) A short guide to long 94. Medina-Rivera A, Defrance M, Sand O et al
non-coding RNA gene nomenclature. Hum (2015) RSAT 2015: regulatory sequence
Genomics 8:7 analysis tools. Nucleic Acids Res 43:
78. Fritah S, Niclou SP, Azuaje F (2014) Data- W50–W56
bases for lncRNAs: a comparative evaluation 95. Rice P, Longden I, Bleasby A (2000)
of emerging tools. RNA 20:1655–1665 EMBOSS: the European Molecular Biology
79. Quek XC, Thomson DW, Maag JL et al Open Software Suite. Trends Genet
(2015) lncRNAdb v2.0: expanding the 16:276–277
reference database for functional long non- 96. Stormo GD, Zhao Y (2010) Determining the
coding RNAs. Nucleic Acids Res 43: specificity of protein-DNA interactions. Nat
D168–D173 Rev Genet 11:751–760
80. Craig JM, Bickmore WA (1993) Chromo- 97. Kel AE, Gossling E, Reuter I et al (2003)
some bands—flavours to savour. Bioessays MATCH: A tool for searching transcription
15:349–354 factor binding sites in DNA sequences.
81. Altschul SF, Gish W, Miller W et al (1990) Nucleic Acids Res 31:3576–3579
Basic local alignment search tool. J Mol Biol 98. Wingender E (2008) The TRANSFAC proj-
215:403–410 ect as an example of framework technology
82. Kent WJ (2002) BLAT—the BLAST-like that supports the analysis of genomic regula-
alignment tool. Genome Res 12:656–664 tion. Brief Bioinform 9:326–332
83. Jacox E, Elnitski L (2008) Finding occur- 99. Wrzodek C, Schroder A, Drager A et al
rences of relevant functional elements in (2010) ModuleMaster: a new tool to decipher
genomic signatures. Int J Comput Sci transcriptional regulatory networks. Biosys-
2:599–606 tems 99:79–81
84. Brennan RG, Matthews BW (1989) Struc- 100. Turatsinze JV, Thomas-Chollier M, Defrance
tural basis of DNA-protein recognition. M et al (2008) Using RSAT to scan genome
Trends Biochem Sci 14:286–290 sequences for transcription factor binding
85. Hudson WH, Ortlund EA (2014) The struc- sites and cis-regulatory modules. Nat Protoc
ture, function and evolution of proteins that 3:1578–1588
268 James R.A. Hutchins
101. Kinsella RJ, Kahari A, Haider S et al (2011) 116. Lindner R, Friedel CC (2012) A comprehen-
Ensembl BioMarts: a hub for data retrieval sive evaluation of alignment algorithms in the
across taxonomic space. Database (Oxford) context of RNA-seq. PLoS One 7:e52403
2011, bar030 117. Buermans HP, den Dunnen JT (2014) Next
102. Metzker ML (2010) Sequencing generation sequencing technology: advances
technologies—the next generation. Nat Rev and applications. Biochim Biophys Acta
Genet 11:31–46 1842:1932–1941
103. Niedringhaus TP, Milanova D, Kerby MB 118. van Dijk EL, Auger H, Jaszczyszyn Y et al
et al (2011) Landscape of next-generation (2014) Ten years of next-generation sequenc-
sequencing technologies. Anal Chem ing technology. Trends Genet 30:418–426
83:4327–4341 119. Li JW, Schmieder R, Ward RM et al (2012)
104. Ozsolak F, Milos PM (2011) RNA sequenc- SEQanswers: an open access community for
ing: advances, challenges and opportunities. collaboratively decoding genomes. Bioinfor-
Nat Rev Genet 12:87–98 matics 28:1272–1273
105. Li R, Li Y, Kristiansen K et al (2008) SOAP: 120. Scholtalbers J, Rossler J, Sorn P et al (2013)
short oligonucleotide alignment program. Galaxy LIMS for next-generation sequencing.
Bioinformatics 24:713–714 Bioinformatics 29:1233–1234
106. Li H, Ruan J, Durbin R (2008) Mapping 121. Blankenberg D, Hillman-Jackson J (2014)
short DNA sequencing reads and calling var- Analysis of next-generation sequencing data
iants using mapping quality scores. Genome using galaxy. Methods Mol Biol 1150:21–43
Res 18:1851–1858 122. Liu B, Madduri RK, Sotomayor B et al (2014)
107. Langmead B, Trapnell C, Pop M et al (2009) Cloud-based bioinformatics workflow plat-
Ultrafast and memory-efficient alignment of form for large-scale next-generation sequenc-
short DNA sequences to the human genome. ing analyses. J Biomed Inform 49:119–133
Genome Biol 10:R25 123. Zweig AS, Karolchik D, Kuhn RM et al
108. Li H, Durbin R (2009) Fast and accurate (2008) UCSC genome browser tutorial.
short read alignment with Burrows-Wheeler Genomics 92:75–84
transform. Bioinformatics 25:1754–1760 124. Goecks J, Nekrutenko A, Taylor J (2010)
109. Lunter G, Goodson M (2011) Stampy: a sta- Galaxy: a comprehensive approach for sup-
tistical algorithm for sensitive and fast porting accessible, reproducible, and trans-
mapping of Illumina sequence reads. Genome parent computational research in the life
Res 21:936–939 sciences. Genome Biol 11:R86
110. Langmead B, Salzberg SL (2012) Fast 125. Hillman-Jackson J, Clements D, Blankenberg
gapped-read alignment with Bowtie 2. Nat D et al (2012) Using Galaxy to perform large-
Methods 9:357–359 scale interactive data analyses. Curr Protoc
111. Li H (2013) Aligning sequence reads, clone Bioinformatics Chapter 10, Unit 10.15
sequences and assembly contigs with BWA- 126. Smedley D, Haider S, Durinck S et al (2015)
MEM. arXiv preprint arXiv:1303.3997 The BioMart community portal: an innova-
112. Sedlazeck FJ, Rescheneder P, von Haeseler A tive alternative to large, centralized data repo-
(2013) NextGenMap: fast and accurate read sitories. Nucleic Acids Res 43:W589–W598
mapping in highly polymorphic genomes. 127. Wolstencroft K, Haines R, Fellows D et al
Bioinformatics 29:2790–2791 (2013) The Taverna workflow suite: design-
113. Santana-Quintero L, Dingerdissen H, ing and executing workflows of Web Services
Thierry-Mieg J et al (2014) HIVE-hexagon: on the desktop, web or in the cloud. Nucleic
high-performance, parallelized sequence Acids Res 41:W557–W561
alignment for next-generation sequencing 128. Mangalam H (2002) The Bio* toolkits—a
data analysis. PLoS One 9:e99033 brief overview. Brief Bioinform 3:296–302
114. Lee WP, Stromberg MP, Ward A et al (2014) 129. Stabenau A, McVicker G, Melsopp C et al
MOSAIK: a hash-based algorithm for accu- (2004) The Ensembl core software libraries.
rate next-generation sequencing short-read Genome Res 14:929–933
mapping. PLoS One 9:e90581 130. Yates A, Beal K, Keenan S et al (2014) The
115. Fonseca NA, Rung J, Brazma A et al (2012) Ensembl REST API: Ensembl data for any
Tools for mapping high-throughput sequenc- language. Bioinformatics 31(1):143–145
ing data. Bioinformatics 28:3169–3177 131. Mishima H, Aerts J, Katayama T et al (2012)
The Ruby UCSC API: accessing the UCSC
Genomic Database Searching 269
Abstract
Gene finding is the process of identifying genome sequence regions representing stretches of DNA that
encode biologically active products, such as proteins or functional noncoding RNAs. As this is usually the
first step in the analysis of any novel genomic sequence or resequenced sample of well-known organisms, it
is a very important issue, as all downstream analyses depend on the results. This chapter describes the
biological basis for gene finding, and the programs and computational approaches that are available for the
automated identification of protein-coding genes. For bacterial, archaeal, and eukaryotic genomes, as well
as for multi-species sequence data originating from environmental community studies, the state of the art in
automated gene finding is described.
Key words Gene prediction, Gene finding, Genomic sequence, Protein-coding sequences, Next-
generation sequencing, Environmental sequence samples
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_11, © Springer Science+Business Media New York 2017
271
272 Alice Carolyn McHardy and Andreas Kloetgen
2 Methods
2.1 Gene Finding in Finding genes in genome sequences is a simpler task in bacteria and
Bacteria and Archaea archaea than it is in eukaryotic organisms. First, the gene structure
is less complex: a protein-coding gene corresponds to a single open
Finding Genes in Genome Sequence 273
Table 1
Bacterial and archaeal gene finders
Similarity searches /
protein domain searches
C
+1 +1
+2 +2
+3 +3
–1 –1
–2 –2
–3 –3
C. Post -processing
Train classifier
+1
+2
+3
–1
–2
–3
C
Completely labeled sequence
Fig. 1 Overview of a sequence of steps employed by a bacterial and archaeal gene finder. In (a), the sequence
is initially searched for regions which exhibit significant conservation on amino acid level relative to other
protein-coding regions or which show motifs of protein domains. By extending such regions to a start and stop
codon, a partial labeling of the genome sequence into coding regions (light gray) and noncoding ORFs (nORFs),
which significantly overlap with such coding sequences in another frame (dark gray), can be obtained. The
labeled parts can be used as training sequences to derive the vectors of intrinsic sequence features for
training a binary classifier. In (b), the classifier is applied to classify all ORFs above a certain length in the
sequence as either coding sequences (CDSs) or nORFs. In the post-processing phase (c), the start positions of
the predicted CDSs are reassigned with the help of translation start site models, and conflicts between
neighboring predictions are resolved
##gff-version 2
##date 2006-03-21
Fig. 2 Output of the program GISMO (Table 1). The output is in GFF format. For each prediction, the contig name, the start and stop positions, the support vector
machine (SVM) score and the reading frame are given. The position of a ribosome binding site (RBS) and a confidence assignment (1 for high confidence; 0 for low
confidence) are also given. If a prediction has a protein domain match in PFAM, the e-value of that hit and the description of the PFAM entry are also reported.
Predictions that were discarded in the post-processing phase (removal of overlapping low-confidence predictions) are labeled as discarded (“1”)
Finding Genes in Genome Sequence 277
Table 2
Metagenomic gene finders
2.3 Gene Finding Gene prediction for bacterial and archaeal genomes has reached
in Eukaryotes high levels of accuracy and is more accurate than that for eukaryotic
organisms (see Note 3). However, newer programs for eukaryotic
datasets have reached more than 90 % specificity and 80 % sensitiv-
ity, at least for exon identification in compact eukaryotic genomes
[53] (see also Subheading 2.3.4). Nevertheless, the process of gene
prediction in eukaryotic genomes is still a complex and challenging
problem for several reasons. First, only a small fraction of a eukary-
otic genome sequence corresponds to protein-encoding exons,
which are embedded in vast amounts of noncoding sequence.
Second, the gene structure is complex (Fig. 3).
The CDS encoding the final protein product can be located
discontinuously in two or more exonic regions, which are some-
times very short and are separated from each other by an intron
sequence. The junctions of exon–intron boundaries are character-
ized by splice sites on the initial transcript, which guide intron
removal in fabrication of the ripe mRNA transcript at the spliceo-
some. Additional signal sequences, such as an adenylation signal,
are found in proximity to the transcript ends. These determine a
cleavage site corresponding to the end of the ripe transcript to
which a polyadenylation (polyA) tail is added for stability. The
issue is further complicated by the fact that genes can have alterna-
tive splice and polyA sites, as well as alternative translation and
transcription initiation sites. Third, due to the massive sequencing
requirements, additional eukaryotic genomes which can be used to
study sequence conservation are becoming available more slowly
than their bacterial and archaeal counterparts. So far, 283 eukary-
otic genomes have been finished compared to 241 archaeal and
5,047 bacterial genomes in the current GOLD release v6 [54].
The complex organization of eukaryotic genes makes determi-
nation of the correct gene structure the most difficult problem in
eukaryotic gene prediction. Signals of functional sites are very
informative—for instance, splice site signals are the best means of
locating exon–intron boundaries. Methods which are designed to
identify these or other functional signals such as promoter or polyA
sites are generally referred to as “signal sensors.” Methods that
classify genomic sequences into coding or noncoding content are
called “content sensors.” Content sensors can use all of the above-
mentioned sources of information for gene identification—
Table 3
Eukaryotic gene finders
2.3.2 Gene Prediction Not only gene finding programs but also entire computational
and Annotation Pipelines pipelines for data analysis, including gene prediction and annota-
tion, are of great importance for genome projects (see Note 4). A
brief description of the widely used Ensembl gene prediction and
annotation pipeline [69] and its updates [70] is therefore given at
this point. The complete pipeline involves a wide variety of pro-
grams (Fig. 4). In this pipeline, repetitive elements are initially
identified and masked to remove them from the analyzed input.
1. The sequences of known proteins from the organism are
mapped to a location on the genomic sequence. Local alignment
programs are used for a prior reduction of the sequence search
space, and the HMM-based GeneWise program [71] is used for
the final sequence alignment.
2. An ab initio predictor such as Genscan [72] is run. For predic-
tions confirmed by the presence of homologs, these homologs
are aligned to the genome using GeneWise, as before.
3. Simultaneously with step 1, Exonerate [73] is used to align
known cDNA sequences to the genome sequence. If RNA-seq
data are integrated directly, gaps in the protein-coding models
identified in step 1 are filled in or added if they are completely
missing.
4. The candidates found via this procedure are merged to create
consensus transcripts with 30 untranslated regions (UTRs),
CDSs, and 50 UTRs.
5. Redundant transcripts are merged and genes are identified,
which, by Ensembl’s definition, correspond to sets of transcripts
with overlapping exons.
282 Alice Carolyn McHardy and Andreas Kloetgen
Genome sequence
Known cDNA sequences
Known proteins
Align to genome Align to genome and RNA-seq data
Add novel
cDNA and RNA-seq genes
6. Novel cDNA genes which do not match with any exons of the
protein-coding genes identified so far are added to create the
final gene set.
The Ensembl pipeline leans strongly towards producing a spe-
cific prediction with few false positives rather than a sensitive pre-
diction with few false negatives. Every gene prediction that is
produced is supported by direct extrinsic evidence, such as tran-
script sequences, known protein sequences from the organism’s
proteome, or known protein sequences of related organisms. It is
important to note that for the genomes of different organisms, this
procedure is applied with slight variations, depending on the
resources available. Since 2012, supporting evidence provided by
RNA-seq data has been integrated into the process. For some
species, the additional data were directly considered in the annota-
tion process, whereas for others, the results of the standard pipeline
were updated based on RNA-seq evidence [70]. For the latter
approach, an update script was developed and applied to the zebra-
fish genome [74]. Other annotation pipelines that are very similar
to Ensembl exist, such as the UCSC known genes [75] and the
publicly available tool Maker [76]. These also use extrinsic evidence
Finding Genes in Genome Sequence 283
3 Conclusions
4 Notes
Fig. 5 The location of genes in GC-rich genomes is indicated by the frame-specific GC content. A plot of the
frame-specific GC content was computed for a sliding window 26 bp in size that was moved by a step size
of 5 across the sequence. The GC content is plotted in the lower panel for the three frames—Frame 3,
Frame 2 and Frame 1. The upper panel shows the location of annotated protein-coding genes in the
genome (arrows)
Acknowledgments
References
1. Metzker ML (2010) Sequencing Implications for finding sequence motifs in
technologies—the next generation. Nat Rev regulatory regions. Nucleic Acids Res
Genet 11:31–46 29:2607–2618
2. Benson DA, Cavanaugh M, Clark K, Karsch- 14. Larsen TS, Krogh A (2003) EasyGene—a pro-
Mizrachi I, Lipman DJ, Ostell J, Sayers EW karyotic gene finder that ranks ORFs by statis-
(2013) GenBank. Nucleic Acids Res 41: tical significance. BMC Bioinformatics 4:21
D36–D42 15. Lukashin AV, Borodovsky M (1998) Gene-
3. Dong H, Nilsson L, Kurland CG (1996) Co- Mark.hmm: new solutions for gene finding.
variation of tRNA abundance and codon usage Nucleic Acids Res 26:1107–1115
in Escherichia coli at different growth rates. J 16. Delcher AL, Harmon D, Kasif S, White O,
Mol Biol 260:649–663 Salzberg SL (1999) Improved microbial gene
4. Ikemura T (1981) Correlation between the identification with GLIMMER. Nucleic Acids
abundance of Escherichia coli transfer RNAs Res 27:4636–4641
and the occurrence of the respective codons 17. Krause L, McHardy AC, Nattkemper TW,
in its protein genes: a proposal for a synony- P€uhler A, Stoye J, Meyer F (2007) GISMO—
mous codon choice that is optimal for the E. gene identification using a support vector
coli translational system. J Mol Biol machine for ORF classification. Nucleic Acids
151:389–409 Res 35:540–549
5. Sharp PM, Bailes E, Grocock RJ, Peden JF, 18. Mahony S, McInerney JO, Smith TJ, Golden A
Sockett RE (2005) Variation in the strength (2004) Gene prediction using the Self-
of selected codon usage bias among bacteria. Organizing Map: automatic generation of
Nucleic Acids Res 33:1141–1153 multiple gene models. BMC Bioinformatics
6. Rocha EP (2004) Codon usage bias from 5:23
tRNA’s point of view: redundancy, specializa- 19. Ochman H, Lawrence JG, Groisman EA
tion, and efficient decoding for translation (2000) Lateral gene transfer and the nature of
optimization. Genome Res 14:2279–2286 bacterial innovation. Nature 405:299–304
7. Wallace EW, Airoldi EM, Drummond DA 20. Hayes WS, Borodovsky M (1998) How to
(2013) Estimating selection on synonymous interpret an anonymous bacterial genome:
codon usage from noisy experimental data. machine learning approach to gene identifica-
Mol Biol Evol 30:1438–1453 tion. Genome Res 8:1154–1171
8. McHardy AC, P€ uhler A, Kalinowski J, Meyer F 21. Ou HY, Guo FB, Zhang CT (2004) GS-
(2004) Comparing expression level‐dependent Finder: a program to find bacterial gene start
features in codon usage with protein abun- sites with a self-training method. Int J Biochem
dance: an analysis of ‘predictive proteomics’. Cell Biol 36:535–544
Proteomics 4:46–58 22. Suzek BE, Ermolaeva MD, Schreiber M, Salz-
9. Saunders R, Deane CM (2010) Synonymous berg SL (2001) A probabilistic method for
codon usage influences the local protein struc- identifying start codons in bacterial genomes.
ture observed. Nucleic Acids Res Bioinformatics 17:1123–1130
38:6719–6728 23. Tech M, Pfeifer N, Morgenstern B, Meinicke P
10. Hooper SD, Berg OG (2000) Gradients in (2005) TICO: a tool for improving predictions
nucleotide and codon usage along Escherichia of prokaryotic translation initiation sites. Bio-
coli genes. Nucleic Acids Res 28:3517–3523 informatics 21:3568–3569
11. Fickett JW, Tung CS (1992) Assessment of 24. Zhu HQ, Hu GQ, Ouyang ZQ, Wang J, She
protein coding measures. Nucleic Acids Res ZS (2004) Accuracy improvement for identify-
20:6441–6450 ing translation initiation sites in microbial gen-
12. Hayashi T, Makino K, Ohnishi M, Kurokawa omes. Bioinformatics 20:3308–3317
K, Ishii K, Yokoyama K, Han CG, Ohtsubo E, 25. Shibuya T, Rigoutsos I (2002) Dictionary-
Nakayama K, Murata T et al (2001) Complete driven prokaryotic gene finding. Nucleic Acids
genome sequence of enterohemorrhagic Res 30:2710–2725
Escherichia coli O157:H7 and genomic com- 26. Badger JH, Olsen GJ (1999) CRITICA: cod-
parison with a laboratory strain K-12. DNA ing region identification tool invoking compar-
Res 8:11–22 ative analysis. Mol Biol Evol 16:512–524
13. Besemer J, Lomsadze A, Borodovsky M (2001) 27. Frishman D, Mironov A, Mewes HW, Gelfand
GeneMarkS: a self-training method for predic- M (1998) Combining diverse evidence for gene
tion of gene starts in microbial genomes.
Finding Genes in Genome Sequence 289
nGASP—the nematode genome annotation 69. Curwen V, Eyras E, Andrews TD, Clarke L,
assessment project. BMC Bioinformatics 9:549 Mongin E, Searle SM, Clamp M (2004) The
54. Reddy TBK, Thomas A, Stamatis D, Bertsch J, Ensembl automatic gene annotation system.
Isbandi M, Jansson J, Mallajosyula J, Pagani I, Genome Res 14:942–950
Lobos E, Kyrpides N (2015) The Genomes 70. Flicek P, Ahmed I, Amode MR, Barrell D, Beal
OnLine Database (GOLD) v. 5: a metadata K, Brent S, Carvalho-Silva D, Clapham P,
management system based on a four level Coates G, Fairley S (2013) Ensembl 2013.
(meta)genome project classification. Nucleic Nucleic Acids Res 41:D48–D55
Acids Res. 43:D1099–1106 71. Birney E, Clamp M, Durbin R (2004) Gene-
55. Brent MR, Guigo R (2004) Recent advances in Wise and Genomewise. Genome Res
gene structure prediction. Curr Opin Struct 14:988–995
Biol 14:264–272 72. Burge C, Karlin S (1997) Prediction of com-
56. Brent MR (2008) Steady progress and recent plete gene structures in human genomic DNA.
breakthroughs in the accuracy of automated J Mol Biol 268:78–94
genome annotation. Nat Rev Genet 9:62–73 73. Slater GS, Birney E (2005) Automated genera-
57. Sleator RD (2010) An overview of the current tion of heuristics for biological sequence com-
status of eukaryote gene prediction strategies. parison. BMC Bioinformatics 6:31
Gene 461:1–4 74. Collins JE, White S, Searle SM, Stemple DL
58. DeCaprio D, Vinson JP, Pearson MD, Mon- (2012) Incorporating RNA-seq data into the
tgomery P, Doherty M, Galagan JE (2007) zebrafish Ensembl genebuild. Genome Res
Conrad: gene prediction using conditional ran- 22:2067–2078
dom fields. Genome Res 17:1389–1398 75. Hsu F, Kent WJ, Clawson H, Kuhn RM, Die-
59. Gross SS, Do CB, Sirota M, Batzoglou S khans M, Haussler D (2006) The UCSC
(2007) CONTRAST: a discriminative, known genes. Bioinformatics 22:1036–1046
phylogeny-free approach to multiple informant 76. Cantarel BL, Korf I, Robb SM, Parra G, Ross
de novo gene prediction. Genome Biol 8:R269 E, Moore B, Holt C, Alvarado AS, Yandell M
60. Bernal A, Crammer K, Pereira F (2012) Auto- (2008) MAKER: an easy-to-use annotation
mated gene-model curation using global dis- pipeline designed for emerging model organ-
criminative learning. Bioinformatics ism genomes. Genome Res 18:188–196
28:1571–1578 77. Lomsadze A, Ter-Hovhannisyan V, Chernoff
61. Schweikert G, Zien A, Zeller G, Behr J, Diet- YO, Borodovsky M (2005) Gene identification
erich C, Ong CS, Philips P, De Bona F, Hart- in novel eukaryotic genomes by self-training
mann L, Bohlen A (2009) mGene: accurate algorithm. Nucleic Acids Res 33:6494–6506
SVM-based gene finding with an application 78. Tenney AE, Brown RH, Vaske C, Lodge JK,
to nematode genomes. Genome Res Doering TL, Brent MR (2004) Gene predic-
19:2133–2143 tion and verification in a compact genome with
62. Stanke M, Diekhans M, Baertsch R, Haussler D numerous small introns. Genome Res
(2008) Using native and syntenically mapped 14:2330–2335
cDNA alignments to improve de novo gene 79. Wei C, Lamesch P, Arumugam M, Rosenberg
finding. Bioinformatics 24:637–644 J, Hu P, Vidal M, Brent MR (2005) Closing in
63. Korf I (2004) Gene finding in novel genomes. on the C. elegans ORFeome by cloning TWIN-
BMC Bioinformatics 5:59 SCAN predictions. Genome Res 15:577–582
64. Zickmann F, Lindner MS, Renard BY (2013) 80. Guigo R, Reese MG (2005) EGASP: collabo-
GIIRA–RNA-Seq driven gene finding incor- ration through competition to find human
porating ambiguous reads. Bioinformatics genes. Nat Methods 2:575–577
30:606–613 81. Guigo R, Flicek P, Abril JF, Reymond A,
65. Martin JA, Wang Z (2011) Next-generation Lagarde J, Denoeud F, Antonarakis S, Ashbur-
transcriptome assembly. Nat Rev Genet ner M, Bajic VB, Birney E et al (2006) EGASP:
12:671–682 the human ENCODE genome annotation
66. Wang Z, Gerstein M, Snyder M (2009) RNA- assessment project. Genome Biol 7(Suppl 1):S2
Seq: a revolutionary tool for transcriptomics. 82. ENCODE Project Consortium (2012) An
Nat Rev Genet 10:57–63 integrated encyclopedia of DNA elements in
67. Ozsolak F, Milos PM (2011) RNA sequencing: the human genome. Nature 489:57–74
advances, challenges and opportunities. Nat 83. Rosenbloom KR, Sloan CA, Malladi VS, Dres-
Rev Genet 12:87–98 zer TR, Learned K, Kirkup VM, Wong MC,
68. Yandell M, Ence D (2012) A beginner’s guide Maddren M, Fang R, Heitner SG (2013)
to eukaryotic genome annotation. Nat Rev ENCODE data in the UCSC genome browser:
Genet 13:329–342 year 5 update. Nucleic Acids Res 41:D56–D63
Finding Genes in Genome Sequence 291
84. Harrow J, Frankish A, Gonzalez JM, Tapanari 87. Linke B, McHardy AC, Krause L, Neuwege H,
E, Diekhans M, Kokocinski F, Aken BL, Barrell Meyer F (2006) REGANOR: a gene prediction
D, Zadissa A, Searle S (2012) GENCODE: the server for prokaryotic genomes and a database
reference human genome annotation for the of high quality gene predictions for prokar-
ENCODE project. Genome Res yotes. Appl Bioinformatics 5:193–198
22:1760–1774 88. Warren AS, Archuleta J, Feng W-C, Setubal JC
85. Sharpton TJ (2014) An introduction to the (2010) Missing genes in the annotation of pro-
analysis of shotgun metagenomic data. Front karyotic genomes. BMC Bioinformatics
Plant Sci 5:209 11:131
86. Nielsen P, Krogh A (2005) Large-scale pro- 89. Osterman A, Overbeek R (2003) Missing
karyotic gene prediction and comparison to genes in metabolic pathways: a comparative
genome annotation. Bioinformatics genomics approach. Curr Opin Chem Biol
21:4322–4329 7:238–251
Chapter 12
Abstract
Many biological sequences have a segmental structure that can provide valuable clues to their content,
structure, and function. The program changept is a tool for investigating the segmental structure of a
sequence, and can also be applied to multiple sequences in parallel to identify a common segmental
structure, thus providing a method for integrating multiple data types to identify functional elements in
genomes. In the previous edition of this book, a command line interface for changept is described. Here
we present a graphical user interface for this package, called changeptGUI. This interface also includes
tools for pre- and post-processing of data and results to facilitate investigation of the number and
characteristics of segment classes.
Key words Multiple change-point analysis, Genome segmentation, Functional element discovery,
Model selection
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_12, © Springer Science+Business Media New York 2017
293
294 Edward Tasker and Jonathan M. Keith
2 Change-Point Analysis
2.1 Software for Liu and Lawrence [8] introduced a Bayesian multiple change-point
Sequence model for sequence segmentation in 1999. The Bayesian model has
Segmentation: since been extended by Keith and coworkers [1–4, 9] and along
changept with the development of an efficient technique for Gibbs sampling
from a distribution with varying dimension [10], has been encoded
as the C program changept, which samples parameter estimates
from the underlying change-point model given a target sequence.
This chapter describes some of the practical issues involved in
implementing changept and the associated graphical user inter-
face (GUI) changeptGUI. An attractive feature of changept is
that it is not only a sequence segmentation algorithm, it also
includes the classification of segments into groups that share similar
sequence characteristics (these classes can be considered similar to
the hidden states in HMMs). This is consistent with the modular
nature of DNA functionality; functional elements within DNA
sequence, for example transcription factor binding sites (TFBS),
often have defined boundaries (segmentation) and similar func-
tional elements will often have similar sequence characteristics
(classification).
As is typical of Bayesian methodologies, changept does not
merely generate a single segmentation, optimized according to
some scoring function. Rather, it generates multiple segmenta-
tions, sampled from a posterior distribution over the space of all
possible segmentations. Markov chain Monte Carlo (MCMC) sim-
ulation is applied to estimate the posterior probabilities of the
underlying model parameters.
Changept estimates, for each genomic position, the probabil-
ity that the given genomic position belongs to a given segment class
for each of the segment classes. As seen later, it is these probabilities
in association with the estimates of the segment class characteristics
that will be used to guide biological insight. The ability to estimate
probabilities in this way derives from the Bayesian modeling frame-
work. However, changept is relatively fast compared to alternative
Bayesian genomic segmentation methods, and can feasibly be
applied to whole eukaryotic genomes. Although a full description
of the Bayesian model and sampling algorithm is beyond the scope
of this chapter, a few brief explanations are necessary.
296 Edward Tasker and Jonathan M. Keith
2.2 The Bayesian It is convenient to think about the observed sequence as a random
Segmentation and vector which has been generated given parameters of an underlying
Classification Model model. The sequence is presumed to be drawn from a probability
(BSCM): A Model for distribution, where the probability distribution is a function of the
changept parameters in the model. The model incorporates three main con-
cepts: change-points, segments, and segment classes.
For a sequence which contains characters from an alphabet of
any given size (D), a segment is a region of the sequence within
which the probability of observing each of the D characters is
consistent throughout the segment. Change-points are positions
in the sequence at which the characters either side of the change-
point belong to different segments. Each segment in the sequence
belongs to one of a fixed number (T) of segment classes.
Individual segments are each defined by a vector (θ) that con-
tains the probability of observing each of the D characters in that
segment. The θ probability vectors are assumed to be drawn from a
mixture of Dirichlet distributions, where the number of compo-
nents is the number of segment classes (T). Each segment class is
distinguished by a vector, α(j) (with D entries), which is the param-
eter vector for one of the Dirichlet distributions in the mixture. The
mixture proportions are represented by the vector
π ¼ ðπ 1 , . . . , π T Þ. Changept samples from the posterior distribu-
tion: the positions of K change-points, the α(j) vectors and the π
vector. A full description of the distribution can be found in the
supplementary materials of [2].
2.3 MCMC Two features of MCMC algorithms that need to be explained here
Simulation are the burn-in phase and subsampling. MCMC methods involve a
Markov chain for which the limiting distribution is the distribution
from which one wishes to sample, in this case a posterior distribu-
tion over the space of segmentations and classifications. However,
the chain approaches this distribution asymptotically, and thus the
elements generated early in the chain are not typical and need to be
discarded. This early phase of sampling is known as burn-in. Even
after burn-in, it is an inefficient (and often infeasible) use of disk
space to record all of the segmentations generated by the algo-
rithm. The algorithm therefore asks the user to specify a sampling
block length. The chain of segmentations will be divided into
blocks of this length and only one element in each block will be
recorded for future processing.
Estimates of the α(j) and π vectors are calculated by averaging
the values over post burn-in samples. The weights of the D com-
ponents in the α(j) vectors can be considered as the frequency of
each of the D characters in the given segment class. As we see in
more detail later this is the feature which distinguishes each seg-
ment class.
The probability that a position in the sequence belongs to a
given segment class is estimated by first calculating the probability
Sequence Segmentation with changeptGUI 297
3.1 The Input Changept, and the underlying BSCM, operates on a sequence
Sequence(s) composed of letters from an arbitrary alphabet; when using chan-
gept, the sequence(s) for which the model parameters are being
estimated (which is referred to as the input sequence) must be in a
single line of a text file.
The changeptGUI is capable of taking as input: a single
sequence, parallel sequences, or a DNA sequence alignment in axt
format (description at https://genome.ucsc.edu/goldenPath/
help/axt.html). If using an alignment in axt format, the GUI will
convert the alignment into a single input sequence before running
the changept algorithm. The process of converting the alignment
298 Edward Tasker and Jonathan M. Keith
3.1.1 The Input Although the input sequence can be composed of a relatively
Sequence Format arbitrary alphabet, the following guidelines should be used:
1. The alphabet should be made up of only alphanumeric char-
acters, i.e., a–Z and 0–9.
2. Capitalized letters will be considered as different characters
from lower-case letters.
3. The characters I,J,K,L,M,N,O (all capitals) will be ignored by
changept. That is, the characters either side of these special
characters will be considered as adjacent by changept. These
characters may be necessary in the sequence as place-holders
when there is no appropriate information for that position, for
example: when there are gaps in DNA sequence alignment.
4. The character ‘#’ is used to mark the location of fixed change-
points; in every sample generated by changept, positions
marked with ‘#’ will be considered as a change-point. This is
useful when concatenating alignment blocks into a single input
sequence; it would not make sense to allow the changept algo-
rithm to consider two points of the genome that are not
adjacent to each other as part of the same segment.
Alignment
Species 1: AAAACCCCC-GGGGTTTT
Species 2: ACGTACGT-AACGTACGT
Encoding
Conservation: 10000100IJ00100001
Bi-directional: abcdefghIJhgfedcba
Full: abcdefghIJijklmnop
species are assigned the character ‘I’, and gaps relative to the
second species are assigned a ‘J’.
2. The bi-directional encoding conserves both match/mismatch
information and DNA sequence information. The encoding
allows for the fact that DNA is a double stranded molecule in
which functional elements can be coded for on either strand. It
also allows for the fact that when a whole genome sequence
alignment is performed the ordering of the reference genome is
kept constant, however alignment blocks from the aligning
species may be in either direction. Considering this, the align-
ment of a particular sequence can be considered as equivalent
to the alignment of the complement of that sequence.
3. The full encoding preserves fully both sequences in the align-
ment. Each possible alignment pairing is assigned a unique
character from ‘a’–‘p’.
3.1.3 Recording Genomic When generating the final output (the segmentation map), in order
Position Information: for the changeptGUI to relate the position of segments in the
input.log model to genomic coordinates, an input.log file is automatically
generated (for the axt input), which records the genomic positions
of the alignment blocks. A user who is generating their own input
sequence that they want to be related to genomic coordinates will
need to generate their own input.log in the following format:
1. The file should be in a tab separated format, where each line
contains the genomic coordinates of the ends of the alignment
blocks for each species, the coordinates should be 1-based and
inclusive at both ends (the same as for axt format).
2. On each line, separated by tabs should be the following pieces
of information: the alignment block number; chromosome
for the reference species; starting position of the alignment
block for the reference species; ending position of the align-
ment block for the reference species; the strand of the align-
ment block for the reference species (either + or );
chromosome for the aligned species; starting position of
the alignment block for the aligned species; ending position
of the alignment block for the aligned species; the strand of the
alignment block for the aligned species (either + or ).
For an input that is not in axt format, if an input.log file is
not provided then the final output positions will be relative to
positions in the input sequence (as opposed to genomic
coordinates).
Alignment block 17
16 chrY 252611 252675 chr5 109552634 109552702 + 2462
TACGTACCGTGTGACTGCTCCTGAGA----AGATCCTGTCTATCATCTTGGTAGAAAGGGCTGGAAAGG
TGCTCACTGGGTGACAGCACCGGAGAGAGAAGACGCAGTCTATCATCCAGGAAGAGATGGCTGCAAGGG
Bi-directional encoding
#acfecafhfbfafafdffdffbfafaJJJJafacgfdfafaaafaafcdffdafacaefffafgaacff#
Input.log entry
17 chrY 252611 252675 + chr5 109552634 109552702 +
Fig. 2 A demonstration of the bi-directional encoding and the input.log line entry using alignment block
17 of the alignment of the Mouse genome to the Human Y chromosome as an example. The top shows the
alignment block entry in axt format, the middle shows the bi-directional encoding and the bottom shows the
input.log line entry
Species 1: AAAACCCCC-GGGGTTTT
Species 2: ACGTACGT-AACGTACGT
Encoding: aaaaccccIJccccaaaa
10000100IJ00100001
3.1.5 Parallel Input The BSCM can be extended to multiple parallel sequences. Each
Sequences sequence may be composed of its own alphabet; the alphabets of
the multiple sequences need not be the same. The model assumes
that the positions of change-points in each of the sequences are the
same, and that corresponding segments are always allocated to the
same class. However, the corresponding segments from each of the
parallel sequences have distinct character frequencies.
The guidelines for constructing multiple sequences to be run
through changept in parallel are the same as for a single sequence,
with the additional comment that care should be taken to ensure
that the sequences are the same length, that the positions of any ‘#’
characters are the same and that the positions of any of the special
characters I,J,K,L,M,N,O are the same in each parallel sequence.
There is a conceptual difference between segmenting parallel
sequences and segmenting a single sequence encoding multiple
sequences. Parallel segments are assumed to be generated indepen-
dently, conditional on the character frequencies in each sequence.
To illustrate this, consider the two-sequence encoding shown in
Fig. 3 and how it differs from the bi-directional encoding in Fig. 1.
Sequence Segmentation with changeptGUI 301
3.2.2 Beginning the Upon opening changeptGUI the user will be prompted to either
Project open an existing project or begin a new project. When beginning a
new project, the user will be prompted to give the project a name
and specify whether or not the input will be a DNA sequence
alignment (in axt format). A subdirectory called “ProjectName”
will be created which will store the output files and the file “pro-
jectname.cpsegpro” which is the file used to load existing
projects.
3.2.3 Loading the Input If the user is running changept for a DNA sequence alignment (in
Sequence axt format), the user can load the alignment file(s) by clicking the
“load alignment file(s)” button. The user will then need to select
the desired encoding for the alignment (see Subheading 3.1.2). If
multiple alignment files are selected then the alignment blocks from
all selected files will be concatenated into a single input sequence.
If the user is not using a DNA sequence alignment as input
then they will be able to load the input sequence by clicking the
“load input sequence(s)” button (see Subheading 3.1.1). If multi-
ple files are chosen then changeptGUI will assume the user is
running changept for parallel sequences (note changept can be
302 Edward Tasker and Jonathan M. Keith
3.2.4 Choosing Input In order to run changept, the user needs to specify three para-
Parameters meters. Firstly, the user will need to specify the number of segment
classes in the model. In general, the appropriate number of segment
classes will not be known in advance. Consequently, changept
should be run for a range of models and then the appropriate
model chosen using model selection criteria (see Subheading 3.3).
The other two parameters required are the sampling block size and
number of samples. There is considerable latitude in the choice of
these parameters, but the following guidelines can be used.
3.2.5 Sampling Block The sampling block size is the number of samples that changept
Size generates for each sample that is output to file. For a sampling block
size of say 10,000, changept generates 10,000 samples, then 1 of
the 10,000 samples is chosen uniformly and randomly to output to
file. The sampling block size should be large enough so that con-
secutive post burn-in samples appear relatively independent.
A rough guideline for the sampling block size is ~10 % of the length
of the input sequence, or ~10 times the average number of change
points. For large sequences (e.g., whole chromosomes) 10 % of the
sequence length may be too long and too time consuming (the
larger the sampling block size the longer the sampler will take to
produce the desired number of samples). In this case changept
can be run for a small number of iterations (until convergence first
occurs) with a relatively small sampling block size, then extended
with a sampling block size of ~10 times the average number of
change-points.
3.2.6 Number This is the number of samples that changept will output to file. The
of Samples number should be large enough so that the sampler can both
converge and provide sufficient post-burn-in samples from which
to estimate model parameters. At least 500 post burn-in samples are
recommended. It is recommended to initially underestimate the
number of samples needed. If burn-in is not achieved or the num-
ber of post-burn-in samples is smaller than desired, additional
samples can then be generated. The sampler generates a Markov
chain, thus the last sample from the initial run can be used as the
starting point for a new run, without requiring a second period of
burn-in. After the changept algorithm has run for the initial
number of samples, the run can be extended from the “Run Seg-
mentation” tab in the changeptGUI.
Sequence Segmentation with changeptGUI 303
3.2.7 Assessing Once changept has finished running, the “Model Analysis” tab
Convergence can be used to inspect the output. The first thing that should be
checked is that the sampler has converged; each model will need to
be checked individually. Models with fewer segment classes will
have fewer parameters to estimate and will generally converge faster
than models with more segment classes. To check for convergence,
go to the “Model Analysis” tab and select a model from the drop-
down menu, then click on the “Sample plots” interface. The “Sam-
ple plots” interface can be used to plot model parameter values
from each of the samples generated by changept. The fist plot to
generate should be a plot of the log-likelihood over all samples. If
this plot has not converged then changept should be run for more
samples. It can be the case that the log-likelihood has converged
but the sampler has not yet converged for the other parameters.
The next plots that should be checked for convergence are the
mixture proportions for each segment class and then the character
frequencies for each segment class. If the sampler has not converged
it can be set to run again for a larger number of samples.
3.2.8 Example: Running The alignment of the mouse genome to the human Y chromosome
changept in axt format was loaded and changept was run using the bi-
directional encoding for models with the number of classes ranging
from 2–10. The input sequence length is 2,524,681 (ignoring
gaps). Changept was initially run for 1000 samples with a sampling
block size of 10,000. Figure 4 (top) shows the log-likelihood over
the 1000 samples for the 7-class model. The sampler converges in
log-likelihood after a relatively small number of samples (less than
50). Figure 4 (middle and bottom) show the mixture proportions
and the conservation rate respectively (see Note 1) for each segment
class in the 7-class model across the 1000 samples. Figure 4 shows
that the sampler has converged in the 1000 samples. However,
convergence in mixture proportions and character frequencies
takes longer than for log-likelihood.
As is expected for a sequence of that length, a sampling block
size of 10,000 is too small. This is apparent in Fig. 4 (middle and
bottom), as there is an observable level of correlation between the
values in consecutive samples. Using the “All models” selection in
the drop-down menu, the average number of change-points can be
calculated using the last 500 samples (after convergence). The
number of change-points is approximately 15,000, which suggests
that a good choice for the sampling block size is approximately
150,000. Thus, for each model changept was run for an addi-
tional 500 samples using a sampling block size of 150,000. Figure 5
shows the mixture proportions for the 1500 samples after running
for an additional 500 samples. Figure 5 indicates that for the last
500 samples, with the sampling block size of 150,000, there is less
correlation between consecutive samples. A similar observation can
be made for each of the models (from 2–10 segment classes); thus
–3.915.106
–3.96.106
–4.005.106
Ln-Likelihood
–4.05.106
–4.095.106
–4.14.106
–4.185.106
500 1000
Sample number
1 Class 0
Class 1
0.9
Class 2
0.8 Class 3
0.7
Mixture proportions
Class 4
Class 5
0.6
Class 6
0.5
0.4
0.3
0.2
0.1
0
500 1000
Sample number
1 Class 0
Class 1
0.9
Class 2
0.8 Class 3
0.7 Class 4
Class 5
Conversion
0.6
Class 6
0.5
0.4
0.3
0.2
0.1
0
500 1000
Sample number
Fig. 4 All three images are plots generated using the “Sample plots” interface of changeptGUI depicting
values from 1000 samples output by changept for the 7-class segmentation of the alignment of the Mouse
genome to the Human Y chromosome. (Top) A plot of the log-likelihood for each sample. (Middle) A plot of the
mixture proportions for each of the seven classes for each sample. (Bottom) A plot of the conservation rate
(see Note 1) for each of the seven classes for each sample
Sequence Segmentation with changeptGUI 305
1 Class 0
Class 1
0.9
Class 2
0.8 Class 3
Class 4
0.7
Class 5
Conversion
0.6
Class 6
0.5
0.4
0.3
0.2
0.1
0
750 1500
Sample number
Fig. 5 A plot of the mixture proportions for the 1500 samples generated by changept for the 7-class
segmentation of the alignment of the Mouse genome to the Human Y chromosome. The first 1000 samples
were output by changept in an initial run using a sampling block size of 10,000 and the second 500
samples were output using a sampling block size of 150,000. The image is generated using the “Sample
plots” interface in the “Model analysis” tab of the changeptGUI
3.3 Model Selection As already stated, generally changept will be run for multiple
models with different numbers of segment classes. The appropriate
3.3.1 Information Criteria
model can then be selected using information criteria; a model
with a lower value of information criteria will be preferred to a
model with a higher value. A full discussion of information criteria
is beyond the scope of this chapter, a discussion of the use of
information criteria for model selection when using changept
can be found in Ref. [3]. ChangeptGUI can calculate estimates of
three different information criteria. In the “Model Analysis” tab,
under the drop-down menu select “All models”, the user will then
be prompted to specify a number of samples to use for generating
the “All models summary” (see Note 2). The GUI will then show
properties for each of the models including the estimates of the
information criteria DICV, AIC, and BIC. As discussed in Ref. [3],
the DICV estimate is generally preferred for model selection.
There are two more considerations that need to be made for
model selection. Both relate to the idea that the selected model
should be informative to the user. The first consideration is that a
model which contains an empty segment class is not informative; an
empty segment class is one for which the mixture proportion for
that class is zero or close to zero. The second consideration is that
segment classes should differ significantly (and informatively) from
306 Edward Tasker and Jonathan M. Keith
3.3.2 Guidelines for 1. Select the model with the first minimum for DICV (first, when
Model Selection moving from the model with the fewest segment classes to the
most).
2. If the model contains at least one empty segment class, choose
the next largest model which does not contain an empty seg-
ment class.
3. If the model contains segment classes that contain character
frequency estimates which are indistinguishable, then choose
the next largest model for which all segment classes are
distinguishable.
3.3.3 Segment Class The estimates of mixture proportions and character frequencies for
Parameter Estimates each class can be calculated using the “Model parameters” interface
in the “Model Analysis” tab. Similarly for the “All models sum-
mary”, the user will be prompted to specify the number of samples
to use to generate the parameter estimates (see Note 2). Once the
number of samples to use for parameter estimates has been chosen,
the GUI will calculate the estimates and display them in both
tabular and graphical displays.
3.3.4 Example Changept was run for models with 2–10 segment classes. As
Continued: Model Selection explained in the previous example section, each model was run for
and Segment Class 1500 samples and in each case the last 500 samples were appropri-
Parameter Estimates ate to use for subsequent parameter estimates. Figure 6 shows the
plot of the estimates of DICV for each model (2–10 segment
classes).
Following the guidelines for model selection (see Subhead-
ing 3.3.2) the first step is to identify the model with the first
minimum in DICV, which as Fig. 6 shows is the 7-class model.
The next step is to check whether the 7-class model contains empty
segments: to do this, select the 7-class model from the drop-down
menu in the “Model Analysis” tab, then select the “Model para-
meters” interface. Figure 7 shows the parameter estimates for the
7-class model using 500 post burn-in samples to generate the
estimates.
The segment class with the smallest mixture proportion is class
4 with 5.47 %, which would generally not be considered empty.
Sequence Segmentation with changeptGUI 307
7.8525.106
7.85.106
DICV values
7.8475.106
7.845.106
7.8425.106
7.84.106
2 3 4 5 6 7 8 9 10
Number of classes
Fig. 6 A plot of DICV estimates calculated using the last 500 samples (of 1500) for each model (2–10 classes)
for the segmentation of the alignment of the Mouse genome to the Human Y chromosome. The image is
generated by selecting “All models” from the drop-down menu in the “Model analysis” tab of the
changeptGUI
0.8 Class 3
0.7 Class 4
Class 5
0.6 Class 6
0.5
0.4
0.3
0.2
0.1
0
Mixture ‘a’ freq. ‘b’ freq. ‘c’ freq. ‘d’ freq. ‘e’ freq. ‘f’ freq. ‘g’ freq. ‘h’ freq.
Proportion
Fig. 7 A plot of the point estimates of parameters (including mixture proportions and character frequencies) for
the 7-class segmentation of the alignment of the Mouse genome to the Human Y chromosome. The image is
generated using the “Parameter estimates” interface in the “Model analysis” tab of the changeptGUI
1 Class 0
Class 1
0.9
Class 2
0.8 Class 3
Class 4
0.7
Class 5
Conservation
0.6
Class 6
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
GC Content
Fig. 8 A plot depicting parameter values from the last 500 samples (of 1500) for each class from the 7-class
segmentation of the alignment of the Mouse genome to the Human Y chromosome. The vertical axis shows
the conservation rate and the horizontal axis shows the GC content (see Note 1)
3.4 The Once the model (the number of segment classes) has been chosen,
Segmentation Map the segmentation map of segment positions for each segment class
can be generated for the chosen model. It should be emphasized
3.4.1 Generating Map
again that the map of segment positions is based on estimates of the
of Segment Positions
Sequence Segmentation with changeptGUI 309
3.4.2 The Segmentation In addition to the segmentation map being saved in bed files, the
Viewer map is also able to be viewed in the “Segmentation viewer” inter-
face in the “Segmentation Map” tab. Bed files are a convenient
310 Edward Tasker and Jonathan M. Keith
Class 0
Class 1
Class 2
Segments Class 3
Class 4
Class 5
Class 6
Ensemble genes
Aligned regions
chrY
1 10000000
Fig. 9 The segmentation map generated using changeptGUI for the 7-class segmentation of the
alignment of the Mouse genome to the Human Y chromosome. The map is visualized using the “Segmentation
viewer” interface of the GUI; at the top is the segmentation map with positions of segments color-coded by the
class to which they belong; below the segmentation map are two tracks which have been loaded from bed
formatted files. The middle track shows regions with annotated genes, and the bottom track shows the
regions which were aligned
3.4.3 Example The 7-class model was chosen for the segmentation of the align-
Continued: Segmentation ment of the mouse genome to the human Y chromosome (see
Map Subheading 3.3.4). The segmentation map was generated for the
7-class model with a burn-in of 1000 samples using a probability
threshold of 0.5, a minimum segment length of 1, allowing for up
to ten consecutive gaps within a segment and no more than 50 % of
the segment length coming from gaps (these are the default para-
meters for the GUI). Figure 9 shows, for the first 10,000,000 bases
of the human Y chromosome, the segmentation map as displayed in
the “Segmentation viewer” interface. The map is displayed with the
addition of tracks showing the regions of the chromosome which
were aligned and the regions which contain genes (as annotated in
“ensGene.txt.gz” available at http://hgdownload.soe.ucsc.edu/
goldenPath/hg19/database/).
4 Notes
References
1. Keith JM, Adams P, Stephen S, Mattick JS are more complex than protein-coding
(2008) Delineating slowly and rapidly evolving sequences. PLoS One 9(5):e97336
fractions of the Drosophila genome. J Comput 5. Hoffman MM, Buske OJ, Wang J, Weng Z,
Biol 15(4):407–430 Bilmes JA, Noble WS (2012) Unsupervised
2. Oldmeadow C, Mengersen K, Mattick JS, pattern discovery in human chromatin struc-
Keith JM (2010) Multiple evolutionary rate ture through genomic segmentation. Nat
classes in animal genome evolution. Mol Biol Methods 9:473–476
Evol 27(4):942–953 6. Hoffman MM, Buske OJ, Bilmes JA, Noble
3. Oldmeadow C, Keith JM (2011) Model selec- WS (2011) Segway: simultaneous segmenta-
tion in Bayesian segmentation of multiple tion of multiple functional genomics data sets
DNA alignments. Bioinformatics 27 with heterogeneous patterns of missing data.
(5):604–610 http://noble.gs.washington.edu/proj/seg-
4. Algama M, Oldmeadow C, Tasker E, Menger- way/manuscript/temposegment.nips09.hoff-
sen K, Keith JM (2014) Drosophila 30 UTRs man.pdf
312 Edward Tasker and Jonathan M. Keith
7. Algama M, Keith JM (2014) Investigating 10. Keith MJ, Kroese DP, Bryant D (2004) A gen-
genomic structure using changept: a Bayesian eralised Markov sampler. Methodol Comput
segmentation model. Comput Struct Biotech- Appl Probab 6:29–53
nol J 10:107–115 11. Karolchik D, Baertsch R, Diekhans M, Furey
8. Liu JS, Lawrence CE (1999) Bayesian infer- TS, Hinrichs A et al (2003) The UCSC
ence on biopolymer models. Bioinformatics genome browser database. Nucleic Acids Res
15:38–52 31(1):51–54
9. Keith MJ (2006) Segmenting eukaryotic gen- 12. Fujita PA, Rhead B, Zweig AS, Hinrichs AS,
omes with the generalized Gibbs sampler. Karolchik D et al (2011) The UCSC genome
J Comput Biol 13(7):1369–1383 browser database: update 2011. Nucleic Acids
Res 39:D876–D882
Part III
Abstract
In this chapter, I review the basic algorithm underlying the CODEML model implemented in the software
package PAML. This is intended as a companion to the software’s manual, and a primer to the extensive
literature available on CODEML. At the end of this chapter, I hope that you will be able to understand
enough of how CODEML operates to plan your own analyses.
Key words Natural selection, CODEML, PAML, Codon substitution model, dN/dS ratio, Site
models, Branch models, Branch-Site models
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_13, © Springer Science+Business Media New York 2017
315
316 Anders Gonçalves da Silva
3.1 Step 1: Counting To understand how these parameters come together, let us work
Synonymous (S) and through a simple example. Let us say we have the same codon from
Non-synonymous (N) two separate sequences, both code for the amino acid Isoleucine
Sites (I), but using different codons:
Seq1 ATT
Seq2 ATC
3 3 1
N AT T ¼ þ þ
3 3 3
7
N AT T ¼
3
For many codons, S can be estimated as the sum over codons,
and N is defined as N ¼ 3 ∗ r S, where r is the number of codons
in the sequence. We can then take the average per sequence to
obtain an expectation of S and N for the data-set. At this point it
should be clear why the number of non-synonymous sites is much
larger than the number of synonymous sites.
3.2 Step 2: Counting We can then estimate the number of synonymous differences per
Synonymous (S) and synonymous site Sd and the number of non-synonymous differ-
Non-synonymous (N) ences per non-synonymous site Nd across the data-set. For Seq1
Differences and Seq2 above, the calculation is straightforward. There is a single
synonymous difference out of a single observed difference. Thus,
Sd ¼ 1 and Nd ¼ 0.
The calculation becomes a little more involved if there are
additional observed changes. For instance, in the example given
by Yang and Nielsen [12], we have
Seq1 TTA
Seq2 CTC
Measuring Natural Selection 319
Both these codons code for Leucine (L). There are a total of
two differences, however, we do not know the order of the changes.
If we start from sequence 1, the T ! C mutation could have
happened first, followed by the A ! C, or vice-versa. Because we
do not know, we must calculate Sd and Nd across both pathways.
If we suppose T ! C happened first, that would lead to a codon
CTA, which also codes for L, which in turn would suffer an A ! C
mutation, leading to CTC. In this case, we would have two synon-
ymous mutations in a total of two changes. If the order were
reversed, starting with the A ! C mutation, this would lead to a
codon TTC, which codes for Phenylalanine (F), this would be
followed by a T ! C mutation at the first codon position, leading
to the CTC codon, which now codes for L. Thus, in this pathway,
we have two non-synonymous changes over a total of two possible
changes. Across both pathways, therefore, we have two synony-
mous changes across four possible changes:
2þ0
Sd ¼
4
1
Sd ¼
2
If changes occurred across all three positions, then there would
be a total of six possible pathways. Again, the total Sd is obtained by
summing across r codon sites that are being compared. We can then
obtain the proportion of synonymous changes per synonymous
site (ρS), and the proportion of non-synonymous changes per
non-synonymous site (ρN), by dividing Sd by S, and Nd by N,
respectively.
Sd
ρS ¼
S
Nd
ρN ¼
N
3.3 Step 3: The proportions ρS and ρN are uncorrected estimates of dS and dN,
Correcting for Multiple respectively—uncorrected for multiple, latent, and mutation
(Latent) Mutational events. The way to correct these estimates is by employing a nucle-
Events to Obtain dN otide substitution model that estimates the number of mutation
and dS events given the amount of divergence between sequences. The
simplest model is the Jukes and Cantor [13] model, which would
apply the following corrections:
4∗ρs
3∗ln 1
3
dS ¼
4
4ρ
3∗ln 1 N
3
dN ¼
4
320 Anders Gonçalves da Silva
d S ¼ 1
7
N ¼
3
0
Nd ¼
1
0
ρN ¼
7
3
¼0
dN ¼ 0
dN
ω¼
dS
0
¼
1
¼0
The issue with the naive model is that it assumes that all nucleotide
changes have equal weights. One instance in which this assumption
is obviously violated relates to how often we expect a transition
(A <–> C or T <–> G mutations) relative to a transversion (all
four other possible mutations). By chance, we would expect one
half transitions for every transversion. Empirical observations,
however, demonstrate that the actual ratio of transitions to trans-
versions (κ) is often larger than one, and sometimes is as great as
40 [14]. However, before we delve into the violations that affect
how different parameters are treated, I first want to quickly
review how they are put together into a mathematical framework
[12, 15–17].
4.1 The Markov While the data unit of concern is usually a sequence of nucleic acids
Process or amino acids, it is easier to model the evolutionary process at the
single base/amino acid level, and assume that each is independent
from the other. It is known that this assumption is violated, but
it provides mathematical convenience and the results are often
reasonable. With this simplification in mind, we can model the
probability of change at a single site, and multiply through all the
observed sites to obtain the likelihood of the data.
When we think of the evolutionary process forward in time, we
can imagine a single nucleotide that may or may not change over a
specific time period, t. Let us say that the probability of it changing
in an infinitesimal time dt is udt. If change does occur, it may take
on any of the three other possible nucleotides (in which case the
identity of the base changes at that site). It is also convenient to
allow the possibility it may take on the same base it had before
(in this case change has occurred but no difference is observed).
On the other hand, with probability 1udt the base will not
change. With this simple model, we can derive the instantaneous
probability of observing base j some small time dt in the future
given that we now have base i [15]:
P ij ðdtÞ ¼ ð1 udtÞδij þ udtπ j
The expression to the right of the plus sign gives us the joint
probability of change occurring (with probability udt) and that the
change was to base j (with probability π j). In the expression to the
left of the plus sign, δij ¼ 1 if i ¼ j and zero otherwise. Thus, if
after dt has passed, and we still observe i, two events are possible
and we need to sum over both: (1) the i we now observe is the same
as the i we observed dt time ago, which happens with probability
(1udt); or (2) the i we now observe is a new i resulting from a
change from the previous i, which happens with probability udt π i.
322 Anders Gonçalves da Silva
AAT AAC
v1
v2
Fig. 1 Calculating the likelihood of a tree can be achieved by collapsing nodes starting at the tips, and working
towards the root, calculating the likelihood of each internal node. Here, we present a single example, that can
be easily expanded to a whole tree by following the same principles. The circles represent observed data at a
single site. One sequence has codon “AAT” while the other has “AAC.” These sequences are connected to
their ancestral node (diamond) through branches of length “v1” and “v2,” respectively. We do not know the
codon of the ancestral node (diamond). Any of the 61 sense codons are plausible. If we fix the unknown
ancestral state at “AAT,” for instance, it is possible to calculate the joint probability that along “v1” the codon
did not change state, while along “v2” the codon changed from “AAT” to “AAC” (see text). We can then fix the
ancestral node at “AAC,” and perform the calculation again, and so on until we have the probability for all 61
possible codons. The likelihood of the node is the sum of the individual ancestral state probabilities
Measuring Natural Selection 323
PðtÞ ¼ e Qt
324 Anders Gonçalves da Silva
4.2 The Importance The basic model (Subheading 4.1) is based on rates of change from
of the Transition/ one codon to another. To understand why transition/transversion
Transversion Ratio bias might affect these rates we first need to define non-degenerate,
two-fold degenerate, and four-fold degenerate sites [18]. Two-fold
degenerate sites are those for which out of the three possible events
that would change the nucleotide at that position, one results in a
synonymous change. Four-fold degenerate sites are those for which
all three possible events result in a synonymous change. Finally,
non-degenerate sites are those for which any change is non-
synonymous (Subheading 2).
It is believed that the reason for the often observed transition
bias lies in the fact that in two-fold degenerate sites the synonymous
change is (almost) always a transition. This can be verified by
examining a table defining the relationship between codons and
amino acids. First, most amino acids are coded by two or four
different codons (Table 1). Furthermore, almost all codons that
code for the same amino acid are different at only the third posi-
tion. Finally, for those amino acids coded by only two codons, the
codons are distinct by a single transition event. Thus, the majority
of synonymous mutations are transitions, and would, by our expec-
tations of how the evolutionary process works, be more frequent
than non-synonymous transversions.
Thus, in the basic Markov model transitions are weighted by κ
(the transition/transversion ratio). This has important conse-
quences to estimates of ω. To illustrate this, I have run the HIV
Measuring Natural Selection 325
Table 1
The universal genetic code
Codon Amino acid Codon Amino acid Codon Amino acid Codon Amino acid
UUU Phe UCU Ser UAU Tyr UGU Cys
UUC Phe UCC Ser UAC Tyr UGC Cys
UUA Leu UCA Ser UAA Stop UGA Stop
UUG Leu UCG Ser UAG Stop UGG Trp
CUU Leu CCU Pro CAU His CGU Arg
CUC Leu CCC Pro CAC His CGC Arg
CUA Leu CCA Pro CAA Gln CGA Arg
CUG Leu CCG Pro CAG Gln CGG Arg
AUU Ile ACU Thr AAU Asn AGU Ser
AUC Ile ACC Thr AAC Asn AGC Ser
AUA Ile ACA Thr AAA Lys AGA Arg
AUG Met ACG Thr AAG Lys AGG Arg
GUU Val GCU Ala GAU Asp GGU Gly
GUC Val GCC Ala GAC Asp GGC Gly
GUA Val GCA Ala GAA Glu GGA Gly
GUG Val GCG Ala GAG Glu GGG Gly
example data-set provided with the PAML package (the full files run
for this example can be found at: https://github.com/andersgs/
measuringNaturalSelection. The full data-set includes 13 sequences
spanning 273 bases of the envelope glycoprotein (env) gene, V3
region, of the HIV virus [19, 20]. The first thing to notice is that
the sequences are from a coding region, an important assumption
of the CODEML model, and that they start at the first position of a
codon, and end at the third position of the last codon (273 bases
equals 91 codons). Finally, none of the codons code for a stop
codon (see Note 1).
To demonstrate the effect of κ on how CODEML counts synony-
mous and non-synonymous mutations, I have analyzed the data by
fixing κ at 0.01, 0.10, 1.00, and 10.00. In Table 2, we can see the
results for a single branch of the tree. We can see that as the bias
towards transitions increases the expected number of non-
synonymous changes (N) decreases (from 235.6 to 228.8), while
the expected number of synonymous changes (S) increases
(from 37.4 to 44.2). This makes sense, because the model imple-
mented in CODEML weights transitions by κ; and as we saw above, all
transitions are synonymous. Not only does κ affect the estimate of
326 Anders Gonçalves da Silva
Table 2
Estimates of synonymous and non-synonymous changes across different transition/transversion
ratios for the same data-set
Table 3
Nucleotide frequencies across the three codon positions and the whole data-set
4.3 Codon Frequency Another important parameter of the model is the equilibrium
Model probability of each codon (π i). In the naive implementation, we
assumed that all codons were equally likely. However, sequences
and genomes of organisms often have bias in their nucleotide
content, and in their codon usage [21]. As illustrated in Table 3
and Fig. 2, nucleotide and codon frequencies are highly skewed in
the HIV data-set.
CODEML implements a number of codon frequency models that
help account for these biases. The simplest is the naive implemen-
tation: all codons have equal frequency (1/61). This model is
sometimes referred to as FEqual or F0. A more realistic approach
involves calculating the expected codon frequency based on the
nucleotide frequencies at each codon position across the whole
data-set. As an example, under this model, the frequency of the
codon “TTT” for the HIV data-set would be 0.233 ¼ 0.0121, as
the frequency of “T” in the data-set is 0.23, and there are 3 “T”s in
this codon (Table 3). This approach has two implementations in
Measuring Natural Selection 327
100
75
Counts
50
25
gg tt
gatt
ac c
t
agg
aa tt
ca t
t
tta
gc tt
ga c
c
t
t
g
ag c
g
c
c
a
g
ag t
a
c
c
g
cg c
gg c
c
g
gg c
a
a
ga c
g
a
a
a
tc
ta
tg
cc
cg
ac
ca
gc
gg
ga
ag
aa
tt
ct
tc
tg
ta
at
gt
t
cc
ca
cg
gc
aa
tt
a
tc
ct
gt
ct
tc
ta
ta
tg
at
tg
gt
at
cc
cc
gc
cg
ac
ca
ac
aa
Codons
Fig. 2 Counts of observed codons across 13 sequences spanning 273 bases of the envelope glycoprotein (env)
gene, V3 region, of the HIV virus
CODEML: (1) F1X4 and (2) F1X4MG. The “1X4” stands for the
dimensions of the matrix used to calculate the codon frequencies: it
has four columns (one for each nucleotide), and a single row
(the frequency of each base across the data-set). The “MG” suffix
differentiates between the model described by Muse and Gaut [11]
and that described by Goldman and Yang [16].
The next level up in complexity estimates the expected codon
frequency based on the nucleotide frequencies at each codon posi-
tion. Under this model, the frequency of the “TTT” codon in our
data-set would be 0.15 ∗ 0.26 ∗ 0.28 ¼ 0.0109, with 0.15, 0.26,
and 0.28 being the respective frequencies of “T” at the first, sec-
ond, and third codon positions in the data-set (Table 3). This
approach also has two implementations: (1) F3X4 and (2)
F3X4MG. The nomenclature has the same rationale as the previous
one. The “3” now refers to the three rows needed to account for all
codon positions.
It is also possible to simply estimate codon frequencies based
solely on the frequencies of the codons in the data-set (called
“Codon Table” in CODEML, and referred here as FObs, Table 4).
This carries the obvious caveat that if the codon was not observed in
the data-set, it will have probability of zero. The appropriateness of
this assumption must be evaluated on a case-by-case basis. It is
328 Anders Gonçalves da Silva
Table 4
Estimates of synonymous and non-synonymous changes across different codon frequency models for
a single branch for the same data-set
Table 5
Estimates of synonymous and non-synonymous change across different phylogenetic hypotheses for
the same branch and data-set
A 13 B 13
12 12
10 10
11 11
9 9
8 8
7 7
6 6
5 5
4 1
3 3
2 2
1 4
C 13 D 13
12 1
10 10
11 11
1 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
9 12
Fig. 3 Alternative phylogenetic hypotheses for 13 HIV sequences. (A) Tree assuming mutation rate is equal
across all lineages and amount of divergence is only a function of time; the three remaining topologies shift the
position of sequence 1 on the tree, thus forcing mutation rates to be different across lineages. (B) Sequence 1
(highlighted in bold) is exchanged with sequence 4; (C) Sequence 1 is exchanged with sequence 9; and (D)
Sequence 1 is exchanged with sequence 12
position of the tree, this would constitute good evidence for posi-
tive natural selection on this particular lineage.
A much more common case can be demonstrated by taking
only sequences 1, 9, 10, and 11 (as in the sub-clade of tree D from
Fig. 3 that has sequence 1 as the basal sequence). In this case,
332 Anders Gonçalves da Silva
4.5 Estimating dN/dS As discussed above, κ, equilibrium probabilities of codons (π) and
divergence time and topology all have an effect on the estimates of
expected number of synonymous/non-synonymous sites and the
expected number of synonymous/non-synonymous differences.
Ultimately, this impacts on the estimate of ω, the evolutionary
rate. Thus far, we have assumed that a single dN/dS ratio is suffi-
cient to explain the observed data-set. This is the simplest model
possible in CODEML, and is often referred to as the “M0” model
(which is modelled by the Markov process described at the end of
Subheading 4.1). Empirical evidence, however, suggests that rate
heterogeneity is frequent [28]. For instance, purifying selection
might be stronger at a section of the gene that is closely related to
its function (say a binding site, for instance), and be less so in
another portion of the gene. Furthermore, positive selection is
expected to occur on only a handful of codons, with the bulk of
the codons either neutral or under purifying selection [20, 29, 30].
To accommodate rate heterogeneity across sites, and improve
our ability to detect distinct natural selection patterns across sites in
a gene, three classes of models have been developed: (1) sites
models, of which the “M0” is the most basic [20, 30]; (2) branch
models [31, 32]; and (3) branch-site models [33–36]. As the names
suggest, the models differ on where ω heterogeneity occurs in a
gene: it is either among sites; or among lineages; or both.
These different models are accommodated within our frame-
work by generating different Q matrices, one for each category of ω
identified in the model. The appropriate Q matrix is then invoked
depending on the site, the branch of the tree, or both. Essentially,
this adds an additional variable to the model, yij, which specifies
to what category of ω codon i from branch j belongs to.
Measuring Natural Selection 333
4.5.1 Sites Models The sites models are some of the most developed and tested models
available in CODEML [37]. These models attempt to estimate two
basic sets of parameters: values of ω for each category of ω; and the
proportion of sites that are evolving under each category. The
number of ω categories is specified a priori for each model, and is
usually between one and three. These cover the following situa-
tions: ω < 1 (purifying selection); ω 1 (neutral); and ω > 1
(positive selection). It is possible to model additional categories,
but this is generally not advised. The reason is that increasing the
number of parameters in the model will increase the bias in the
estimates, with a concomitant decrease in the variance.
Among the site models, there are two separate classes of
models. One set assumes that we can group sites into different
categories, and each category has a single estimate of ω. The
other assumes that the category of sites under purifying selection
has ω values that follow a Beta distribution, with two parameters α
and β. This allows greater complexity to be incorporated into the
model (Fig. 4), but without adding too many new parameters.
To illustrate the different possibilities, I have run the models
“M0,” “M1a,” “M2a,” “M7,” and “M8.” The “M0” is the basic
model used thus far (and presented in Subheading 4.1), in which
A B
2.5
Density
Density
1.5
1.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
ω ω
C D
8
8
Density
Density
4
4
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
ω ω
Fig. 4 Probability distribution of sites with different ω value under different Beta distribution hypotheses.
(A) A scenario where there are many sites with strong purifying selection and many that are nearly neutral
(Beta distribution parameters α ¼ 0.5; β ¼ 0.5). (B) A scenario where most of the sites are under intermedi-
ate levels of purifying selection (α ¼ 5; β ¼ 5). (C) A scenario where the majority of the sites are nearly
neutral (α ¼ 2; β ¼ 0.2). (D) A scenario where the majority of the sites are under strong purifying selection
(α ¼ 0.2; β ¼ 2)
334 Anders Gonçalves da Silva
Model 0: one-ratio
...
lnL(ntime: 23 np: 25): -1133.892429 +0.000000
...
kappa (ts/tv) = 2.27532
omega (dN/dS) = 0.91089
branch t N S dN/dS dN dS N*dN S*dS
14..1 0.024 234.4 38.6 0.9109 0.0078 0.0086 1.8 0.3
...
Model 1: NearlyNeutral (2 categories)
...
Measuring Natural Selection 335
The “M7” and “M8” models are similar to the “M1a” and
“M2a,” respectively, except that the “M7” assumes that all sites are
evolving under ω values that are drawn from a Beta distribution (as
in Fig. 4). The “M8” assumes that only a portion of the sites are
evolving under ω drawn from a Beta distribution, with another
portion having some ω > 1.
The output for these models is similar to the above models, but
has some important distinctions. First, in the model header there is
some additional information contained in the parentheses. In the
analyses performed here, we see that for “M7” there are 10 cate-
gories. These refer to the number of bins the interval between
0 and 1 was divided into in order to estimate the shape parameters
of the Beta distribution. Each bin has an equal proportion of the
sites (as seen in the line starting with “p:”), with the boundaries of
the bins allowed to vary in order to accommodate this constraint (as
seen in the line starting with “w:”). Thus, if we imagine the extreme
case where 90 % of the sites were evolving at ω < 0.1, then there
would be nine bins included in the interval (0,0.1), and a single bin
in the interval [0.1,1), leading to a Beta distribution with most of
its density in values < 0.1 (e.g., Fig. 4, panel D). In the case of
“M8,” there are 11 categories, the 10 categories to estimate the
Beta distribution, and a single category for sites evolving at ω > 1.
The parameters of the Beta distribution (called p and q in the
CODEML output) are not very different for the “M7” and “M8”
models in this data-set (“M7”: p ¼ 0.179 and q ¼ 0.153; “M8”:
p ¼ 0.138 and q ¼ 0.102), and are not very different between
themselves. This suggests that, under these models, sites evolving
at ω > 0 and ω < 1 are either under strong purifying selection or
336 Anders Gonçalves da Silva
4.5.2 Branch Models The branch models are similar to the site models, but instead of
assuming that rate heterogeneity occurs among sites in the data-set,
it assumes that it occurs among lineages [31, 32]. In Subhead-
ing 4.4 I examined the effect of moving a sequence to a different
location of the tree (Fig. 3, panels B–D). Here, we will re-analyze
the trees A and D using the branch-site model in order to test the
hypothesis that ω is not constant across branches in the tree.
Measuring Natural Selection 337
There are two possible ways of doing this: (1) assume that each
branch of the tree has a unique ω; or (2) modify the tree file in order
to group branches by two or more ω rate. The first approach is
generally not recommended, as the number of parameters will grow
with the number of sequences on the tree.
There are good instructions in the manual on how to modify a
tree file in order to group branches by evolutionary rate. In short,
modifying the tree file involves adding a #<integer> or
$<integer> to the appropriate locations. For instance, any branch
with #1 would mean that the assigned branch would have ω1; while
a ∖$1 to a basal branch would automatically assign #1 to the whole
clade. All branches not assigned a specific ω value are treated as
evolving under ω0. Below, I have reproduced the tree file I used for
obtaining results presented here for the second strategy. As can be
seen, there are two standard Newick formatted trees, each with 13
sequences (hence the “13 2” in the first row). In both trees,
Sequence 1 has a #1 next to it, while all others have no additional
notations. This is equivalent to saying that I am interested in
estimating two separate ω values, and that all sites in all sequences
but Sequence 1 have their non-synonymous changes weighted by
ω0 in the Markov model (Subheading 4.1), and those of Sequence
1 are weighted by ω1.
13 2
(1 #1,2,((3,(4,5)),(6,(7,(8,((((9,11),10),12),13))))));
(12,2,((3,(4,5)),(6,(7,(8,((((9,11),10),1 #1),13))))));
The tree for this data-set has 23 branches (Fig. 3). Thus, under
strategy one, where each branch has a unique ω, 23 ω values are
estimated:
lnL(ntime: 23 np: 47): -1116.995163 +0.000000
...
kappa (ts/tv) ¼ 2.29522
w (dN/dS) for branches: 0.08998 0.57791 999.00000 999.00000 999.00000 999.00000
999.00000 0.82760 0.59207 999.00000 999.00000 1.14794 0.48694 1.04106
0.24161 999.00000 999.00000 0.08994 0.00010 0.40932 0.30594 0.67798
2.28343
branch t N S dN/dS dN dS N*dN S*dS
14..1 0.027 234.3 38.7 0.0900 0.0037 0.0408 0.9 1.6
14..2 0.076 234.3 38.7 0.5779 0.0229 0.0396 5.4 1.5
14..15 0.160 234.3 38.7 999.0000 0.0621 0.0001 14.5 0.0
15..16 0.022 234.3 38.7 999.0000 0.0087 0.0000 2.0 0.0
A 13
12
10
11
9
8
7
6
5
4
3
0.1
2
1
B 13
1
10
11
9
8
7
6
5
4
0.05
3
2
12
Fig. 5 Phylogenetic tree for the 13 HIV sequences plotted with branch lengths inferred using CODEML. Branch
lengths indicated on the scales are the expected number of substitutions per codon. (A) The same as tree A in
Fig. 3. (B) The same as tree D in Fig. 3
Measuring Natural Selection 339
Tree A
lnL(ntime: 23 np: 26): -1132.801466 +0.000000
kappa (ts/tv) ¼ 2.27454
w (dN/dS) for branches: 0.98143 0.08790
branch t N S dN/dS dN dS N*dN S*dS
14..1 0.027 234.4 38.6 0.0879 0.0037 0.0420 0.9 1.6
14..2 0.075 234.4 38.6 0.9814 0.0250 0.0255 5.9 1.0
14..15 0.159 234.4 38.6 0.9814 0.0530 0.0540 12.4 2.1
15..16 0.022 234.4 38.6 0.9814 0.0074 0.0076 1.7 0.3
Tree D
lnL(ntime: 23 np: 26): -1233.751712 -100.950246
kappa (ts/tv) ¼ 2.57528
w (dN/dS) for branches: 0.99875 999.00000
dN & dS for each branch
branch t N S dN/dS dN dS N*dN S*dS
14..12 0.176 233.5 39.5 0.9987 0.0588 0.0589 13.7 2.3
14..2 0.282 233.5 39.5 0.9987 0.0938 0.0940 21.9 3.7
14..15 0.000 233.5 39.5 0.9987 0.0000 0.0000 0.0 0.0
15..16 0.051 233.5 39.5 0.9987 0.0169 0.0169 3.9 0.7
4.5.3 Branch-Site The branch-site models combine the models above, and allow for
Models both heterogeneity among sites and branches [33–36]. It achieves
this in a hierarchical fashion, where some proportion of the sites are
modelled under a “sites model” and the remainder are modelled
under the “branch model.” The number of ω values to be estimated
will be determined by the number of “site” categories, and the
number of “branch” categories. When the model is specified,
the user defines the number of categories to be modelled, which
is the number of “site” categories plus 1. The additional category
will accommodate the different ω values specified in the tree file.
The available models increase in complexity, and are named
“Clade Model A” through to “Clade Model D.” In Model A
there are four categories of sites: (1) sites under purifying selection
(0 < ω0 < 1); (2) neutral sites (ω1 ¼ 1); (3) sites that are under
positive selection in one clade/branch (ω2 > 1), but under purify-
ing selection in the rest of the tree (0 < ω0 < 1); and (4) sites that
are under positive selection in one clade/branch (ω2 > 1), but are
neutral in the rest of the tree (ω ¼ 1). In this model, ω0 and
ω2 are estimated from the data, while ω1 is fixed at 1. The model
also estimates the proportion of sites in categories 1 and 2.
The remaining sites are split between categories 3 and 4 in propor-
tion to the sites in categories 1 and 2, respectively. Model B
340 Anders Gonçalves da Silva
−1100
−1110
log−Likelihood
Model
m0
mdk3
−1120
−1130
5 Model Selection
Δ ¼ 2∗ðlnLðsimpleÞ lnLðcomplexÞÞ
An outcome of the “nested models” rule is that the model with
more parameters should always fit the data better, and thus should
have a higher log-likelihood value. Therefore, Δ will always be
positive, and is expected to be distributed according to a χ 2 distri-
bution, with degrees of freedom specified by the difference in the
number of parameters between the two models (which, again,
because these models are nested will always be positive and larger
than 0). We can then ask what is the probability of observing a
specific Δ by chance. If such a Δ is unlikely, then we might consider
accepting the more complex model, as the information gain is
greater than we would expect by chance. The threshold probability
beyond which we are willing to accept the more complex model is
arbitrary, and should be carefully considered before the analyses
begin (or perhaps even before any data is collected). However, it is
generally accepted to be 0.05.
The assumption that Δ follows a χ 2 distribution does not always
hold [39]. The authors of CODEML have attempted to determine
empirically whether this assumption holds for many cases [38, 39].
These are carefully outlined in the manual, and the numerous
papers produced during the development of CODEML. Unless it
has been specifically demonstrated for the model comparison you
wish to undertake, it might not be prudent to expect this assump-
tion to hold. In the CODEML manual the authors outline which
models can be compared using this approach.
Finally, it is important to note that there is an expectation, not
often explicitly stated, that the two models being compared differ
only in their assumptions about ω. All other parameters are, for the
most part, kept constant, as is the data. Thus, the topology should
be the same, as should be the codon frequency model. If κ is fixed in
Measuring Natural Selection 343
7 Final Remarks
8 Notes
1. CODEML does not allow for stop codons in its data. This is often a
source of frustration among PAML users, as the program will
often end abruptly with no warning or error message if there are
stop codons in the input sequences. If the program stops while
reading the input sequences, it is quite possible this is the issue.
One can manually check sequences the ExPASy translatetool.
One should also ensure that CODEML is set to read the codons
using the appropriate translation table (e.g., universal code).
2. There are different strategies employed to identify the best κ
value to use in an analysis. Some decide to use κ values obtained
from the literature, others decide to estimate κ from the data. If
estimating κ from the data, it is often the case that κ is first
estimated by fixing ω ¼ 1. Then, subsequently, setting κ to the
value estimated, and then estimating ω. Ideally, one should
perform a sensitivity analysis, where changes in the estimates of
ω are explored along a range of sensible κ values.
3. In order to automate the process of replication, I have written a
bash script that creates the necessary control files and re-runs
the models n number of times (which can be downloaded at:
https://github.com/andersgs/measuringNaturalSelection).
4. When performing model selection with CODEML one of the
easiest traps to fall into is attempting to analyze all possible
models. This strategy can easily result in any combination of
models being only slightly different, with no significant statisti-
cal differences among models, but with large biological differ-
ences. Thinking carefully about which set of models to examine
before starting the analysis can help avoid this common pitfall.
For instance, if one is interested in whether there is evidence for
differences in selective pressure across multiple lineages, then
perhaps the suite of branch models might be sufficient. Ulti-
mately, there are no simple heuristics for selecting the “correct”
model. One’s prior expectations along with one’s question dic-
tate which models are most appropriate. Finally, if one is found
in such a situation, picking the simplest model is usually the
recommended approach, however, it is not necessarily the best
approach. One should use what information they have at their
disposal to argue for a more complex model if the biology is such
that the simpler model does not seem reasonable.
5. The p-value of LRT can be determined by using standard statis-
tical tables. It can also be estimated in R by using the standard
function pchisq(). In the example given in the text
(Δ ¼ 72.34, and degrees of freedom ¼ 3), the p-value can be
obtained with the following command:
pchisq(q ¼ 72.34, df ¼ 3, lower ¼ F)
346 Anders Gonçalves da Silva
Acknowledgements
References
1. Darwin C, Wallace A (1858) On the tendency application to the chloroplast genome. Mol
of species to form varieties; and on the perpet- Biol Evol 11:715–724
uation of varieties and species by natural means 12. Yang Z, Nielsen R (2000) Estimating synony-
of selection. J Proc Linn Soc 3:45–62 mous and nonsynonymous substitution rates
2. Endler JA (1986) Natural selection in the wild. under realistic evolutionary models. Mol Biol
Princeton University Press, Princeton Evol 17:32–43
3. Grant PR, Grant BR (2006) Evolution of char- 13. Jukes TH, Cantor CR (1969) Evolution of
acter displacement in Darwin’s finches. Science protein molecules. In: Munro HN (ed) Mam-
313:224–226 malian protein metabolism. Academic Press,
4. Luikart G, England PR, Tallmon DA et al New York, pp 21–123
(2003) The power and promise of population 14. Yang Z, Yoder AD (1999) Estimation of the
genomics: from genotyping to genome typing. transition/transversion rate bias and species
Nat Rev Genet 4:981–994 sampling. J Mol Evol 48:274–283
5. Beutler B, Jiang Z, Georgel P et al (2006) 15. Felsenstein J (1981) Evolutionary trees from
Genetic analysis of host resistance: toll-like DNA sequences: a maximum likelihood
receptor signaling and immunity at large. Ann approach. J Mol Evol 17:368–376
Rev Immunol 24:353–389 16. Goldman N, Yang Z (1994) A codon-based
6. Kimura M (1968) Genetic variability main- model of nucleotide substitution for protein-
tained in a finite population due to mutational coding DNA sequences. Mol Biol Evol
production of neutral and nearly neutral isoal- 11:725–736
leles. Genet Res 11:247–269 17. Bielawski JP, Yang Z (2005) Maximum likeli-
7. King JL, Jukes TH (1969) Non-Darwinian hood methods for detecting adaptive protein
evolution. Science 164:788–798 evolution. In: Nielsen R (ed) Statistical meth-
8. Yang Z (2007) PAML 4: phylogenetic analysis ods in molecular evolution. Springer, New
by maximum likelihood. Mol Biol Evol York, pp 103–124
24:1586–1591 18. Li WH, Wu CI, Luo CC (1985) A new method
9. Yang Z (1997) PAML: a program package for for estimating synonymous and nonsynon-
phylogenetic analysis by maximum likelihood. ymous rates of nucleotide substitution consid-
Bioinformatics 13:555–556 ering the relative likelihood of nucleotide and
10. Nei M, Gojobori T (1986) Simple methods for codon changes. Mol Biol Evol 2:150–174
estimating the numbers of synonymous and 19. Leitner T, Kumar S, Albert J (1997) Tempo
nonsynonymous nucleotide substitutions. and mode of nucleotide substitutions in gag
Mol Biol Evol 3:418–426 and env gene fragments in human immunode-
11. Muse SV, Gaut BS (1994) A likelihood ficiency virus type 1 populations with a known
approach for comparing synonymous and non- transmission history. J Virol 71:4761–4770
synonymous nucleotide substitution rates, with 20. Yang Z, Nielsen R, Goldman N et al (2000)
Codon-substitution models for heterogeneous
Measuring Natural Selection 347
selection pressure at amino acid sites. Genetics 33. Yang Z, Nielsen R (2002) Codon-substitution
155:431–449 models for detecting molecular adaptation at
21. Grantham R, Gautier C, Gouy M et al (1980) individual sites along specific lineages. Mol Biol
Codon catalog usage and the genome hypoth- Evol 19:908–917
esis. Nucleic Acids Res 8:r49–r62 34. Bielawski JP, Yang ZH (2004) A maximum
22. Duret L (2002) Evolution of synonymous likelihood method for detecting functional
codon usage in metazoans. Curr Opin Genet divergence at individual codon sites, with appli-
Dev 12:640–649 cation to gene family evolution. J Mol Evol
23. Akashi H (1995) Inferring weak selection from 59:121–132
patterns of polymorphism and divergence at 35. Yang Z, Wong WSW, Nielsen R (2005) Bayes
“silent” sites in Drosophila DNA. Genetics empirical bayes inference of amino acid sites
139:1067–1076 under positive selection. Mol Biol Evol
24. Sharp PM, Averof M, Lloyd AT et al (1995) 22:1107–1118
DNA sequence evolution: the sounds of 36. Zhang J, Nielsen R, Yang Z (2005) Evaluation
silence. Philos T Roy Soc B 349:241–247 of an improved branch-site likelihood method
25. Yang Z, Nielsen R (2008) Mutation-selection for detecting positive selection at the molecular
models of codon substitution and their use to level. Mol Biol Evol 22:2472–2479
estimate selective strengths on codon usage. 37. Yang Z, Bielawski J (2000) Statistical methods
Mol Biol Evol 25:568–579 for detecting molecular adaptation. Trends
26. Thorne JL, Kishino H, Felsenstein J (1992) Ecol Evol 15:496–503
Inching toward reality: an improved likelihood 38. Whelan S, Goldman N (1999) Distribution of
model of sequence evolution. J Mol Evol statistics used for comparison of models of
34:3–16 sequence evolution in phylogenetics. Mol Biol
27. Yang Z (1994) Maximum likelihood phyloge- Evol 16:1292–1299
netic estimation from DNA sequences with 39. Anisimova M, Bielawski JP, Yang Z (2001)
variable rates over sites: approximate methods. Accuracy and power of the likelihood ratio
J Mol Evol 39:306–314 test in detecting adaptive molecular evolution.
28. Yang Z, Swanson WJ (2002) Codon- Mol Biol Evol 18:1585–1592
substitution models to detect adaptive evolu- 40. Felsenstein J (2004) Inferring phylogenies.
tion that account for heterogeneous selective Sinauer Associates, Inc., Sunderland, MA
pressures among site classes. Mol Biol Evol 41. Nei MM, Suzuki YY, Nozawa MM (2010) The
19:49–57 neutral theory of molecular evolution in the
29. Nielsen R (1997) The ratio of replacement to genomic era. Ann Rev Genom Hum G
silent divergence and tests of neutrality. J Evol 11:265–289
Biol 10:217–231 42. Whelan S, Goldman N (2004) Estimating the
30. Nielsen R, Yang Z (1998) Likelihood models frequency of events that cause multiple-
for detecting positively selected amino acid nucleotide changes. Genetics 167:2027–2043
sites and applications to the HIV-1 envelope 43. Kosiol C, Holmes I, Goldman N (2007) An
gene. Genetics 148:929–936 empirical codon model for protein sequence
31. Yang Z (1998) Likelihood ratio tests for evolution. Mol Biol Evol 24:1464–1479
detecting positive selection and application to 44. Harrisson KA, Pavlova A, Telonis-Scott M et al
primate lysozyme evolution. Mol Biol Evol (2014) Using genomics to characterize evolu-
15:568–573 tionary potential for conservation of wild
32. Yang Z, Nielsen R (1998) Synonymous and populations. Evol Appl. doi:10.1111/
nonsynonymous rate variation in nuclear eva.12149
genes of mammals. J Mol Evol 46:409–418
Chapter 14
Inferring Trees
Simon Whelan and David A. Morrison
Abstract
Molecular evolution can reveal the relationship between sets of homologous sequences and the patterns of
change that occur during their evolution. An important aspect of these studies is the inference of a
phylogenetic tree, which explicitly describes evolutionary relationships between homologous sequences.
This chapter provides an introduction to evolutionary trees and how to infer them from sequence data
using some commonly used inferential methodology. It focuses on statistical methods for inferring trees
and how to assess the confidence one should have in any resulting tree, with a particular emphasis on the
underlying assumptions of the methods and how they might affect the tree estimate. There is also some
discussion of the underlying algorithms used to perform tree search and recommendations regarding the
performance of different algorithms. Finally, there are a few practical guidelines, including how to combine
multiple software packages to improve inference, and a comparison between Bayesian and Maximum
likelihood phylogenetics.
Key words Phylogenetic inference, Evolutionary trees, Maximum likelihood, Parsimony, Distance
methods, Review
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_14, © Springer Science+Business Media New York 2017
349
350 Simon Whelan and David A. Morrison
(continued)
Inferring Trees 351
Box 1 (continued)
2.1 Assumptions Nearly all methods of tree inference take as input a multiple sequence
About the Sequence alignment, often referred to as the data matrix (see Note 1). Here,
Data this matrix shall be a set of homologous nucleotide sequences,
meaning that they have all arisen by descent from the same locus in
their common ancestor [4, 12–14]. The columns of this matrix
represent the homology relationships between the nucleotide char-
acters in the sequence, so that if an A from one sequence is in the
same column as a G from another sequence then it means they must
have evolved from the same common ancestral nucleotide through a
series of substitutions. Indels—a contraction of ‘insertions and dele-
tions’ and often referred to as gaps—describe one or more characters
352 Simon Whelan and David A. Morrison
that are missing from a sequence with a ‘-’ character. Indels arise
either due to a deletion (where it has been lost in a lineage leading to
that sequence) or due to an insertion (where the ancestor for other
sequences in the column acquired these extra characters for this
sequence). There may also be ambiguity characters, such as ‘W’
(weak), representing either A or T, or missing data characters, such
as ‘X’ or ‘N.’
When inferring trees it is important to realize that most of the
methods described here use only substitutions to infer trees, and
they completely ignore missing data and indel characters (i.e., gaps
are treated as missing) [15]. They often also treat ambiguity char-
acters as missing data. A small number of methods do attempt to
incorporate information about insertions and deletions when infer-
ring trees, but these are beyond the scope of an introductory text
[16–18]. The accuracy of the alignment is also known to be very
important for tree inference [14, 19–21]. Chapter 8 discusses the
multiple sequence alignment problem in detail, so it is sufficient to
say here that the approach to multiple sequence alignment can have
a substantial effect on downstream analyses, including the tree
estimate and the confidence one has in that tree estimate.
2.2 The Tree All of the methods discussed here take the data matrix and use it to
Assumption estimate a tree, but aiming to infer a tree requires some assumptions
about the evolutionary process. The most important set of assump-
tions are related to the sequences evolving in a tree-like manner (see
Note 2), which involves assuming that all sequences share a com-
mon ancestor and that sequences evolving along all of the branches
in the tree evolve independently. Violations of the former assump-
tion occur when unrelated regions are included in the data. This
occurs, for example, when only subsets of protein domains are
shared between sequences, or when data have entered the tree
from other sources such as sequence contamination, gene flow
(e.g., hybridization, lateral gene transfer), or mobile genetic ele-
ments, such as transposons. Violations of the second assumption
occur when information in one part of a tree affects sequences in
another part. This occurs, for example, in gene families under gene
conversion or gene flow.
Before assuming a bifurcating tree for phylogenetic analyses
one should try to ensure these implicit assumptions are not vio-
lated. This is done by first exploring the data without assuming a
tree structure [22]. Possible tree-independent approaches include
spectral analysis [23] and splits graphs [24], the latter now being a
very popular choice. When the first set of assumptions is not met,
one should consider whether the evolution of the sequences could
be better represented by a network rather than a tree, as this
incorporates reticulations representing gene flow [25, 26].
The second set of assumptions relates to assigning directional-
ity to the evolutionary process. When we have directional
Inferring Trees 353
A B Seq5
Root Seq6
Seq4
Seq3 Inferred
Root
Out1
Seq1 Seq2 Outgroup
Seq2
Seq3
Seq4
Seq5
Seq6
Out1
Seq1
Fig. 1 Two common forms of bifurcating tree are used in phylogenetics. (a)
Rooted trees make explicit assumptions about the most recent common ances-
tor of sequences and can imply directionality of the evolutionary process. (b)
Unrooted trees assume a time-reversible evolutionary process. The root of
sequences 1–6 can be inferred by adding an outgroup
2.3 Model Details of the nucleotide substitution models used when inferring
Assumptions trees are provided in Chapters 13 and 15, but here we discuss how
some of the common assumptions might affect tree inference. Nearly
all of the models that are widely used for the evolution of nucleotides
make three key assumptions [15]: (1) reversibility, which means that
observations about the evolutionary process seem to be the same
going in any direction across the tree; (2) stationarity, which means
that the frequencies with which we see nucleotides (or amino acids)
are the same throughout the tree; and (3) homogeneity, which
means that we expect, on average, the same substitution process to
be acting throughout the tree. The methods of tree inference
described later are thought to work relatively well when these
assumptions approximately hold [11]. When these assumptions are
seriously violated it is often referred to as model misspecification,
which can have serious effects on the quality of the trees inferred, by
causing phylogenetic artifacts. Two widely known artifacts are long
branch attraction, where long and often deep branches in a tree
incorrectly group together [22, 27, 28], and compositional artifacts,
354 Simon Whelan and David A. Morrison
3 Scoring Trees
3.1 Scores Fortunately, there is now a broad consensus that statistical methods
provide the most reliable results, since they use the information
stored in the sequence data most efficiently and they may be rela-
tively robust to modeling errors. All statistical methods use a likeli-
hood function to assess how likely the alignment of sequences is to
occur, conditional on the substitution model used and the phylo-
genetic tree (see Box 1). Statistical methods come in two flavors:
maximum likelihood (ML), which searches for the tree that max-
imizes the likelihood function, and Bayesian inference, which sam-
ples trees in proportion to the likelihood function and prior
expectations. The primary strength behind statistical methods is
that they are based on established and reliable mathematical and
statistical methodology that has been applied to many areas of
research, from classical population genetics to modeling world
economies [31]. By using substitution models, statistical methods
allow us to capture and correct for the important evolutionary
forces known to affect sequence evolution, which in turn leads to
more accurate estimates of evolutionary trees. Statistical methods
are also statistically consistent: under an accurate evolutionary
model they tend to converge to the ‘true tree’ as progressively
longer sequences are used [32, 33]. This, and other associated
properties, enables statistical methodology to produce high-quality
phylogenetic estimates with a minimum of bias under a wide range
of conditions. Statistical methods are computationally intensive,
which has been an issue in the past, but modern programs and
computing power mean that we can now use these statistical meth-
ods even on thousands of sequences [34, 35].
Parsimony counts the minimum number of changes required
on a tree to describe the observed variation in the sequence data.
Parsimony is intuitive to understand because it reconstructs the
Inferring Trees 355
3.2 Why Estimating The effective and accurate estimation of phylogenetic trees remains
Trees Is Difficult difficult, despite their wide-ranging importance in experimental
and computational studies. Statistical and parsimony methods
need to identify the highest scoring (optimal) tree, which can
only be done by searching through the entirety of tree space [11].
This approach of exhaustive tree search is infeasible because the size
of tree space increases rapidly with the number of sequences. For 50
sequences, there are approximately 1076 possible trees, a number
comparable to the estimated number of atoms in the observable
universe. This necessitates heuristic tree search methods for search-
ing tree space that speed up computation, often at the expense of
accuracy. The phylogenetic tree estimation problem is statistically
unusual, and there are few well-studied examples from other
research disciplines to draw on for heuristics [15].
Consequently, there has been a lot of active research into
methodology for finding the optimal tree using a variety of novel
heuristic algorithms. Many of these approaches take a ‘hill-climb-
ing’ optimization approach and progressively look to improve the
tree estimate by iteratively examining the score of nearby trees, then
making the highest scoring the new best estimate, and stopping
when no further improvements can be found. The nature of these
heuristics means that there is often no way of deciding whether the
newly discovered optimum is the globally best tree or whether it is
one of many other local optima in tree space (although see [40], for
a description of a Branch and Bound algorithm).
By acknowledging this problem and applying phylogenetic
software to its full potential, it is possible to produce good estimates
of trees that exhibit many characteristics of the sequences’ evolu-
tionary history [41].
4.1 Proposing There is no substitute for a good starting topology when inferring
an Initial Tree trees. An initial tree can use distance-based clustering methods,
chosen for computational speed, most commonly Neighbor-
Joining [38]. Occasionally, more sophisticated approaches such as
quartet puzzling [43] are used, which can be highly effective for
smaller datasets, but may not scale so well for large numbers of
sequences. An alternative, widely used approach is to use sequence-
based clustering algorithms [11, 40]. These use something akin to
full statistical or parsimony approaches. A popular choice is Step-
wise addition (Fig. 2a), which starts with a tree of three sequences
and randomly adds the remaining sequences to the tree in the
location that maximizes the scoring criterion. Since the proposed
A Seq5 Seq1
Seq1 Seq4 Seq1
? ? ? Seq3
Seq3
Add Seq4 to branch Add Seq5 to branch
? Seq3
leading to Seq3 ? leading to Seq2
Seq4
Seq4 Seq5
? ? ?
Seq2 Seq2 Seq2
Seq5 Seq3
Seq3
Add branch separating Add branch separating
Seq2 Seq4
Seq2 and Seq5 Seq3 and Seq4
Seq4
Seq5
Seq4
Seq5
Seq3 Seq2 Seq2
C
Cut current branch Attach branch leading to
under consideration subtree to a branches on
leaving part of the the remainder of the
original tree and the original tree
subtree ( )
Fig. 2 Common algorithms used in tree search. The first two rows show common sequence-based clustering
algorithms that are frequently used to propose trees: (a) stepwise addition progressively adds sequences to a
tree at a location that maximizes the score function; and (b) star decomposition starts with a topology with no
defined internal branches and serially adds branches that maximize the score function. The final row, (c),
shows the subtree-pruning-and-regrafting (SPR) tree rearrangement algorithm. Dotted arrows demonstrate
some potential regrafting points
358 Simon Whelan and David A. Morrison
4.2 Refining the Tree Refining the tree estimate is the heuristic optimization step, which
Estimate uses an iterative procedure similar to hill climbing that stops when
no further improvement can be found. For each iteration a traversal
scheme is used to move around tree space and propose a set of
candidate trees from the current tree estimate. Each tree is assessed
using the score function and one (or more) trees are chosen as the
starting point for the next round of iteration. The set of candidate
trees are proposed by traversal schemes that make small rearrange-
ments to the current tree, usually examining each internal branch of
a tree in turn and they vary in the way they propose trees from it.
The most effective method is probably subtree-pruning-and-
regrafting (SPR) [44], with nearest neighbor interchange (NNI)
being a more restricted simplification, and tree bisection and recon-
nection (TBR) being a generalized extension [45].
SPR generates candidate trees by breaking the internal branch
under consideration and proposes new trees by regrafting the resul-
tant subtree to each of the remaining branches of the original
topology and computing their likelihood, which is demonstrated
in Fig. 2c with three example regraftings (dotted arrows). The
number of trees proposed by SPR increases rapidly with the number
of sequences and makes pure SPR impractical for larger datasets.
Modern tree search programs add a further restriction to SPR that
makes the number of candidate trees linear with the number of
sequences by bounding the number of branches that a subtree can
move from its original position. The subtree in Fig. 2c, for example,
could be bounded in its movement to a maximum of two branches
(all branches not represented by a triangular subtree in the figure).
The simpler NNI algorithm merely breaks an internal branch to
produce four subtrees, which can be arranged to form two new
topologies per branch. This approach is very fast because it only
considers a small number of trees per step, but it also represents the
weakness of NNI. The small number of trees means NNI can only
make small steps in tree space, which makes it more liable to getting
stuck at local optima during tree search than other more expansive
schemes [46]. On the other hand, TBR is a more generalized form
of SPR and allows any branch of the pruned subtree to be regrafted
on to the original tree and not just the one cut. In practice, TBR
does not seem much more effective than SPR and the large number
of trees that must be examined at each step in the iteration makes
TBR computationally impractical for large numbers of sequences.
Inferring Trees 359
4.3 Stopping Criteria Many phylogenetic software packages do not resample tree space
and stop after a single round of refinement. When resampling is
used, a stopping rule is required. These are usually arbitrary, allow-
ing only a prespecified number of resamples or refinements. An
alternative is to base the stopping rule on how frequently improve-
ments in the overall optimal tree are observed [43].
4.4 Resampling from Sampling from one place in tree space and refining will lead to a
Tree Space local likelihood optimum, which may or may not be the globally
optimal tree. Resampling expands the area of tree space searched by
the heuristic and may uncover better optima. Each optimum has an
area of tree space associated with it, and the majority of phyloge-
netic problems have potentially large numbers of local optima.
Resampling is achieved by starting the refinement procedure from
another point in tree space. Two of the many possible resampling
schemes are discussed here: stepwise addition and the ratchet.
Stepwise addition with random sequence ordering is a viable
resampling strategy, because adding sequences in a different order
tends to produce relatively good starting trees [34, 40]. An alter-
native approach is to try and use information from the current best
tree estimate as a means of finding other optima. The ratchet
procedure is one such approach and obtains a new starting tree by
reweighting the characters and then refining the current tree [41].
All resampling strategies can also be helped by tabu search, where
regions of tree space close to currently identified optima are not
allowed as new starting trees [42, 47].
4.6 Choosing a In addition to the tree model (containing the topology and branch-
Model and Partitioning length estimates), statistical methods require a substitution model
Data that captures the major factors affecting sequence evolution in the
observed sequence data. Many models exist, with variable numbers
of parameters, associated with different assumptions about the
evolutionary forces acting on sequences during their evolution.
By far the most popular approach to choosing a model is to use a
formal model selection procedure. Here, an information-theoretic
approach is taken that attempts to measure the fit of each model to
the observed data, using distances derived from Akaike’s Informa-
tion Criterion (AIC) or the Bayesian Information Criterion (BIC),
which adjust the likelihoods for the number of parameters in the
model. The procedure for model selection is now fully automated,
with programs such as jModelTest [51] and ProtTest [52] capable
of selecting the best-fit nucleotide and amino acid model,
respectively.
For protein coding sequences, this approach to model selection
requires users to decide in what form they should analyze their data,
for example, as nucleotides or amino acids. This arbitrary decision
can have a substantial effect on tree inference and has been
addressed using a more general form of model selection that
assesses both the model and the data type and is available in the
ModelOMatic program [53]. Similar approaches to model selec-
tion can be taken for RNA sequences as well using PHASE [54].
Performance-based model selection provides an alternative
approach and attempts to select models based both on their
model fit and how likely they are to result in inferential errors.
Although not widely used at present, performance-based model
selection has the potential to offer a valuable alternative to
information-theoretic approaches.
A second important factor to consider when selecting a model
for likelihood-based tree inference is possible nonindependence of
evolutionary pressures acting along the sequences. There are many
biological reasons for variation in these forces and they affect the
global applicability of the models along the sequences. One com-
mon solution to this potential problem is to use different models
for different parts of the sequence, which is called partitioning [55].
In principle the different partitions should have greater inter- than
intrapartition variability in substitution rates. The usual procedure
is to apply the above tests to different data subsets defined by
biological motivation (e.g., different genes, different codon posi-
tions, paired versus unpaired rRNA), which can be automated by
PartitionFinder [56]. Programs such as GARLI, MrBayes, and
RAxML can also accommodate partitioning.
An alternative approach to dealing with sequence heterogeneity
is through the use of mixture models. Here, the likelihood of each
character is calculated under more than one model, and these like-
lihoods are then combined. Such models have been developed for
Inferring Trees 361
5.1 Measures of The probability of the point estimate of a tree being correct is
General Branch vanishingly small. Therefore, it is usual to consider the combined
Support and branch support values as a general measure of confidence in a tree.
Bootstrapping Each branch of a rooted tree represents a subtree or clade, which is
all of the descendants of a single common ancestor. The support
provided by the data for clades, and therefore branches, is of
primary importance.
There are many proposed ways to quantify this support, which
can be grouped into: (1) analytical procedures, such as interior-
branch tests, likelihood-ratio tests, clade significance, and the
incongruence length difference test; (2) statistical procedures,
such as the nonparametric bootstrap, posterior probabilities, the
jackknife, topology-dependent permutation, and clade credibility;
and (3) nonstatistical procedures, such as the decay index, clade
stability, data decisiveness, spectral signals, splits graphs, and clou-
dograms. From the third group, splits graphs are the most popular
approach [24], while cloudograms provide an alternative visualiza-
tion [60] (see Note 2).
From among the statistical procedures, the two most com-
monly used measures are bootstrap proportions and posterior
probabilities (Subheading 6). The nonparametric bootstrap uses a
resampling scheme to assess confidence in a tree on a branch-by-
branch basis [61]. Tree estimates are obtained for a large number of
simulated datasets, obtained by resampling the original data. That
is, sampling with replacement is used to produce a simulated data-
set by repeatedly drawing samples from the original data to make a
new dataset of suitable length. Bootstrap values are placed on
branches of the original point estimate of the tree, representing
the frequency that implied bipartitions are observed in the trees
obtained from the simulated data. This procedure assumes that if
362 Simon Whelan and David A. Morrison
5.2 Confidence Sets When one has a set of competing hypotheses, each represented by a
of Trees: The SH and specific tree topology, one often wishes to test which of those
AU Test hypotheses (topologies) are supported and which can be rejected.
Often, this question is associated with the placement of an out-
group that roots a particular clade on the tree, for instance, when
one is studying the origins of the organelles [64, 65] or testing
molecular evidence for macroevolutionary transitions, such as those
in cetaceans [66].
There are two popular tests that are suitable for addressing this
type of question: the Shimodaira–Hasegawa (SH-)test [67] and the
Approximate-Unbiased (AU-)test [68]. These tests are variants of
the bootstrapping procedure discussed earlier. Instead of looking at
how frequently particular branches are recovered, they examine the
differences in likelihood between the ML tree for the bootstrapped
data and the other hypotheses in the resampled data. Both tests use
these differences in likelihood to form a confidence set of trees,
which can reject those hypotheses that fall below the critical value.
Both tests control the level of type I (false positive) error
successfully. In other words, the confidence interval is conservative
and does not place unwarranted confidence in a small number of
trees. The AU test is constructed in a subtly different manner to the
SH test, which removes a potential bias and can increase statistical
power in some cases. This difference may allow the AU test to reject
more trees than the SH test and produce tighter confidence inter-
vals (demonstrated in Fig. 3).
Readers may also come across the Kishino–Hasegawa (KH-)
test [69] for comparing tree hypotheses. The KH test is only
applicable when all of the potential hypotheses (trees) can be
Inferring Trees 363
Bootstrap resampling
0.2
0.2
0.2
0.2
0.2
S5 A G G G T and sample each column with
equal probability. Each column
Probability distribution
can be sampled multiple times.
of columns
…
S1 S3 S3 G T G C A S3 T G G A T S3 T C G C T
S4 G T G C G S4 A G G G A original data S4 T C G C A
S5 G T G G G S5 A G G G A S5 T G G G A
S4
S1 S3 S1 S3 S1 S3
OR
S2 S5 approximated
S4 S2 using RELL or S4
other methods
S2 S5 S4 S5 S2 S5
S4 S2 S5 S3 S1
1.00
S4 S2 S5 S4 S5 S2 S4 S2 S5 S2 S5
0.75
Bootstrap
0.75 0.23 0.02 0.00 0.00
support
S2 S5
AU test 0.84 0.34 0.12 0.04 0.01
Bootstrap support
mapped onto tree SH test 0.88 0.41 0.20 0.06 0.04
from original data
Fig. 3 Two common ways to assess confidence in tree estimate are to use bootstrap resampling (1) to
provide a general measure of confidence in a tree or (2) for hypothesis testing. These approaches resample
with replacement columns from the original data matrix, to generate simulated datasets with properties
that reflect the original data. Full bootstrapping approaches analyze these data in the manner of the original
dataset to produce a tree estimate for each bootstrap replicate, although there are also heuristic
approaches to obtain these estimates. The results from the simulation can be summarized through the
tree estimates (for bootstrap values or support) or the difference in likelihood between the ML tree and
specific topologies (AU- and SH-test)
364 Simon Whelan and David A. Morrison
5.3 Other Tests of The methods described earlier are very general and used by many
Tree Topologies different tree search programs. There are, however, several
program-specific approaches to assessing the confidence in a tree
worth mentioning. Both RAxML and IQPNNI have implemented
different fast approximations to the standard bootstrap [34, 70,
71]. These fast bootstrapping approaches use information gained
during tree search to speed up the assessment of bootstrapped trees
and thus reduce computational costs. An alternative approach, used
in PhyML, is the approximate likelihood ratio test (aLRT), which
performs a per branch statistical test to produce statistics that may
be comparable to bootstrap proportions [72].
Prior
Sequence data
Informative prior
Seq1 ACTC … CGCC
Density
+ Seq2 ACTG … CGCT
Seq3 ATTG … CACT
Vague prior Seq4 ATTG … CACT
Parameter space
Posterior distributions
Tree C Tree A
Tree B Tree C
Posterior probability
Tree A
Density
Tree B
Parameter space Tree space
Fig. 4 A schematic showing Bayesian tree inference. The prior (left) contains the
information or beliefs one has about the parameters contained in the tree and the
evolutionary model before seeing the data. During Bayesian inference, this is
combined with the information about the tree and model parameter values held in
the original data (right) to produce the posterior distribution. These may be
summarized to provide estimates of the parameter values (bottom left) and the
trees (bottom right)
6.1 Bayesian To obtain the posterior probability of trees, those parameters that
Estimation of Trees are not of direct interest to the analysis need to be ‘integrated out’
of the posterior distribution. These extra parameters are sometimes
referred to as ‘nuisance parameters,’ and include components of the
evolutionary model and branch lengths, that is we want the tree
topologies averaged across all parameter values.
A common summary measure in Bayesian phylogenetics is the
maximum a posteriori probability (MAP) tree, which is the tree
with the highest posterior probability in tree space. The integration
required to obtain the MAP tree is represented in the transition
from left to right in the posterior distribution section of Fig. 4,
where the area under the curve for each tree on the left equates to
the posterior probability for each tree on the right. The confidence
in the MAP tree can naturally be estimated from the posterior
distribution and requires no additional computation. In Bayesian
parlance, this is achieved by constructing a credibility interval,
which can roughly be considered similar to the confidence interval
of classical statistics and is constructed by adding trees to the
credible set in order of decreasing probability. For example, the
credibility interval for the data in Fig. 4 would be constructed by
366 Simon Whelan and David A. Morrison
very low posterior probability are rarely visited. The overall result is
that the amount of time a chain spends in regions of tree space is
directly proportional to the posterior distribution. This approach
allows the posterior probability of trees to be easily calculated from
the frequency of time that the chain spends visiting different
topologies.
6.3 How Long to Run The number of samples required for MCMC to successfully sample
a Markov Chain Monte the posterior distribution is dependent on two factors: convergence
Carlo and mixing (see Note 6). A chain is said to have converged when it
begins to accurately sample from the posterior distribution, and the
period before this happens is called burn in. The mixing of a chain is
a measure of how similar one sample from the MCMC is to the
next, with good mixing allowing relatively close samples to provide
independent draws from the posterior distribution. Mixing is
important because it controls both how quickly a chain converges
and its ability to sample effectively from the posterior distribution
afterward. When a chain mixes well, all trees can be quickly reached
from all other trees and MCMC is a highly effective method. When
mixing is poor the chain’s ability to sample effectively from the
posterior is compromised. It is common for samples to be recorded
only periodically in the chain, say every 100 or 1000 iterations, to
ensure samples are as close to independent draws from the posterior
as possible.
It is notoriously difficult to confirm that the chain has con-
verged and is successfully mixing, but there are diagnostic tools
available to help. A powerful way to examine these conditions is to
run multiple chains and compare them. If a majority of chains
starting from substantially different points in tree space concentrate
their sampling in the same region it is indicative that the chains have
converged. Evidence for successful mixing can be found by com-
paring samples between converged chains. When samples are clearly
different, it is strong evidence that the chain is not mixing well.
These comparative approaches can go awry; for example, when a
small number of good tree topologies with large centers of attrac-
tion are separated by long and deep troughs in the surface of
posterior probabilities. If by chance all the chains start in the same
centre of attraction, they can misleadingly appear to have con-
verged and mixed well, even when they have poorly sampled the
posterior. This behavior has been induced, for example, for small
trees under artificial misspecifications of the evolutionary model,
although the general prevalence of this problem is currently
unknown.
An alternative diagnostic is to examine a plot of the likelihood
and/or model parameter values, such as rate variation parameters
and sums of branches in the tree, against sample number. Before
convergence these values may tend to show discernable patterns of
change. The likelihood function, for example, may appear on aver-
age to steadily increase while the chain moves to progressively
368 Simon Whelan and David A. Morrison
better areas of tree space. When the chain converges these values
may appear to have quite large random fluctuations with no appar-
ent trend. Fast fluctuation accompanied by quite large differences
in likelihood, for example, would be indicative of successful mixing.
This character alone is a weak indicator of convergence, because
chains commonly fluctuate before they find better regions of tree
space. New sampling procedures, such as Metropolis Coupled
MCMC (MC3), are being introduced that can address more diffi-
cult sampling and mixing problems, and are likely to feature more
frequently in phylogenetic inference (see Notes 5 and 6).
6.4 The Specification All Bayesian phylogenetic analyses require the specification of prior
of Priors distributions for all of the parameters in the model, including the
substitution model and the tree and its branch lengths. There are
two broad approaches to specifying priors that divide Bayesian
inference into two groups [80]. The first approach is Subjective
Bayes, where the informative prior represents a researcher’s prior
belief in the question at hand. Bayesian inference adjusts this prior
belief to obtain a posterior that reveals how much the researcher’s
opinion should be changed by the observed data. At a superficial
level Subjective Bayes is exactly how one may wish to conduct
scientific research, but it is difficult to reconcile with the more
general approach to scientific method. The prior beliefs for partic-
ular parameters or models vary between researchers, making it
difficult for researchers to agree on the accuracy, or even relevance,
of the posterior. The rarity with which research fields can agree on a
prior makes publishing using Subjective Bayes almost impossible.
The second approach is Objective Bayes, where a researcher
attempts to express ignorance of the problem at hand in their
priors, and then use the posterior distribution as a way of assessing
how much information is in their data. This approach is closely
linked to the idea of uninformative or flat priors, which define all
outcomes as equally likely. In tree inference this can be broadly
interpreted as each tree being equally likely, which is philosophically
similar to how other methods, such as likelihood and parsimony,
treat tree estimation. The problem with Objective Bayes is that
there is no such thing as ‘uninformative priors,’ since even relatively
innocent-looking flat priors actually make strong statements about
the data, and this can have a major impact on the posterior distri-
bution. A simple example of this problem can be demonstrated by
trying to specify a prior distribution on a square with edge length l.
One can assign a flat prior to l so that all edge lengths are equally
likely. An alternative, and equally valid approach, is to assign a flat
prior on the area of the square, l2, so that all areas are equally likely.
Both of these approaches produce different and incompatible prior
distributions, since a uniform distribution over l can never be the
same as a uniform distribution over l2.
Inferring Trees 369
8 Notes
Acknowledgments
References
1. Hahn BH et al (2000) AIDS—AIDS as a 3. Ames RM et al (2012) Determining the evo-
zoonosis: scientific and public health implica- lutionary history of gene families. Bioinfor-
tions. Science 287:607–614 matics 28:48–55
2. Pellegrini M et al (1999) Assigning protein 4. Liberles DA et al (2012) The interface of
functions by comparative genome analysis: protein structure, protein biophysics,
protein phylogenetic profiles. Proc Natl Acad and molecular evolution. Protein Sci
Sci U S A 96:4285–4288 21:769–785
374 Simon Whelan and David A. Morrison
5. Hahn MW, Han MV, Han S-G (2007) Gene affects genomic analysis. Mol Biol Evol
family evolution across 12 Drosophila gen- 30:642–653
omes. PLoS Genet 3:e197 22. W€agele JW, Mayer C (2007) Visualizing dif-
6. Mouse Genome Sequencing Consortium ferences in phylogenetic information content
(2002) Initial sequencing and comparative of alignments and distinction of three classes
analysis of the mouse genome. Nature of long-branch effects. BMC Evol Biol 7:147
420:520–562 23. Hendy MD, Penny D (1993) Spectral analysis
7. Lynch M, Walsh B (2007) The origins of of phylogenetic data. J Classif 10:5–24
genome architecture. Sinauer Associates, Sun- 24. Morrison DA (2010) Using data-display net-
derland, MA works for exploratory data analysis in
8. Gogarten JP, Doolittle WF, Lawrence JG phylogenetic studies. Mol Biol Evol
(2002) Prokaryotic evolution in light of 27:1044–1057
gene transfer. Mol Biol Evol 19:2226–2238 25. Huson DH, Bryant D (2006) Application of
9. Yang Z, Rannala B (2010) Bayesian species phylogenetic networks in evolutionary stud-
delimitation using multilocus sequence data. ies. Mol Biol Evol 23:254–267
Proc Natl Acad Sci U S A 107:9264–9269 26. Morrison DA (2011) Introduction to phylo-
10. Siepel A et al (2005) Evolutionarily conserved genetic networks. RJR Productions, Uppsala,
elements in vertebrate, insect, worm, and Sweden
yeast genomes. Genome Res 15:1034–1050 27. Philippe H, Germot A (2000) Phylogeny of
11. Felsenstein J (2003) Inferring Phylogenies. eukaryotes based on ribosomal RNA: long-
Sinauer Associates, Sunderland, MA branch attraction and models of sequence
12. Löytynoja A, Goldman N (2008) Phylogeny- evolution. Mol Biol Evol 17:830–834
aware gap placement prevents errors in 28. Inagaki Y et al (2004) Covarion shifts cause a
sequence alignment and evolutionary analysis. long-branch attraction artifact that unites
Science 320:1632–1635 microsporidia and archaebacteria in EF-1α
13. Anisimova M, Cannarozzi G, Liberles DA phylogenies. Mol Biol Evol 21:1340–1349
(2010) Finding the balance between the math- 29. Viklund J, Ettema TJ, Andersson SG (2011)
ematical and biological optima in multiple Independent genome reduction and phyloge-
sequence alignment. Trends Evol Biol 2:e7 netic reclassification of the oceanic SAR11
14. Löytynoja A (2012) Alignment methods: clade. Mol Biol Evol 29:599–615
strategies, challenges, benchmarking, and 30. Morrison DA (2006) Phylogenetic analyses of
comparative overview. In: Evolutionary geno- parasites in the new millennium. Adv Parasitol
mics. Springer, New York, pp 203–235. 63:1–124
15. Yang Z (2006) Computational molecular evo- 31. Edwards AWF (1972) Likelihood: an account
lution. Oxford University Press, Oxford of the statistical concept of likelihood and its
16. Redelings B, Suchard M (2005) Joint Bayes- application to scientific inference. Cambridge
ian estimation of alignment and phylogeny. University Press, New York
Syst Biol 54:401–418 32. Chang JT (1996) Full reconstruction of Mar-
17. Thorne JL, Kishino H, Felsenstein J (1991) kov models on evolutionary trees: identifiabil-
An evolutionary model for maximum likeli- ity and consistency. Math Biosci 137:51–73
hood alignment of DNA sequences. J Mol 33. Rogers JS (1997) On the consistency of max-
Evol 33:114–124 imum likelihood estimation of phylogenetic
18. McGuire G, Denham MC, Balding DJ (2001) trees from nucleotide sequences. Syst Biol
Models of sequence evolution for DNA 46:354–357
sequences containing gaps. Mol Biol Evol 34. Stamatakis A (2014) RAxML version 8: a tool
18:481–490 for phylogenetic analysis and post-analysis of
19. Morrison DA, Ellis JT (1997) Effects of large phylogenies. Bioinformatics
nucleotide sequence alignment on phylogeny 30:1312–1313
estimation: a case study of 18S rDNAs of 35. Izquierdo-Carrasco F, Smith SA, Stamatakis A
Apicomplexa. Mol Biol Evol 14:428–441 (2011) Algorithms, data structures, and
20. Wong K, Suchard M, Huelsenbeck J (2008) numerics for likelihood-based phylogenetic
Alignment uncertainty and genomic analysis. inference of huge trees. BMC Bioinformatics
Science 319:473–476 12:470
21. Blackburne BP, Whelan S (2013) Class of 36. Steel M, Penny D (2000) Parsimony, likeli-
multiple sequence alignment algorithm hood, and the role of models in molecular
phylogenetics. Mol Biol Evol 17:839–850
Inferring Trees 375
37. Siddall ME, Kluge AG (1997) Probabilism 52. Darriba D et al (2011) ProtTest 3: fast selec-
and phylogenetic inference. Cladistics tion of best-fit models of protein evolution.
13:313–336 Bioinformatics 27:1164–1165
38. Saitou N, Nei M (1987) The neighbor- 53. Whelan S et al (2015) ModelOMatic: fast and
joining method—a new method for recon- automated model selection between RY,
structing phylogenetic trees. Mol Biol Evol nucleotide, amino acid, and codon substitu-
4:406–425 tion models. Syst Biol 64:42–55
39. Allman ES, Rhodes JA (2006) The identifia- 54. Allen JE, Whelan S (2014) Assessing the state
bility of tree topology for phylogenetic mod- of substitution models describing noncoding
els, including covarion and mixture models. J RNA evolution. Genome Biol Evol 6:65–75
Comput Biol 13:1101–1113 55. Blair C, Murphy RW (2011) Recent trends in
40. Swofford DL et al (1996) Phylogenetic infer- molecular phylogenetic analysis: where to
ence. In: Hillis DM, Moritz C, Mable BK next? J Hered 102:130–138
(eds) Molecular systematics. Sinauer Associ- 56. Lanfear R et al (2012) PartitionFinder: com-
ates, Sunderland, MA, pp 407–514 bined selection of partitioning schemes and
41. Morrison DA (2007) Increasing the efficiency substitution models for phylogenetic analyses.
of searches for the maximum likelihood tree in Mol Biol Evol 29:1695–1701
a phylogenetic analysis of up to 150 nucleo- 57. Pagel M, Meade A (2004) A phylogenetic
tide sequences. Syst Biol 56:988–1010 mixture model for detecting pattern-
42. Whelan S (2007) New approaches to phylo- heterogeneity in gene sequence or character-
genetic tree search and their application to state data. Syst Biol 53:571–581
large numbers of protein alignments. Syst 58. Le SQ, Lartillot N, Gascuel O (2008) Phylo-
Biol 56:727–740 genetic mixture models for proteins. Philos
43. Vinh LS, von Haeseler A (2004) IQPNNI: Trans R Soc B Biol Sci 363:3965–3976
moving fast through tree space and stopping 59. Le SQ, Gascuel O (2010) Accounting for
in time. Mol Biol Evol 21:1565–1571 solvent accessibility and secondary structure
44. Money D, Whelan S (2012) Characterizing in protein phylogenetics is clearly beneficial.
the phylogenetic tree-search problem. Syst Syst Biol 59:277–287
Biol 61:228–239 60. Bouckaert RR (2010) DensiTree: making
45. Bryant D (2004) The splits in the neighbor- sense of sets of phylogenetic trees. Bioinfor-
hood of a tree. Ann Combin 8:1–11 matics 26:1372–1373
46. Whelan S, Money D (2010) The prevalence of 61. Felsenstein J (1985) Confidence limits on
multifurcations in tree-space and their impli- phylogenies: an approach using the bootstrap.
cations for tree-search. Mol Biol Evol Evolution 39:783–791
27:2674–2677 62. Hillis DM, Bull JJ (1993) An empirical test of
47. Lin Y-M, Fang S-C, Thorne JL (2007) A tabu bootstrapping as a method for assessing con-
search algorithm for maximum parsimony fidence in phylogenetic analysis. Syst Biol
phylogeny inference. Eur J Oper Res 42:182–192
176:1908–1917 63. Efron B, Halloran E, Holmes S (1996) Boot-
48. Zwickl D (2006) Genetic algorithm strap confidence levels for phylogenetic trees.
approaches for the phylogenetic analysis of Proc Natl Acad Sci U S A 93:13429
large biological sequence datasets under the 64. Embley TM, Martin W (2006) Eukaryotic
maximum likelihood criterion. Ph.D. thesis, evolution, changes and challenges. Nature
University of Texas, USA 440:623–630
49. Lewis PO (1998) A genetic algorithm for 65. Fitzpatrick DA, Creevey CJ, McInerney JO
maximum-likelihood phylogeny inference (2006) Genome phylogenies indicate a mean-
using nucleotide sequence data. Mol Biol ingful α-proteobacterial phylogeny and sup-
Evol 15:277–283 port a grouping of the mitochondria with
50. Lemmon AR, Milinkovitch MC (2002) The the Rickettsiales. Mol Biol Evol 23:74–85
metapopulation genetic algorithm: an effi- 66. McGowen MR, Gatesy J, Wildman DE
cient solution for the problem of large phy- (2014) Molecular evolution tracks macroevo-
logeny estimation. Proc Natl Acad Sci U S A lutionary transitions in Cetacea. Trends Ecol
99:10516–10521 Evol 29:336–346
51. Darriba D et al (2012) jModelTest 2: more 67. Shimodaira H, Hasegawa M (1999) Multiple
models, new heuristics and parallel comput- comparisons of log-likelihoods with
ing. Nat Methods 9:772
376 Simon Whelan and David A. Morrison
applications to phylogenetic inference. Mol 83. Rannala B, Zhu T, Yang Z (2012) Tail para-
Biol Evol 16:1114–1116 dox, partial identifiability, and influential
68. Shimodaira H (2002) An approximately unbi- priors in Bayesian branch length inference.
ased test of phylogenetic tree selection. Syst Mol Biol Evol 29:325–335
Biol 51:492–508 84. Lewis PO, Holder MT, Holsinger KE (2005)
69. Kishino H, Hasegawa M (1989) Evaluation of Polytomies and Bayesian phylogenetic infer-
the maximum likelihood estimate of the evo- ence. Syst Biol 54:241–253
lutionary tree topologies from DNA sequence 85. Yang ZH (2007) Fair-balance paradox, star-
data, and the branching order in Hominoidea. tree paradox, and Bayesian phylogenetics.
J Mol Evol 29:170–179 Mol Biol Evol 24:1639–1655
70. Stamatakis A, Hoover P, Rougemont J (2008) 86. Lartillot N, Philippe H (2004) A Bayesian
A rapid bootstrap algorithm for the RAxML mixture model for across-site heterogeneities
web servers. Syst Biol 57:758–771 in the amino-acid replacement process. Mol
71. Minh BQ, Nguyen MAT, von Haeseler A Biol Evol 21:1095–1109
(2013) Ultrafast approximation for phyloge- 87. Lartillot N, Brinkmann H, Philippe H (2007)
netic bootstrap. Mol Biol Evol Suppression of long-branch attraction arte-
30:1188–1195. doi:10.1093/molbev/ facts in the animal phylogeny using a site-
mst024 heterogeneous model. BMC Evol Biol 7:S4
72. Anisimova M, Gascuel O (2006) Approxi- 88. Robinson D et al (2003) Protein evolution
mate likelihood-ratio test for branches: a fast, with dependence among codons due to ter-
accurate, and powerful alternative. Syst Biol tiary structure. Mol Biol Evol 20:1692–1704
55:539–552 89. Lartillot N, Poujol R (2011) A phylogenetic
73. Huelsenbeck JP et al (2002) Potential appli- model for investigating correlated evolution
cations and pitfalls of Bayesian inference of of substitution rates and continuous pheno-
phylogeny. Syst Biol 51:673–688 typic characters. Mol Biol Evol 28:729–744
74. Holder M, Lewis PO (2003) Phylogeny esti- 90. Lukoschek V, Keogh JS, Avise JC (2012)
mation: traditional and Bayesian approaches. Evaluating fossil calibrations for dating phylo-
Nat Rev Genet 4:275–284 genies in light of rates of molecular evolution:
75. Ronquist F, Deans AR (2010) Bayesian phy- a comparison of three approaches. Syst Biol
logenetics and its influence on insect system- 61:22–43
atics. Annu Rev Entomol 55:189–206 91. Baele G et al (2012) Improving the accuracy
76. Yang Z, Rannala B (2012) Molecular phylo- of demographic and molecular clock model
genetics: principles and practice. Nat Rev comparison while accommodating phyloge-
Genet 13:303–314 netic uncertainty. Mol Biol Evol
77. Drummond AJ et al (2012) Bayesian phylo- 29:2157–2167
genetics with BEAUti and the BEAST 1.7. 92. Delsuc F, Brinkmann H, Philippe H (2005)
Mol Biol Evol 29:1969–1973 Phylogenomics and the reconstruction of the
78. Ronquist F et al (2012) MrBayes 3.2: efficient tree of life. Nat Rev Genet 6:361–375
Bayesian phylogenetic inference and model 93. Landan G, Graur D (2007) Heads or tails: a
choice across a large model space. Syst Biol simple reliability check for multiple sequence
61:539–542 alignments. Mol Biol Evol 24:1380–1383
79. Larget B, Simon DL (1999) Markov chain 94. Penn O et al (2010) An alignment confidence
Monte Carlo algorithms for the Bayesian anal- score capturing robustness to guide tree
ysis of phylogenetic trees. Mol Biol Evol uncertainty. Mol Biol Evol 27:1759–1767
16:750–759 95. Jordan G, Goldman N (2012) The effects of
80. Alfaro ME, Holder MT (2006) The posterior alignment error and alignment filtering on the
and the prior in Bayesian phylogenetics. Annu sitewise detection of positive selection. Mol
Rev Ecol Evol Syst 37:19–42 Biol Evol 29:1125–1139
81. Zhang C, Rannala B, Yang Z (2012) Robust- 96. Huber KT et al (2002) Spectronet: a package
ness of compound Dirichlet priors for Bayes- for computing spectra and median networks.
ian inference of branch lengths. Syst Biol Appl Bioinformatics 1:2041–2059
61:779–784 97. Huson DH (1998) SplitsTree: analyzing and
82. Bergsten J, Nilsson AN, Ronquist F (2013) visualizing evolutionary data. Bioinformatics
Bayesian tests of topology hypotheses with an 14:68–73
example from diving beetles. Syst Biol 98. Gil M et al (2013) CodonPhyML: fast maxi-
62:660–673 mum likelihood phylogeny estimation under
Inferring Trees 377
codon substitution models. Mol Biol Evol 101. Lartillot N, Lepage T, Blanquart S (2009)
30:1270–1280 PhyloBayes 3: a Bayesian software package
99. Swofford DL (2002) Phylogenetic analysis for phylogenetic reconstruction and molecu-
using parsimony (*and other methods). lar dating. Bioinformatics 25:2286–2288
Sinauer Associates, Sunderland, MA 102. Nylander JA et al (2008) AWTY (are we there
100. Guindon S et al (2010) New algorithms and yet?): a system for graphical exploration of
methods to estimate maximum-likelihood MCMC convergence in Bayesian phyloge-
phylogenies: assessing the performance of netics. Bioinformatics 24:581–583
PhyML 3.0. Syst Biol 59:307–321
Chapter 15
Abstract
Most phylogenetic methods are model-based and depend on models of evolution designed to approximate
the evolutionary processes. Several methods have been developed to identify suitable models of evolution
for phylogenetic analysis of alignments of nucleotide or amino acid sequences and some of these methods
are now firmly embedded in the phylogenetic protocol. However, in a disturbingly large number of cases, it
appears that these models were used without acknowledgement of their inherent shortcomings. In this
chapter, we discuss the problem of model selection and show how some of the inherent shortcomings may
be identified and overcome.
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_15, © Springer ScienceþBusiness Media New York 2017
379
380 Lars S. Jermiin et al.
same coin: the former is the phylogeny, a rooted binary tree that
depicts the time and order of different divergence events, and the
latter is the process by which mutations in DNA accumulate over
time along diverging lineages. It makes no sense to consider the
pattern and the process separately, even though only one of the two
might be of interest, because the estimate of evolutionary pattern
depends on the evolutionary process, and vice versa. Underpinning
this chapter is also a hope to raise an awareness of what the term
rate of molecular evolution means: it is not just a single variable, as
commonly portrayed, but rather, in mathematical terms, a matrix of
variables. Finally, although many types of mutations are known to
affect DNA and protein, the focus of this chapter is on point
mutations in DNA. Our reasons for limiting the focus to these
changes is that phylogenetic studies frequently rely on the products
of point mutations as the main source of phylogenetic information,
and that the substitution models used in phylogenetic methods
usually focus on those types of changes (much of what is written
below applies equally well, albeit with some modifications, to
sequences of amino acids).
In the following sections, we first describe the phylogenetic
assumptions, including some of the relevant aspects of the Markov
models commonly used in phylogenetic studies. We then discuss
some of the terminology used to characterize phylogenetic data and
describe several methods that can be used for identifying the opti-
mal Markov model. We also discuss why it is necessary to use data-
surveying methods before and after phylogenetic analysis. Using
such methods prior to phylogenetic analyses is becoming increas-
ingly popular, but it is still rare to see phylogenetic results being
properly evaluated using the parametric bootstrap.
2 Underlying Principles
2.1 The Phylogenetic In the context of the evolutionary pattern, it is generally assumed
Assumptions that the sequences have evolved along a bifurcating tree, where
each edge in this tree represents the period of time over which
point mutations have accumulated and each bifurcation represents
a speciation event. Sequences that evolve in this manner are consid-
ered useful for studies of many aspects of evolution. A violation of
this assumption occurs when gene duplication, recombination
between homologous chromosomes and/or lateral gene transfer
between genomes has occurred. In phylogenetic trees, gene dupli-
cations resemble speciation events and might be interpreted as such
unless all descendant copies of each gene duplication are accounted
for, which is neither always possible nor always the case. One
solution to this problem would be to carry out a probabilistic
orthology analysis [52]. Recombination between homologous
chromosomes is most easily detected in sequences with a recent
common origin, and the phylogenetically confounding effect of it
diminishes with the age of the recombination event (due to the
subsequent accumulation of point mutations). Lateral gene transfer
is more difficult to detect, but it is thought to affect studies of
phylogenetic data with recent as well as ancient origins. Methods to
detect recombination [53–57] and lateral gene transfer [58–64] are
available, but it is beyond the scope of this chapter to review them.
In the following, we simply assume that the sequences evolved on a
bifurcating tree, without gene duplication, recombination and lat-
eral gene transfer.
In the context of the evolutionary process, it is generally assumed
that the sites in a gene have evolved independently under the same
Markovian conditions (and the sites then are said to be independent
and identically distributed). The advantage of this is that only one
Identifying Optimal Models of Evolution 383
Table 1
The spectrum of conditions relating to the phylogenetic assumptions
ðRt Þ2 ðRt Þ3
Pðt Þ ¼ I þ Rt þ þ þ
2! 3!
X
1
ðRt Þk ð2Þ
¼
k¼0
k!
¼ e Rt
where R is a time-independent rate matrix satisfying three
conditions:
1. r ij > 0 for i 6¼ j ;
P
2. r ii ¼ j 6¼i r ij , implying that R1 ¼ 0, where 1T ¼ ð1, 1, 1, 1Þ
and 0T ¼ ð0, 0, 0, 0Þ;
3. π T R ¼ 0T , where π T ¼ ðπ 1 , π 2 , π 3P
, π 4 Þ is the stationary distri-
bution of R, 0 < π j < 1, and 1 ¼ π j .
The second of these conditions is required to ensure that Pðt Þ is
a valid transition matrix for t > 0.
Let f 0j be the frequency of the j-th nucleotide in the ancestral
sequence. Then the process is
1. stationary, if Pr ðX ðt Þ ¼ j Þ ¼ f 0j ¼ π j , for j ¼ 1, . . . , 4, and
2. reversible, if the balance equation π i r ij ¼ π j r j i is met for
1 i, j 4.
R is a key component in the context of modeling the accumu-
lation of point mutations at sites in a nucleotide sequence. Each
element of R has a role to play when modeling the state a site will be
in at time t, so it is useful to understand the implications of chang-
ing R. For this reason, it is unwise to consider the rate of evolution
as a single variable when, in fact, it is better represented by a matrix
of variables.
2.3 Modeling the Consider a site in a pair of nucleotide sequences that have diverged
Evolutionary from their common ancestor by two independent Markov pro-
Processes at a Site cesses. Let X and Y denote the Markov processes operating at the
in two Sequences site, one for each edge, and let PX(t) and PY(t) be the transition
functions that describe the Markov processes X(t) and Y(t). The
joint probability that the sequences contain nucleotide i and j,
respectively, is then given by
f ij ðt Þ ¼ Pr X ðt Þ ¼ i, Y ðt Þ ¼ j X ð0Þ ¼ Y ð0Þ ; ð3Þ
3.1 Bias The term bias has variously been used to describe (a) a systematic
distortion of a statistical result due to a factor not allowed for in its
derivation, (b) a nonuniform distribution of the frequencies of
nucleotides, codons or amino acids, and (c) compositional hetero-
geneity among homologous sequences. In some instances, there is
little doubt about the meaning but in others, the authors have
inadvertently provided grounds for confusion. Because of this, we
recommend that the term bias be reserved for statistical purposes,
and that four other terms be used to describe the observed nucleo-
tide content:
1. The nucleotide content of a sequence is uniform if the nucleo-
tide frequencies are identical; otherwise, it is nonuniform;
2. The nucleotide content of two sequences is compositionally
homogeneous if they have the same nucleotide content; other-
wise, it is compositionally heterogeneous.
The advantages of adopting this terminology are: (a) that we
can discuss model selection without the ambiguity that we other-
wise might have had to deal with, and that (b) we are not forced to
state what the unbiased condition is—it might not be a uniform
nucleotide content, as implied in many instances. The five terms are
also applicable to codons and amino acids without loss of clarity.
non-historical signals [97], include: the rate signal, which may arise
when the sites and/or lineages evolve at nonhomogeneous rates;
the compositional signal, which may arise when the sites and/or
lineages evolve under different stationary conditions; and the
covarion signal, which may emerge when the sites evolve non-
independently (e.g., switching between being variable and invari-
able along different edges).
Occasionally, the term phylogenetic signal is used, either synon-
ymously with the historical signal or to represent the signals that
the phylogenetic methods use during inference of a phylogeny. Due
to its ambiguity and the fact that most popular phylogenetic meth-
ods are unable to distinguish the historical and non-historical sig-
nals [100], we recommend the term phylogenetic signal be used
with caution.
Separating the different signals is difficult because their mani-
festations are similar, so an inspection of the inferred phylogeny is
unlikely to offer the best solution to this problem. Recent simula-
tion studies of nucleotide sequences generated under stationary,
reversible, and homogeneous conditions as well as under more
complex conditions have revealed a complex relationship between
the historical and non-historical signals [100]. The results show
that the historical signal decays over time whereas the other signals
may increase over time, depending on the nature of the evolution-
ary processes operating over time. The results also show that the
relative magnitude of the signals determines whether the phyloge-
netic methods are likely to infer the correct tree, and that it is
possible to infer the correct tree even though the historical signal
has been lost. Therefore, there are reasons to be cautious when
studying ancient evolutionary events: what might be a well-
supported phylogeny may in fact be a tree representing largely the
non-historical signals.
3.3 Testing the The composition of nucleotides in an alignment may vary across
Stationary, Reversible, sequences and/or across sites. In either case, it would be unwise to
and Homogeneous assume that the evolutionary processes can be modeled accurately
Condition using a single time-reversible Markov model.
A solution to this problem is to determine whether there is
compositional heterogeneity in the alignment of nucleotides and, if
so, what type of compositional heterogeneity it is. If there is com-
positional heterogeneity across the sites but not across the
sequences, then it is possible to model the evolutionary processes
using a set of time-reversible Markov models applied to different
sets of sites in the alignment; several methods facilitate such an
analysis (e.g., [13, 14, 101–104]). On the other hand, if there is
compositional heterogeneity across the sequences, then none of
these strategies are appropriate because lineage-specific evolution-
ary processes must have been different.
388 Lars S. Jermiin et al.
Tree
Time 1 2
t0
R1 R2
1’ 2’
t1
Fig. 1 A rooted 2-tipped tree with two diverging nucleotide sequences (i.e., fat horizontal lines) observed at
time t0 (i.e., sequences 1 and 2) and at time t1 (i.e., sequences 10 and 20 ). The evolutionary processes
operating along the terminal edges are labeled R1 and R2
(because the two sequences are identical) while at time t1, the
divergence matrix might look like this:
2 3
191 71 68 57
6 14 142 22 33 7
Dðt 1 Þ ¼ 6
4 16
7: ð6Þ
12 144 29 5
26 19 18 138
Each element of D (i.e., dij) represents the number of sites where the
descendant sequences have nucleotides i and j, respectively. It is easy
to see that the two sequences have different nucleotide frequencies
and that the matrix is asymmetrical (i.e., d ij 6¼ d j i for i ¼
6 j ), so it is
tempting to conclude that they have evolved under different condi-
tions. However, doing so would be unwise before testing whether
the two sequences have evolved under the same conditions. This can
be done using the matched-pairs tests of symmetry, marginal sym-
metry, and internal symmetry.
In order to use the matched-pairs test of symmetry [111], we
enter the elements of D(t1) into the following equation:
390 Lars S. Jermiin et al.
X d ij d j i 2
S 2S ¼ ; ð7Þ
i<j
d ij þ d j i
where d1 is the sum of the first row of D(t1), d 1 is the sum of the
l l
S 2M ¼ uT V1 u; ð9Þ
3.3.3 Matched-pairs We now turn to cases where the alignment contains more than two
Tests of Homogeneity with sequences. Suppose we have four nucleotide sequences that have
More than Two Sequences evolved independently on a bifurcating tree with three bifurcations.
The alignment might look like that in Fig. 2. Visual inspection of
this alignment reveals how difficult it is to determine whether the
sequences have evolved under stationary, reversible, and homoge-
neous conditions, emphasizing the need for statistical tests to
address this issue. In the following, we demonstrate how such
tests may be used.
Initially, we use the overall matched-pairs test of marginal sym-
metry [106], which returns a test statistic (TM2) that is asymptoti-
cally distributed as a χ 2-variate on υ ¼ ðn 1Þðl 1Þ degrees of
freedom. For the data in Fig. 2, T 2M ¼ 56:93 and υ ¼ 9. Assuming
evolution under stationary conditions, the probability of
T 2M 56:93 is ~5.22 1009, implying that these data are
unlikely to have evolved under stationary conditions. This result
fits well with what is known about the processes that generated the
data (Fig. 2).
Notwithstanding this result, it is possible that some of the
sequences have evolved under stationary, reversible, and homoge-
neous conditions. To test whether this is the case, the matched-
pairs tests of symmetry, marginal symmetry, and internal symmetry
for pairs of sequences may be used. Table 2 shows the
corresponding three sets of six probabilities. After using the
sequential Bonferroni correction to counteract the problem of
392 Lars S. Jermiin et al.
Seq1 TTTCTGTAGACTACAGCCGAACTGATACAATACAAGCACAAACAATTCACCGCGTCGCGCACAGT
Seq2 CGTCTGGGATCTTTTGCCGGGCTGGGTCGCTACACGAACGCAGAGTTCTACTCCGGTCGCACTTG
Seq3 CTACAGTTAAGTTCTGCAGAGCTGCTTGACTATACGATCAACGAATACAAGACGGGGCGCACAGG
Seq4 CTTCGGTATAGTTCTGCCGAGCTGGTTCGCTACATGATCAATGATTACGACCCTGGGCCCTCTGG
CGTCAAAGCGGCATTCCATAAAAGTTCATCCATACCCCGAGGTAACCTCACGTCGTCACGGGCTGACGTAATCAC
CGGATGAGTTGGTTACGGAGAGTGCGGGTCTTTTCCCAAAGTTCATTTCCCGTCGTTTCGGCCTGTTGTAATCAT
CATATAAGTGGGATTCCGTAAGATCATGTCTCTACCCAAAGGGTACATGTTGTCTTCACGGCCAAACCTAATCAC
CGTATGAGTGGGATGGTGTCAAATTTCTTCTTGACCGGCAGGTCACCTCTTGTCCTGAGGGCCGGGCGGCAGCAG
GAAAGCACCGCCCGACCGGTCAAGCCTCAGAAGGGTCGAACACGGACTCAGTCTCAAGTGCTCCTCCACAAACGT
GTGTGCTCCGCCCCATCGGTGAAGCCCCGCTAGCGTATTACTCGGAATGTGTATCTAGTGCCAATTCATATACGT
GGGTCACCTGCCCAACAGTTGAAGGCGCGCCAGGCCGGCCCACGCATACAGACTCCAGAGCAACTCCATCAACGT
GTCTGTGCTGTTTTGCCTGTAATGCCTCGTCAGGCGGGAGCACGGTTTTAGTATCCTTGCCTACTCTATTATTCT
CATACTTAGTTCACCATCCCCGAGCCTATTTCCCTTAAAATGCGGTAACCCGGCCAGGGAGGAGAGAAAGAGTGG
ATAAGTTAGTTTATAATCTCCTCGCCTATTTCCTTAGAAATAGTATTATCGATCTTTGACGGAGTGAACTATGGG
CAAACTTACCTCAAAGTCTCCGCGGCTAGTCCCATTGAAATACGATTATCTCACCTTGCAAGAGTGAAAAAATCG
GAAATTTAGCTGATAATCTCTTCAGCTAATTCTTTAGAAATAGGCTTATCGTCCCGGGTTGGTGCGAAACATCCG
Fig. 2 An alignment of nucleotides—the data were generated using Hetero [167] with default settings
invoked. However, the marginal distributions of the evolutionary processes operating along the four terminal
edges were nonuniform (π 1T ¼ ð0:4, 0:3, 0:2, 0:1Þ for Seq1 and Seq3, and π 2T ¼ ð0:1, 0:2, 0:3, 0:4Þ for
Seq2 and Seq4), implying that the sequences evolution under nonstationary as well as nonreversible
conditions
Table 2
Results returned from the matched-pairs tests of symmetry, marginal symmetry, and internal
symmetry for the sequences in Fig. 2
A G B G
C
•• •• •• ••C
A A
T T
Fig. 3 Tetrahedral plot, with the borders (a) or axes (b) displayed. The four dots
represent the four sequences in Fig. 2
A T B T C T
G G G
A A A
C C C
Fig. 4 Tetrahedral plots, based on first codon sites (a), second codon sites (b), and third codon sites (c) from an
alignment of 2514 codons (extracted from a longer alignment of mitochondrial protein-coding genes from 53
animal species). This data set was originally analyzed by Bourlat et al. [115], who reported evidence of
compositional heterogeneity
Fig. 5 PP plots from data generated under stationary, reversible, and homogeneous conditions (a) and from the
alignment of 2514 second codon sites (for details, see Fig. 4). Results from the matched-pairs tests of
symmetry, marginal symmetry, and internal symmetry (i.e., 1378 probabilities from each test) are presented in
panels (b), (c), and (d), respectively. For each test, we first ordered the observed probabilities in a descending
order before plotting them against a uniform distribution of probabilities. The horizontal line in each panel
equals 0.05
P
C lat
Lu lym yne
m en re
br e is
U icu lla_ _du
Ap r s t m
Complex Conditions
Have Evolved Under
Te
3.3.5 Phylogenetic
re A ly Ha ech _te orq er
Le b lb P sia_ liot is_ rre ua ilii
ra in u c i c s ta
m
tro p el D ja_ ara _do ers a
Ep Bra m et on an po do llo a
ig nc y r ia io r x i
on h B Ho zon a_fl _m _re osa a
i o o m
ic s a o _m uv yd rio
As h to _ _ ia a
tro Ep My thy ma con sa arin tilis s
pe Xe ta xi s_ _ s pi u
St n b t e s
ro As cte not tre e_ luca elc rict ns
ng te n_ urb tus glu ya he or
yl A r p _ t
oc P ste L ina oly ella bu ino nus ri
Pa en isas ria uid _pe aca _b rge sa
Fig. 6 Heat map with color-coded probabilities obtained using the matched-pairs
M M cro Dap _co a_e me nct um a
on e p h le m st at
ta tri or ni op pu ica a
C T stra diu a_ma_p tra sa
H i o e e m a u ta
D alo Ci na_ thya a_f _se tth lex
ol c on in _ ra n a
io y a te a n ile i
lu nt _s s ct ks
m hi a tin in i
_n a_ vi a ia
at ro gn lis
io re yi
na tz
lis i
the data are not big enough to affect the phylogenetic estimate.
One option is to assume that the non-historical signals found in
[115] shows, there are cases where the assumptions of popular
assumed by most popular molecular phylogenetic methods and, if
sequences are consistent with evolution under the conditions
So far, we have focused on determining whether homologous
the corresponding cell is black; otherwise, it is white. The names of the sequences are shown along the
ig lis i
ny
i
The heat map was obtained by doing synchronized row-and-column permutations of the heat map in Fig. 6
whole tree (the shorter the internal edges in the true tree are, the
of the internal edges in the true tree, relative to the length of the
likely outcome if the compositional signal were stronger than the
heterogeneous sequences and the true tree were like that on the left
Fig. 7 Heat map with color-coded probabilities obtained using the matched-pairs test of marginal symmetry.
those on the left in Fig. 8b and c, then the two signals would differ
example, if we were to infer the phylogeny of four compositionally
400 Lars S. Jermiin et al.
Fig. 8 Diagram showing what might happen if a phylogeny were to be estimated from four compositionally
heterogeneous sequences and the phylogenetic methods assumed evolution under stationary, reversible, and
homogeneous conditions. Three scenarios (a, b, and c) are considered, with the true tree on the left, the
nucleotide composition of the sequences in the middle, and the inferred tree on the right. In each case, the
ancestral nucleotide composition was uniform
harder they are to infer: [105]); and the length of the sequences
(the longer the alignment is, the smaller the effect of stochastic
error and the bigger the likelihood of systematic error [due
to model misspecification]) (Wong and Jermiin, in prep.). In con-
clusion, this is not a recommended option.
Another option is to compare the sequences using phylogenetic
methods that are able to accommodate more complex processes of
molecular evolution. Broadly speaking, these phylogenetic meth-
ods can be divided into distance-based [19–22, 25, 26, 29, 31],
parsimony-based [18, 23], likelihood-based [16, 17, 24, 27, 28,
30, 32, 34, 36–42] and Bayesian [30, 33, 35] methods. It is beyond
the scope of this chapter to review these phylogenetic methods, but
it is still worth highlighting some of their features, strengths and
weaknesses.
The distance-based phylogenetic methods are designed to cal-
culate accurate estimates of the evolutionary distances between
pairs of sequences, which, in turn, may be used to infer a tree
using a clustering algorithm (e.g., [116–119]). The methods are
fast and, therefore, attractive. However, they are often of limited
value because only one model of evolution is applied across the sites
(for an exception, see Ref. [31]). Furthermore, it is often impossible
Identifying Optimal Models of Evolution 401
A B C
R4 R1
R2
R3
R3 R1
R6 R2 R1
R5 R4 R1
R7 R8 R1 R2 R2 R1 R1 R1 R1 R1
R1 R2
Fig. 9 Diagram illustrating two algorithms used to identify optimal models of evolution for sequence data.
Starting with a unique model of evolution assigned to each edge in the tree (a), the CORE algorithm [38] will
attempt to reduce complexity of the model until the best model of evolution is identified (b). Beginning with the
same model of evolution assigned to all edges in the tree (c), the bottom-up algorithm [42] increases the
complexity of the model until the best model of evolution is identified (b)
2012 [39] and 2014 [42], with the first one relying on substitution
mapping [121] and the second one not doing so. In both cases, the
search for the best model of evolution begins with the simplest
model of evolution (i.e., one where the same rate matrix is assigned
to all edges in the tree: Fig. 9c). By assigning different rate matrices
to different edges, the complexity of the model of evolution can be
increased. A comparison of the fit between tree, model and data can
then be used to evaluate whether the increased complexity of any of
the new models is a better explanation of the data than the older
less-complex model of evolution. This process of increasing model
complexity can be continued until the fit between tree, model and
data cannot be improved further.
The results obtained using these three classes of phylogenetic
methods are striking because they have revealed a more complex
evolution for much of the sequence data that has been analyzed
thus far. Not only is it now clear that phylogenetic methods that
assume globally stationary, reversible, and homogeneous Markov-
ian processes in many cases are inappropriate for the data and may
produce biased phylogenetic results [100, 105]; it is also clear that
there is a need for accurate and fast phylogenetic methods that
simultaneously can search tree space while fitting the data optimally
to each of the tree topologies considered. However, doing so is a
big combinatorial problem that will require a smart heuristic
solution because there are ∏ni¼3 ð2i 3Þ rooted binary trees, each
with 2n 2 edges, and a Bell number of distinct rate-matrix
arrangements over the edges for each of these trees [38]. Most of
the phylogenetic methods discussed above assume that the tree is
known and that information about the models of evolution is the
only unknown component, so better methods are clearly needed.
404 Lars S. Jermiin et al.
3.4 Testing the It is commonly assumed that the sites in an alignment of nucleo-
Assumption of tides are independent and identically distributed, or at least inde-
Independent and pendent. The latter case includes a scenario where rate-
Identical Processes heterogeneity across sites (RHAS) is modeled using a Γ distribution
and a proportion of invariable sites. However, the order and num-
ber of nucleotides in a gene usually determine the function of the
gene product, implying that it would be unrealistic, and maybe
even unwise, to assume that the sites in the alignment of nucleo-
tides are independent and identically distributed.
To determine whether sites in an alignment of nucleotides are
independent and identically distributed, it is necessary to compare
this simple model to the more complex models that describe the
interrelationship among sites in the alignment of nucleotides.
Descriptions of more complex models may depend on prior knowl-
edge of the genes and gene products, and comparisons may require
using likelihood-ratio tests, sometimes together with permutation
tests or a parametric bootstrap.
The likelihood-ratio test provides a statistically sound frame-
work for comparing alternative evolutionary hypotheses [122]. In
statistical terms, the likelihood-ratio test statistic, Δ, is defined as
max L data H0
Δ¼ ð11Þ
max L data H1
where the likelihood, L, of the data, given the null hypothesis (H0),
and the likelihood of the data, given the alternative hypothesis
(H1), both are maximized with respect to the parameters. If
Δ > 1, then the data favor H0; otherwise, if Δ < 1, the data favor
the alternative hypothesis. When the hypotheses are nested and H0
is a special case of H1, then Δ is <1 and –2log(Δ) is asymptotically
distributed under H0 as a χ 2-variate with υ degrees of freedom (υ is
the extra number of parameters in H1)—for a discussion of the
likelihood-ratio test, see Whelan and Goldman [123] and Goldman
and Whelan [124]. When the hypotheses are not nested, it is
necessary to use the parametric bootstrap [122, 125], in which
case pseudo-data must be generated under H0. In some cases,
permutation tests may be used instead [126].
The evolution of protein-coding genes and RNA-coding genes
is likely to differ due to the structural and functional constraints of
the gene products, so to determine whether sites in an alignment of
such genes evolved under independent and identical conditions, it
is useful to draw on knowledge of the structure and function of
these genes and their gene products. For example, while a protein-
coding gene may be regarded as a sequence of independently evol-
ving sites (Fig. 10a), it might be more appropriate to consider it as a
sequence of independently evolving codons (Fig. 10b) or a
sequence of independently evolving codon positions, with sites in
the same codon position evolving under identical and independent
Identifying Optimal Models of Evolution 405
A
Gene ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCA
Unit
B
Gene ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCA
Unit
C
Gene ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCA
Unit
Category 123123123123123123123123123123123123123123123123123123123123
D
Gene ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCA
Unit
Category 123123123123456456456456456456456456456456456789789789789789
E
Gene ATGAACGAAAATCTGTTCGCTTCATTCATTGCCCCCACAATCCTAGGCCTACCCGCCGCA
Unit 1
Unit 2
Fig. 10 Models used to describe the relationship among sites in a protein-coding gene. The protein-coding
gene may be regarded as a sequence of independently evolving units, where each unit is a (a) site, (b) codon,
or (c) site assigned its own model of evolution, given its position within a codon. More complex models include
those that consider (d) information about the gene product’s structure and function (here, categories 1, 2, and
3 correspond to models assigned to sites within the codons that encode amino acids in one structural domain,
categories 4, 5, and 6 correspond to models assigned to sites in codons that encode amino acids in another
structural domain, and so forth), and (e) overlapping reading frames (here, unit 1 corresponds to one reading
frame of one gene whereas unit 2 corresponds to that of the other gene)
A
Gene GAACTTGATTTAAAAGCCTATGTTTTGAAAACATAATAAAGAAATATAAATTTTTCT
Unit
B
Gene GAACTTGATTTAAAAGCCTATGTTTTGAAAACATAATAAAGAAATATAAATTTTTCT
Unit
Category 222333222222233322333332211122333332222333332222222333333
C
Gene GAACTTGATTTAAAAGCCTATGTTTTGAAAACATAATAAAGAAATATAAATTTTTCT
Unit
Category 222333222222233322333332211122333332222333332222222333333
Fig. 11 Models used to describe the relationship among sites in RNA-coding genes. An RNA-coding gene may
be regarded as a sequence of independently evolving units, where each unit is (a) a site, or (b) a site assigned
its own model of evolution, given the role it serves in the gene product (here, category 1 corresponds to a
model assigned to sites encoding the anticodon, category 2 corresponds to a model assigned to sites that
encode loops in the gene product, and category 3 corresponds to a model assigned to sites encoding the
stems in the gene product). A more advanced approach uses structural information to link nucleotides that
match each other in the gene product (c) (thin lines connect these pairs of nucleotides)
3.5 Choosing a Time- If a set of sites in an alignment were found to have evolved inde-
reversible Substitution pendently under stationary, reversible, and homogeneous condi-
Model tions, then there is a big family of time-reversible Markov models
available for analysis of these data. Finding the most appropriate
Markov model from this family of models is easy due to the fact that
many of the models are nested, implying that the likelihood-ratio
test [122] may be used to determine whether the alternative
hypothesis, H1, provides a significantly better fit to the data than
the null hypothesis, H0. This model-selection method became
practically possible in 1998 for nucleotide sequences [145] and in
2005 for amino acid sequences [146]. In both cases, the method
allows for RHAS, thus catering for some of the differences found
among sites in phylogenetic data.
Although the abovementioned model-selection method
appears attractive, there are reasons for concern. For example, it
assumes that at least one of the models compared is correct, an
assumption that would be violated in most cases. Other problems
include those arising when: (a) multiple tests are done on the same
data and the tests are non-independent; (b) sample size (i.e., num-
ber of sites) is small; and (c) non-nested models are compared (for
an informative discussion of the problems, see Refs. [147, 148]).
Finally, it is assumed that the tree used during the comparisons of
Identifying Optimal Models of Evolution 409
models is the most likely tree for every model compared, which
might not be the case.
Some of these problems are easily dealt with by other model-
selection methods. Within the context of likelihood, alternative
models of evolution may be compared using the Akaike Informa-
tion Criterion (AIC) [149], where AIC for a given model, R, is a
function of the maximized log-likelihood on R and the number of
estimable parameters, K (e.g., nucleotide frequency, conditional
rates of change, proportion of invariant sites, rate variation among
sites, and number of edges in the tree):
AIC ¼ 2max logL dataR þ 2K : ð12Þ
2K ðK þ 1Þ
AICc ¼ AIC þ : ð13Þ
l K 1
The AIC may be regarded as the amount of information lost by
using R to approximate the evolutionary processes and 2K may be
regarded as a penalty for allowing 2K parameters; hence, the best-
fitting Markov model corresponds to the smallest value of AIC (or
AICc).
Within the Bayesian context, alternative models of evolution
may be compared using the Bayesian Information Criterion (BIC)
[151], the Bayes factor (BF) [152–154], posterior probabilities
(PP) [155, 156] and decision theory (DT) [157], where, for
example,
Pr dataR i
BFij ¼ ð14Þ
Pr dataR j
and
BIC ¼ 2max logL dataR þ Klogl: ð15Þ
3.6 General Having inferred the best tree for a given data set using a model of
Approaches to Model evolution, it is always a good idea to evaluate the fit between tree,
Selection model(s), and data. This cannot be accomplished using the non-
parametric bootstrap (because it measures variability of the estimate
obtained using a phylogenetic method) but can be done using the
parametric bootstrap. The use of the parametric bootstrap to test
the appropriateness of a given Markov model was proposed by
Goldman [125], and is a modification of Cox’s [164] test, which
considers non-nested models. The test can be performed as follows:
1. For a given model, R, use the original alignment to obtain the
log-likelihood, log L, and the maximum-likelihood estimates of
the free parameters;
2. Calculate the unconstrained log-likelihood, log L*, for the orig-
inal alignment using the following equation:
X
N
Ni
logL ¼
*
log ; ð16Þ
i¼1
N
where N is the number of sites in the alignment and Ni is the
number of times that the pattern at column i occurs in the
alignment;
3. Calculate δobs ¼ logL * logL;
4. Use the inferred tree and the optimized parameter values
(obtained during step 1) to generate 1000 pseudo-data sets;
5. For each pseudo-data set, j ¼ 1, . . . , 1000, obtain log Lj (the
log-likelihood under R), log Lj*, and δj ¼ logL *j logL j ;
6. Estimate p, the proportion of times where δj > δobs . A large p-
value supports the hypothesis that R is sufficient to explain the
evolutionary process underpinning the data while a small p-value
provides evidence against this hypothesis.
412 Lars S. Jermiin et al.
A B
240 260 280 300 320 340 180 200 220 240 260
Difference in log-likelihood Difference in Log-likelihood
Fig. 12 Examples of the results from two parametric bootstrap analyses. (a) Parametric bootstrap results
under the GTRþΓ model based on 1000 simulations. For each bootstrap replicate, the difference in log-
likelihoods was obtained by subtracting the log-likelihood under the GTRþΓ model from the unconstrained
log-likelihood. The arrow indicates the difference in log-likelihood for the actual data under the GTRþΓ model.
(b) Parametric bootstrap results under the BHþI model based on 1000 simulations. For each bootstrap
replicate, the difference in log-likelihoods was obtained by subtracting the log-likelihood under the BHþI
model from the unconstrained log-likelihood. The arrow indicates the difference in log-likelihood for the actual
data under the BHþI model
4 Discussion
References
1. Zakharov EV, Caterino MS, Sperling FAH 5. Board PG, Coggan M, Chelnavayagam G et al
(2004) Molecular phylogeny, historical bio- (2000) Identification, characterization and
geography, and divergence time estimates for crystal structure of the Omega class of gluta-
swallowtail butterflies of the genus Papilio thione transferases. J Biol Chem
(Lepidoptera: Papilionidae). Syst Biol 275:24798–24806
53:193–215 6. Pagel M (1999) Inferring the historical pat-
2. Brochier C, Forterre P, Gribaldo S (2005) An terns of biological evolution. Nature
emerging phylogenetic core of Archaea: phy- 401:877–884
logenies of transcription and translation 7. Charleston MA, Robertson DL (2002) Pref-
machineries converge following addition of erential host switching by primate lentiviruses
new genome sequences. BMC Evol Biol 5:36 can account for phylogenetic similarity with
3. Hardy MP, Owczarek CM, Jermiin LS et al the primate phylogeny. Syst Biol 51:528–535
(2004) Characterization of the type I inter- 8. Jermann TM, Opitz JG, Stackhouse J et al
feron locus and identification of novel genes. (1995) Reconstructing the evolutionary his-
Genomics 84:331–345 tory of the artiodactyl ribonuclease superfam-
4. de Queiroz K, Gauthier J (1994) Toward a ily. Nature 374:57–59
phylogenetic system of biological nomencla- 9. Eisen JA (1998) Phylogenomics: improving
ture. Trends Ecol Evol 9:27–31 functional predictions for uncharacterized
Identifying Optimal Models of Evolution 415
genes by evolutionary analysis. Genome Res 24. Yang Z, Roberts D (1995) On the use of
8:163–167 nucleic acid sequences to infer early branches
10. Misof B, Liu SL, Meusemann K et al (2014) in the tree of life. Mol Biol Evol 12:451–458
Phylogenomics resolves the timing and pat- 25. Gu X, Li W-H (1996) Bias-corrected para-
tern of insect evolution. Science linear and logdet distances and tests of molec-
346:763–767 ular clocks and phylogenies under
11. Darriba D, Taboada GL, Doallo R et al nonstationary nucleotide frequencies. Mol
(2011) ProtTest 3: fast selection of best-fit Biol Evol 13:1375–1383
models of protein evolution. Bioinformatics 26. Gu X, Li W-H (1998) Estimation of evolu-
27:1164–1165 tionary distances under stationary and nonsta-
12. Darriba D, Taboada GL, Doallo R et al tionary models of nucleotide substitution.
(2012) jModelTest 2: more models, new Proc Natl Acad Sci U S A 95:5899–5905
heuristics and parallel computing. Nat Meth- 27. Galtier N, Gouy M (1998) Inferring pattern
ods 9:772 and process: maximum-likelihood implemen-
13. Lanfear R, Calcott B, Ho SYW et al (2012) tation of a nonhomogenous model of DNA
Partitionfinder: combined selection of parti- sequence evolution for phylogenetic analysis.
tioning schemes and substitution models for Mol Biol Evol 15:871–879
phylogenetic analyses. Mol Biol Evol 28. Galtier N, Tourasse N, Gouy M (1999) A
29:1695–1701 nonhyperthermophilic common ancestor to
14. Lanfear R, Calcott B, Kainer D et al (2014) extant life forms. Science 283:220–221
Selecting optimal partitioning schemes for 29. Tamura K, Kumar S (2002) Evolutionary dis-
phylogenomic datasets. BMC Evol Biol 14:82 tance estimation under heterogeneous substi-
15. Jermiin LS, Ho JWK, Lau KW et al (2009) tution pattern among lineages. Mol Biol Evol
SeqVis: a tool for detecting compositional 19:1727–1736
heterogeneity among aligned nucleotide 30. Foster PG (2004) Modelling compositional
sequences. In: Posada D (ed) Bioinformatics heterogeneity. Syst Biol 53:485–495
for DNA sequence analysis. Humana Press, 31. Thollesson M (2004) LDDist: a Perl module
Totowa, NJ, pp 65–91 for calculating LogDet pair-wise distances for
16. Barry D, Hartigan JA (1987) Statistical analy- protein and nucleotide sequences. Bioinfor-
sis of hominoid molecular evolution. Stat Sci matics 20:416–418
2:191–210 32. Jayaswal V, Jermiin LS, Robinson J (2005)
17. Reeves J (1992) Heterogeneity in the substi- Estimation of phylogeny using a general Mar-
tution process of amino acid sites of proteins kov model. Evol Bioinf Online 1:62–80
coded for by the mitochondrial DNA. J Mol 33. Blanquart S, Lartillot N (2006) A Bayesian
Evol 35:17–31 compound stochastic process for modeling
18. Steel MA, Lockhart PJ, Penny D (1993) Con- nonstationary and nonhomogeneous
fidence in evolutionary trees from biological sequence evolution. Mol Biol Evol
sequence data. Nature 364:440–442 23:2058–2071
19. Lake JA (1994) Reconstructing evolutionary 34. Jayaswal V, Robinson J, Jermiin LS (2007)
trees from DNA and protein sequences: para- Estimation of phylogeny and invariant sites
linear distances. Proc Natl Acad Sci U S A under the General Markov model of nucleo-
91:1455–1459 tide sequence evolution. Syst Biol
20. Lockhart PJ, Steel MA, Hendy MD et al 56:155–162
(1994) Recovering evolutionary trees under 35. Blanquart S, Lartillot N (2008) A site- and
a more realistic model of sequence evolution. time-heterogeneous model of amino acid
Mol Biol Evol 11:605–612 replacement. Mol Biol Evol 25:842–858
21. Steel MA (1994) Recovering a tree from the 36. Dutheil J, Boussau B (2008) Non-
leaf colourations it generates under a Markov homogeneous models of sequence evolution
model. Appl Math Lett 7:19–23 in the Bioþþ suite of libraries and programs.
22. Galtier N, Gouy M (1995) Inferring phyloge- BMC Evol Biol 8:255
nies from DNA sequences of unequal base 37. Jayaswal V, Jermiin LS, Poladian L et al
compositions. Proc Natl Acad Sci U S A (2011) Two stationary, non-homogeneous
92:11317–11321 Markov models of nucleotide sequence evolu-
23. Steel MA, Lockhart PJ, Penny D (1995) A tion. Syst Biol 60:74–86
frequency-dependent significance test for par- 38. Jayaswal V, Ababneh F, Jermiin LS et al
simony. Mol Phylogenet Evol 4:64–71 (2011) Reducing model complexity when
416 Lars S. Jermiin et al.
the evolutionary process over an edge is mod- evolution and phylogeny. Oxford University
eled as a homogeneous Markov process. Mol Press, Oxford, pp 33–62
Biol Evol 28:3045–3059 52. Ullah I, Sjöstrand J, Andersson P et al (2015)
39. Dutheil JY, Galtier N, Romiguier J et al Integrating sequence evolution into probabi-
(2012) Efficient selection of branch-specific listic orthology analysis. Syst Biol 64:969–982
models of sequence evolution. Mol Biol Evol 53. Drouin G, Prat F, Ell M et al (1999) Detect-
29:1861–1874 ing and characterizing gene conversion
40. Zou LW, Susko E, Field C et al (2012) Fitting between multigene family members. Mol
nonstationary general-time-reversible models Biol Evol 16:1369–1390
to obtain edge-lengths and frequencies for the 54. Posada D, Crandall KA (2001) Evaluation of
Barry-Hartigan model. Syst Biol 61:927–940 methods for detecting recombination from
41. Groussin M, Boussau B, Gouy M (2013) A DNA sequences: computer simulations. Proc
branch-heterogeneous model of protein evo- Natl Acad Sci U S A 98:13757–13762
lution for efficient inference of ancestral 55. Posada D (2002) Evaluation of methods for
sequences. Syst Biol 62:523–538 detecting recombination from DNA
42. Jayaswal V, Wong TKF, Robinson J et al sequences: empirical data. Mol Biol Evol
(2014) Mixture models of nucleotide 19:708–717
sequence evolution that account for heteroge- 56. Martin DP, Williamson C, Posada D (2005)
neity in the substitution process across sites RDP2: recombination detection and analysis
and across lineages. Syst Biol 63:726–742 from sequence alignments. Bioinformatics
43. Woodhams MD, Fernandez-Sanchez J, Sum- 21:260–262
ner JG (2015) A new hierarchy of phyloge- 57. Bruen TC, Philippe H, Bryant D (2006) A
netic models consistent with heterogeneous simple and robust statistical test for detecting
substitution rates. Syst Biol 64:638–650 the presence of recombination. Genetics
44. Jermiin LS, Jayaswal V, Ababneh F et al 172:2665–2681
(2008) Phylogenetic model evaluation. In: 58. Ragan MA (2001) On surrogate methods for
Keith J (ed) Bioinformatics: data, sequence detecting lateral gene transfer. FEMS Micro-
analysis, and evolution. Humana Press, biol Lett 201:187–191
Totowa, NJ, pp 331–364 59. Dufraigne C, Fertil B, Lespinats S et al (2005)
45. Sullivan J, Arellano EA, Rogers DS (2000) Detection and characterization of horizontal
Comparative phylogeography of Mesoameri- transfers in prokaryotes using genomic signa-
can highland rodents: concerted versus inde- ture. Nucleic Acids Res 33:e6
pendent responses to past climatic 60. Azad RK, Lawrence JG (2005) Use of artifi-
fluctuations. Am Nat 155:755–768 cial genomes in assessing methods for atypical
46. Demboski JR, Sullivan J (2003) Extensive gene detection. PLoS Comp Biol 1:461–473
mtDNA variation within the yellow-pine 61. Tsirigos A, Rigoutsos I (2005) A new compu-
chipmunk, Tamias amoenus (Rodentia: Sciur- tational method for the detection of horizon-
idae), and phylogeographic inferences for tal gene transfer events. Nucleic Acids Res
northwestern North America. Mol Phylo- 33:922–933
genet Evol 26:389–408
62. Ragan MA, Harlow TJ, Beiko RG (2006) Do
47. Carstens BC, Stevenson AL, Degenhardt JD different surrogate methods detect lateral
et al (2004) Testing nested phylogenetic and genetic transfer events of different relative
phylogeographic hypotheses in the Plethodon ages? Trends Microbiol 14:4–8
vandykei species group. Syst Biol 53:781–792
63. Beiko RG, Hamilton N (2006) Phylogenetic
48. Penny D, Hendy MD, Steel MA (1992) Prog- identification of lateral genetic transfer events.
ress with methods for constructing evolution- BMC Evol Biol 6:15
ary trees. Trends Ecol Evol 7:73–79
64. Sjöstrand J, Tofigh A, Daubin V et al (2014) A
49. Tavaré S (1986) Some probabilistic and statis- Bayesian method for analyzing lateral gene
tical problems on the analysis of DNA transfer. Syst Biol 63:409–420
sequences. Lect Math Life Sci 17:57–86
65. Fitch WM (1986) An estimation of the num-
50. Ababneh F, Jermiin LS, Robinson J (2006) ber of invariable sites is necessary for the accu-
Generation of the exact distribution and sim- rate estimation of the number of nucleotide
ulation of matched nucleotide sequences on a substitutions since a common ancestor. Prog
phylogenetic tree. J Math Model Algor Clin Biol Res 218:149–159
5:291–308
66. Lockhart PJ, Larkum AWD, Steel MA et al
51. Bryant D, Galtier N, Poursat M-A (2005) (1996) Evolution of chlorophyll and bacteri-
Likelihood calculation in molecular phyloge- ochlorophyll: the problem of invariant sites in
netics. In: Gascuel O (ed) Mathematics of
Identifying Optimal Models of Evolution 417
sequence analysis. Proc Natl Acad Sci U S A 81. Jow H, Hudelot C, Rattray M et al (2002)
93:1930–1934 Bayesian phylogenerics using an RNA substi-
67. Yang Z (1996) Among-site rate variation and tution model applied to early mammalian evo-
its impact on phylogenetic analysis. Trends lution. Mol Biol Evol 19:1591–1601
Ecol Evol 11:367–372 82. Lockhart PJ, Steel MA, Barbrook AC et al
68. Waddell PJ, Steel MA (1997) General time (1998) A covariotide model explains apparent
reversible distances with unequal rates across phylogenetic structure of oxygenic photosyn-
sites: mixing G and inverse Gaussian distribu- thetic lineages. Mol Biol Evol 15:1183–1188
tions with invariant sites. Mol Phylogenet 83. Galtier N (2001) Maximum-likelihood phylo-
Evol 8:398–414 genetic analysis under a covarion-like model.
69. Gowri-Shankar V, Rattray M (2006) Compo- Mol Biol Evol 18:866–873
sitional heterogeneity across sites: effects on 84. Pupko T, Galtier N (2002) A covarion-based
phylogenetic inference and modelling the cor- method for detecting molecular adaptation:
relations between base frequencies and substi- application to the evolution of primate mito-
tution rate. Mol Biol Evol 23:352–364 chondrial genomes. Proc R Soc B
70. Schöniger M, von Haeseler A (1994) A sto- 269:1313–1316
chastic model for the evolution of autocorre- 85. Susko E, Inagaki Y, Field C et al (2002) Test-
lated DNA sequences. Mol Phylogenet Evol ing for differences in rates-across-sites distri-
3:240–247 butions in phylogenetic subtrees. Mol Biol
71. Tillier ERM (1994) Maximum likelihood Evol 19:1514–1523
with multiparameter models of substitution. 86. Wang HC, Spencer M, Susko E et al (2007)
J Mol Evol 39:409–417 Testing for covarion-like evolution in protein
72. Hein J, Støvlbœk J (1995) A maximum- sequences. Mol Biol Evol 24:294–305
likelihood approach to analyzing nonoverlap- 87. Wang HC, Susko E, Spencer M et al (2008)
ping and overlapping reading frames. J Mol Topological estimation biases with covarion
Evol 40:181–190 evolution. J Mol Evol 66:50–60
73. Muse SV (1995) Evolutionary analyses of 88. Wu JH, Susko E (2009) General heterotachy
DNA sequences subject to constraints on sec- and distance method adjustments. Mol Biol
ondary structure. Genetics 139:1429–1439 Evol 26:2689–2697
74. Rzhetsky A (1995) Estimating substitution 89. Wang HC, Susko E, Roger AJ (2009) PRO-
rates in ribosomal RNA genes. Genetics COV: maximum likelihood estimation of pro-
141:771–783 tein phylogeny under covarion models and
75. Tillier ERM, Collins RA (1995) Neighbor site-specific covarion pattern analysis. BMC
joining and maximum likelihood with RNA Evol Biol 9:225
sequences: addressing the interdependence of 90. Wang HC, Susko E, Roger AJ (2011) Fast
sites. Mol Biol Evol 12:7–15 statistical tests for detecting heterotachy in
76. Pedersen A-MK, Wiuf C, Christiansen FB protein evolution. Mol Biol Evol
(1998) A codon-based model designed to 28:2305–2315
describe lentiviral evolution. Mol Biol Evol 91. Wu JH, Susko E (2011) A test for heterotachy
15:1069–1081 using multiple pairs of sequences. Mol Biol
77. Tillier ERM, Collins RA (1998) High appar- Evol 28:1661–1673
ent rate of simultaneous compensatory base- 92. Kolmogoroff A (1936) Zur theorie der Mar-
pair substitutions in ribosomal RNA. Genetics koffschen ketten. Math Annal 112:155–160
148:1993–2002 93. Yang Z (2014) Molecular evolution: a statisti-
78. Higgs PG (2000) RNA secondary structure: cal approach. Oxford University Press,
physical and computational aspects. Q Rev Oxford
Biophys 30:199–253 94. Jukes TH, Cantor CR (1969) Evolution of
79. Pedersen A-MK, Jensen JL (2001) A protein molecules. In: Munro HN (ed) Mam-
dependent-rates model and an MCMC- malian protein metabolism. Academic, New
based methodology for the maximum- York, pp 21–132
likelihood analysis of sequences with overlap- 95. Lanave C, Preparata G, Saccone C et al
ping frames. Mol Biol Evol 18:763–776 (1984) A new method for calculating evolu-
80. Savill NJ, Hoyle DC, Higgs PG (2001) RNA tionary substitution rates. J Mol Evol
sequence evolution with secondary structure 20:86–93
constraints: comparison of substitution rate 96. Naylor GPJ, Brown WM (1998) Amphioxus
models using maximum-likelihood methods. mitochondrial DNA, chordate phylogeny,
Genetics 157:399–411
418 Lars S. Jermiin et al.
and the limits of inference based on compar- special reference to the positions of hedge-
isons of sequences. Syst Biol 47:61–76 hog, armadillo, and elephant. Syst Biol
97. Grundy WN, Naylor GJP (1999) Phyloge- 48:31–53
netic inference from conserved sites align- 111. Bowker AH (1948) A test for symmetry in
ments. J Exp Zool 285:128–139 contingency tables. J Am Stat Assoc
98. Li CH, Matthes-Rosana KA, Garcia M et al 43:572–574
(2012) Phylogenetics of Chondrichthyes and 112. Stuart A (1955) A test for homogeneity of the
the problem of rooting phylogenies with dis- marginal distributions in a two-way classifica-
tant outgroups. Mol Phylogenet Evol tion. Biometrika 42:412–416
63:365–373 113. Holm S (1979) A simple sequentially rejective
99. Campbell MA, Chen WJ, Lopez JA (2013) multiple test procedure. Scand J Stat 6:65–70
Are flatfishes (Pleuronectiformes) monophy- 114. Cannings C, Edwards AWF (1968) Natural
letic? Mol Phylogenet Evol 69:664–673 selection and the de Finetti diagram. Ann
100. Ho SYW, Jermiin LS (2004) Tracing the Hum Genet 31:421–428
decay of the historical signal in biological 115. Bourlat SJ, Juliusdottir T, Lowe CJ et al
sequence data. Syst Biol 53:623–637 (2006) Deuterostome phylogeny reveals
101. Lartillot N, Philippe H (2004) A Bayesian monophyletic chordates and the new phylum
mixture model for across-site heterogeneities Xenoturbellida. Nature 444:85–88
in the amino-acid replacement process. Mol 116. Fitch WM, Margoliash E (1967) Construc-
Biol Evol 21:1095–1109 tion of phylogenetic trees. Science
102. Le SQ, Dang CC, Gascuel O (2012) Model- 155:279–284
ing protein evolution with several amino acid 117. Cavalli-Sforza LL, Edwards AWF (1967)
replacement matrices depending on site rates. Phylogenetic analysis: models and estimation
Mol Biol Evol 29:2921–2936 procedures. Am J Hum Genet 19:233–257
103. Lartillot N, Rodrigue N, Stubbs D et al 118. Saitou N, Nei M (1987) The neighbor-
(2013) PhyloBayes MPI: phylogenetic recon- joining method: a new method for recon-
struction with infinite mixtures of profiles in a structing phylogenetic trees. Mol Biol Evol
parallel environment. Syst Biol 62:611–615 4:406–425
104. Nguyen L-T, Schmidt HA, Von Haeseler A 119. Gascuel O (1997) BIONJ: an improved ver-
et al (2015) IQ-TREE: a fast and effective sion of the NJ algorithm based on a simple
stochastic algorithm for estimating model of sequence data. Mol Biol Evol
maximum-likelihood phylogenies. Mol Biol 14:685–695
Evol 32:268–274 120. Zou L, Susko E, Field C et al (2011) The
105. Jermiin LS, Ho SYW, Ababneh F et al (2004) parameters of the Barry-Hartigan model are
The biasing effect of compositional heteroge- statistically non identifiable. Syst Biol
neity on phylogenetic estimates may be 60:872–875
underestimated. Syst Biol 53:638–643 121. Minin VN, Suchard MA (2008) Fast, accurate
106. Ababneh F, Jermiin LS, Ma C et al (2006) and simulation-free stochastic mapping.
Matched-pairs tests of homogeneity with Philos Trans R Soc Lond B 363:3985–3995
applications to homologous nucleotide 122. Huelsenbeck JP, Rannala B (1997) Phyloge-
sequences. Bioinformatics 22:1225–1231 netic methods come of age: testing hypoth-
107. Ho JWK, Adams CE, Lew JB et al (2006) eses in an evolutionary context. Science
SeqVis: visualization of compositional hetero- 276:227–232
geneity in large alignments of nucleotides. 123. Whelan S, Goldman N (1999) Distributions
Bioinformatics 22:2162–2163 of statistics used for the comparison of models
108. Lanave C, Pesole G (1993) Stationary MAR- of sequence evolution in phylogenetics. Mol
KOV processes in the evolution of biological Biol Evol 16:11292–11299
macromolecules. Binary 5:191–195 124. Goldman N, Whelan S (2000) Statistical tests
109. Rzhetsky A, Nei M (1995) Tests of applicabil- of gamma-distributed rate heterogeneity in
ity of several substitution models for DNA models of sequence evolution in phyloge-
sequence data. Mol Biol Evol 12:131–151 netics. Mol Biol Evol 17:975–978
110. Waddell PJ, Cao Y, Hauf J et al (1999) Using 125. Goldman N (1993) Statistical tests of models
novel phylogenetic methods to evaluate of DNA substitution. J Mol Evol 36:182–198
mammalian mtDNA, including amino acid- 126. Telford MJ, Wise MJ, Gowri-Shankar V (2005)
invariant sites-LogDet plus site stripping, to Consideration of RNA secondary structure sig-
detect internal conflicts in the data, with nificantly improves likelihood-based estimates
Identifying Optimal Models of Evolution 419
of phylogeny: examples from the bilateria. Mol 141. Shapiro B, Rambaut A, Drummond AJ
Biol Evol 22:1129–1136 (2005) Choosing appropriate substitution
127. Goldman N, Yang Z (1994) A codon-based models for the phylogenetic analysis of
model of nucleotide substitution for protein- protein-coding sequences. Mol Biol Evol
coding DNA sequences. Mol Biol Evol 23:7–9
11:725–736 142. Hyman IT, Ho SYW, Jermiin LS (2007)
128. Muse SV, Gaut BS (1994) A likelihood Molecular phylogeny of Australian Helicario-
approach for comparing synonymous and nidae, Microcystidae and related groups (Gas-
nonsynonymous nucleotide substitution tropoda: Pulmonata: Stylommatophora)
rates, with application to the chloroplast based on mitochondrial DNA. Mol Phylo-
genome. Mol Biol Evol 11:715–724 genet Evol 45:792–812
129. Dayhoff MO, Schwartz RM, Orcutt BC (eds) 143. Hudelot C, Gowri-Shankar V, Jow H et al
(1978) A model of evolutionary change in (2003) RNA-based phylogenetic methods:
proteins. National Biomedical Research application to mammalian mitochondrial
Foundation, National Biomedical Research RNA sequences. Mol Phylogenet Evol
Foundation, Washington, DC 28:241–252
130. Jones DT, Taylor WR, Thornton JM (1992) 144. Murray S, Flø Jørgensen M, Ho SYW et al
The rapid generation of mutation data matri- (2005) Improving the analysis of dinoflage-
ces from protein sequences. CABIOS late phylogeny based on rDNA. Protist
8:275–282 156:269–286
131. Henikoff S, Henikoff JG (1992) Amino acid 145. Posada D, Crandall KA (1998) MODELT-
substitution matrices from protein blocks. EST: testing the model of DNA substitution.
Proc Natl Acad Sci U S A 89:10915–10919 Bioinformatics 14:817–818
132. Adachi J, Hasegawa M (1996) Model of 146. Abascal F, Zardoya R, Posada D (2005) Prot-
amino acid substitution in proteins encoded Test: selection of best-fit models of protein
by mitochondrial DNA. J Mol Evol evolution. Bioinformatics 21:2104–2105
42:459–468 147. Burnham KP, Anderson DR (2002) Model
133. Cao Y, Janke A, Waddell PJ et al (1998) Con- selection and multimodel inference: a practi-
flict among individual mitochondrial proteins cal information-theoretic approach. Springer,
in resolving the phylogeny of eutherian New York
orders. J Mol Evol 47:307–322 148. Posada D, Buckley TR (2004) Model selec-
134. Yang Z, Nielsen R, Hasegawa M (1998) tion and model averaging in phylogenetics:
Models of amino acid substitution and appli- advantages of akaike information criterion
cations to mitochondrial protein evolution. and bayesian approaches over likelihood
Mol Biol Evol 15:1600–1611 ratio tests. Syst Biol 53:793–808
135. M€ uller T, Vingron M (2000) Modeling 149. Akaike H (1974) A new look at the statistical
amino acid replacement. J Comp Biol model identification. IEEE Trans Auto Cont
7:761–776 19:716–723
136. Adachi J, Waddell PJ, Martin W et al (2000) 150. Sugiura N (1978) Further analysis of the data
Plastid genome phylogeny and a model of by Akaike’s information criterion and the
amino acid substitution for proteins encoded finite corrections. Comm Stat A Theor Meth
by chloroplast DNA. J Mol Evol 50:348–358 7:13–26
137. Whelan S, Goldman N (2001) A general 151. Schwarz G (1978) Estimating the dimension
empirical model of protein evolution derived of a model. Ann Stat 6:461–464
from multiple protein families using a maxi- 152. Suchard MA, Weiss RE, Sinsheimer JS (2001)
mum likelihood approach. Mol Biol Evol Bayesian selection of continuous-time Mar-
18:691–699 kov chain evolutionary models. Mol Biol
138. Dimmic MW, Rest JS, Mindell DP et al Evol 18:1001–1013
(2002) RtREV: an amino acid substitution 153. Aris-Brosou S, Yang Z (2002) Effects of mod-
matrix for inference of retrovirus and reverse els of rate evolution on estimation of diver-
transcriptase phylogeny. J Mol Evol 55:65–73 gence dates with special reference to the
139. Abascal F, Posada D, Zardoya R (2007) metazoan 18S ribosomal RNA phylogeny.
MtArt: a new model of amino acid replace- Syst Biol 51:703–714
ment for Arthropoda. Mol Biol Evol 24:1–5 154. Nylander JA, Ronquist F, Huelsenbeck JP
140. Le SQ, Gascuel O (2008) An improved gen- et al (2004) Bayesian phylogenetic analysis
eral amino acid replacement matrix. Mol Biol of combined data. Syst Biol 53:47–67
Evol 25:1307–1320
420 Lars S. Jermiin et al.
155. Kass RE, Raftery AE (1995) Bayes factors. J phylogenetic substitution models. Syst Biol
Am Stat Assoc 90:773–795 52:594–603
156. Raftery AE (1996) Hypothesis testing and 163. Soubrier J, Steel M, Lee MSY et al (2012) The
model selection. In: Gilks WR, Richardson influence of rate heterogeneity among sites on
S, Spiegelhalter DJ (eds) Markov chain the time dependence of molecular rates. Mol
Monte Carlo in practice. Chapman & Hall, Biol Evol 29:3345–3358
London, pp 163–167 164. Cox DR (1962) Further results on tests of
157. Minin V, Abdo Z, Joyce P et al (2003) separate families of hypotheses. J R Stat Soc
Performance-based selection of likelihood B 24:406–424
models for phylogenetic estimation. Syst 165. Rambaut A, Grassly NC (1997) Seq-Gen: an
Biol 52:674–683 application for the Monte Carlo simulation of
158. Posada D, Crandall KA (2001) Selecting DNA sequence evolution along phylogenetic
methods of nucleotide substitution: An appli- trees. CABIOS 13:235–238
cation to human immunedeficiency virus 1 166. Fletcher W, Yang ZH (2009) INDELible: a
(HIV-1). Mol Biol Evol 18:897–906 flexible simulator of biological sequence evo-
159. Posada D (2008) jModelTest: phylogenetic lution. Mol Biol Evol 26:1879–1888
model averaging. Mol Biol Evol 167. Jermiin LS, Ho SYW, Ababneh F et al (2003)
25:1253–1256 Hetero: a program to simulate the evolution
160. Yang Z (2006) Computational molecular of DNA on a four-taxon tree. Appl Bioinfor-
evolution. Oxford University Press, Oxford matics 2:159–163
161. Yang Z, Kumar S, Nei M (1995) A new 168. Felsenstein J (2004) Inferring phylogenies.
method of inference of ancestral nucleotide Sinauer Associates, Sunderland, MA
and amino acid sequences. Genetics 169. Rokas A, Kr€uger D, Carroll SB (2005) Animal
141:1641–1650 evolution and the molecular signature of
162. Susko E, Field C, Blouin C et al (2003) Esti- radiations compressed in time. Science
mation of rates-across-sites distributions in 310:1933–1938
Chapter 16
Abstract
Lateral genetic transfer (LGT) is the process by which genetic material moves between organisms (and
viruses) in the biosphere. Among the many approaches developed for the inference of LGT events from
DNA sequence data, methods based on the comparison of phylogenetic trees remain the gold standard for
many types of problem. Identifying LGT events from sequenced genomes typically involves a series of steps
in which homologous sequences are identified and aligned, phylogenetic trees are inferred, and their
topologies are compared to identify unexpected or conflicting relationships. These types of approach
have been used to elucidate the nature and extent of LGT and its physiological and ecological consequences
throughout the Tree of Life. Advances in DNA sequencing technology have led to enormous increases in
the number of sequenced genomes, including ultra-deep sampling of specific taxonomic groups and single
cell-based sequencing of unculturable “microbial dark matter.” Environmental shotgun sequencing enables
the study of LGT among organisms that share the same habitat.
This abundance of genomic data offers new opportunities for scientific discovery, but poses two key
problems. As ever more genomes are generated, the assembly and annotation of each individual genome
receives less scrutiny; and with so many genomes available it is tempting to include them all in a single
analysis, but thousands of genomes and millions of genes can overwhelm key algorithms in the analysis
pipeline. Identifying LGT events of interest therefore depends on choosing the right dataset, and on
algorithms that appropriately balance speed and accuracy given the size and composition of the chosen
set of genomes.
Key words Lateral genetic transfer, Horizontal genetic transfer, Phylogenetic analysis, Phyloge-
nomics, Multiple sequence alignment, Orthology
1 Introduction
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_16, © Springer Science+Business Media New York 2017
421
422 Cheong Xin Chan et al.
large, even among strains of the same species [2], and that LGT was
a significant force in the remodeling of microbial genomes [3–5].
The thousands of genomes now available allow for targeted ana-
lyses of specific groups including pathogens [6, 7], as well as studies
that aim to infer LGT broadly across the Tree of Life, using either
all available genomes [8–10] or representative subsets [11]. Initial
studies of the human microbiome suggest that rates of LGT may be
higher in host-associated settings than in any other environment
[12]. Many microbes can regulate their uptake of DNA from the
environment, and processes such as biofilm formation [13] and
inflammation [14] may induce a massive increase in the rates of
LGT.
Detection of LGT events relies on statistical methods to iden-
tify DNA sequences either with different properties than the major-
ity of the genome, or with unusual patterns of distribution or
relatedness across a set of genomes. Although LGT events need
not be delimited by gene [15] or domain [16] boundaries, in many
studies the gene (or protein) is the unit of analysis owing to interest
in which gene-encoded functions have been acquired via LGT.
Initially, DNA of lateral origin bears features of the donor genome
(e.g., G+C content, codon usage, or higher order compositional
features) which may be anomalous in the new genomic content.
Composition-based approaches have been used to identify genes of
foreign origin and carry the advantage that inference can be made
from a single genome. Phylogenetic approaches have their own
distinct advantages: the best build on many decades of theory and
practice, employ explicit models of the evolutionary process, are
reasonably precise in delineating the lateral region, and can detect
events of different ages from very recent to more ancient. Since
compositional anomalies eventually “ameliorate” [17] to those of
the host genome, phylogenetic methods are needed to detect
more-ancient transfer events [8, 18]. However, phylogenetic
approaches as typically applied require the inference of orthologous
sets of genes or proteins to proceed. All approaches run up against
the challenges of overlapping and/or superimposed events, and
loss of signal at deeper divergences.
In this chapter, we outline a computational workflow of meth-
ods for assembling and aligning sets of putatively orthologous
sequences, inferring phylogenetic trees and, by comparisons with
a reference topology, identifying prima facie instances of LGT.
Recent work [10] has highlighted the need for algorithms that
scale to tens of thousands of genomes while making use of current
best-practice algorithms. All the methods we present here are
meant to scale well with increasing dataset size, and many have
been applied to thousands and even millions of sequences. The
notes not only present further details, but where appropriate also
call attention to limitations of these methods and to alternative
approaches.
Detecting Lateral Gene Transfer 423
2.1 Sources of Data The inference methods we focus on depend on the availability of
sequenced genomes (completed or draft), with gene predictions. We
will not review algorithms for sequence assembly or gene prediction
here, and refer the reader instead to reviews of these important steps
in this book (Chapters 2 and 11) and elsewhere [19–21].
The principal source of sequenced genomes remains the Gen-
Bank resource [22]. As of 12 September 2016 the NCBI Genome
database lists over 73,600 genomes of bacteria and archaea, from
Abiotrophia defectiva strain ATCC 49176 to Zymomonas mobilis
ATCC 29192 in various states of assembly, including ~5800 com-
pleted genomes. Also available are over 13,000 viral and plasmid
sequences, and about 3500 genomes of microbial eukaryotes. Other
microbial eukaryote genomes are also available from the Joint
Genome Institute’s Genome Portal (http://genome.jgi.doe.gov/).
These genomes with associated gene predictions can be acquired in a
number of ways, including in bulk using UNIX commands such as
‘rsync’ and ‘wget.’ Other resources include the Pathosystems
Resource Integration Center [23], which offers function and path-
way annotation alongside the sequenced genomes. The Genomes
OnLine Database [24] includes standards-compliant metadata,
including the type of environment and geographic coordinates of
isolation.
2.2 Software All programs described in this chapter are freely available:
Programs
BLAST [25]: http://blast.ncbi.nlm.nih.gov/; alternatives include
UBLAST [26] and RAPSearch2 [27]
MCL [28]: http://micans.org/mcl/
MUSCLE [29]: http://www.drive5.com/muscle/
MAFFT [30]: http://mafft.cbrc.jp/alignment/software/
BMGE [31]: https://wiki.gacrc.uga.edu/wiki/BMGE
Gblocks [32]: http://molevol.cmima.csic.es/castresana/Gblocks.
html
MrBayes [33]: http://mrbayes.sourceforge.net/
RAxML [34]: http://sco.h-its.org/exelixis/web/software/
raxml/index.html
CLANN [35]: http://chriscreevey.github.io/clann/
SPR supertree [36]: http://kiwi.cs.dal.ca/Software/index.php/
SPRSupertrees
phytools [37]: http://cran.r-project.org/web/packages/
phytools/
424 Cheong Xin Chan et al.
3 Methods
3.1 Clustering of 1. All-versus-all BLAST (or the more efficient UBLAST or RAP-
Sequences into Sets of Search2) is carried out on the set of predicted proteins from all
Putative Homologs and genomes in the analysis. Each pairwise BLAST with expecta-
Orthologs tion score e 103 is kept and is normalized by dividing by its
self-score, yielding the set of significant edges (by this crite-
rion). If the genomes are closely related (e.g., conspecific), this
and subsequent steps should instead be carried out on gene
(not protein) sets, with corresponding changes in parameter
values where required.
2. This edge set is used as input for Markov clustering using MCL.
Validation of MCL on the Protein Data Bank suggested that an
inflation parameter (I) of 1.1 is appropriate (see Note 1).
As MCL is memory intensive, it may be necessary to carry out
Refinement of alignment
(BMGE or Gblocks)
concordance discordance
No LGT LGT
Fig. 1 An overall workflow of phylogenetic approach for detecting lateral genetic transfer
Detecting Lateral Gene Transfer 425
3.2 Multiple 1. Each MRC of protein sequences is kept in FASTA format, the
Sequence Alignment most commonly accepted file format across computational
tools for sequence analysis.
2. Next, multiple sequence alignment is performed on each
MRC. Many tools are available for this purpose; MUSCLE
[29] and MAFFT [30] are the most popular programs to
date. Running MUSCLE or MAFFT (mafft-linsi) at default
settings should yield good results in most cases. In MUSCLE,
the maximum number of iterations during the refinement pro-
cess (-maxiters option) is 16 by default, and can be increased
for sets of very dissimilar (highly divergent) sequences. The
aligned sequences are in FASTA format by default. Other
tools, e.g., FSA and Clustal Omega, are tailored for very large
sequence sets (e.g., >100 sequences).
3. For each alignment, ambiguously aligned regions and ragged
ends can be removed using BMGE [31]. The default para-
meters (including use of the BLOSUM62 model) are likely to
be sufficient for most protein sequence alignments. The default
output format is PHYLIP sequential, but other formats can be
easily specified using the -o option. Another popular program
for this purpose is GBlocks [32], but its default settings are too
strict for normal use, so it is usually necessary to adjust its
parameter settings depending on the number of sequences
within a set [40].
4 Notes
Acknowledgements
References
1. Fleischmann RD, Adams MD, White O et al 5. Ochman H, Lawrence JG, Groisman EA
(1995) Whole-genome random sequencing (2000) Lateral gene transfer and the nature of
and assembly of Haemophilus influenzae Rd. bacterial innovation. Nature 405:299–304
Science 269:496–512 6. Chan CX, Beiko RG, Ragan MA (2011) Lat-
2. Welch RA, Burland V, Plunkett G et al (2002) eral transfer of genes and gene fragments in
Extensive mosaic structure revealed by the Staphylococcus extends beyond mobile ele-
complete genome sequence of uropathogenic ments. J Bacteriol 193:3964–3977
Escherichia coli. Proc Natl Acad Sci U S A 7. Young BC, Golubchik T, Batty EM et al
99:17020–17024 (2012) Evolutionary dynamics of Staphylococ-
3. Gogarten JP, Townsend JP (2005) Horizontal cus aureus during progression from carriage to
gene transfer, genome innovation and evolu- disease. Proc Natl Acad Sci U S A
tion. Nat Rev Microbiol 3:679–687 109:4550–4555
4. Keeling PJ, Palmer JD (2008) Horizontal gene 8. Beiko RG, Harlow TJ, Ragan MA (2005)
transfer in eukaryotic evolution. Nat Rev Genet Highways of gene sharing in prokaryotes.
9:605–618 Proc Natl Acad Sci U S A 102:14332–14337
Detecting Lateral Gene Transfer 431
9. Puigbò P, Wolf YI, Koonin EV (2010) The tree their associated metadata. Nucleic Acids Res
and net components of prokaryote evolution. 40:D571–D579
Genome Biol Evol 2:745–756 25. Camacho C, Coulouris G, Avagyan V et al
10. Beiko RG (2011) Telling the whole story in a (2009) BLAST+: architecture and applications.
10,000-genome world. Biol Direct 6:34 BMC Bioinformatics 10:421
11. Yutin N, Puigbò P, Koonin EV et al (2012) 26. Edgar RC (2010) Search and clustering orders
Phylogenomics of prokaryotic ribosomal pro- of magnitude faster than BLAST. Bioinformat-
teins. PLoS One 7:e36972 ics 26:2460–2461
12. Smillie CS, Smith MB, Friedman J et al (2011) 27. Zhao Y, Tang H, Ye Y (2012) RAPSearch2: a
Ecology drives a global network of gene fast and memory-efficient protein similarity
exchange connecting the human microbiome. search tool for next-generation sequencing
Nature 480:241–244 data. Bioinformatics 28:125–126
13. Ehrlich GD, Ahmed A, Earl J et al (2010) The 28. Enright AJ, Van Dongen S, Ouzounis CA
distributed genome hypothesis as a rubric for (2002) An efficient algorithm for large-scale
understanding evolution in situ during chronic detection of protein families. Nucleic Acids
bacterial biofilm infectious processes. FEMS Res 30:1575–1584
Immunol Med Microbiol 59:269–279 29. Edgar RC (2004) MUSCLE: multiple
14. Stecher B, Denzler R, Maier L et al (2012) Gut sequence alignment with high accuracy and
inflammation can boost horizontal gene trans- high throughput. Nucleic Acids Res
fer between pathogenic and commensal Enter- 32:1792–1797
obacteriaceae. Proc Natl Acad Sci U S A 30. Katoh K, Standley DM (2013) MAFFT multi-
109:1269–1274 ple sequence alignment software version 7:
15. Chan CX, Beiko RG, Darling AE et al (2009) improvements in performance and usability.
Lateral transfer of genes and gene fragments in Mol Biol Evol 30:772–780
prokaryotes. Genome Biol Evol 1:429–438 31. Criscuolo A, Gribaldo S (2010) BMGE (Block
16. Chan CX, Darling AE, Beiko RG et al (2009) Mapping and Gathering with Entropy): a new
Are protein domains modules of lateral genetic software for selection of phylogenetic informa-
transfer? PLoS One 4:e4524 tive regions from multiple sequence align-
17. Lawrence JG, Ochman H (1997) Amelioration ments. BMC Evol Biol 10:210
of bacterial genomes: rates of change and 32. Talavera G, Castresana J (2007) Improvement
exchange. J Mol Evol 44:383–397 of phylogenies after removing divergent and
18. Ragan MA, Harlow TJ, Beiko RG (2006) Do ambiguously aligned blocks from protein
different surrogate methods detect lateral sequence alignments. Syst Biol 56:564–577
genetic transfer events of different relative 33. Ronquist F, Teslenko M, van der Mark P et al
ages? Trends Microbiol 14:4–8 (2012) MrBayes 3.2: efficient Bayesian phylo-
19. Stein L (2001) Genome annotation: from genetic inference and model choice across a
sequence to biology. Nat Rev Genet large model space. Syst Biol 61:539–542
2:493–503 34. Stamatakis A (2014) RAxML version 8: a tool
20. El-Metwally S, Hamza T, Zakaria M et al for phylogenetic analysis and post-analysis of
(2013) Next-generation sequence assembly: large phylogenies. Bioinformatics
four stages of data processing and computa- 30:1312–1313
tional challenges. PLoS Comput Biol 9: 35. Creevey CJ, McInerney JO (2005) CLANN:
e1003345 investigating phylogenetic information
21. Richardson EJ, Watson M (2013) The auto- through supertree analyses. Bioinformatics
matic annotation of bacterial genomes. Brief 21:390–392
Bioinform 14:1–12 36. Whidden C, Zeh N, Beiko RG (2014) Super-
22. Benson DA, Cavanaugh M, Clark K et al trees based on the subtree prune-and-regraft
(2013) GenBank. Nucleic Acids Res 41: distance. Syst Biol 63:566–581
D36–D42 37. Revell LJ (2012) phytools: an R package for
23. Wattam AR, Abraham D, Dalay O et al (2014) phylogenetic comparative biology (and other
PATRIC, the bacterial bioinformatics database things). Methods Ecol Evol 3:217–223
and analysis resource. Nucleic Acids Res 42: 38. Harlow TJ, Gogarten JP, Ragan MA (2004) A
D581–D591 hybrid clustering approach to recognition of
24. Pagani I, Liolios K, Jansson J et al (2012) The protein families in 114 microbial genomes.
Genomes OnLine Database (GOLD) v. 4: sta- BMC Bioinformatics 5:45
tus of genomic and metagenomic projects and
432 Cheong Xin Chan et al.
39. Skippington E, Ragan MA (2011) Within- 50. Ragan MA, Bernard G, Chan CX (2014)
species lateral genetic transfer and the evolu- Molecular phylogenetics before sequences: oli-
tion of transcriptional regulation in Escherichia gonucleotide catalogs as k-mer spectra. RNA
coli and Shigella. BMC Genomics 12:532 Biol 11:176–185
40. Beiko RG, Ragan MA (2008) Detecting lateral 51. Chan CX, Ragan MA (2013) Next-generation
genetic transfer: a phylogenetic approach. phylogenomics. Biol Direct 8:3
Methods Mol Biol 452:457–469 52. Baum BR (1992) Combining trees as a way of
41. Yang Z (1994) Estimating the pattern of nucle- combining data sets for phylogenetic inference,
otide substitution. J Mol Evol 39:105–111 and the desirability of combining gene trees.
42. Whelan S, Goldman N (2001) A general Taxon 41:3–10
empirical model of protein evolution derived 53. Ragan MA (1992) Phylogenetic inference
from multiple protein families using a based on matrix representation of trees. Mol
maximum-likelihood approach. Mol Biol Evol Phylogenet Evol 1:53–58
18:691–699 54. Beiko RG, Hamilton N (2006) Phylogenetic
43. Reinert G, Chew D, Sun F et al (2009) identification of lateral genetic transfer events.
Alignment-free sequence comparison (I): sta- BMC Evol Biol 6:15
tistics and power. J Comput Biol 55. Whidden C, Beiko R, Zeh N (2013) Fixed-
16:1615–1634 parameter algorithms for maximum agreement
44. Wan L, Reinert G, Sun F et al (2010) forests. SIAM J Comput 42:1431–1466
Alignment-free sequence comparison (II): the- 56. Skippington E, Ragan MA (2011) Lateral
oretical power of comparison statistics. J Com- genetic transfer and the construction of genetic
put Biol 17:1467–1490 exchange communities. FEMS Microbiol Rev
45. Ulitsky I, Burstein D, Tuller T et al (2006) The 35:707–735
average common substring approach to phylo- 57. Aberer AJ, Kobert K, Stamatakis A (2014) Exa-
genomic reconstruction. J Comput Biol Bayes: massively parallel Bayesian tree inference
13:336–350 for the whole-genome era. Mol Biol Evol 31
46. Domazet-Lošo M, Haubold B (2009) Efficient (10):2553–2556
estimation of pairwise distances between gen- 58. Drummond AJ, Suchard MA, Xie D et al
omes. Bioinformatics 25:3221–3227 (2012) Bayesian phylogenetics with BEAUti
47. Chan CX, Bernard G, Poirion O et al (2014) and the BEAST 1.7. Mol Biol Evol
Inferring phylogenies of evolving sequences 29:1969–1973
without multiple sequence alignment. Sci Rep 59. Guindon S, Dufayard JF, Lefort V et al (2010)
4:6504 New algorithms and methods to estimate
48. Bonham-Carter O, Steele J, Bastola D (2013) maximum-likelihood phylogenies: assessing
Alignment-free genetic sequence comparisons: the performance of PhyML 3.0. Syst Biol
a review of recent approaches by word analysis. 59:307–321
Brief Bioinform 15:890–905 60. Price MN, Dehal PS, Arkin AP (2010) Fast- /
49. Haubold B (2014) Alignment-free phyloge- Tree 2—approximately maximum-likelihood
netics and population genetics. Brief Bioinform trees for large alignments. PLoS One 5:e9490
15:407–418
Chapter 17
Abstract
Recombination between nucleotide sequences is a major process influencing the evolution of most species
on Earth. The evolutionary value of recombination has been widely debated and so too has its influence on
evolutionary analysis methods that assume nucleotide sequences replicate without recombining. When
nucleic acids recombine, the evolution of the daughter or recombinant molecule cannot be accurately
described by a single phylogeny. This simple fact can seriously undermine the accuracy of any phylogenetics-
based analytical approach which assumes that the evolutionary history of a set of recombining sequences can
be adequately described by a single phylogenetic tree. There are presently a large number of available
methods and associated computer programs for analyzing and characterizing recombination in various
classes of nucleotide sequence datasets. Here we examine the use of some of these methods to derive and
test recombination hypotheses using multiple sequence alignments.
1 Introduction
Many methods have been developed for the detection and analysis
of recombination in nucleotide sequence data (for a reasonably
comprehensive list see http://www.bioinf.manchester.ac.uk/
recombination/programs.shtml). While most of these methods
provide some statistical indication of how much evidence of recom-
bination is present within a group of nucleotide sequences, some
will also infer likely recombination breakpoint positions, identify
recombinants and their possible parental sequences, or estimate
recombination rates [1].
Detecting recombination is often very straightforward. If, for
example, three nucleotide sequences are considered, two of them
will generally be more closely related to one another than either is
to the third. In the absence of recombination one would expect
these relationships to be maintained continuously across the
lengths of the three sequences. To detect recombination all that is
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_17, © Springer Science+Business Media New York 2017
433
434 Darren P. Martin et al.
2 Program Usage
2.1 Data Files RDP4 will accept nucleotide sequence alignments in many com-
mon formats including FASTA, CLUSTAL, MEGA, NEXUS,
GDE, PHYLIP, and DNAMAN. Although the program performs
optimally on alignments containing between 4 and 1000 sequences
of up to 50 kb in length, it can also be used to analyze between 4
and 100 sequences of up to 5 Mb in length.
A wide variety of datasets can be productively analyzed for
recombination providing care is taken during their assembly. The
optimal size of a dataset depends on the degree of sequence diver-
sity present therein. As recombination can only be detected if it
occurs in sequences that are not identical to one another, it is
Detecting and Analyzing Genetic Recombination Using RDP4 435
2.2 Program Settings Before screening a nucleotide sequence alignment for recombina-
tion it may be advisable to adjust various program settings that can
influence how RDP4 will search for and analyze recombination
signals. All of the program’s settings can be changed using the
“Options” button at the top of the main program window (Fig. 1).
Under the “General” tab in the “Analyze Sequences Using:”
section (Fig. 2), it is possible to select the methods that will be used
to detect recombination. In most cases, however, it is advisable to
use the default selections (RDP, GENECONV, and MAXCHI)
436 Darren P. Martin et al.
Command buttons
Sequence display
Recombination information
Stop button
Plot display
Options tabs
A 80
70
Breakpoints left undetected (%)
60
Methods tested in [19]
50
40 RDP4 Methods
30
20
10
B 80
Distance from actual breakpoint (nts)
70
10
C
30
Recombinants misidentified (%)
25
RDP4 without breakpoint polish
20
RDP4 with breakpoint polish
15
10
Fig. 3 The inference power and accuracy of various recombination detection methods. Seven of the
recombination detection methods implemented in RDP4 were compared individually (RDP, GENECONV,
RECSCAN, MAXCHI, CHIMAERA, SISCAN, and 3SEQ) and in combination (R + G + M referring to the RDP,
GENECOV, and MAXCHI methods used in primary screening mode and all of the methods used in secondary
screening mode as per the RDP4 default settings) with that of both the jpHMM [7] method and the Simplot
438 Darren P. Martin et al.
that there are two possible ways of using the RECSCAN and
SISCAN methods to detect recombination in an alignment. By
default both RECSCAN and SISCAN will be used to automatically
check recombination signals detected by all other methods (i.e.,
they will be used for secondary or confirmatory recombination
scans) but they will not be used to explore for any new recombina-
tion signals. These methods can be selected to explore for new
recombination signals by ticking the left box beside the method
name (be warned though that analyses can become very slow if
these methods are used for exploratory screening of large datasets).
The RDP, GENECONV, MAXCHI, 3SEQ, and CHIMAERA
methods will all automatically be used to check recombination
signals detected by all other methods regardless of whether they
are selected or not. The LARD [13] method can only be used to
check signals detected by other methods and should only be
selected if datasets are very small (<20 sequences). A very rough
estimate of the anticipated analysis time is given at the bottom of
the options menu so that you can judge whether particular selec-
tions are computationally viable.
Under the “Data processing options” section on the “General”
tab (Fig. 2), it is possible to configure the way RDP4 processes
detectable recombination signals during its formulation of a recom-
bination hypothesis. Apart from the “Disentangle overlapping
events” and “Polish breakpoints” options, default settings should
almost always be used. If the “Disentangle overlapping events”
option is selected, the program will attempt to ensure that the
recombination hypothesis it derives does not invoke recombination
between pairs of recombinant sequences that have breakpoints
which fall at similar sites (such as would be identified as relatively
unlikely reciprocal recombination events). This setting works well
when recombination in the dataset is relatively sparse and some
evidence for recombination hot-spots is present. However, the
method used to disentangle overlapping recombination events
can get into a circular loop where it is unable to derive a recombi-
nation hypothesis that does not involve reciprocal recombination.
ä
Fig. 3 (Continued) version of the BOOTSCAN method [8] using simulated HIV recombinant datasets and
analysis results published in [7]. (a) Recombination detection power (lower scores are better). (b) Breakpoint
inference accuracy without (in blue) and with (in orange) the “Polish breakpoints” setting (lower scores are
better). Note that the jpHMM method has close to the maximum accuracy achievable for a recombination
breakpoint site inference test. (c) The accuracy of recombinant identification without (in blue) and with (in
orange) the “Polish breakpoints” setting (lower is better). Note that the jpHMM method and the Simplot version
of the BOOTSCAN method were used to screen known simulated recombinants against a set of known non-
recombinant reference sequences and therefore could not be directly compared to the RDP4 methods with
respect to recombinant identification accuracy. For almost all of the methods in RDP4 the “Polish breakpoints”
setting has a large positive impact on the accuracy with which both breakpoint sites are inferred, and
recombinant sequences are identified
Detecting and Analyzing Genetic Recombination Using RDP4 439
2.3 Producing Once analysis settings have been selected and a multiple sequence
a Preliminary alignment has been loaded in RDP4, an automated exploratory
Recombination search for recombination signals can be carried out by pressing
Hypothesis the “X-Over” button (Fig. 1). Note that the automated explor-
atory search consists of two main phases: the first involving the
detection of recombination signals in the alignment and the second
440 Darren P. Martin et al.
2.4 Making a When an automated analysis has been either terminated or run to
Recombination-Free completion, a set of colored blocks will be presented in the “Sche-
Dataset matic sequence display” on the bottom right panel of the program
(Fig. 1). These blocks graphically represent the recombination
events that RDP4 has detected and characterized. For each
sequence in the dataset, the name of the sequence and a colored
strip are displayed. Beneath some of these strips (and
corresponding to lightened sections of the colored strips) are a
series of colored blocks. Each of these blocks represents a proposed
recombination event. If the mouse pointer is moved over any of
these blocks, information relating to the represented recombina-
tion event will be displayed in the “recombination information
display” on the top right panel of the screen (Fig. 1). This informa-
tion includes:
1. Possible recombination breakpoints.
2. Names of sequences in the dataset that are most closely related
to the parents of the recombinant sequence.
3. The approximate probability that the apparent recombination
signal arose due to convergent mutation rather than
recombination.
4. The number of sequences in the dataset carrying similar recom-
bination signals (in the “Confirmation table”).
5. A bar graph showing evidence used by the program to identify
the recombinant.
The most important bit of information displayed here is, how-
ever, the series of warnings that the program gives in capitalized red
letters. These will indicate when RDP4 is reasonably unsure about
some of the conclusions it has reached. The program will issue a
warning if (a) one or both of the inferred breakpoint positions are
probably inaccurate, (b) the wrong sequence may have been identi-
fied as the recombinant, (c) it is possible that an alignment error has
generated a false positive recombination signal, (d) there is only one
sequence in the dataset resembling one of the recombinant’s parental
sequences, and (e) only trace evidence of a recombination event is
present within the currently specified sequence (see Note 1).
Detecting and Analyzing Genetic Recombination Using RDP4 441
2.5 Navigating The automated output given by RDP4 is nothing more than a
Through the Analysis preliminary hypothesis describing a small fraction of the recombi-
Results nation events that have occurred during the evolutionary histories
of the sequences being analyzed. It is very important to understand
that the program is fallible. The program’s failures will be of three
major types:
1. inaccurate identification of recombination breakpoint positions;
2. incorrect identification of parental sequences as recombinants;
and
3. incorrect identification of groups of recombinants that have
descended from the same recombinant ancestor.
Unfortunately there are no automated tools in RDP4 that will
reliably indicate whether the preliminary results that it yields con-
tain these errors. It is very likely that, unless the initial automated
RDP4 results indicate that there only a few recombinant sequences
(<20 % of the sequences in the dataset are recombinants), the
program will have made some mistakes interpreting the patterns
of recombination it has detected. The size and importance of the
mistakes will scale with the number of unique recombination events
the program detects. It is especially important to understand that
mistakes made early on in an analysis (such as in the first 10 % of
unique recombination events the program characterizes) will be
more impactful than those made in the end stages of an analysis.
This arises because RDP4 identifies and characterizes the easiest to
detect recombination events first and leaves the interpretation of
the least obvious recombination signals until last. Once, for exam-
ple, a mistake has been made identifying which of the sequences is
442 Darren P. Martin et al.
2.6 Checking the Graphs in the plot display (Fig. 1) are useful for checking the
Accuracy of accuracy of recombination breakpoint estimation. Light and dark-
Breakpoint gray shaded areas of the graphs, respectively, indicate the 99 % and
Identification 95 % confidence intervals of breakpoint locations as determined by
the BURT method (Breakpoint Uncertainty in Recombined Tri-
plets). While the positions of breakpoints are indicated by inter-
secting lines in SISCAN, RDP, and RECSCAN plots, they are
instead indicated by peaks in MAXCHI (see plot display in Fig. 4)
and CHIMAERA plots, and conversely are represented as alternat-
ing peaks and troughs in 3SEQ plots. If the breakpoint positions
that are indicated by different methods do not match, this will
imply that there is a fair degree of uncertainty regarding the posi-
tion of the breakpoints. The various recombination detection
Detecting and Analyzing Genetic Recombination Using RDP4 443
MAXCH matrix
MAXCH plot
Fig. 4 Various tools for checking the accuracy of breakpoint prediction. The peaks of MAXCHI plots (in the
bottom left panel) should coincide with breakpoint positions (indicated by the left and right bounds of the pink
area). Breakpoint sites should also fall within the gray area (indicating the 99 % confidence interval of the
breakpoint locations as determined by the BURT method). A MAXCHI matrix (top right panel) is similar to a
MAXCHI plot but it expresses maximum Chi square p-values associated with every possible pair of break-
points. The “peaks” in the matrix are the dark red regions. Arrows on the matrix indicate peaks corresponding
with the pair of breakpoints identified by RDP4. The patterns of polymorphic sites in a sequence triplet (top left
panel) can help indicate the range of invariant nucleotide sites where a breakpoint position likely occurs
(grayed-out sites that are indicated by the black box). The site identified as the breakpoint position is indicated
by a vertical arrow
2.7 Checking the The next thing to consider for a particular detected recombination
Accuracy of event is whether RDP4 has correctly identified the recombinant
Recombinant sequence. This can be very difficult to assess. The program uses a
Sequence range of phylogenetic and genetic distance-based tests to infer
Identification which of the three sequences used to detect a recombination signal
is the recombinant. Very often different tests will indicate that
different sequences are most likely recombinant. RDP4 therefore
uses a weighted consensus of these tests when it automatically
identifies recombinant sequences (see Note 2). The results of
these tests are displayed, together with the weighted consensus, as
a series of bar graphs in the “recombination information display”
(Fig. 1) on the top left panel. RDP4 will display a warning if the
tests do not clearly indicate which sequence is recombinant. This
warning should not be disregarded; time and care should be
afforded to determine whether the recombinant has been misiden-
tified as one of the suggested parental sequences.
The best way to assess whether a recombinant sequence has
been correctly identified is to compare phylogenetic trees con-
structed from the portion of the alignment between the inferred
breakpoints with those constructed from the remainder of the
alignment. By default RDP4 will automatically construct
UPGMA trees for each of the two sections of the alignment when-
ever a particular recombination event is selected for more detailed
Detecting and Analyzing Genetic Recombination Using RDP4 445
“Major” parent
“Minor” parent
Recombinant
Color key
Fig. 5 Using phylogenetic trees to determine which sequence(s) is (are) recombinant. In this example RDP4
has inferred that five sequences (“O,” “I,” “J,” “N,” and “W”) are descended from a common recombinant
ancestor with parental sequences resembling the “Q” and “T” sequences. Note that these five sequences do
not form a monophyletic group in either tree. This indicates either that RDP4 has “over-grouped” the
sequences or that other recombination events elsewhere in these sequences may be obscuring the phyloge-
netic relationships between them (as turns out to be the case in this example)
2.8 Evaluating How In order to accurately retrace the history of recombination events
Well Recombination that are detectable in a group of sequences, it is necessary to
Signals Have Been correctly identify sequences that share evidence of the same ances-
Grouped into tral recombination event(s). RDP4 will often mistakenly group
Recombination Events sequences that are the descendants of different ancestral recombi-
nants. Conversely, RDP will seldom mistakenly identify two
sequences carrying evidence of the same recombination event as
carrying evidence of two different recombination events. Although
the descendants of an ancestral recombinant might be expected to
all have nearly identical recombination breakpoint patterns and
cluster together within phylogenetic trees, this is not always the
case. For example, some sequences may contain only partial evi-
dence of a particular recombination event because a second, newer
recombination event overprinted part of the evidence from the
older recombination event (see Note 4).
Besides searching for the synchronized movement of sequence
clusters within phylogenetic trees constructed from different parts
of sequence alignments, another way that sequences with evidence
of the same ancestral recombination events can be identified is by
comparing RECSCAN or RDP plots that are generated with the
same parental sequences but with different potential recombinant
sequences. In the phylogenetic trees that RDP4 displays, sequences
are highlighted in red, pink, and purple whenever they are inferred
to be carrying evidence of a particular recombination event (Fig. 5).
Detecting and Analyzing Genetic Recombination Using RDP4 447
2.10 Saving Analysis This manual verification procedure can become extremely tedious
Results for large datasets and it is advisable that analysis results be regularly
saved in .rdp format. This can be performed by pressing the “Save”
button on the menu bar at the top of the screen (Fig. 1) and
Detecting and Analyzing Genetic Recombination Using RDP4 449
selecting “.rdp” (RDP project file) as the format in which the data
should be saved. RDP project files can be reloaded in RDP so that
manual verification part of a recombination analysis can be carried
out over multiple sessions.
When an analysis is completed, the final results can also be saved
in a table format by pressing the “Save” button and selecting the “.
csv” format option. This format will produce a tabulated results file
that can be opened in any spreadsheet program (such as Microsoft
Excel). Note, however, that RDP4 will not be able to reload
analysis results from a “.csv” file.
It is also possible to save or copy sequence alignments, trees,
plots, and matrices in various formats by right clicking on these in
their respective program windows and selecting either the “Save
as. . .” or “Copy” options presented.
3 Examples
3.1 Producing Load the example alignment file “PVY Example.fas” (this and all
a Preliminary other example files referred to here can be found in the directory
Recombination where you have installed RDP4) and press the “Options” button
Hypothesis (Fig. 1). The example sequences we will be analyzing are linear
virus genomes, so in the “General Recombination Detection
Options” section under the “General settings” tab, change the
“Sequences are circular” setting to “Sequences are linear”
(Fig. 2). Besides this change, we will use the default RDP4 settings
for this example. Press the “OK” button at the bottom of the
options form. Press the “X-Over” button (Fig. 1) and wait for
the automated analysis to complete (it should take approximately
8 min).
3.2 Navigating Press the left mouse button when the mouse pointer is on a back-
Through the Results ground grayed area of the schematic sequence display (Fig. 1). This
focuses the program on the display. Pressing either the “Pg Up,”
“Pg Dn,” or “space bar” keys on your computer keyboard will
allow you to navigate through the detected recombination events
in an ordered fashion (alternatively you can use the arrow buttons
beneath the schematic sequence display to do the same thing).
Immediately after finishing the automated analysis, pressing the
“Pg Dn” button will take you to the first recombination event
identified by RDP4. Pressing it again will take you to the second
event, and so forth. Pressing the “Pg Up” button will take you to
the previous event. Pressing the space bar will take you to the
recombination event with the best associated p-value.
Starting with the first event (press the “Pg Up” or “Pg Dn”
button until information on “recombination event 1” is displayed
in the recombination information panel) you will see a graph drawn
on the “plot display” (Fig. 1). The exact type of graph that is
450 Darren P. Martin et al.
3.3 Checking the It is important that you check the accuracy with which RDP4 has
Accuracy of identified the recombination breakpoint positions. The RDP
Breakpoint method used to detect this recombination signal has a lower degree
Identification of breakpoint inference accuracy than the RECSCAN, MAXCHI,
CHIMAERA, and 3SEQ methods (Fig. 3). To see a MAXCHI
graph for event 1, press the “check using” listbox on the right-
hand side of the plot display (Fig. 1). One of the options listed is to
construct a MAXCHI plot. Select this option and see whether the
peaks on any of the three lines plotted correspond with the left and
right borders of the pink area. Look at graphs for some of the other
methods. The left and right boundaries of the pink area should
match positions in the RDP, RECSCAN, SISCAN, and DIS-
TANCE plots where two of the three plotted lines intersect.
Detecting and Analyzing Genetic Recombination Using RDP4 451
As with the MAXCHI plot, the left and right boundaries of the pink
area should match peaks in at least one of the lines in CHIMAERA,
TOPAL [20], PHYLPRO, and LARD plots. For this recombina-
tion event all of the methods seem to indicate the recombination
breakpoint at position 2250 has been correctly identified.
The actual breakpoint position in this example might, however,
not be as obvious as you think. Note that in the recombination
information display, five different sequences have been identified as
descendants of the same recombinant (you can tell this by looking
at the confirmation table in the recombination information dis-
play). Press the “Trees” button at the top of the screen (Fig. 1).
Five of the sequences in the trees displayed (Fig. 5) are highlighted
in red, pink, or purple. These are all sequences that potentially also
carry evidence of recombination event 1. The sequence in red, “I,”
is currently selected. Move the mouse pointer over “W” and press
the right mouse button. Select the “Go to W” option. This will
center the schematic sequence display on “W.” Move the mouse
pointer over the left most colored block representing the recombi-
nation event 1 signal in “W.” Look at the recombination informa-
tion display. Note that the “Ending” breakpoint position is
identified here as 2261 and not 2250. This is a small but important
difference. Press the “show relevant sequences” button (Fig. 4) and
use the scroll bar at the bottom of the sequence display to move to
position 2261. The color coding of the nucleotides now corre-
sponds with the colors of the lines in the plot later. You will see
that at position 2258, “W” and “T” share an A nucleotide, and also
that the breakpoint is inferred to lie three nucleotides to the right of
this point in “W” (instead of eight nucleotides to the left of this
point as in “I”). Now go back to the corresponding representation
of recombination event 1 in “I” and left click on it. Look at the
sequence display and you will see that at position 2258 sequence
“I” has a G residue that is shared with sequence “Q.”
Clearly the breakpoint position should be somewhere in the
region between 2250 and 2261, but its precise location is
unknown. Let us, just for the sake of this example, adjust the
breakpoint position to nucleotide 2261. To do this move
the mouse pointer to nucleotide 2261 of the middle sequence in
the sequence display and right click on it. One of the options
offered will be to “Place ending breakpoint here.” Select this
option, so that when you look at the representations of this event
in the schematic sequence display you will see that they all report
the breakpoint position as 2261.
3.4 Checking the Look at the bar graphs in the recombination information display
Accuracy of (Fig. 1). The first set of three bars indicate the consensus “Recom-
Recombinant binant scores” of sequences “I” (0.667), “Q” (0.077), and “T”
Sequence (0.256). These scores are the weighted consensus of a series of tests
Identification (each indicated by a set of three bars in the graph) to determine
452 Darren P. Martin et al.
3.5 Evaluating There is apparently some evidence that recombination event 1 may
RDP4’s Grouping of have occurred in the common ancestor of five sequences in the
Recombination Events dataset—those sequences currently highlighted in purple/pink/
red in the trees (“I,” “W,” “J,” “N,” and “O”). It is, however,
also apparent that the five identified recombinant sequences neither
all cluster within the phylogenetic trees, nor all move together
between the phylogenetic trees. This fact suggests that RDP4 may
have “over-grouped” these sequences and that they may in fact be
carrying evidence of multiple different independent recombination
events. If you look at these five sequences in the schematic sequence
display you will, however, immediately see the probable reason that
these sequences do not move together between the two phyloge-
netic trees: RDP4 has identified additional recombination events in
all of these sequences other than “I.” In such cases it is not expected
that a group of sequences carrying evidence of the same ancestral
recombination event will all cluster together in both of the trees.
RDP4 provides another tool with which you can check whether
two sequences carry evidence of the same ancestral recombination
event. In either one of the trees right click on sequence “W” and
select the “Recheck the plot with W as the recombinant” option.
This will compare the plots produced using the currently selected
sequence (in this case sequence “I”—the one in red) with that of
sequence “W.” The result of this comparison is displayed
Detecting and Analyzing Genetic Recombination Using RDP4 453
Fig. 6 Comparing recombination signals to determine whether two recombinants are descended from a
common recombinant ancestor. (a) An RDP method plot for recombination event number 1 in the example
dataset. (b) A similar RDP method plot to that in A but with sequence “W” replacing sequence “I” in the
scanned sequence triplet. The colored line above the plot is a graphical representation of how closely the plot
in (b) resembles that in (a). Note that across the recombination breakpoints the two plots are nearly identical
(the blue color in the bar expresses this similarity) implying that “I” and “W” probably both descended from the
same recombinant ancestor. Note also that in (b) the deep red color in the part of the colored line
corresponding to sequence coordinates ~5000 to ~9000 clearly indicates that “W” likely carries evidence
of a second large recombination event that is not shared with “I”
it does in this case) then the pattern of sites shared by the sequences
being compared and their supposed parental sequences are very
similar across the breakpoint(s). Thus it is very likely that the two
sequences being compared carry evidence of the same recombina-
tion event.
For the sake of this example, let us pretend that the very similar
recombination signals that are evident in “I” and “W” are not
derived from the same ancestral recombination event. To exclude
the recombination event detected in “W” from event 1, go to the
side-by-side tree display and right click on “W.” Choose the option
to “Mark W as not having evidence of this event.” If you would like
to reinclude “W” as having evidence of recombination event 1,
then right click on “W” in the tree and select the “Mark W as
having evidence of this event” option.
3.6 Completing the Because you have changed the way that RDP4 has interpreted event
Analysis 1, you need to let the program reformulate its characterization of all
the other recombination events detected. First, however, it is
important to inform RDP4 that you are content with the current
interpretation of recombination event 1. To do this, right click on
the flashing block in the schematic sequence display. Select the
“Accept this event in all five sequences where it is found” option.
You should notice that a series of red borders have been drawn
around the blocks representing the recombination event 1 signals
in sequences “I,” “W,” “J,” “N,” and “O.” Now either click on the
flashing “Re-scan” button beneath the schematic sequence display
or right click anywhere in the schematic sequence display and select
the “Re-Scan and re-identify recombinant sequences for all unac-
cepted events” option.
When RDP4 has finished reanalyzing the remaining recombina-
tion signals press the “Pg Dn” button on your keyboard and you can
start evaluating recombination event 2. Continue until you reach the
end of the analysis. You may notice that the program skips event 2.
The recombination signal corresponding to recombination event
2 has been identified by RDP4 as being attributable to sequence
misalignment. To see recombination event 2 you will need to click
on the “options” button at the top of the screen, move to the
“General” tab, in the “Data Processing Options” section, press the
button besides the “list events detected by >1 method” label until
the label reads “list all events.” If you now look at sequence “B” in
the schematic sequence display, you should notice a gray block
labeled “unknown” under the line representing this sequence: this
block is representative of “recombination event 2.”
Recombination event 3 is detected in sequences “J,” “N,” and
“W” but it is clear that the sizes of the recombinationally derived
fragments in these three sequences differ substantially. Whereas all
three of the sequences have similar recombination signals across the
beginning (or 50 ) breakpoint (approximately at position 5680),
Detecting and Analyzing Genetic Recombination Using RDP4 455
3.7 Further Analyses If large numbers of recombination breakpoints have been detected,
you may want to test whether the distributions of these breakpoints
indicate the presence of recombination hot- or cold-spots within
the sequences that have been analyzed. To demonstrate this, open
the file “HIV Example.rdp” (it can be found in the directory where
you have installed RDP4). Press the arrow beside the “X-Over”
button and select the “breakpoint distribution plot” menu option.
After a minute or two the program will display the plot indicated in
(Fig. 7). The black line in this plot represents the numbers of
recombination breakpoints (individually indicated by vertical lines
above the plot) that fall within 200 nucleotides of the genome
coordinates indicated on the x-axis (in this case corresponding to
nucleotide sites within the first sequence in the analyzed dataset,
A1.KE.94). The gray and white areas, respectively, represent the
95 % and 99 % confidence intervals of the expected degrees of
breakpoint clustering under random recombination. Whereas
genome coordinates at which the black line spikes up above the
white/gray area are statistically supported recombination hot-
spots, those where the black line dips below the white/gray area
are statistically supported recombination cold-spots.
If you have a GenBank file on hand that corresponds with one of
the sequences in a dataset that has been analyzed for recombination,
and this file contains information on the locations of gene bound-
aries, then it is also possible to test for associations between recom-
bination breakpoint distributions and genome organization. Once
again, open the file “HIV Example.rdp.” When it is loaded press the
“Open” button again and select the file “HXB2 Genbank File.txt”
(it can be found in the folder where you installed RDP4). This file
simply contains a plain text version of the GenBank file for sequence
“B.FR.83.HXB2” that is accessible using the following URL:
http://www.ncbi.nlm.nih.gov/nuccore/K03455.1. When this file
is loaded into RDP4 you should notice that a series of arrows are
added to the colored similarity map above the sequence display. Press
the arrow beside the “X-Over” button and again select the
456 Darren P. Martin et al.
Fig. 7 Recombination breakpoint distribution and recombination-induced protein folding disruption plots. (a)
Recombination breakpoint hot- (red arrows) and cold-spots (blue arrows) detectable within HIV-1M genomes
represented in the file, HIV Example.rdp (after [21]). The black plot represents the inferred clustering of
recombination breakpoints identified within the HIV-1M sequences analyzed here. Dark and gray areas,
respectively, represent the 95 % and 99 % bounds of the expected degrees of breakpoint clustering under
random recombination. Vertical lines above the plot indicate the positions of recombination breakpoints. (b)
Expected degrees of folding disruption within chimeric envelope proteins of simulated HIV-1M recombinants.
Whereas the black line represents the mean folding disruptions inferred for the simulated chimeric envelope
proteins, the gray area represents the range of folding disruptions observed in these simulated proteins. The
vertical black lines above the plots represent the locations of recombination breakpoint sites that were
detected in the envelope genes of HIV genomes represented in file “HIV Example.rdp”
Fig. 8 Testing for associations between genome arrangement and breakpoint distributions. Breakpoint
clustering is compared between the genome regions represented in blue and orange. In this example
(found in the file “HIV Example.rdp”) it is clear from the last three rows of the table that that HIV-1M
breakpoints tend to cluster far more around the edges of genes (in blue with low associated p-values) than
they do within the central parts of genes (in orange)
4 Notes
name of the sequence in the tree and pressing the left mouse
button. The same sequence is then marked both in the current
tree, and in all of the other trees. It is also possible to clear
markings or automatically color sequence names (so that they
are the same colors as those displayed in the schematic sequence
display on the bottom right panel). This can be accomplished by
selecting the appropriate option on the menu that appears
whenever you press the right mouse button while the mouse
pointer is over one of the tree displays.
References
1. Martin DP, Lemey P, Posada D (2011) Analys- mosaic structure in sequence triplets. Genetics
ing recombination in nucleotide sequences. 176:1035–1047
Mol Ecol Resour 11:943–955 12. Gibbs MJ, Armstrong JS, Gibbs AJ (2000)
2. Martin DP, Williamson C, Posada D (2005) Sister-scanning: a Monte Carlo procedure for
RDP2: recombination detection and analysis assessing signals in recombinant sequences.
from sequence alignments. Bioinformatics Bioinformatics 16:573–582
21:260–262 13. Holmes EC, Worobey M, Rambaut A (1999)
3. Muhire B, Martin DP, Brown JK et al (2013) A Phylogenetic evidence for recombination in
genome-wide pairwise-identity-based proposal dengue virus. Mol Biol Evol 16:405–409
for the classification of viruses in the genus 14. Price MN, Dehal PS, Arkin AP (2010)
Mastrevirus (family Geminiviridae). Arch FastTree 2–approximately maximum-
Virol 158:1411–1424 likelihood trees for large alignments. PLoS
4. Martin D, Rybicki E (2000) RDP: detection of One 5:e9490
recombination amongst aligned sequences. 15. Felsenstein J (1989) PHYLIP—Phylogeny
Bioinformatics 16:562–563 Inference Package (version 3.2). Cladistics
5. Padidam M, Sawyer S, Fauquet CM (1999) 5:163–166
Possible emergence of new geminiviruses by 16. Guindon S, Gascuel O (2003) A simple, fast,
frequent recombination. Virology and accurate algorithm to estimate large phy-
265:218–225 logenies by maximum likelihood. Syst Biol
6. Smith JM (1992) Analyzing the mosaic struc- 52:696–704
ture of genes. J Mol Evol 34:126–129 17. Stamatakis A (2006) RAxML-VI-HPC: maxi-
7. Schultz A-K, Zhang M, Leitner T et al (2006) mum likelihood-based phylogenetic analyses
A jumping profile Hidden Markov Model and with thousands of taxa and mixed models. Bio-
applications to recombination sites in HIV and informatics 22:2688–2690
HCV genomes. BMC Bioinformatics 7:265 18. Ronquist F, Huelsenbeck JP (2003) MrBayes
8. Lole KS, Bollinger RC, Paranjape RS et al 3: Bayesian phylogenetic inference under
(1999) Full-length human immunodeficiency mixed models. Bioinformatics 19:1572–1574
virus type 1 genomes from subtype C-infected 19. Weiller GF (1998) Phylogenetic profiles: a
seroconverters in India, with evidence of inter- graphical method for detecting genetic recom-
subtype recombination. J Virol 73:152–160 binations in homologous sequences. Mol Biol
9. Posada D, Crandall KA (2001) Evaluation of Evol 15:326–335
methods for detecting recombination from 20. McGuire G, Wright F (2000) TOPAL 2.0:
DNA sequences: computer simulations. Proc improved detection of mosaic sequences within
Natl Acad Sci U S A 98:13757–13762 multiple alignments. Bioinformatics
10. Martin DP, Posada D, Crandall KA, William- 16:130–134
son C (2005) A modified bootscan algorithm 21. Simon-Loriere E, Galetto R, Hamoudi M et al
for automated identification of recombinant (2009) Molecular mechanisms of recombina-
sequences and recombination breakpoints. tion restriction in the envelope gene of the
AIDS Res Hum Retroviruses 21:98–102 human immunodeficiency virus. PLoS Pathog
11. Boni MF, Posada D, Feldman MW (2007) An 5:e1000418
exact nonparametric method for inferring
Chapter 18
Abstract
The history of particular genes and that of the species that carry them can be different for a variety of
reasons. In particular, gene trees and species trees can differ due to well-known evolutionary processes such
as gene duplication and loss, lateral gene transfer, or incomplete lineage sorting. Species tree reconstruction
methods have been developed to take this incongruence into account; these can be divided grossly into
supertree and supermatrix approaches. Here we introduce a new Bayesian hierarchical model that we have
recently developed and implemented in the program guenomu. The new model considers multiple sources
of gene tree/species tree disagreement. Guenomu takes as input posterior distributions of unrooted gene
tree topologies for multiple gene families, in order to estimate the posterior distribution of rooted species
tree topologies.
Key words Supertree, Supermatrix, Tree reconciliation, Bayesian phylogenetics, Tree distance,
Species tree, Duplications and losses, Incomplete lineage sorting
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6_18, © Springer Science+Business Media New York 2017
461
462 Leonardo de Oliveira Martins and David Posada
Fig. 1 Comparison between the supertree and supermatrix approaches. In the top panel we have an example
data set composed of two gene families (homologous sets), already aligned. In this example we have several
sequences from the same species in each gene family, which might represent paralogs within an individual
sample or more than one sampled individual from the species. A supertree approach (middle panel) would
then estimate the gene tree for each alignment independently and then summarize the information from all
resulting gene trees into a single tree. A supermatrix approach, on the other hand, would first concatenate all
gene family alignments into one single alignment, which would then be used in the phylogenetic reconstruc-
tion. Notice how for the supermatrix approach it is essential to find the correspondent sequences for all gene
families, a limitation also imposed by several supertree methods. Therefore, the user must decide which
representative from each species to use, for all gene families. That is, does the sequence sp1_1 from gene
family 1 correspond to sp1_1 or sp1_2 from gene family 2?
gene tree estimation has been devised [22], being restricted how-
ever to a fixed species tree. A maximum-likelihood approach based
on a simplified version of this birth–death model has recently been
shown to allow for species tree estimation under a hierarchical
model [23]. Recently, we developed a Bayesian model that takes
464 Leonardo de Oliveira Martins and David Posada
3.1.1 Bayesian Sampler Given a collection of gene families, guenomu will estimate the
posterior distribution of species trees together with other para-
meters pertaining to the model. By using a resampling technique,
the species tree inference is done through two Bayesian procedures,
where first the gene family alignments are used one at a time to
sample the distributions of gene trees, and then these individual
gene distributions are incorporated into a second, multigene Bayes-
ian model. guenomu is responsible for the later, multigene Bayesian
sampling, that will sample species trees respecting the distance
constraints imposed by the multivariate model. The user is then
free to use her favorite single gene phylogenetic inference algo-
rithm to estimate the gene tree distributions (e.g., MrBayes [25],
PhyloBayes [26], BEAST [27]), which will then be given as input
to guenomu.
Even gene trees that did not come from a Bayesian analysis can
be used in guenomu, and a set of compatible species trees will still
be inferred. However, besides the probabilistic interpretation being
lost in such an attempt, care must be taken to preserve the uncer-
tainty in the gene tree estimation. In other words, even bootstrap
replicates should be preferred over point estimates, since guenomu
relies on the gene tree uncertainty to explore the space of species
trees. The primary output of guenomu is a file or a set of files in a
compact binary format that must be interpreted by the same pro-
gram guenomu in order to output human-readable files.
3.1.3 Summary Statistics One important auxiliary functionality of the guenomu package is
Coalescent Species Trees offered through the program bmc2_maxtree, which estimates the
species tree under the multispecies coalescent using several
distance-based algorithms, namely, GLASS [29], SD [30],
STEAC [31], and MAC [15]. It only needs a file with the species
names (as for guenomu, as we will see later) and the files with the
gene trees, and it will output four species tree estimates, one under
each modification of the main algorithm. These algorithms rely
strongly on the branch lengths of the gene trees (see Note 3), but
if these are absent then the program will assume that they are all
equal to one.
3.1.4 Pairwise Distances Another useful auxiliary program is bmc2_tree, which calculates all
Between Trees pairwise distances between a set of gene trees and a set of species
trees. Given two tree files, it will create a file named “pairwise.txt”
with a table consisting of the gene tree and species tree indices,
followed by the list of distances between them.
466 Leonardo de Oliveira Martins and David Posada
3.1.5 Simulation of Tree The guenomu program can benefit from the uncertainty in the
Inference Error estimation of individual gene trees, since it allows the model to
consider alternative trees that, however unlikely in isolation, can
provide a better fit when analyzed in conjunction with other gene
families. For each given input gene tree with branch lengths, the
auxiliary program bmc2_addTreeNoise tries to generate a collec-
tion of trees similar to it, based on the premise that shorter branch
lengths are more likely to be wrong. In other words, it will trans-
form each tree from an input file into a distribution of trees,
increasing the space of possibilities. It also works on trees without
branch lengths and can be used to preprocess the gene tree files
estimated by the user before the Bayesian analysis conducted by
guenomu.
4 Input Files
5 Running guenomu
You can run guenomu by including all settings into a control file
(and run it like “guenomu run.ctrl”), or by setting the parameters
as command line arguments. You can also include the default para-
meters in the control file, and then overwrite some of these para-
meters in the command line (which take precedence). The control
file can contain a series of arguments and accepts commentaries
inside square brackets, which are then ignored by the program.
Absent parameters are replaced by default values.
The only exception is the collection of gene tree files, and a list
with the names of all species, that must be given by the user. Many
programs require a gene-wise mapping between each gene leaf and
the species it represents, but guenomu only assumes that the species
name can be found within the gene leaf name and does the mapping
automatically. For instance, if one has for a given gene family, three
sequences “a1,” “b1,” and “x” all from Escherichia coli (they might
be paralogous sequences, or individuals from the same locus, or a
combination of both), then it is enough to rename the sequences to
something like “ecoli_a1,” “ecoli_b1,” and “ecoli_x” while includ-
ing “ecoli” into the list of species and remembering that in this case
all gene families must use the name “ecoli” to refer to this species. It
is allowed for a given species not to be found in some gene family,
468 Leonardo de Oliveira Martins and David Posada
Fig. 2 Excerpt of an example control file for guenomu showing the list of gene tree files and the list of species
names that should be found within the leaves of the gene trees
Fig. 3 Control file options for guenomu that describe the file names containing lists of gene trees and of
species names
5.1 Optimization by If the number of iterations for the main sampling stage of the
Simulated Annealing Bayesian sampling is set to zero, then the program will behave
differently: it will assume that we are interested in the simulated
annealing results. That is, we are using guenomu as an optimization
470 Leonardo de Oliveira Martins and David Posada
tool, and it will store the chain status at the final iteration of each
cycle. The simulated annealing will sample from a ‘modified’ distri-
bution, which is the Bayesian posterior distribution exponentiated
to a value (the inverse of its temperature, on a thermodynamic
interpretation). In the simulated annealing step, several cycles are
performed serially where each cycle is composed of many iterations
with an initial and a final temperature. That is, within each cycle the
temperature changes across iterations, and the chain state at the end
of one cycle is used as the initial state for the next one.
The simulated annealing step was devised to control the initial
state of the chain, for the posterior sampling. That is, even within
the Bayesian sampling one can benefit from the simulated anneal-
ing: if one wants to start the chain at a really random initial value, it
is enough to set both the initial and final temperatures to a value
below one—which will allow the chain to explore freely even
regions of very low probability. This is a safe choice for convergence
checks. On the other hand, if one is interested in starting the sample
at a good initial estimate of trees, then it might be worth setting the
final temperature to a high value, such that only moves increasing
the posterior probability are accepted. Increasing the number of
cycles allows for the chain to escape local optima.
If one is interested in using guenomu as an optimization tool,
then we suggest that several cycles of optimization are employed to
avoid local optima, and that the initial and final (inverse) tempera-
tures should span a large interval. The output given by guenomu
(after running it with the option “-z 1,” for instance) will have the
optimal estimated species tree at the end of each cycle. We might
then look at the most frequent or best species tree overall, remem-
bering that the optimal values are the set of genes and species trees
that minimize their overall distances between each other. In this
scenario, guenomu’s results are therefore a generalization of the
SPR supertree approach [35], the mulRF supertree [32], or the
GTP methods [17], and can be directly compared by choosing only
the appropriate distances.
5.2 Output The main sampler will store its output with all trees and parameters
in a binary, compact format that cannot be used by other programs.
This file is named “job0.checkpoint.bin,” and if you are running
the parallel version then each job will generate its own file, for
example, “job1.checkpoint.bin,” “job2.checkpoint.bin,” etc.
Such files must be interpreted by another run of guenomu (with
option “-z 1”), which will generate a single output file with all
numeric parameters as well as the set of posterior species and gene
family trees.
5.2.1 Posterior Sample The output file with the discrete and continuous numeric para-
of Numeric Parameters meters from all posterior samples is called “params.txt.” One exam-
ple file is shown in Fig. 4. It is a tab-formatted file with a header row
Fig. 4 Example of file params.txt with posterior samples of parameters from the MCMC chain. The first seven columns are the global parameters, while the
remaining columns represent the parameters per gene family. For this example only duplications and losses were used for two gene families
Genome-wide Species Tree Estimation
471
472 Leonardo de Oliveira Martins and David Posada
5.2.2 Output Trees Besides the output file with numeric parameters, guenomu also
outputs the resampled gene tree files as well as the posterior distri-
bution of species trees in several formats. The resampled gene trees
will contain the same trees as in the input tree files, but with their
posterior frequencies, that is, taking into account information from
other gene families. These files will have the same name as the
input, but with the prefix “post” and in the trprobs format, such
that, for example, an input gene tree file named “gene001.tre” will
originate a posterior file called “post.gene001.trprobs.”
The posterior species tree distribution is output to three files:
“species.tre,” “species.trprobs,” and “unrooted.trprobs.” The
model in guenomu works with rooted species trees, and therefore
the files “species.tre” and “species.trprobs” are comprised of
rooted species trees. The difference between these two files is the
format, where “species.tre” is in standard nexus format and con-
tains all sampled species trees in the order in which they appeared in
the MCMC chain. The file “species.trprobs,” however, is in the
compact nexus format where only distinct trees are represented,
together with their posterior frequencies. The trees inside this file
will be named “tree_0”, “tree_1,” etc., where the number is the
same as in the “sptree” column of the numeric parameters file, and
is used to identify them. This number also appears as comments
inside the “species.tre” file, to allow for the mapping of trees
between both files. Please be careful since most phylogenetic analy-
sis software cannot handle the trprobs format properly, and there-
fore the standard nexus file should be used to estimate the
consensus tree or in other analyses.
If one is interested in obtaining a consensus tree from the
“species.tre” file, then we suggest a few options: (1) the “sumt”
option of MrBayes [25], (2) the “consensus()” function of the ape
library for R [38], (3) the “consensus” method of the dendropy
module for python [39]. We do, however, suggest to always check
also the posterior frequencies directly from the trprobs files to have
an idea about the sharpness of the distributions. In Fig. 5 we show
an example of the file “species.tre” that can be compared to the file
“species.trprobs” from Fig. 6.
Notice how in this file the trees are ordered by frequency, where
for instance the tree number 3 is the so-called maximum a poster-
iori (MAP) tree, with frequency of 17.5 %. Notice, also, that some
of the trees (e.g., tree_2, tree_3, tree_7 and tree_12) differ only in
the root location.
The file “unrooted.trprobs” also contains the posterior distri-
bution of species trees in the compact trprobs format, but this time
neglecting the information about the root location. That is, it
474 Leonardo de Oliveira Martins and David Posada
Fig. 5 Example of “species.tre” file output by guenomu. In this file we have one
tree topology per MCMC iteration, where identical trees can be identified by the
descriptor “idx” (which is a comment according to the nexus format and therefore
does not interfere with other programs)
Fig. 6 Example of guenomu’s “species.trprobs” output file. Unlike the “species.tre” file (see Fig. 5), distinct
topologies appear only once, and furthermore are ordered according to their posterior frequencies (described
within comments)
Fig. 7 Example of “unrooted.trprobs” output file from the guenomu software. This file has the same
information as “species.trprobs” (Fig. 6) but where the root location of each topology was neglected—
therefore joining several trees that differ only in the root location. Notice how the trees are still represented as
rooted (deepest node is a dichotomy), just to ease reading and visualization
476 Leonardo de Oliveira Martins and David Posada
6 Conclusion
7 Notes
References
1. Rannala B, Yang Z (2008) Phylogenetic infer- 13. Kluge AG (1989) A concern for evidence and a
ence using whole genomes. Annu Rev Geno- phylogenetic hypothesis of relationships
mics Hum Genet 9:217–231 among Epicrates (Boidae, Serpentes). Syst
2. Woese CR (1987) Bacterial evolution. Micro- Zool 38:7–25
biol Rev 51:221–271 14. de Queiroz A, Gatesy J (2007) The superma-
3. Brown JR, Doolittle WF (1997) Archaea and trix approach to systematics. Trends Ecol Evol
the prokaryote-to-eukaryote transition. Micro- 22:34–41
biol Mol Biol Rev 61:456–502 15. Helmkamp LJ, Jewett EM, Rosenberg NA
4. Fitz-Gibbon ST, House CH (1999) Whole (2012) Improvements to a class of distance
genome-based phylogenetic analysis of free- matrix methods for inferring species trees
living microorganisms. Nucleic Acids Res from gene trees. J Comput Biol 19:632–649
27:4218–4222 16. Slowinksi J, Page RDM (1999) How should
5. Snel B, Bork P, Huynen MA (1999) Genome species trees be inferred from molecular
phylogeny based on gene content. Nat Genet sequence data? Syst Biol 48:814–825
21:108–110 17. Chaudhary R, Bansal MS, Wehe A, Fernández-
6. Fukami-Kobayashi K, Minezaki Y, Tateno Y, Baca D, Eulenstein O (2010) iGTP: a software
Nishikawa K (2007) A tree of life based on package for large-scale gene tree parsimony
protein domain organizations. Mol Biol Evol analysis. BMC Bioinformatics 11:574
24:1181–1189 18. Felsenstein J (1981) Evolutionary trees from
7. Grishin NV, Wolf YI, Koonin EV (2000) From DNA sequences: a maximum likelihood
complete genomes to measures of substitution approach. J Mol Evol 17:368–376
rate variability within and between proteins. 19. Chaudhary R, Boussau B, Burleigh JG,
Genome Res 10:991–1000 Fernandez-Baca D (2014) Assessing
8. Clarke GDP, Beiko RG, Ragan MA, Charlebois approaches for inferring species trees from
RL (2002) Inferring genome trees by using a multi-copy genes. Syst Biol 64:325–339
filter to eliminate phylogenetically discordant 20. Heled J, Drummond AJ (2010) Bayesian infer-
sequences and a distance matrix based on mean ence of species trees from multilocus data. Mol
normalized BLASTP scores. J Bacteriol Biol Evol 27:570–580
184:2072–2080 21. Liu L, Pearl DK (2007) Species trees from gene
9. Housworth EA, Postlethwait J (2002) Mea- trees: reconstructing Bayesian posterior distri-
sures of synteny conservation between species butions of a species phylogeny using estimated
pairs. Genetics 162:441–448 gene tree distributions. Syst Biol 56:504–514
10. Lin Y, Moret BME (2008) Estimating true 22. Akerborg O, Sennblad B, Arvestad L, Lagerg-
evolutionary distances under the DCJ model. ren J (2009) Simultaneous Bayesian gene tree
Bioinformatics 24:i114–i122 reconstruction and reconciliation analysis. Proc
11. Gordon A (1986) Consensus supertrees: the Natl Acad Sci U S A 106:5714–5719
synthesis of rooted trees containing overlap- 23. Boussau B, Szöll GJ, Duret L, Gouy M, Tan-
ping sets of labeled leaves. J Classif nier E, Daubin V (2013) Genome-scale coesti-
348:335–348 mation of species and gene trees. Genome Res
12. Ragan MA (1992) Phylogenetic inference 23:323–330
based on matrix representation of trees. Mol 24. De Oliveira Martins L, Mallo D, Posada D
Phylogenet Evol 1:53–58 (2014) A Bayesian supertree model for
478 Leonardo de Oliveira Martins and David Posada
Jonathan M. Keith (ed.), Bioinformatics: Volume I: Data, Sequence Analysis, and Evolution, Methods in Molecular Biology,
vol. 1525, DOI 10.1007/978-1-4939-6622-6, © Springer Science+Business Media New York 2017
479
480 B IOINFORMATICS: VOLUME I: DATA, SEQUENCE ANALYSIS,
Index
AND EVOLUTION
BLOSUM .......................... 168, 182, 184, 195–197, 206 ChIP. See Chromatin immunoprecipitation (ChIP)
Boa constrictor .............................................................. 397 ChIP-seq..............................................103, 109, 115–117
Bonferroni correction ................................. 391, 395, 398 Chromatin .................................. 4, 8, 103, 109–110, 284
Bootstrap ................................................... 283, 361–364, Chromatin immunoprecipitation (ChIP) .................... 81,
369, 370, 372, 381, 384, 404, 408, 411–414, 104, 109, 237
427, 430, 465, 467 Chromatin interaction analysis by paired-end tag
Bowtie................................................................... 255, 257 sequencing (ChIA-PET) ................................... 110
Bragg’s Law ...............................................................49, 59 CINEMA .............................................................. 186–187
Branch and bound ........................................................ 356 Clade .................................. 337–340, 361, 362, 452, 459
Branch length ............................................ 323, 334, 338, Clade credibility ............................................................ 361
351, 355, 365, 368, 373, 426–428, 464–467 Clade stability ................................................................ 361
Branch models..............................................336–339, 345 CLANN ................................................................ 423, 428
Branch-site models.......................................336, 339–341 CLAP ............................................................................. 142
BRCA1 ................................................................. 113–118 Classification......................................137–161, 172, 238,
Breakpoints................................................ 433, 434, 436, 273, 274, 297–299, 350
438–444, 446–448, 450–451, 453–459 Cleavage...............................................8, 12, 18, 249, 279
Browser Extensible Data (BED) ....................... 111, 112, ClinVar ........................................................................... 110
114, 118, 225, 235, 254, 311 Clone ........................................................... 7, 11, 41, 83,
Burn-in .....................................298, 304, 311, 312, 367, 101–103, 237, 239, 246
369, 426, 468, 469, 472 Cloudogram ......................................................... 361, 371
Burrows-Wheeler Alignment tool (BWA) .......... 255, 257 ClueGO ......................................................................... 128
BWA-MEM ................................................................... 255 CLUSS........................................................................... 142
Clustal Omega.....................................173, 180–182, 425
C ClustalW ..............................................177, 180, 181, 186
Clustering .................................140–142, 145, 146, 150,
Caenorhabditis elegans ......................................... 226, 284
CAGE. See Cap Analysis Gene Expression (CAGE) 154, 156, 157, 161, 355, 357, 400, 424–425,
Cancer......................... 27, 110, 111, 115, 116, 130, 227 455–457
Clusters of Orthologous groups (COG) ..................... 220
Cancer genomes ............................................................ 227
Cap Analysis Gene Expression (CAGE) ....................104, Coalescent ...........................................462, 465, 476, 477
108, 109, 112 Coda............................................................................... 472
CATH ................................ 138, 146, 149, 151–157, 159 CODEML ................................317, 320, 323, 325–327,
329, 330, 332, 333, 335, 338, 340–345
CATHEDRAL ............................................ 151, 155, 156
CATH-Gene3D .........................144, 150, 154, 155, 159 Coding sequence (CDs) ........................... 110, 272–275,
CCD camera. See Charged Coupled Device 277, 315–317, 344
Codon .......................................209, 272, 273, 275, 278,
(CCD) camera
CCT. See CGView Comparison Tool (CCT) 280, 285, 287, 315–320, 322–329, 332, 338,
CDD. See Conserved Domain Database (CDD) 342, 345, 360, 372, 373, 386, 394–396,
404–406, 422
CD-HIT ........................................................................ 142
cDNA. See Complementary DNA (cDNA) Codon bias ........................................................... 272, 329
CDS. See Coding sequence (CDs) Codon substitution model ........................................... 344
Collaborative computational Project n 4 interactive
CGView Comparison Tool (CCT) ............................218,
219, 221 (CCP4i) .................................................. 57, 61, 69
Changept .............................................................. 297–313 Common ancestor..................................... 139, 148, 153,
ChangeptGUI ...................................................... 295–313 329, 330, 339, 350–353, 361, 379, 385, 389,
452, 462, 466
CHAOS ......................................................................... 211
Charged Coupled Device (CCD) camera.................... 16, Community genomics .................................................. 277
50, 73 Complementary DNA (cDNA)......................... 100, 103,
104, 108, 271, 280–282
ChEBI. See Chemical Entities of Biological Interest
(ChEBI) Compositionally heterogeneous..................386, 399–401
Cheliona mydas.............................................................. 397 Compositionally homogeneous .......................... 386, 395
Compositional signal .................................................... 387
Chemical Entities of Biological Interest (ChEBI) ...... 125
ChIA-PET. See Chromatin interaction analysis by paired- Computer aided drug discovery (CADD)..................... 27
end tag sequencing (ChIA-PET) Conditional random fields (CRFs) .............................. 280
CHIMAERA ..................... 436–439, 442, 443, 450, 451 Consed.......................................................................40–42
BIOINFORMATICS
Index 481
Consensus........................143, 145, 151, 158, 174, 178, Differentially expressed........................................ 124, 128
281, 285, 354, 366, 427, 444, 446, 451, 473 Diffraction .......................................49–51, 57, 59–61, 72
Consensus Coding Sequence (CCDS) ............... 229, 235 Dijkstra’s algorithm ........................................................ 40
Consensus sequence............................35, 36, 40, 92, 186 Directed acyclic graph (DAG).................... 127, 155, 156
Conservation .................................... 110, 115, 126, 168, Dirichlet distribution .................................................... 298
185, 186, 191, 196, 237, 272, 275, 279, 300, Distance methods................................................. 355, 357
303, 304, 306, 308–310, 313, 425 DNA Databank of Japan (DDBJ) ......................... 79, 80,
Conservative substitutions..................192, 195, 205, 206 98, 99, 203, 233, 243
Conserved Domain Database (CDD) .......................145, DNase I hypersensitive sites (DHSs) ........................... 110
156, 208 DNase I hypersensitivity ...................................... 103, 237
Constraint.................................143, 149, 323, 335, 342, dN/dS ratio................................................. 320, 332, 338
343, 404, 465 Dotlet.................................................................... 215, 216
Content sensor ..................................................... 279, 283 Dot-matrix................................................... 193, 194, 257
Contig........................................................ 4, 7, 35, 36, 40 Dotter ................................................................... 215, 216
Contig construction........................................................ 36 DREG ................................................................... 247, 250
Convergence......................................170, 175, 303–305, Drosophila melanogaster................................................ 102
313, 366–368, 373, 426, 470, 472 Drug targets .................................................................. 108
Coordinates refinement .................................................. 70 DSSP ..................................................................... 173, 174
COOT .......................................................................57, 70 Dynamic Light Scattering............................................... 48
CORA ............................................................................ 157 Dynamic programming............................. 142, 151, 152,
CORE ............................................................................ 403 168–169, 175–180, 182
Covarion signal.............................................................. 387 algorithm .............................. 35, 140, 168, 169, 181,
CpG ...................................................................... 104, 109 193, 194
CpG islands .......................................................... 230, 239 Dynamic range ................................................... 51, 73, 74
Critica ................................................................... 274, 286
Cross-vectors .............................................................53, 54 E
Crystallographic R factor (Rcryst) .............................55, 68 EasyGene .............................................................. 274, 286
Crystal planes .................................................................. 49
EBI. See European Bioinformatics Institute (EBI)
Cycle reversible termination (CRT)............................... 17 ECOD. See Evolutionary Classification of protein
Cyclic-array sequencing ..................................... 15, 16, 19 Domains (ECOD)
Cyclin-dependent kinase............................................... 238 EcoRI............................................................................. 249
Cytogenetics .................................................................. 245
eCRAIG ................................................................ 281, 285
Cytoscape.............................................................. 128, 133 EGASP. See ENCODE Genome Annotation Assessment
Project (EGASP)
D
Electron density ............................. 51, 54, 56, 70–74, 77
Dali....................................................... 150–152, 154–156 EMBOSS. See European Molecular Biology Open
Dali Dictionary..................................................... 149, 157 Software Suite (EMBOSS)
Danio rerio ........................................................... 261, 397 ENCODE Genome Annotation Assessment Project
Data decisiveness ........................................................... 361 (EGASP) ............................................................ 284
Data management ......................................................... 124 Encyclopedia of DNA Elements (ENCODE)...........111,
DAVID .......................................................................... 128 115, 227, 234, 235, 237, 284
dbGaP ........................................................ 81, 90, 94, 101 Enhancer........................................................................ 108
DDBJ. See DNA Databank of Japan (DDBJ) Entrez ..82, 84, 88, 91–97, 99, 100, 150, 156, 258, 259
De Bruijn graph .............................................................. 36 Environmental genomics .............................................. 277
Decay index ................................................................... 361 Environmental sample ............... 28, 84–85, 91, 203, 277
Decision theory (DT) .......................................... 409, 410 Environmental sequence samples..................85, 277–278
Degenerate sites ............................................................ 324 Epigenetic marker ................................................ 109, 295
Dendrogram .................................................................. 477 Epigenetics ...............................9, 27, 28, 109–111, 115,
De novo ...........................4, 5, 8, 10, 280, 283, 284, 286 227, 235, 244
assembly ............................................................ 17, 281 Epigenomics .................................................................... 26
Dialign ........................................................................... 183 Equivalogs ..................................................................... 148
Dialysis ............................................................................. 57 Escherichia coli ........................................... 15, 80, 95, 467
Dictyostelium ................................................................. 232 ESTs. See Expressed sequence tags (ESTs)
DICV .................................................................... 305–308 Euchromatic genome.................................................... 229
482 B IOINFORMATICS: VOLUME I: DATA, SEQUENCE ANALYSIS,
Index
AND EVOLUTION
Eukaryote .......................... 139, 226, 279–285, 406, 423 Four-fold degenerate .................................................... 324
Eukaryotic .................................... 26, 83, 100, 107, 109, Free energy ...................................................................... 48
127, 147, 271, 272, 279, 280, 282–285, 297, 421 French and Wilson method ......................................61, 64
European Bioinformatics Institute (EBI) ..................130, FSA................................................................................. 425
144, 156, 232 FUGUE ......................................................................... 177
European Molecular Biology Open Software Suite Functional Annotation of the Mammalian Genome
(EMBOSS)...............................230, 247, 250, 257 (FANTOM) ....................................................... 111
European Nucleotide Archive (ENA)79, 80, 99, 233, 243 Functional element discovery.............................. 295, 299
E-value. See Expectation value (E-value) FunFHMMer ................................................................ 154
Evolutionary Classification of protein Domains FunTree ......................................................................... 157
(ECOD) ........................................... 150, 155, 156
EVolutionary Ensembles of REcurrent SegmenTs G
(EVEREST) ..................................... 141, 142, 145 Galaxy ..................................................228, 255–257, 260
Evolutionary pattern ....................................380–382, 414 Gap extension.............................................. 183, 192, 216
Evolutionary process................................. 321, 324, 350,
Gap opening ................................................ 183, 192, 216
352, 353, 355, 380–385, 387–389, 392, 401, Gap penalty182, 183, 192, 198–201, 204, 206, 209, 218
405, 408, 409, 411, 412, 414, 422 GARLI ......................................................... 359, 360, 372
Evolutionary trees ......................................................... 354
Gblocks ................................................................. 423, 425
Ewald sphere .............................................................50, 73 Gbrowse......................................................................... 112
Exon..........................................108, 110, 115, 233, 237, GC-rich........................................................ 272, 285, 286
239, 256, 279–285 GeMMA......................................................................... 154
Exonerate....................................................................... 281
GenBank ......................................... 79, 80, 82–102, 203,
ExPASy .......................................................................... 345 217, 228, 233, 235–238, 240, 241, 243, 244,
Expectation-maximization (EM) ................................. 179 271, 455
Expectation value (E-value)............................... 156, 158,
GENCODE.............. 227, 229, 235–237, 244, 260, 284
171, 173, 178, 204, 216, 218, 220, 222, 247, 276 GENECONV ...............................................435, 437–439
Expressed sequence tags (ESTs) ......................... 96, 100, Gene conversion............................................................ 352
103, 108, 141, 203, 237
Gene duplication .................................................. 192, 382
Expression .................................... 51, 81, 126, 143, 174, Gene expression ..................................8, 10, 81, 85, 104,
186, 237, 245, 249, 272, 321–323 108, 235, 242, 244, 249
External profile alignment (EPA)................................. 182 Gene Expression Omnibus (GEO) ........................80–81,
90, 95, 101
F
Gene finding......................................................... 272–287
False negative ................................................................ 282 Gene flow .................................................... 352, 370, 371
False positive..................................... 158, 171, 282, 287, Gene functions .............................................................. 160
362, 435, 440, 448 GeneID ........................................................ 232, 262, 263
FASTA ......................................40–42, 92, 96, 101, 102, Gene loss........................................................................ 176
140, 171, 174, 201, 202, 205, 208–209, 216, Genemark.hmm ........................................ 273, 274, 278,
218, 221, 222, 235, 238, 240, 241, 247, 248, 280, 283
254, 256, 425, 434 Gene ontology (GO) ................................ 123, 125–128,
Fast Fourier Transformation (FFT) ...................... 53, 178 133, 148, 155
FastTree2 ....................................................................... 445 Gene prediction......................................... 229, 232, 273,
FASTX ........................................................................... 209 278, 279, 281–284, 286–287, 423
FATCAT. See Flexible structural AlignmenT by Chaining General feature format (GFF) ............................. 111, 118
Aligned fragment pairs allowing Twists (FATCAT) General time-reversible (GTR) model ................ 384, 402
Figure of merit ................................................................ 76 Genescan........................................................................ 229
Find Individual Motif Occurrences (FIMO).............247, Gene set enrichment analysis........................................ 128
250 Genetic algorithm ................................................ 359, 372
Flexible structural AlignmenT by Chaining Aligned Genetic code...................................................99, 272, 325
fragment pairs allowing Twists (FATCAT).....151, Genetic drift ......................................................... 110, 351
152, 156 Gene transfer format (GTF)................................ 111, 118
FlowerPower ................................................................. 142 Gene tree ........................... 428, 462–469, 472, 473, 476
Fluorescence resonance energy transfer (FRET) .......... 15 Gene tree parsimony reconciliation ............................. 462
FlyBase .................................................................. 126, 232 GeneWise....................................................................... 281
BIOINFORMATICS
Index 483
GenInfo identifier (GI).......................233, 238, 240, 462 Hidden Markov models (HMMs) .................... 143–148,
Genome annotation ............................83, 107–118, 137, 155, 159, 168, 181, 182, 184, 273, 274, 277,
159, 235, 237, 285 280, 281, 296, 297
Genome browser ....................................... 112–118, 228, Hierarchical clustering ......................................... 161, 355
230–251, 253, 254, 256, 258–260, 262, 312 Hierarchical model........................................................ 127
Genome reference consortium (GRC) ............... 229, 260 Hill-climbing ........................................................ 358, 369
Genome segmentation.................................................. 295 Histone ........................................................ 109, 110, 237
Genomes OnLine Database (GOLD) ................ 138, 226 Historical signal........................................... 386, 387, 399
Genome Survey Sequence (GSS) ................................. 203 HIV ...........................................324, 326, 327, 329–331,
Genome three dimensional .......................................... 157 338, 455–458
Genomic Data Submission policy .................................. 90 Hive-hexagon ................................................................ 255
Genomic sequence, .................................... 80, 100, 112, HKL2000 ........................................................... 56, 60, 61
137, 203, 226, 227, 229, 230, 237, 241, 250, HMMER ....................................................................... 144
254, 271–274, 279–281, 283, 286 HMMs. See Hidden Markov models (HMMs)
Genotype ...................................................................27, 28 Hmmscan ...................................................................... 144
Genscan ......................................................................... 281 Hmm search .................................................................. 144
GEO. See Gene Expression Omnibus (GEO) HOGENOM ................................................................. 147
ggsearch ......................................................................... 209 Hominids....................................................................... 227
GIGA ............................................................................. 148 Homogeneity ....................................................... 38, 353,
GISMO ........................................................ 274, 276, 286 383, 388–393, 397
GLAM2Scan.................................................................. 250 Homogeneous........................................... 380, 383, 384,
GLASS ......................................................... 465, 476, 477 387–388, 390–398, 400, 402, 403, 408, 412–414
Global alignment....................................... 171, 176, 177, Homologous structure alignment database
183, 184, 197, 198, 200–202, 211, 213, 214 (HOMSTRAD) ........................................ 150, 157
Globally homogeneous .......................383, 391, 392, 395 Homologue ..............................137, 143, 144, 146, 148,
Go.db............................................................................. 127 149, 154, 157, 159, 171
GOLD. See Genomes OnLine Database (GOLD) Homology ...........................................27, 137, 147, 153,
Goniometer ...............................................................56, 59 168, 170, 171, 173, 191–192, 246, 350, 351
GOstat ........................................................................... 128 Homoplasy ...................................................................... 27
G-quadruplex ................................................................ 249 Homo sapiens......................................130, 252, 397, 398
GS-Finder ...................................................................... 287 HOMSTRAD. See Homologous structure alignment
GTR. See General time- reversible (GTR) database (HOMSTRAD)
Guenomu.............................................................. 461–477 Horizontal genetic transfer ................................. 240, 273
Guide tree .................................169, 170, 175, 177–179, Human.............................3, 4, 8, 13, 27, 28, 61, 81, 83,
181, 182, 185 85, 89, 110, 115, 130, 146, 154, 209, 210, 212,
218, 226–229, 237, 244, 245, 250, 251, 284,
H 299, 301, 302, 306–310, 312, 315, 422
Human and Vertebrate Analysis and Annotation
Haemophilus influenzae ............................ 4, 11, 226, 421
HAL-HAS ..................................................................... 413 (HAVANA) ...................................... 229, 233, 235
Haliotis rubra ................................................................ 397 Human genome project.............................. 27, 226, 227,
232, 260
HAMAP........................................................144, 147–148
Hanging drop............................................................56–58 Hybridization ................................... 15, 18, 22, 104, 352
Haplotype ........................................................................ 27 Hydrophilic ................................................................... 139
HapMap......................................................................... 111 Hydrophobic ..................... 139, 150, 154, 184, 186, 187
Hydrophobicity ........................................... 184, 186, 187
Hashing .......................................................................... 35
HCH dehydrochlorinase .............................................. 208
I
Heat map .............................................................. 397–399
Heuristics........................... 169, 171, 186, 296, 345, 356 Identifiers....................................83, 86, 87, 91, 95, 101,
approach ........................................193, 201, 363, 372 216, 226–228, 232, 242–245, 253
HGNC ........................................................................... 232 iGraph ............................................................................ 131
HHblits.......................................................................... 144 Image plate ...................................................................... 50
HHsearch ............................................................. 144, 181 IMPALE ........................................................................ 435
484 B IOINFORMATICS: VOLUME I: DATA, SEQUENCE ANALYSIS,
Index
AND EVOLUTION
RefSeq.............................. 79, 83, 84, 86, 89–90, 95–97, SCOPe ......................................................... 150, 155, 156
229, 233, 235–238, 241, 243, 251–253, 262, 263 Scoring matrix .................................. 175, 194, 196, 198,
Reganor ................................................................ 274, 286 201, 206, 207, 250
RegEx. See Regular expression (RegEx) SD ................................................................ 265, 476, 477
Region of interest (ROI) ................................... 230–234, SDT................................................................................ 435
236, 237, 239–241, 260 SEA ................................................................................ 151
Regular expression (RegEx) .............................. 143, 174, SeaView.......................................................................... 186
249, 250 Secondary structure .......................................13, 71, 140,
Regulation ............................................. 9, 108–109, 115, 149–154, 158, 167, 173, 174, 249
227, 234, 237, 244 prediction........................................................ 167, 174
Regulatory Sequence Analysis Tools Secondary Structure Matching (SSM)......................... 156
(RSAT) ...................................................... 247, 250 Segmentation ....................................................... 295–313
Repeats.......................................... 4, 5, 8, 11, 21, 24, 30, Self-organizing map ............................................. 273, 274
146, 158, 172, 179, 183, 184, 193, 210, 213, Self-vectors ...................................................................... 53
230, 237, 366 Sensitivity.......................................... 142, 144, 176, 247,
Resampling ........................ 358, 359, 361, 363, 372, 465 279, 283, 285, 286, 345, 426
Resequencing ......................................................... 8, 9, 27 Seq-Gen ......................................................................... 412
RESTful ................................................................ 259, 260 Sequence(ing)
Restriction endonuclease .............................................. 249 alignment ............................................. 108, 110, 111,
Reverse transcription............................................ 104, 281 142–144, 146, 167–187, 194, 198, 201–202,
Reversibility .......................................................... 353, 383 206, 207, 257, 281, 295, 299, 300, 302, 303,
Reversible.......................................... 9, 17, 18, 353, 380, 308, 311, 313, 350, 352, 355, 425, 434, 435,
383–385, 387–403, 408, 412–414 441, 446, 449
Review......................88–90, 92–94, 255, 280, 294, 296, by cleavage................................................................. 12
315, 317, 321, 344, 382, 400, 423, 428, 462 database .....................................65, 79–92, 140–144,
Rfam...................................................................... 235, 242 156, 170, 171, 173, 202, 243, 274
Ribosome..................................................... 273, 276, 277 homology........................................................ 137, 147
Rice ................................................................................ 232 by hybridization ........................................................ 15
RMSD............................................................................ 157 identity.............................................52, 65, 138, 144,
RNAcentral.................................................................... 242 148, 149, 153, 154, 182, 193, 195, 196
RNA-Seq ........................... 103, 108, 109, 280–283, 285 by ligation ............................................................16, 21
Robinson-Foulds distance ............................................ 476 profile ....................................................................... 146
Root ................................... 70, 144, 151, 176, 322, 323, similarity ........................................ 65, 140–143, 148,
353, 362, 402, 466, 473–475 153, 157, 171, 197, 217, 220, 274
Rooted tree..........................................353, 361, 402, 446 by synthesis...............................................9, 12, 15, 17
Rotating anode generator............................................... 49 Sequence Alignment/Map format (SAM) ......... 111, 113
Rotation function......................................................53, 68 Sequenced Tagged Sites (STSs) ......................7, 100, 239
R Project for Statistical Computing.................... 127, 130 Sequence Read Archive (SRA) ........................ 79–82, 85,
RPS-BLAST.......................................................... 206, 208 90–91, 93–98, 100, 101, 103, 105
rRNA ........................................26, 85, 87, 242, 277, 360 Sequin ...............................................................90, 92, 101
RSAT. See Regulatory Sequence Analysis Tools (RSAT) SeqVis ............................................................................ 394
Shimodaira–Hasegawa (SH) test......................... 362, 363
S Short Oligonucleotide Alignment Program
SABmark ........................................................................ 187 (SOAP) .............................................................. 255
Short read assembly ....................................................9, 36
Saccharomyces ...................................................... 126, 262
Saccharomyces cerevisiae................................................. 226 Short tandem repeat (STR) ............................................ 27
SAM. See Sequence Alignment/Map format (SAM) Signaling pathway ................................................ 130–132
Signal sensor ......................................................... 279, 280
SAMtools ....................................................................... 111
Sanger sequencing................................... 4, 11, 13–14, 18 Signal-to-noise ratio............................50, 51, 61, 72, 296
SAP method .................................................................. 177 Silkworm........................................................................ 232
Simulated annealing .............................. 69, 465, 469–470
Savant Browser .............................................................. 242
SBML. See Systems Biology Markup Language (SBML) Single-cell sequencing................................................... 277
Scaffold ........................................... 4, 42–44, 83, 97, 100 Single linkage ....................................................... 161, 425
SCOP ................................................... 150–152, 154–157 Single molecule sequencing (SMS)................................ 21
BIOINFORMATICS
Index 489
Single nucleotide polymorphism (SNP) ...................... 27, Substitutions...............................21, 139, 143, 146, 157,
42, 43, 81, 115, 213, 230, 237, 284 179, 182, 185, 192–196, 204–206, 318, 319,
SISCAN ...............................................436–439, 442, 450 330, 338, 344, 350–355, 360, 366, 368,
Site model.................................................... 275, 333, 336 380–382, 386–413, 426, 427, 429, 430
Sitting drop ...............................................................57, 58 Subtree............................... 357, 358, 361, 362, 428, 430
Sliding window............................................ 273, 286, 294 Subtree pruning and regrafting.................................... 357
Smith and Waterman algorithm ..................197, 201–202 Suffix tree.............................................................. 213, 214
SMRT sequencing ........................................................... 24 Superfamilies ............................................. 144, 148, 150,
SNAP .................................................................... 280, 283 153–155, 157–159, 218, 232
SNP. See Single nucleotide polymorphism (SNP) SUPERFAMILY .................................................. 144, 159
SOAP. See Short Oligonucleotide Alignment Program Superfold ....................................................................... 154
(SOAP) Supermatrix .......................................................... 462, 463
Solexa ....................................6, 16–18, 21, 36, 37, 40–44 Super-saturated solution...........................................47, 48
Space groups....................................................... 48, 60, 64 Supertree .................. 423, 428–430, 462, 463, 470, 476
Species tree ...................................................379, 461–477 Supervised............................................................. 273, 283
Specificity .......................................... 155, 249, 278, 279, Superword array ........................................................37–39
283, 285, 286 Support vector machine....................................... 273, 276
Spectral analysis ............................................................. 352 SWISS-PROT...............................................141, 144–147
Spectral signal................................................................ 361 Symmetry, 48, 54, 61, 62, 64, 75,
SpectroNet..................................................................... 371 76, 388–396 ............................................. 398, 399
Sphingobium indicum .......................................... 208, 218 Synchrotron...........................................49, 56, 59, 61, 72
Splice junction ............................................................... 108 Synonymous codons ............................................ 272, 329
Spliceosome ................................................................... 279 Synonymous mutation........................316–318, 320, 324
Splice site .............................................279, 280, 283, 287 Synteny ................................................................. 215, 216
Splicing ........................................................ 108, 243, 271 Systems biology.................................................... 108, 128
Splits graph .................................................. 352, 361, 371 Systems Biology Markup Language (SBML) .............. 128
SplitsTree ....................................................................... 371
SPRSupertrees ...................................................... 428, 429 T
SRA. See Sequence Read Archive (SRA) Table browser ............................................ 113, 115, 118,
SSAP ....................................................138, 151, 155, 157 128, 236, 253, 254, 256
ssearch ............................................................................ 209 Taverna ........................................................ 228, 257, 260
SSG. See Structurally similar groups (SSGs)
tbl2asn .................................................................... 92, 101
Stampy ........................................................................... 255 tblastn .......................................................... 203, 247, 248
Star decomposition .............................................. 357, 358 tblastx........................................................... 203, 247, 248
Start codon .................................................. 273, 275, 316
TBR. See Tree bisection and reconnection (TBR)
Stationarity ........................................................... 353, 383 T-Coffee ..................................... 173, 176–181, 183, 185
Stationary.......... 380, 383, 387–403, 408, 412–414, 472 Telomeric ....................................................................... 249
Stationary condition .......................................... 387, 390,
100000 Genomes Project............................................. 426
391, 395–397 Tertiary structure ........................................ 147, 177, 183
STEAC......................................................... 465, 476, 477 Tetrahedral plot.................................................... 393–395
Stepwise addition ........................................ 357, 359, 372
Tetrahymena .................................................................. 232
Stereochemically Restrained Least Squares Refinement54 TFASTX......................................................................... 209
Stop codon ............... 273, 275, 280, 316, 323, 328, 345 TFBS. See Transcription factor binding sites (TFBSs)
STRAP ........................................................................... 186 The Cancer Genome Atlas (TCGA) ............................ 111
STRUCTAL .................................................................. 152
Thermosynechococcus elongatus ..................................71, 73
Structural domain ............. 149, 150, 155–157, 159, 405 Threading ...................................................................... 177
Structurally similar groups (SSGs) ............................... 157 3D-Coffee............................................................. 173, 177
Structure
3SEQ .......................................... 436–438, 442, 443, 450
comparison ....................................138, 149, 151, 155 1000 Genomes Project ........................................ 111, 227
factors........................................ 52, 55, 56, 61, 69, 76 TICO ............................................................................. 287
STS. See Sequenced Tagged Sites (STSs)
TIGRFAMs .........................................144, 145, 147, 148
Subcellular localisation.................................................. 147 Time-homogeneous.................................... 383, 384, 402
Subfamilies............................................................ 148, 172 TMHMM ...................................................................... 184
Subjective Bayes ................................................... 368, 369 TOPAL .......................................................................... 451
490 B IOINFORMATICS: VOLUME I: DATA, SEQUENCE ANALYSIS,
Index
AND EVOLUTION