0% found this document useful (0 votes)
8 views7 pages

Gardner 等 - 2005 - A benchmark of multiple sequence alignment programs upon structural RNAs

Uploaded by

craneaston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views7 pages

Gardner 等 - 2005 - A benchmark of multiple sequence alignment programs upon structural RNAs

Uploaded by

craneaston
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Published online April 28, 2005

Nucleic Acids Research, 2005, Vol. 33, No. 8 2433–2439


doi:10.1093/nar/gki541

A benchmark of multiple sequence alignment


programs upon structural RNAs
Paul P. Gardner*, Andreas Wilm1 and Stefan Washietl2

Department of Evolutionary Biology, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen Ø,


Denmark, 1Institut für Physikalische Biologie, Heinrich-Heine-Universität Düsseldorf, Universitätsstrasse 1,
D-40225 Düsseldorf, Germany and 2Institut für Theoretische Chemie, Universität Wien, Währingerstrasse 17,
A-1090 Wien, Austria

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


Received March 1, 2005; Revised and Accepted April 12, 2005

ABSTRACT and phylogenetic reconstruction has been proposed (12), yet


current implementations are limited in terms of functionality
To date, few attempts have been made to bench- and sequence size (13–18). A second structural RNA align-
mark the alignment algorithms upon nucleic acid ment approach employs (predicted) structures and aligns these
sequences. Frequently, sophisticated PAM or directly (19–21). Again these structures are generally limited
BLOSUM like models are used to align proteins, yet in terms of sequence size, but primarily suffer from the
equivalents are not considered for nucleic acids; inaccuracy of single sequence structure prediction (22–24)
instead, rather ad hoc models are generally favoured. [although, pruning low-probability base pairs yields modest
Here, we systematically test the performance of improvement (25,26)]. In addition, when structure is not
existing alignment algorithms on structural RNAs. conserved algorithms which attempt to include RNA
This work was aimed at achieving the following structural information are likely to fail [e.g. methylation-
goals: (i) to determine conditions where it is appropri- guide snoRNAs, Air-RNA]. Many of these methods are
also impractical when sequence length is large or when single
ate to apply common sequence alignment methods
sequence structure prediction (27,28) performs poorly (24).
to the structural RNA alignment problem. This indicates The performance of current sequence alignment methods
where and when researchers should consider aug- has been thoroughly analysed in terms of protein alignment
menting the alignment process with auxiliary informa- accuracy (29–32). Benchmarking has also been performed for
tion, such as secondary structure and (ii) to determine simulated non-coding DNA (33). However, the results of these
which sequence alignment algorithms perform well studies do not explore methods for the specific problem of
under the broadest range of conditions. We find that aligning structural RNAs. In this work, we extend these studies
sequence alignment alone, using the current algori- and test the performance of current alignment algorithms upon
thms, is generally inappropriate ,50–60% sequence structural RNA datasets.
identity. Second, we note that the probabilistic method The aims of this work are 2-fold. First, to identify the
ProAlign and the aging Clustal algorithms generally ‘twilight zone’ of RNA sequence alignment—the homology
range below which sequence alignment alone is unlikely
outperform other sequence-based algorithms, under
to produce reliable results and researchers should
the broadest range of applications. seriously consider augmenting the alignment process with
auxiliary information such as secondary structure. Second,
INTRODUCTION to identify algorithms capable of reliably aligning structural
RNA sequences under a range of sequence identities.
Motivation
The use of multiple sequence alignments is an essential step
Alignment algorithms
for many RNA sequence analysis methods [e.g. RNA structure
analysis (1–5), RNA homology search (6,7), non-coding The simplest form of an alignment is the pairwise sequence
RNA (ncRNA) detection (8,9) and RNA-based phylogenetic alignment. This can be performed by aligning sequences glob-
inference (10,11)]. Structural alignment of RNA is, ally (34) or locally (35), both employ dynamic programming,
however, an open problem. An algorithm for simultaneous thus resulting in a quadratic time complexity. Different scoring
structural RNA sequence alignment, structure prediction schemes may be used, which already produce varying results.

*To whom correspondence should be addressed. Tel: +45 3532 1338; Fax: +45 3532 1300; Email: [email protected]

ª The Author 2005. Published by Oxford University Press. All rights reserved.

The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press
are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but
only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]
2434 Nucleic Acids Research, 2005, Vol. 33, No. 8

The alignment of multiple sequences is far more complex,


as the mathematically optimal solution imposes exponential
complexity. Therefore, heuristics are used, which do not
guarantee an optimal solution, but perform multiple sequence
alignment in reasonable time.
One common approach is called progressive alignment (36),
which builds a multiple alignment from pairwise alignments.
The idea is that an alignment of sequences, which have more
recently diverged, is more likely to be reliable. Thus, high-
scoring pairwise alignments are aligned first and next closely
related sequences (or alignments of sequences) are added
progressively. The order of this progressive alignment is, in
most cases, defined by a guide tree, which is created before-

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


hand from a distance matrix, produced by aligning all nðn1Þ=2
possible pairs of sequences first. The basic drawback of this
method is the fact that gaps introduced in an early step cannot
be removed during the later addition of sequences, e.g. errors
made in an early step propagate during the alignment process
(‘Once a gap always a gap’). The so-called iterative methods
prevent this by realigning sequences or sequence groups in the
multiple alignment, thus, theoretically optimizing the alignment
until successive iterations fail to improve the alignment or reach Figure 1. An overview of alignment programs used in this work. Programs
a predefined limit (convergence). were classified into the categories described in detail in Alignment algorithms
Another class of algorithms is called consistency-based. Here, section.
a multiple alignment is constructed by extracting (maximum- sum-of-pairs score (SPS) employed in many previous align-
scoring) pairwise alignments from a library such that these ment benchmarks (29,32,33). SPS is defined as the fraction out
combined pairwise solutions are not contradictory or mutually of all possible character pairs that are aligned in both the
exclusive. predicted and the reference alignments. Perfectly predicted
Probabilistic methods are an increasingly popular way of (concordant) alignments receive an SPS of one, absolutely
generating solutions to biological problems. The basic premise imperfectly predicted (discordant) alignments receive an
is to produce a model that one believes best describes the SPS of zero. Although, for nucleotide alignments an SPS of
system behaviour. Model parameters are subsequently estim- zero is rarely observed. Essentially, SPS provides a measure
ated from reliable data. In terms of sequence alignment, a of the sensitivity of the prediction. The second measure,
pairwise hidden Markov model-based approach has been dubbed the structure conservation index (SCI), provides a
proposed (37), and now implemented and extended to multiple measure of the conserved secondary structure information
sequence alignment (38,39). contained within the alignment (9). It is a derivative of the
The structural RNA alignment approach of Sankoff (12) score calculated by the RNAalifold consensus folding algo-
merges the recursions of Smith–Waterman (35) type sequence rithm (5,42) which is based upon the sum of a thermodynamic
alignment and Nussinov et al. (maximal base-pairing) (40) or and a covariance term and, in contrast to the SPS, is independ-
Zuker and Stiegler (energy-based) (41) RNA structure predic- ent from a reference alignment. The SCI is close to zero if
tion (15). The basic idea is to implicitly include base-pairing RNAalifold identifies no common RNA structure in the align-
interactions into the alignment procedure such that homo- ment, whereas a set of perfectly conserved structures has
logous base pairs are aligned correctly. Unfortunately, the an SCI  1. An SCI >1 indicates that there is a conserved
algorithm is computationally expensive [O(n3m) in time, RNA secondary structure which is, in addition, supported by
and O(n2m) in space, where n is the sequence length and m compensatory and/or consistent mutations preserving the
is the number of sequences]. Current implementations, Dyna- common structure. We note that the SCI scores alignment
lign (13,14), Foldalign (16), PMcomp (15) and Stemloc accuracy only in terms of the secondary structure information.
(17,18), are restricted implementations of the Sankoff algo- For example, if the helices of a secondary structure are accur-
rithm, which impose practical limits on the size or the shape of ately aligned, it does not affect the SCI whether the loop
substructures. In addition, sensible score routines such as ther- regions are well aligned in terms of sequence similarity.
modynamics (13,14), a combination of sequence and thermo- The SCI specifically points out the structural aspect of align-
dynamic scores (16) and partition-function derived probability ment accuracy and, therefore, appears to be a useful measure
matrices (15), can be used to score alignments. in addition to the SPS.
Figure 1 shows a classification of all the programs used in
this study into the above categories.
Alignment programs
MATERIALS AND METHODS We tested 11 sequence alignment programs and 4 structural
alignment programs (see Supplementary Material Table 1).
Accuracy measures Where previous alignment benchmarks have only considered
In order to evaluate alignment methods on structural RNAs, the default or ‘out of the box’ behaviour we try testing a
we use two independent measures. First is the traditional range of parameter combinations for each alignment method.
Nucleic Acids Research, 2005, Vol. 33, No. 8 2435

Algorithm options are summarized in the Supplementary homology groups. An additional tRNA dataset was generated
Material Tables 2 and 3. with just two sequences to each alignment (hereafter referred
to as dataset 2). This was used to contrast pairwise structural
alignment methods and sequence alignment methods.
Test datasets
We generated four diverse structural RNA datasets of Group II Caveat: tools improve
introns, 5S rRNA, tRNA and U5 spliceosomal RNA. The
sequences and the reference alignments for calculating the We note here that our data reflect the state of the art in early
SPS were obtained from the Rfam v5.0 database. SRP from 2005. Most of the tools tested are relatively recent, and many
the SRPDB database was included in the original dataset but are still under development. Hence, not all the observations
was later discarded due to poor comparability between below will remain reproducible. In fact, we hope this study
predicted and structural alignments contained in this dataset helps to obtain better results in the future.
(the results suggest that a fraction of the SRP reference

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


sequences have been misaligned). Using the same procedure
as described previously (42), we generated 100 sub- RESULTS
alignments for each family. The alignments contained five
sequences each and encompassed a range of sequence iden- Dataset 1: applicability of pure sequence alignment to
tities. This large dataset (hereafter referred to as dataset 1) was ncRNA sequences
divided into high (>75% sequence identity, 73 alignments), All the 11 sequence alignment algorithms were tested upon
medium (<75 and >55% sequence identity, 73 alignments) dataset 1. The results and the relative algorithm ranks for each
and low (<55% sequence identity, 242 alignments) sequence homology group are summarized in Table 1. We experimented

Table 1. The mean SCI score and mean SPSs computed for dataset 1 (see text for details)

Algorithm High homology (75% < seq. id) Medium homology (75% < seq. id < 55%) Low homology (seq. id < 55%)
SCI SPS Rank SCI SPS Rank SCI SPS Rank

Structural 0.9789 1.0000 0 0.9297 1.0000 0 0.7846 1.0000 0


Align-m (1) 0.9827 0.9600 5 0.8453 0.8825 22 0.4957 0.6748 25
Align-m (2) 0.9827 0.9600 6 0.8453 0.8825 21 0.4957 0.6748 24
Align-m (3) 0.9778 0.9593 7 0.8040 0.8742 26 0.4691 0.6635 29
Align-m (4) 0.9778 0.9593 8 0.8040 0.8742 25 0.4691 0.6635 28
Align-m (5) 0.8995 0.9419 30 0.7597 0.8583 30 0.4777 0.6460 30
Clustal 0.9438 0.9741 9 0.9100 0.9194 2 0.6064 0.7423 8
Clustal (qt) 0.9315 0.9743 12 0.8996 0.9123 9 0.6076 0.7345 9
DIALIGN 0.9071 0.9577 27 0.8018 0.8712 27 0.4979 0.6659 26
DIALIGN (o) 0.9077 0.9601 26 0.8568 0.8860 20 0.5202 0.6721 23
DIALIGN (it) 0.8590 0.9491 34 0.7519 0.8556 31 0.4669 0.6546 32
DIALIGN (it,o) 0.8492 0.9486 35 0.7092 0.8456 33 0.4467 0.6467 33
Handel 0.9604 0.9560 11 0.8570 0.8954 19 0.5360 0.7283 19
MAFFT (fftns) 0.8321 0.9145 36 0.5864 0.8030 37 0.3538 0.6448 37
MAFFT (fftnsi) 0.8840 0.9427 32 0.6655 0.8378 36 0.3845 0.6634 36
MAFFT (nwns) 0.9297 0.9502 25 0.6712 0.8349 35 0.3941 0.6724 35
MAFFT (nwnsi) 0.9330 0.9526 22 0.7004 0.8451 34 0.4071 0.6812 34
MUSCLE 0.9222 0.9684 19 0.8988 0.9181 5 0.6065 0.7668 2
MUSCLE (nj) 0.9268 0.9707 16 0.8841 0.9110 17 0.5902 0.7503 12
MUSCLE (mi32) 0.9222 0.9683 21 0.8959 0.9167 8 0.6069 0.7666 1
MUSCLE (mi32,mt6) 0.9222 0.9683 20 0.8959 0.9167 7 0.6068 0.7664 3
MUSCLE (nj,mt6) 0.9268 0.9707 15 0.8841 0.9110 16 0.5902 0.7503 11
MUSCLE (nj,mi32) 0.9268 0.9708 14 0.8855 0.9112 14 0.5897 0.7501 14
MUSCLE (nj,mi32,mt6) 0.9268 0.9708 13 0.8855 0.9112 13 0.5898 0.7501 13
PCMA 1.0030 0.9635 3 0.9196 0.9059 3 0.5339 0.6890 20
PCMA (agi20) 1.0030 0.9635 2 0.9255 0.9068 1 0.5621 0.7058 16
PCMA (agi60) 1.0030 0.9635 1 0.8938 0.8941 18 0.5270 0.6827 21
POA 0.8478 0.9644 33 0.7666 0.8739 29 0.4656 0.6740 27
POA (g) 0.9253 0.9722 17 0.8836 0.9130 15 0.5581 0.7543 15
POA (p) 0.8668 0.9656 31 0.7814 0.8814 28 0.5079 0.6964 22
POA (gp) 0.9444 0.9726 10 0.8929 0.9188 10 0.5843 0.7726 6
ProAlign (bw400) 0.9978 0.9631 4 0.9163 0.9072 4 0.6045 0.7490 5
Prrn 0.9364 0.9458 24 0.9036 0.9064 11 0.5903 0.7549 10
Prrn (S10) 0.9371 0.9458 23 0.8997 0.9086 12 0.5964 0.7596 4
T-Coffee 0.8867 0.9656 29 0.8129 0.8989 24 0.5322 0.7337 18
T-Coffee (c) 0.9201 0.9733 18 0.8956 0.9194 6 0.5972 0.7543 7
T-Coffee (f ) 0.8867 0.9656 28 0.8129 0.8989 23 0.5322 0.7337 17
T-Coffee (s) 0.7892 0.9536 37 0.7151 0.8637 32 0.4447 0.6934 31

Dataset 1 has been divided into three homology groups: the high-homology group (75% < seq. id), the medium-homology group (75% < seq. id < 55%) and the
low-homology group (seq. id < 55%). Rankings are computed from the product of SCI and SPS. The top 10 ranks are highlighted in boldface. Abbreviations of the
parameter switches used to produce these results are shown in parentheses. Further details of these can be found in Supplementary Material.
2436 Nucleic Acids Research, 2005, Vol. 33, No. 8

with a variety of algorithm parameters. The results using Dataset 2: comparison of structural and sequence
default and any parameter combinations that increased algo- methods
rithm performance are summarized in Figure 2. In order to
Now, we contrast the relative performance of the comparat-
measure relative algorithm performances, a ranking for each
ively good sequence-based methods identified in the previous
algorithm (and parameter setting) was calculated within each
section with structural alignment methods using a smaller
of the three homology ranges. The rank is based upon the
tRNA dataset. The structure-based methods are generally
product of mean SCI score and mean SPSs.
computationally more intensive than the sequence-based
For the high-similarity datasets (sequence identity >75%),
methods—hence the small size (in terms of the number of
there is little difference in accuracy across most of the
sequences and the sequence length) of this dataset.
algorithms considered here (see Table 1). Align-m and Handel
We use dataset 2 to compare the relative performances of
rank well on this dataset, yet the relative performance of
structure-based methods (e.g. Dynalign, Foldalign, PMcomp
both these methods dropped rapidly with decreasing sequence
and Stemloc) to the ‘better’ sequence-based methods identified
homology. Interestingly, for Align-m, this is the opposite to
in the previous section (e.g. ClustalW, MUSCLE, PCMA,

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


what has been observed for protein-based results (43).
POA (gp), ProAlign and Prrn). We observe a dramatic diver-
PCMA ranked well on both the high- and medium-similarity
gence in relative performances below 60% sequence identity
datasets; however, the relative performance of this method
between the structure- and the sequence-based methods (see
dropped on the low-similarity dataset.
Figure 3).
ClustalW, MUSCLE, PCMA, POA [with both global and
The structural methods Dynalign, Foldalign and PMcomp
progressive modes—hereafter referred to as POA (gp)], Pro-
show high conservation of structural information (SCI) across
Align, Prrn and T-Coffee (when ClustalW is used to generate
all homology ranges. However, the SPS of Dynalign and
a library of pairwise alignments) perform comparatively well
PMcomp are significantly lower than that of Foldalign. The
across all homology ranges, with little significant variation
difference is mostly marked in the high to medium homology
between these methods. Only ClustalW, ProAlign and POA
range. This is possibly because the current versions of Dyna-
(gp) consistently ranked in the top 10 across all the datasets.
lign and PMcomp optimize scores solely based on secondary
There is some redundancy in ranking algorithms over all the
structure information and hence are likely to produce rather
combinations of algorithm parameters we use here, obviously
different alignments to those used in the test dataset, where
some combinations produce similar (or even identical) results.
sequence information is also included. Stemloc performs com-
However, this has had little impact on our conclusions.
paratively well in terms of SCI for sequence identities >40%.
The results suggest that 60% sequence identity is a
In terms of SPS, however, Stemloc behaves much like the
crude threshold, whereby the structural content of predicted
purely sequence-based methods. Given the apparent sophist-
sequence alignments diverges from reference structural
ication of this method and large computational resources
alignments (see Figures 2 and 3).
required to run the algorithm (17,18) these results are rather

Figure 2. Both measures of structural RNA alignment correctness, SCI (A) and SPS (B), are plotted as functions of the mean pairwise sequence identity
(calculated using the reference alignments). The curves are fit to dataset 1 (see text for details) using lowess (local weighted regression) smoothing. At most,
two curves are plotted for each alignment package—one corresponding to the default parameters, the other corresponds to the best parameter combination we
could identify.
Nucleic Acids Research, 2005, Vol. 33, No. 8 2437

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


Figure 3. SCI (A) and SPS (B) as functions of the sequence identity for dataset 2 (see text for details). Five structural algorithms are shown: the Sankoff-based
methods Dynalign, Foldalign, PMcomp and Stemloc, and the base pair probability profile alignment method implemented in PMcomp (fast). These are in contrast
with the hand-curated structural alignments and six of the better sequence-based alignment algorithms (ClustalW, MUSCLE, PCMA, POA (gp), ProAlign and Prrn).

disappointing. We hypothesize that perhaps these are due to datasets (32). Another surprise was that the supposedly
an overemphasis upon the sequence-based component of the outdated, yet still widely used method ClustalW, performed
algorithm. consistently well across all homology datasets. This is possibly
All of the above datasets are freely available from a consequence of the fact that more recent algorithms are
http://www.binf.ku.dk/users/pgardner/bralibase/. As novel heavily optimized for protein alignment. The relatively
and updated algorithms become available, updated results new methods ProAlign, POA (gp) and MUSCLE also per-
will also be made available from the web page. formed consistently well. ProAlign, in particular, produced
(comparatively) reliable alignments and ranked in the top 5
across all homology ranges. This is possibly due to the fact
that ProAlign is one of the few algorithms to use a scoring
DISCUSSION
scheme derived from reliable nucleic acid sequence align-
An aim of this work is to determine the boundaries between ments. The performance of POA (gp) is also remarkable,
when pure sequence alignment methods perform well and not only because it employs a very fast method [said to
when augmentation of the alignment with structure is neces- accurately align 5000 EST sequences in 4 h on a Pentium II
sary. We wish to highlight that the benchmarks based purely (44)] but also because it performed consistently well over all
upon structural protein alignments do not adequately test all test sets.
the uses of sequence alignment. In addition, we are pleased to Another conclusion of this work is that the ‘twilight zone’ of
note that our two independent measures of alignment fitness ncRNA alignment—the homology range where little to no
(SCI and SPS) produce similar results. structural information of predicted alignments (using the cur-
In some cases, we found that altering algorithm para- rent state of the art algorithms) for structurally homologous
meters produced a dramatic improvement over the defaults sequences is retained—is in the 50–60% sequence-identity
(e.g. T-Coffee performance improves using Clustal to generate range. This is dramatically higher than that of the protein
a library of pairwise alignments and POA performance sequences which is 10–20% (29). Much of this difference
improves dramatically using a combination of the global is, of course, due to the different alphabet sizes and the gen-
and the progressive modes). erally limited models and the score matrices for nucleotide
We find that the conclusions of previous studies based upon alignment.
structural protein alignments do not necessarily hold for the It is interesting to note that three of the structural methods
alignment of structural ncRNA. For example, DIALIGN, iden- (Dynalign, Foldalign and PMcomp), for a short homology
tified as a method which performed well for low-homology range (40–60% sequence identity), have higher SCI scores
protein alignment did not generally improve (relative to the than the reference alignment and that in the same regions
alternative methods) on low-homology datasets (32). Another there is a dip in the performance when Dynalign, Foldalign
surprising discovery was that T-Coffee, touted as an excellent and PMcomp performance is measured using SPS. This
method for high-homology datasets, did not perform well suggests that the reference alignments themselves may
(again, relative to the alternative methods) on the ncRNA be improved upon in this homology range.
2438 Nucleic Acids Research, 2005, Vol. 33, No. 8

Based upon these results the Foldalign score routines seem (analogous to the index creation of the BLAST procedure)
to have optimized the delicate balance between the sequence and the profile alignment method is likely to produce a
and the structure-based scores. This implementation of superior homology detection tool. This application and the
Sankoff’s algorithm employs a light-weight energy model extension of this method to multiple alignments is an area
(13,41,45,46) in concert with the substitution matrices similar of active research.
to those of RIBOSUM (47) and BLOSUM (48), which seem to Other potentially fruitful research areas to explore are as
produce excellent predictions. However, the computational follows: (i) The implementation of light-weight Sankoff-like
complexity of this algorithm is still an issue, global alignment algorithms, which produce reasonable alignments in a short
is restricted to sequences of 200 nt or less, in practice. time-frame and use score routines combining energy and
Further optimization may increase this bound, however. sequence scores similar to those of Foldalign. (ii) In analogy
The profile-based approach of Hofacker et al. [pmcomp to the improvement of structure prediction accuracy by
- -fast (15,49)], holds promise for producing fast and reason- including stacking parameters (nearest-neighbour model),
ably accurate alignments in satisfactory time across all perhaps the alignment of RNA sequences over a dinucleotide

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


homology ranges. It by no means produces ‘optimal’ align- alphabet could produce improvements in sequence-based
ments in terms of sequence or structure, but is a reasonable alignment. (iii) The current score matrices for nucleotide
compromise between the sequence- and the structure-based alignment are generally ad hoc, it is likely that significant
methods in terms of improved accuracy for the former and improvements could be gained by using RIBOSUM-like
dramatically reduced computational requirements for the matrices (47) for scoring alignment. (iv) ‘Intelligent’ align-
latter. This method is in the process of being re- ment algorithms which employ sequence information when
implemented in C with affine gap costs and an adjustable this is reasonable or structure alignment when this is better.
sequence-weighting parameter. This is available as ‘RNApaln’
with the Vienna package version 1.5 or greater (I. Hofacker,
personal communication).
NOTE ADDED TO PROOF
Since embarking on this project the alignment algorithms
SUMMARY Align-m, Handel and MAFFT have been updated. Preliminary
The results and main conclusions of our study can be analysis of the updated algorithms suggests that improvements
summarized as follows: to Align-m and Handel have resulted in modest performance
increases across the high, medium and low similarity groups of
(i) The two independent measures of global alignment accu- data-set 1. The improvements to MAFFT however have
racy SPS and SCI are generally in agreement. These resulted in major performance increases across all similarity
measures only differ significantly on methods, such groups of data-set 1. In fact, across all similarity groups
as Dynalign and PMcomp, that perform only structural MAFFT (ver. 5) now ranks second only to \proalign. However,
alignment. The SCI is, therefore, a useful score for asses- gap-parameters for this algorithm have been estimated directly
sing the accuracy of structural RNA alignments. from data-set 1, this bias could be alleviated by determining
(ii) The relative performance of multiple sequence alignment optimal gap-parameters for all methods prior to bench-
programs on RNA alignments can differ remarkably from marking. Preliminary work in this direction shows algorithm
the performance observed on protein alignments. performances on RNAs can in some cases be enhanced
(iii) The multiple sequence alignment algorithms, such as by optimising gap-parameters (personal communication
ClustalW, MUSCLE, PCMA, POA (gp), ProAlign and K. Katoh).
Prrn, perform well on high- to medium-homology datasets.
(iv) ClustalW, ProAlign and POA (gp) consistently ranked in
the top 10 across all homology ranges.
(v) The ‘twilight zone’ of ncRNA alignment is in the 50–60% SUPPLEMENTARY MATERIAL
sequence-identity range. Supplementary Material is available at NAR Online.
(vi) Below this limit, algorithms incorporating structural
information (Dynalign, Foldalign, PMcomp and Stemloc)
outperform pure sequence-based methods. However, these
ACKNOWLEDGEMENTS
algorithms are computationally demanding which
severely limits their use in practice. The authors thank Jakob Hull Havgaard and Jan Gorodkin for
providing predictions from unpublished versions of Foldalign
Future directions
2.0.0 and thought provoking discussions. The authors also
One rather interesting result of this study is that the structure thank K. Katoh for interesting discussions and making
profile alignment method (pmcomp - -fast) produces reason- the authors aware of recent improvements to MAFFT.
able structural alignments across all homology ranges in a P.P.G. was supported by a Carlsberg Foundation Grant
dramatically short time. This method in combination with, (21-00-0680). A.W. was supported by the German National
as yet undeveloped, iterative alignment refinement strategies Academic Foundation. S.W. was supported by the Austrian
seems poised to become a method of choice for RNA research- Gen-AU bioinformatics integration network sponsored by
ers in the near future. This also has interesting implications for BM-BWK and BMWA. This work was conceived at the
the notoriously difficult problem of ncRNA homology search. ‘Computational roads to the RNA world workshop’ hosted
A combination of a database of locally stable regions (50) by Robert Giegerich at Bielefeld University. Funding to pay
Nucleic Acids Research, 2005, Vol. 33, No. 8 2439

the Open Access publication charges for this article was 25. Mathews,D. (2004) Using an RNA secondary structure partition function
provided by a Carlsberg Foundation Grant (21-00-0680). to determine confidence in base pairs predicted by free energy
minimization. RNA, 10, 1178–1190.
Conflict of interest statement. None declared. 26. Gardner,P. and Giegerich,R. (2004) A comprehensive comparison of
comparative RNA structure prediction approaches. BMC Bioinformatics,
5, 140.
27. Zuker,M. and Sankoff,D. (1984) RNA secondary structures and their
REFERENCES prediction. Bull. Math. Biol., 46, 591–621.
28. Hofacker,I.L., Fontana,W., Bonhoeffer,S. and Stadler,P.F. (1994) Fast
1. Chiu,D.K. and Kolodziejczak,T. (1991) Inferring consensus structure folding and comparison of RNA secondary structures. Monatshefte f€ ur
from nucleic acid sequences. Comput. Appl. Biosci., 7, 347–352. Chemie, 125, 167–188.
2. Gutell,R.R., Power,A., Hertz,G.Z., Putz,E.J. and Stormo,G.D. (1992) 29. Thompson,J., Plewniak,F. and Poch,O. (1999) A comprehensive
Identifying constraints on the higher-order structure of RNA: continued comparison of multiple sequence alignment programs. Nucleic
development and application of comparative sequence analysis Acids Res., 27.
methods. Nucleic Acids Res., 20, 5785–5795. 30. Thompson,J., Plewniak,F. and Poch,O. (1999) BAliBASE: a benchmark
3. Gorodkin,J., Heyer,L., Brunak,S. and Stormo,G. (1997) Displaying the alignment database for the evaluation of multiple alignment programs.

Downloaded from https://academic.oup.com/nar/article/33/8/2433/2401467 by guest on 04 June 2024


information contents of structural RNA alignments. CABIOS, 13, Bioinformatics, 15, 87–88.
583–586. 31. Bahr,A., Thompson,J., Thierry,J. and Poch,O. (2001) BAliBASE
4. Knudsen,B. and Hein,J. (2003) Pfold: RNA secondary structure (Benchmark Alignment dataBASE): enhancements for repeats,
prediction using stochastic context-free grammars. Nucleic Acids Res., transmembrane sequences and circular permutations. Nucleic
31, 3423–3428. Acids Res., 29, 323–326.
5. Hofacker,I., Fekete,M. and Stadler,P. (2002) Secondary structure 32. Lassmann,T. and Sonnhammer,E. (2002) Quality assessment of multiple
prediction for aligned RNA sequences. J. Mol. Biol., 319, 1059–1066. alignment programs. FEBS Lett., 529, 126–130.
6. Eddy,S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 33. Pollard,D., Bergman,C., Stoye,J., Celniker,S. and Eisen,M. (2004)
755–763. Benchmarking tools for the alignment of functional noncoding DNA.
7. Griffiths-Jones,S., Bateman,A., Marshall,M., Khanna,A. and Eddy,S.R. BMC Bioinformatics, 5, 6.
(2003) Rfam: an RNA family database. Nucleic Acids Res., 31, 439–441. 34. Needleman,S. and Wunsch,C. (1970) A general method applicable
8. Rivas,E. and Eddy,S. (2001) Noncoding RNA gene detection using to the search for similarities in the amino acid sequence of two proteins.
comparative sequence analysis. BMC Bioinformatics, 2, 8. J. Mol. Biol., 48, 443–453.
9. Washietl,S., Hofacker,I. and Stadler,P. (2005) Fast and reliable prediction 35. Smith,T. and Waterman,M. (1981) Identification of common molecular
of noncoding RNAs. Proc. Natl Acad. Sci. USA, 102, 2454–2459. subsequences. J. Mol. Biol., 147, 195–197.
10. Woese,C. and Fox,G. (1977) Phylogenetic structure of the prokaryotic 36. Feng,D. and Doolittle,R. (1987) Progressive sequence alignment as a
domain: the primary kingdoms. Proc. Natl Acad. Sci. USA, 74, prerequisite to correct phylogenetic trees. J. Mol. Evol., 25, 351–360.
5088–5090. 37. Durbin,R., Eddy,S., Krogh,A. and Mitchison,G. (1998) Biological
11. Hudelot,C., Gowri-Shankar,V., Jow,H., Rattray,M. and Higgs,P. (2003) Sequence Analysis: Probabilistic Models of Protein and Nucleic Acids.
RNA-based phylogenetic methods: application to mammalian Cambridge University Press, Cambridge.
mitochondrial RNA sequences. Mol. Phylogenet. Evol., 28, 241–252. 38. Holmes,I. and Bruno,W.J. (2001) Evolutionary HMMs: a Bayesian
12. Sankoff,D. (1985) Simultaneous solution of the RNA folding, alignment approach to multiple alignment. Bioinformatics, 17, 803–820.
and protosequence problems. SIAM J. Appl. Math., 45, 810–825. 39. Löytynoja,A. and Milinkovitch,M.C. (2003) A hidden Markov model for
13. Mathews,D. and Turner,D. (2002) Dynalign: an algorithm for finding progressive multiple alignment. Bioinformatics, 19, 1505–1513.
the secondary structure common to two RNA sequences. J. Mol. Biol., 40. Nussinov,R., Piecznik,G., Grigg,J.R. and Kleitman,D.J. (1978)
317, 191–203. Algorithms for loop matchings. SIAM J. Appl. Math., 35, 68–82.
14. Mathews,D. (2005) Predicting a set of minimal free energy RNA 41. Zuker,M. and Stiegler,P. (1981) Optimal computer folding of large RNA
secondary structures common to two sequences. Bioinformatics, in press. sequences using thermodynamics and auxiliary information. Nucleic
15. Hofacker,I., Bernhart,S. and Stadler,P. (2004) Alignment of RNA base Acids Res., 9, 133–148.
pairing probability matrices. Bioinformatics, 20, 2222–2227. 42. Washietl,S. and Hofacker,I. (2004) Consensus folding of aligned
16. Hull Havgaard,J., Lyngsø,R., Stormo,G. and Gorodkin,J. (2005) Pairwise sequences as a new measure for the detection of functional RNAs by
local structural alignment of RNA sequences with sequence similarity comparative genomics. J. Mol. Biol., 342, 19–30.
less than 40%. Bioinformatics, in press. 43. Van Walle,I., Lasters,I. and Wyns,L. (2004) Align-m: a new algorithm for
17. Holmes,I. (2004) A probabilistic model for the evolution of RNA multiple alignment of highly divergent sequences. Bioinformatics, 20,
structure. BMC Bioinformatics, 5, 166. 1428–1435.
18. Holmes,I. (2005) Accelerated probabilistic inference of RNA structure 44. Lee,C., Grasso,C. and Sharlow,M.F. (2002) Multiple sequence
evolution. BMC Bioinformatics, 6, 73. alignment using partial order graphs. Bioinformatics, 18,
19. Sczyrba,A., Kruger,J., Mersch,H., Kurtz,S. and Giegerich,R. (2003) 452–464.
RNA-related tools on the Bielefeld Bioinformatics Server. 45. Mathews,D., Sabina,J., Zuker,M. and Turner,H. (1999) Expanded
Nucleic Acids Res., 31, 3767–3770. sequence dependence of thermodynamic parameters provides robust
20. Höchsmann,M., Töller,T., Giegerich,R. and Kurtz,S. (2003) Local prediction of RNA secondary structure. J. Mol. Biol., 288,
similarity of RNA secondary structures. In 2nd IEEE Computer Society 911–940.
Bioinformatics Conference (CSB 2003), 11–14 August 2003, Stanford, 46. Xia,T., SantaLucia,J., Burkard,M., Kierzek,R., Schroeder,S., Jiao,X.,
CA, USA. ISBN 0-7695-2000-6, pp. 159–168. Cox,C. and Turner,D. (1998) Thermodynamic parameters for an
21. Siebert,S. and Backofen,R. (2003) MARNA: a server for multiple expanded nearest-neighbor model for formation of RNA duplexes with
alignment of RNAs. In Proceedings of the German Conference on Watson–Crick base pairs. Biochemistry, 4, 14719–14735.
Bioinformatics, GCB 2003, October 12–14, 2003, Neuherberg/Garching 47. Klein,R. and Eddy,S. (2003) RSEARCH: finding homologs of single
near Munich, Germany, pp. 135–140. structured RNA sequences. BMC Bioinformatics, 4, 44.
22. Konings,D. and Gutell,R. (1995) A comparison of thermodynamic 48. Henikoff,S. and Henikoff,J. (1992) Amino acid substitution
foldings with comparatively derived structures of 16S and 16S-like matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89,
rRNAs. RNA, 1, 559–574. 10915–10919.
23. Fields,D. and Gutell,R. (1996) An analysis of large rRNA sequences 49. Bonhoeffer,S., McCaskill,J., Stadler,P. and Schuster,P. (1993) RNA
folded by a thermodynamic method. Fold Des., 1, 419–430. multi-structure landscapes. A study based on temperature dependent
24. Doshi,K., Cannone,J., Cobaugh,C. and Gutell,R. (2004) Evaluation of partition functions. Eur. Biophys. J., 22, 13–24.
the suitability of free-energy minimization using nearest-neighbor 50. Hofacker,I., Priwitzer,B. and Stadler,P.F. (2004) Prediction of
energy parameters for RNA secondary structure prediction. BMC locally stable RNA secondary structures for genome-wide surveys.
Bioinformatics, 5, 105. Bioinformatics, 20, 186–190.

You might also like