0% found this document useful (0 votes)
5 views18 pages

The Accuracy of Several Multiple Sequence Alignment Programs For Proteins

This research article evaluates the accuracy of nine popular protein multiple sequence alignment (MSA) programs using simulated sequences generated by the software Simprot. The study finds that alignment accuracy is significantly influenced by the number of insertions and deletions in the sequences, with Mafft and ProbCons demonstrating the highest accuracy among the tested programs. The results suggest that Simprot provides a more flexible approach for assessing alignment accuracy compared to traditional methods using curated databases like BAliBASE.

Uploaded by

Naimur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views18 pages

The Accuracy of Several Multiple Sequence Alignment Programs For Proteins

This research article evaluates the accuracy of nine popular protein multiple sequence alignment (MSA) programs using simulated sequences generated by the software Simprot. The study finds that alignment accuracy is significantly influenced by the number of insertions and deletions in the sequences, with Mafft and ProbCons demonstrating the highest accuracy among the tested programs. The results suggest that Simprot provides a more flexible approach for assessing alignment accuracy compared to traditional methods using curated databases like BAliBASE.

Uploaded by

Naimur Rahman
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BMC Bioinformatics BioMed Central

Research article Open Access


The accuracy of several multiple sequence alignment programs for
proteins
Paulo AS Nuin1, Zhouzhi Wang1 and Elisabeth RM Tillier*1,2

Address: 1Division of Cancer Genomics and Proteomics, Ontario Cancer Institute, University Health Network, 101 College St, M5G 1L7, Toronto,
Ontario, Canada and 2Dept. Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
Email: Paulo AS Nuin - [email protected]; Zhouzhi Wang - [email protected]; Elisabeth RM Tillier* - [email protected]
* Corresponding author

Published: 24 October 2006 Received: 26 July 2006


Accepted: 24 October 2006
BMC Bioinformatics 2006, 7:471 doi:10.1186/1471-2105-7-471
This article is available from: http://www.biomedcentral.com/1471-2105/7/471
© 2006 Nuin et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract
Background: There have been many algorithms and software programs implemented for the
inference of multiple sequence alignments of protein and DNA sequences. The "true" alignment is
usually unknown due to the incomplete knowledge of the evolutionary history of the sequences,
making it difficult to gauge the relative accuracy of the programs.
Results: We tested nine of the most often used protein alignment programs and compared their
results using sequences generated with the simulation software Simprot which creates known
alignments under realistic and controlled evolutionary scenarios. We have simulated more than
30000 alignment sets using various evolutionary histories in order to define strengths and
weaknesses of each program tested. We found that alignment accuracy is extremely dependent on
the number of insertions and deletions in the sequences, and that indel size has a weaker effect.
We also considered benchmark alignments from the latest version of BAliBASE and the results
relative to BAliBASE- and Simprot-generated data sets were consistent in most cases.
Conclusion: Our results indicate that employing Simprot's simulated sequences allows the
creation of a more flexible and broader range of alignment classes than the usual methods for
alignment accuracy assessment. Simprot also allows for a quick and efficient analysis of a wider
range of possible evolutionary histories that might not be present in currently available alignment
sets. Among the nine programs tested, the iterative approach available in Mafft (L-INS-i) and
ProbCons were consistently the most accurate, with Mafft being the faster of the two.

Background The accuracy assessment of MSA programs is often done


The determination of homologous regions of molecular by employing manually (or semi automatically) curated
sequences is often used for the further inference of their sequence databases such as BAliBASE [1], PREFAB [2] and
function and evolution, and therefore accurate multiple SABmark [3]. So far, BAliBASE has been the most often
sequence alignment (MSA) of nucleic acid and protein used alignment database in evaluating the performance of
sequences is crucial. Consequently, there has been tre- different MSA programs. It was constructed using protein
mendous effort in the development and implementation sequences or models with known three-dimensional
of different MSA algorithms, using distinct approaches to structures. The last inception, version 3.0, had an increase
improve the resulting alignment accuracy. in the number of available sequences and alignments.

Page 1 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Such improvements apparently have addressed the major Alignment programs


concerns of Karplus and Hu [4] regarding the use of BAli- There are many available computer packages that generate
BASE to benchmark MSA algorithms. MSAs of protein sequences. We selected nine of the cur-
rently most often used programs (in order of publication
Alignment databases provide a source of accurate align- date): Clustal W, Dialign2.2, T-Coffee, POA, Muscle,
ments to gauge the accuracy and speed of different pro- Mafft, ProbCons, Dialign-T and Kalign.
grams, but they also present several disadvantages. Even
though the databases' alignments are manually curated, Clustal W [14] version 1.8
there is still the possibility of misalignments which would This is probably the most widely used alignment program
result in accuracy assessment problems. The sets of align- and oldest among the packages tested. The software per-
ments still remain rather small and may not represent the forms a progressive alignment, first employing a pairwise
complete range of scenarios of protein evolution. Further- sequence comparison by calculating a distance matrix that
more, a major drawback of the use of alignment databases stores sequence divergence. After this matrix is obtained, a
is that algorithms can potentially be developed and tuned tree guide is built using Neighbor Joining, followed by the
to the alignments present solely in these data sets. third and final step where sequences are aligned according
to the branch order in the guide tree. The program
Recently there have been several DNA sequence simula- employs two gap penalties in its alignment procedure: gap
tion packages that incorporate indels, such as MySSP [5] opening and gap extension, and in the case of polypep-
and DAWG [6]. MySSP has been widely used in different tides, a full amino acid scoring weight matrix. These gap
studies of phylogenetic inference and evolutionary dis- penalties are mainly dependent on factors such as the
tance estimation coupled with DNA alignment accuracy weight matrix, sequence length and similarity. In simple
[7,8]. For proteins, Lassmann and Sonnhammer [9] in a cases, Clustal W might accurately align corresponding
previous comparison of MSA algorithms used artificially domains and sequences of known secondary or tertiary
created sequence sets generated by the simulation pro- structure while in more complex cases it can be used as a
gram Rose [10]. Rose simulates sequences of proteins good starting point for further refinement.
allowing for the occurrence of indels. Data sets generated
by Rose present their own limitations for the study of the Dialign2.2 [15] version 2.2.1
alignment accuracy. In Rose, indel size and number do This program uses a diagonal method to align sequences
not adequately represent empirical data for proteins that locally and globally. Dialign2.2 does not compare single
have diverged for different evolutionary times. Also the residues, but whole uninterrupted (no gaps, mismatches
program assumes equal evolutionary rates of all the sites allowed) stretches of residues that would form diagonals
in the protein. in a dot-matrix comparison of two sequences. Conse-
quently, it does not penalize the insertion and extension
In this study we introduce an improved approach to assess of gaps, and may leave unrelated segments unaligned. The
alignment accuracy by using simulated protein sequences first step in the procedure creates all possible pairwise
generated by Simprot [11]. Simprot is an advanced simu- alignments, storing a collection of diagonals meeting cer-
lation program that employs a parameterized version of tain consistency criteria [16] without conflicting double
the Qian and Goldstein [12] insertion and deletion or crossover assignments of residues [15]. All saved diago-
(indel) distribution. Although the original distribution nals are weighted in order to define entries with maximum
was empirically derived from a subset of alignments of sum of weights, and then sorted in order to determine the
highly diverged protein sequences, the parameterized ver- degree of overlap, emphasizing the existence of diagonals
sion permits a very flexible simulation of indels in present in multiple sequences. A greedy-like algorithm
sequences for all levels of sequence divergence. Simprot does a final processing, checking diagonals scores from top
also allows variable substitution and indel rates at differ- to bottom creating a final multiple alignment. Gaps are
ent sites by implementing gamma distributed sites rates inserted at the end of the MSA creation until all present
[13]. Three models of amino acids substitution (PMB, residues are connected.
PAM and JTT) are also available. We have used Simprot to
generate known alignments with a wide variety of evolu- T-Coffee (Tree-based consistency objective function for alignment
tionary parameters, as well as the latest BAliBASE database evaluation) [17] version 3.27
of curated alignments, to investigate the accuracy and T-Coffee employs a progressive strategy in aligning
speed of popular and publicly available protein multiple sequences. The program first creates a library from two
sequence alignment software programs. different sources: global alignments from Clustal W and
local alignments from Lalign [18]. For each pair of
sequences global alignments and the pairwise local align-
ments are created from the ten top-scoring non-intersect-

Page 2 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

ing segments. The program processes the global and local profile alignment. Every tree edge is visited iteratively and
information, assigning weights to all pairwise alignments the alignment with an updated summed pairwise score of
relative to sequence identity [19]. This is followed by the each sequence pair is retained. The edges are visited in
combination of groups that are merged into a single order of decreasing distance from the root, with a realign-
library. There is an extension phase for this combined ment of individual sequences, moving to more closely
library, making the final weight of any pair of residues related groups of sequences [23].
reflect part of the information contained in the whole
library. A final step requires a calculation of a distance Mafft (Multiple sequence alignment based on Fast Fourier
matrix and a Neighbor Joining tree, since the alignment is Transform) [25] version 5.732
generated with a progressive strategy by aligning the two Mafft is a program that can be used with different align-
closest sequences on the tree according to the weight ment approaches, either progressive alignment alone
stored in the extended library. The initial pair is then fixed (with Fast Fourier Transform), or progressive followed by
and any existing gaps cannot be shifted later. The progres- iterative refinement. Mafft's basic run can have up to three
sive alignment continues until every sequence is aligned. steps, but the default procedure performs the initial two
steps. First, a progressive alignment is created based on a
POA (Partial Order Alignment) [20] version 2.0 rough distance between every sequence pair based on
POA is another MSA package that uses a progressive align- shared 6-tuples. A guide tree is also generated by UPGMA
ment algorithm without using generalized profiles. This with modified linkage and sequences are then aligned fol-
program introduces the use of a Partial Order-Multiple lowing the branch order of the tree (this step alone is
Sequence Alignment (PO-MSA) format to represent called strategy FFT-NS-1). The second step recalculates a
sequences, and more accurately reflects biological con- distance matrix, based on the information gathered on the
tent. This format stores the alignment as a compacted previous step, and the progressive alignment is re-done
graph for minimal node and edge counts, still containing using a tree obtained from the new matrix as a starting
all the information available in a traditional MSA. point (up to this step, the strategy is known as FFT-NS-2
Sequences are stored as a linear series of nodes each con- and it is the default used by the software). The last phase
nected by two edges. POA uses a traditional dynamic pro- is the iterative refinement which optimizes the Gotoh's
gramming algorithm [21,22], where linear sequences are weighted sum of pairs (WSP) [26] score, with a group-to-
replaced by Partial Order (PO) graphs. These PO struc- group alignment [27] and the tree-dependent restriction
tures are transformed in usual 2D matrices and each com- partition technique [24]. If all three steps are employed,
bination of cells are scored backwards as in a traditional the procedure is called FFT-NS-i, meaning it uses an FFT
Smith-Waterman sequence alignment procedure [22]. method to rapidly identify homologous regions present in
These matrices are then extended in any direction (diago- the sequences which is followed by an iterative phase of
nal, horizontal, vertical) allowing the production of the refinement. FFT converts every single amino acid present
pairwise alignment on junction points. The MSA is in a sequence to a vector representing volume and polar-
obtained from the alignment of two sequences at the ity, which are important factors on substitution events,
beginning with the addition of other sequences succes- allowing the software to predict such occurrences with
sively to the initial pair. precision.

Muscle (Multiple sequence comparison by log-expectation) [2,23] Mafft also includes three additional refinement algo-
version 3.6 rithms: L-INS-i, G-INS-i and E-INS-i [25]. These strategies
Muscle uses a pairwise profile alignment approach. The increase the number of steps required to create an MSA
program first builds a progressive alignment which is then alignment to five. In these cases the first step also requires
improved and refined in two subsequent stages. The pro- the construction of a distance matrix, not using 6-tuples.
gressive alignment is created after the sequence similari- Differently from the FFT-NS-* approaches there is no
ties, a distance estimation and a UPGMA tree are reconstruction of the calculated UPGMA tree and the pro-
calculated. Muscle uses two distance measures: a kmer dis- gram moves to the second step, dividing the gap-free seg-
tance for unaligned sequence pairs and a Kimura distance ments and storing score arrays for each gap-free segment
for aligned pairs [2]. The progressive alignment improve- from one sequence to another. Mafft then calculates an
ment stage creates a new tree with the already calculated "importance" value from the score of the segment and
Kimura distance matrix and then builds a better align- stores how frequently residues appear on other segments.
ment based on this ameliorated tree. The last refinement All "importance" values are then gathered in an "impor-
stage employs a variant of the tree dependent restricted tance" matrix in step three which is quickly followed by a
partitioning [24]. This method deletes one of the tree group-to-group alignment obtained from the score matri-
edges, bi-partitioning the alignment and extracting both ces and a weighting scheme [14] based on a Needleman-
partitions' profiles which are then realigned with a profile- Wunsch algorithm. A final step iteratively refines the

Page 3 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

obtained alignments, optimizing a WSP score and the tances using the Wu-Manber approach. The pairwise dis-
"importance" values calculated previously. tance estimation is followed by a construction of a guide
tree by using UPGMA, which is employed in a global
ProbCons (Probabilistic Consistency-based multiple sequence dynamic programming method to align the sequences/
alignment) [28] version 1.1 profiles. Additionally, the program performs a consist-
ProbCons is the only program that uses a probabilistic ency check in order to define the largest set of sequence
consistency method of alignment. It is a modification of matches that can be inserted in the alignment, using a
the traditional sum-of-pairs scoring system, and in addi- modified version of the Needleman-Wunsch [21] to find
tion incorporates a pair-hidden Markov model-based pro- the most consistent path through the dynamic program-
gressive alignment algorithm. The alignment procedure is ming matrix. Also, Kalign updates the positions of pattern
divided into four steps, starting with a computation of matchings, which adjusts the absolute position of
posterior-probability matrices for every pair of sequences. matches found within sequences to their relative posi-
This is followed by a dynamic programming calculation tions within generated profiles [30].
of the expected accuracy of every pairwise alignment.
Probabilistic consistency transformation is then Results
employed in order to re-estimate the match accuracy Simprot simulated sequences
scores. A guide tree is calculated with hierarchical cluster- Simprot's simulation parameters provide flexibility for
ing with the similarity defined by a weighted average of generating alignments so that the effects of distinct factors
values between sequences of each cluster. The guide tree is can be examined together and/or separately under multi-
used to align the sequences using a progressive approach. ple evolutionary scenarios. Simulated sequences were
A post-processing phase is also done, where random bi- used to investigate the influence of sequence length, indel
partitions of the generated alignment are realigned in frequency, indel length, evolutionary distance, terminal
order to check for better alignment regions. ProbCons dif- gaps length, gamma density function and tree topology
fers from other alignment programs since it does not on the accuracy of alignments inferred by different pro-
incorporate biological concepts such as position-specific grams. More than 30000 alignments were created inde-
gap scoring, evolutionary tree construction and other fea- pendently by Simprot using five phylogenetic trees
tures commonly used by other packages. (Figure 2) with variable lengths and different number of
variable size indels, in order to cover different topological
Dialign-T [29] version 0.2.1 evolutionary patterns. Simprot generates a known align-
This program is a re-implementation of the procedure ment and another file containing the sequences with no
developed in Dialign2.2, but with a better solution to deal indels. One hundred simulated alignments with different
with inconsistent fragments, including fragment-chain- random seed values were created for each combination of
ing. It also implements a new approach for estimating tested parameters. All corresponding sequences were also
probabilities of the random occurrence of each fragment aligned with the nine programs described above and the
present in the sequence to be aligned. Dialign-T does not resulting alignments were compared to the "true" align-
use pre-calculated tables in order to obtain weight scores: ment generated by Simprot. The average accuracy values
it calculates probability tables from several substitution for the 100 alignments of each set are reported here and in
matrices. Additionally, the greedy-like multiple alignment some cases a Wilcoxon signed ranks test was employed in
algorithm from Dialign2.2 was changed in order to avoid order to determine the statistical significance of the differ-
spurious local similarities. ence on average accuracy. The protein substitution matrix
used in all simulations was PMB [32], which is also the
Kalign [30] version 1.04 program's default.
Kalign is another program that uses a progressive align-
ment approach to obtain the best MSA possible. The main As reported previously [9], sequence length does not affect
difference of this algorithm to other methods is that it alignment accuracy of the different programs. In order to
employs the Wu-Manber approximate string matching confirm this, five different root sequence lengths were
algorithm [31] when calculating the distance among employed in the analysis: 50, 100, 150, 200, 250 and 300
sequences. The Wu-Manber algorithm measures the dis- amino acids. These values were selected in order to get the
tance between two strings using a Levenshtein edit dis- resulting alignments in a feasible amount of time, while
tance, which allows an efficient search for mismatches maintaining a significant difference in root sequence
(shared or not) and patterns present in the sequences. lengths. To determine the effect of amino acid substitu-
According to the Kalign developers, this methodology tions and sequence length on alignment accuracy, we first
allows for a distance estimation which is as fast as an k- kept the indel frequency and indel length very low and
tuple algorithm but is more accurate [30]. The first step in considered different trees with various overall evolution-
the alignment procedure is to calculate the pairwise dis- ary distances and with increasing root sequence lengths.

Page 4 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

BAliBASE
Program Publication date Simprot RV11 RV12 RV20 RV30 RV40 RV50 all refs. Time (s)
Clustal W Sep 1994 0.78923 0.48326 0.81114 0.82769 0.70009 0.65208 0.67851 0.70238 22.012
Dialign2.2 Mar 1999 0.75480 0.41433 0.77499 0.80789 0.66539 0.63704 0.66255 0.66723 52.708
T-Coffee Sep 2000 0.83629 0.51866 0.84180 0.84176 0.73867 0.68560 0.74086 0.73186 1273.963
POA Mar 2002 0.75196 0.30605 0.71622 0.77491 0.62705 0.58865 0.57238 0.60754 9.025
Mafft FFT-NS-2 Jul 2002 0.83911 0.46401 0.79774 0.83113 0.73048 0.64222 0.69748 0.70129 1
Muscle Aug 2004 0.83031 0.53313 0.83181 0.84411 0.73635 0.66969 0.71056 0.73110 4.426
Mafft L-INS-i Jan 2005 0.86545 0.56564 0.84497 0.86049 0.77123 0.71307 0.75483 0.75813 15.607
ProbCons Feb 2005 0.86712 0.59117 0.85479 0.85796 0.76782 0.69439 0.75271 0.76227 353.787
Dialign-T Mar 2005 0.77475 0.41372 0.79267 0.80824 0.67674 0.60237 0.67518 0.67024 41.467
Kalign Dec 2005 0.80271 0.47593 0.82048 0.82854 0.72459 0.64190 0.70384 0.70801 3.403

Figure average
Overall 1 accuracy values obtained with all Simprot's simulated sequences and all BAliBASE's references
Overall average accuracy values obtained with all Simprot's simulated sequences and all BAliBASE's refer-
ences. Results are ordered by date of publication. Values in the same column that are not significantly different according to a
Wilcoxon signed ranks test (p < 0.05) have the same colour; values in black are significantly different, and bold font represents
the largest value in the column. CPU times are normalized to Mafft FFT-NS-2 and were obtained with a 44 sequence alignment
of 500 residues.

The majority of the packages tested generated results rang- hammer [9], who showed that programs tended to have
ing from good to excellent with increasing sequence poorer performance as the evolutionary distance
length. POA presented the lowest accuracies on the per- increased when indels were present. The best results were
formed tests and at the same time was positively affected generated by ProbCons and Mafft L-INS-i. ProbCons pre-
by sequence length increase (Figure 3). POA's lower accu- sented better results for trees with longer evolutionary dis-
racy appears to be due to a tendency of the program to tances when intermediate to large indel frequencies were
place large internal gaps close to the sequence terminals, applied. Conversely, Mafft L-INS-i performed better for
while the accuracy increase in larger sequence lengths smaller evolutionary distances and with intermediate
might be explained by the proportionally small influence indel frequency values (Figure 4).
of these terminal gaps in the alignment scoring. Appar-
ently, the alignment of sequences with a large number of In Simprot the evolutionary distance set by the branch
substitutions but a low number of gaps did not present a lengths in the input tree affects the expected number of
problem for any of the nine algorithms used, no matter substitutions and also affects the expected number of
the size of the sequence. Noticeably, Clustal W showed insertions and deletions. In order to further analyze the
the steepest decline in accuracy when sequence length was influence of branch lengths on alignment accuracy, we
increased even at very low gap frequency values (Figure 3 considered a single tree topology and scaled the branches
and 4). The program occasionally had an alignment accu- (Figure 2, tree A) so that the overall tree shape was not
racy decrease two to three times larger than the average for changed (Figure 5). As shown above, all programs were
all other seven programs (Figure 4B), especially when negatively affected by increased evolutionary distances,
indels were added to the reference alignments. particularly when the employed indel frequency parame-
ter was high. POA had the steepest decline in accuracy, at
Different indel frequencies were also used in the simula- small indel frequencies (Figure 5A).
tions in order to test the effect of indel occurrences in
alignment accuracy. Simprot's process for insertions and Due to the fact that indel frequency appeared to have a
deletions assumes a Poisson model, where the expected large effect on MSA accuracy, we analyzed the effect of
frequency of indels between two sequences separated by increasing the indel frequency independently of other fac-
an expected 100 PAM distance is tors (Figure 6). Our results showed that accuracy and its
rate of decline with indel frequency depended on the
p = 1 - e-z/c input tree used. Trees with longer branch lengths (Figure
6A) had sharp decreases in accuracy with increasing indel
where z is the indel probability that is scaled by the evolu- frequencies. Input trees with shorter branch lengths
tionary scale factor c. The smallest frequency p employed showed smaller declines in accuracy (Figure 6B–C). For
was the program's default value 3% and increased up to most programs, when a topology with varied branch
30%. As expected, when indels were added to the simu- length was used, the accuracy decrease was almost linear
lated sequences and evolutionary distance was increased, with increasing indel frequency. ProbCons and Mafft L-
there was an evident loss in accuracy for all programs. This INS-i were the least affected by the increase in evolution-
corroborates results obtained by Lassmann and Sonn- ary distance and resulted in the best performances.

Page 5 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure
Tree topologies
2 used in the analysis
Tree topologies used in the analysis. A, D and E are artificially created topologies, while B and C are based on PFAM
alignments.

Another element that can influence the occurrence and Real protein sequence data often contains non-homolo-
number of indels in protein sequences is the tree topol- gous terminal ends and/or incomplete sequences. We
ogy. We considered this independently of evolutionary investigated the effect of large terminal gaps on alignment
distance and Simprot's indel frequency by considering accuracy. A small modification in Simprot's code was nec-
two trees with identical maximum evolutionary distance essary to include an additional probability of terminal
(Figure 2, Trees D and E) but with different topologies. gaps. Since no reasonable biological model exist for exter-
Although evolution had occurred at different locations in nal gaps, we introduced an ad hoc parameter t which deter-
the two tree topologies tested (tips opposed to internal mines the probability and length of external gaps by
nodes), this did not seem to have large influence on over- scaling the probabilities for internal gaps. Five different
all alignment accuracy for the majority of the algorithms values of the terminal gaps insertion parameter were used
analyzed (Figure 7), as in both cases programs obtained while keeping the internal gap frequency constant (5%)
similar alignment scores. for these simulations. It was observed that the presence of
terminal gaps, regardless of their length, had a minimum
effect on alignment accuracy for most of the programs

Page 6 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 3 of alignment accuracy and increasing


Comparison
Comparison of alignment accuracy and increasing sequence length, at low indel frequency values. Selected
examples with different input trees. The increase in sequence length did not seem to affect alignment accuracy of the
majority of the programs. ProbCons and Mafft L-INS-i were the top performers, followed closely by Muscle, T-Coffee, Mafft
FFT-NS-2 and Kalign. Dialign2.2, Dialign-T and Clustal W presented a better accuracy than POA in most of the cases. Scale fac-
tor: value by which tree's branch lengths are multiplied, making them uniformly change; c is the Qian-Goldstein distribution
value that determines average length of indels.

Page 7 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Comparison
with different
Figure 4 oftree
alignment
topologies
accuracy and increasing sequence length, at high indel frequency values. Selected examples obtained
Comparison of alignment accuracy and increasing sequence length, at high indel frequency values. Selected
examples obtained with different tree topologies. ProbCons and Mafft L-INS-i took turns as top performers. A middle
group of programs is revealed by this comparison, comprising Mafft FFT-NS-2, T-Coffee, Muscle and Kalign. The smallest accu-
racies were shown by Dialign2.2, Dialign-T and POA, and Clustal W at large sequence sizes. Clustal W curves presented a
steep decline in accuracy as the sequence length was increased.

Page 8 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Decrease
Figure 5 in accuracy with an increase in the evolutionary scale factor of topology A
Decrease in accuracy with an increase in the evolutionary scale factor of topology A. POA seemed to be the most
affected by the increase of the scale factor applied to topology A from Figure 1. The top performers are again Mafft L-INS-i and
ProbCons. An intermediary group formed by T-Coffee, Muscle, Mafft FFT-NS-2 and Kalign is followed by Dialign2.2, Dialign-T,
Clustal W and POA that showed poor accuracy values as the scale factor increased.

(Figure 8). Again in this case, Mafft L-INS-i and ProbCons According to Rosenberg [7], changes in the shape of the
were the top performers. gamma distribution, of Yang's [13] distribution of evolu-
tionary rates, influences alignment accuracy. The gamma
The analysis of influence of the indel frequency on align- shape models the proportion of slow to fast evolving posi-
ment accuracy did not take into account the overall size of tions and accounts for variable substitution rates among
the simulated insertions and deletions. To test a possible sites; the lower the α the larger the number of sites with
effect of indel size on the programs' performances three low substitution rate. Decreasing gamma's α shape
different values (2, 3 and 4) of Simprot's c parameter were parameter positively affected the alignment accuracy (Fig-
tested. This value is used by the generalized Qian and ure 10), due to an increasing number of identical sites
Goldstein distribution [11,12] in indel length determina- between sequences. Simprot allows the modification of
tion. Larger c values yield shorter indels while smaller val- the gamma's α shape parameter and in our study it was set
ues result in longer ones [11]. The indel frequency was at 0.1, 0.7, 1 (Simprot's default), 5 and 10. We examined
kept constant. We found that the larger the indels the changing α using different topologies and indel frequency
lower the alignment accuracy, although with a moderate values. The obtained results show a moderate influence of
difference in the final average score (Figure 9). This accu- the value of gamma α at low indel frequencies, resulting
racy loss could be seen for all phylogenetic trees analyzed in an accuracy loss for most of the programs, especially for
and for the majority of the programs. Again, Mafft L-INS- POA. When larger indel frequencies were employed, the
i and ProbCons seemed to fare better and were least negative effect of increasing gamma was accentuated up to
affected by the variation in gap length. α = 5, which was reversed by a small gain in accuracy
when α was increased from 5 to 10 (Figure 10). This is
expected, since with α values the gamma distribution of

Page 9 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 6 decline with larger indel frequency values


Accuracy
Accuracy decline with larger indel frequency values. Different accuracy values from alignments of sequences simulated
with distinct topologies and increasing the indel frequency.

evolutionary rates tends to be less extreme (exponential) the programs tested and in every case had a smaller gap
and to have a curve shape similar to the normal distribu- number average for its alignments than other packages
tion. and the known alignment. Kalign and ProbCons were the
programs with a final number of inserted gaps closest to
In order to deduce why there was a large influence of the real alignment in the majority of the simulations.
insertions and deletions on the programs' performance, Overall, programs that use a progressive alignment with
we analyzed the average number of gaps per sequence tree determination showed a smaller gap number average
present in the "true" alignment and in all resulting align- per sequence than the programs that do not use a guide
ments (Figure 11). As mentioned above, POA had the ten- tree (POA, Dialign2.2 and Dialign-T).
dency of inserting long internal gaps at the sequence
terminals; this inflates the average number of gaps in the In summary our results show that it is the total number of
alignments that are constructed by the program. Clustal indels independently of where in the tree they occur, and
W, under default parameters, was the most conservative of to some degree independently of the number substitu-

Page 10 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 7 of the alignment accuracy values of trees with the same maximum evolutionary distance and different topologies
Comparison
Comparison of the alignment accuracy values of trees with the same maximum evolutionary distance and dif-
ferent topologies. The accuracy curves for both topologies are very similar, independently of the topology employed. The
input trees differed in where evolution had occurred, at the tips or internal branches.

tions, that had the greatest effect on alignment accuracy. BAliBASE


Also, indel size plays a role in alignment accuracy, but to It was important to determine if the results obtained from
a lesser extent than indel number. Additionally, the the Simprot-generated sequences were applicable to align-
gamma distribution of evolutionary rates generally had a ments from actual proteins. We considered the accuracy of
negative effect on the final accuracy. Regarding program the nine programs on the latest version of BAliBASE align-
performance, ProbCons and Mafft L-INS-i achieved the ments (Figure 1). Overall we found results similar to those
best results in the majority of the simulated alignments obtained on the simulated sequences in that ProbCons
sets. An intermediary group consisted of T-Coffee, Muscle, and Mafft using strategy L-INS-i appeared to have the best
Mafft FFT-NS-2 and Kalign, while Clustal W, POA, Dia- performance. In BAliBASE's reference RV11, containing
lign-T and Dialign2.2 often produced the poorest align- equidistant sequences sharing less than 20% identity,
ment accuracy. An overall summary of alignment accuracy ProbCons and Mafft L-INS-i were not statistically different
for each program is shown in Figure 1. With the exception according to the Wilcoxon signed ranks test (p > 0.05).
of Clustal W, in scenarios of large sequence lengths and The same result with no statistical separation was
indel frequency, programs that have a tree-guided multi- observed in reference RV20, which is composed of
ple alignment procedure showed better results than those sequences from divergent subfamilies, in reference RV30,
that do not rely on tree determination to align protein comprised of sequences from protein families with some
sequences. As pointed above, programs with a tree-deter- highly diverged sequences, and in reference RV50, made
mination step were more conservative in inserting gaps of sequences with large insertions.
than programs that lack this step, generally achieving bet-
ter final accuracies. ProbCons did perform significantly better than Mafft L-
INS-i in the set RV12 that contains equidistant sequences
sharing between 20 and 40% identity. Mafft L-INS-i and

Page 11 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 8 accuracy comparison when increasing terminal gaps were inserted in the alignments
Alignment
Alignment accuracy comparison when increasing terminal gaps were inserted in the alignments. The program
rankings are much different than in other situations analyzed. Mafft L-INS-i and ProbCons were the top performers, followed
closely by T-Coffee, Muscle and Mafft FFT-NS-2. t is the parameter used to determine the length of inserted terminal gaps in
Simprot.

T-Coffee were not statistically different (Wilcoxon signed tinguishable, and POA. Overall, the results from Simprot
ranks test, p > 0.05). Conversely, on reference RV40 com- and BAliBASE data sets were consistent, with the excep-
posed of protein sequences with large extensions, Mafft L- tion of Mafft FFT-NS-2 which ranked significantly lower
INS-i outperformed all other packages, with ProbCons on BAliBASE data sets than on Simprot's. These results
and T-Coffee not far behind and not significantly differ- corroborate in part the findings of Lassmann and Sonn-
ent. hammer [9], that showed T-Coffee as the best available
algorithm at the time for BAliBASE v2 alignments. Their
When results for all references are analyzed together, the result also indicated POA as the program with the poorest
same pattern observed from the isolated references was performance.
also found. In this broader scenario, ProbCons and Mafft
L-INS-i achieved the best results and the difference in final Speed of execution
alignment accuracy is not statistically significant (Wil- Mafft FFT-NS-2 was the fastest program for all tested
coxon signed ranks test, p > 0.05). An intermediary pack sequence sizes (Figure 12). T-Coffee, as shown before [9],
is formed of two distinct groups (defined by Wilcoxon had the worst speed, with an average alignment time for
signed ranks test) where Muscle and T-Coffee did slightly the smallest sequence set (100 amino acids) longer than
better than Mafft FFT-NS-2, Kalign and Clustal W. Show- for Clustal W, Mafft (FFT-NS-2 and L-INS-i), Kalign and
ing the poorest performance for the whole database set POA when aligning the largest set. ProbCons had the sec-
were Dialign-T, Dialign2.2, which were statistically indis- ond worst average time for most sequence sizes.

Page 12 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure
Alignment
9 accuracy compared to decreasing c values
Alignment accuracy compared to decreasing c values. The lower the c value, the longer the indels. Both examples
show the modest effect of longer indels on the alignment loss of accuracy. ProbCons and Mafft L-INS-i are the top performers,
followed by T-Coffee, Mafft FFT-NS-2, Muscle and Kalign respectively. There is a bottom group formed by the remainder of
the programs.

Discussion In this work, it could be observed that all programs have


Overall, Mafft L-INS-i and ProbCons generated the best strengths and weaknesses, and among the best performers
alignments on our test data, including simulated Mafft has the most flexible algorithm. The recent addi-
sequences and BAliBASE's v3.0 reference sets, while POA, tions to the program certainly contributed to improve
Dialign2.2, Dialign-T and Clustal W had the worse accu- alignment accuracy. Mafft also has a very fast algorithm
racy. The intermediary group, formed by T-Coffee, Mus- even when aligning iteratively. It has been suggested that
cle, Mafft FFT-NS-2 and Kalign in some cases presented Mafft's accuracy could be increased by incorporating
similar results to the top two algorithms, especially T-Cof- structural information [25]. ProbCons had very similar
fee and Mafft FFT-NS-2 in tests with short evolutionary results and sometimes performed even better than Mafft
distances and low gap frequency and length. This showed L-INS-i, but it is the second slowest program overall. The
the quality of the algorithms and that different alignment power of its algorithm is excellent, even though
approaches to sequence alignment can converge on a very it does not consider any biological aspect of the sequences
similar MSA. when performing an MSA.

Additionally, we only tested the programs with their In the intermediary group, T-Coffee and Muscle were the
default parameters; different program configurations better alternatives, considering that Mafft FFT-NS-2 did
might improve their accuracy. Our results are consistent not perform as well as the iterative approaches, and Kalign
with those previously reported in the original articles of showed inconsistent results in most cases faring below the
Mafft L-INS-i [25] and ProbCons [28], where they ranked other three programs. T-Coffee generates good alignments
top with the best accuracy on BAliBASE v2 alignments. and has the merit of combining alignments from different
sources [25], but the processing time is the worst for every

Page 13 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 10accuracy compared to increasing gamma α


Alignment
Alignment accuracy compared to increasing gamma α. The programs' accuracy decreased with α values up to 5,
reversed when the value increased to 10. POA is the most affected by the gamma α increase, while Mafft L-INS-i and Prob-
Cons are the least affected.

sequence size. Muscle, on the other hand, is an iterative Dialign-T was inconsistent, sometimes as accurate as Clus-
program that produces good quality alignments, often tal W, while otherwise comparable to or worse than
comparable to T-Coffee and Mafft FFT-NS-2, with the Dialign2.2. Dialign-T's accuracy was originally tested
advantage of being extremely fast. Muscle allows an solely on alignment databases (BAliBASE v2.1 and IRM-
increase in the number of iterative steps in its procedure base) [29]. When evaluated against a more diverse collec-
(not tested here) that can probably ameliorate its final tion of protein sequences one can see that the program
alignment quality. Kalign presented accuracy values in does not fare as well as claimed initially.
most of the cases lower than the other three programs in
this intermediary group, but showed very good results at Apparently, programs that have a tree-building step in
low indel frequency values. The packages with poorest their alignment procedure seemed to produce better
performance, Clustal W, POA, Dialign-T and Dialign2.2, results than programs that do not build a phylogenetic
also present qualities such as the rapid assembly of accu- tree or cluster in their alignment process. Of the bottom
rate MSAs of closely related sequences with a low number four performers, only Clustal W builds a Neighbor Joining
of indels. These programs may be employed to create an tree to guide the multiple sequence alignment. According
initial alignment that can be further improved with to POA developers, their program is more suited for align-
another algorithm. Clustal W showed good accuracy ment of multidomain sequences, and a way to improve
results in the alignment of short sequences with indels, the algorithm would be to use a Clustal-like progressive
but had a steady decline when the length of the sequence alignment with a guide tree [20]. Also, our results demon-
containing indels was increased. Although Dialign-T strate that POA had the tendency to insert large gaps on
developers claim that the program's new implementation the sequences' terminal regions, which inflated the aver-
generates better results than version 2.2, this could not be age number of gaps per sequence. This led to the low accu-
seen in our results. In the simulated sequence analysis, racy values generated by POA and a visual inspection of a

Page 14 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure 11
Number of gaps inserted by programs relatively to the known alignment
Number of gaps inserted by programs relatively to the known alignment. The curves show the difference in the
number of gaps relative to the known alignment generated by Simprot. Kalign had the smallest difference for most of the
sequence lengths, while POA was the program with the largest difference. Above zero are programs that do not generate a
guide tree for multiple alignment, while below zero are the programs that do generate a guide tree.

considerable number of resulting sets revealed that the comparison to Clustal W, what might be explained by the
intermediary regions of the alignments were consistent distinct algorithm that calculates pairwise distances. T-
with other packages results. It was also shown before that Coffee showed better accuracy than Clustal W and Kalign,
global alignment programs usually perform better than maybe because of the incorporation of Lalign alignments
local alignment algorithms such as Dialign2.2 and Dia- in its algorithm, that improved the pairwise alignment
lign-T [33]. These two programs seemed to be more suita- generated by a Clustal W-like process. At the same time,
ble to align sequences with high local similarities that Muscle performed as well as T-Coffee and Mafft FFT-NS-2,
were shuffled by recombination. Both programs were showing that all tree-building methods might be equiva-
among the least conservative in inserting gaps, which may lent. UPGMA with modified linkage had an edge when
explain the low alignment accuracy values obtained. the iterative capabilities of Mafft were employed. Finally,
hierarchical clustering, which does not incorporate bio-
Among the programs that use a guide-tree in their multi- logical concepts in its calculation, was better than the
ple alignment procedure, hierarchical clustering outper- biology-based tree determination methods in some the
formed UPGMA with modified linkage (in Mafft's non- tested scenarios.
iterative approach), Neighbor Joining and UPGMA (Mus-
cle and Kalign). Clustal W had the lowest accuracy in the Regarding factors that influence alignment accuracy, indel
group but in many cases it outperformed the programs number is surely the one with the largest effect. The over-
that lack tree-building capabilities, probably because of its all performance of all programs decreased proportionally
profile alignment procedure. Kalign, that uses UPGMA to the increase in gap frequency and to a lesser extent indel
(Neighbor Joining is an option) had superior results in size. This was shown by increasing indel frequency alone,

Page 15 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

Figure
CPU time
12spent by each program when aligning increasing sequence lengths
CPU time spent by each program when aligning increasing sequence lengths. Mafft FFT-NS-2 is clearly the fastest
algorithm while T-Coffee is the slowest among all programs tested.

indel length alone and both combined. Among these two a large effect on alignment accuracy, regardless of their
parameters, indel number seems to have more conse- length.
quences to accuracy loss than indel length alone, maybe
because most alignment algorithms have a tendency to Conclusion
merge gaps. Larger evolutionary distance plays a role in Our analysis reveals that Mafft is the best choice for pro-
the quality of MSA, and this might be related to an tein sequence alignment, based on its overall alignment
increased number of indel events with longer branch quality and processing speed. Other algorithms, however,
lengths. In cases with both low indel frequency and cannot be dismissed as they showed very good results for
length, sequences were aligned by all packages accurately some evolutionary scenarios. By comparing accuracy and
even for simulations based on trees with long branches. date of publication of the programs (Figure 1), it seems
These results show that some alignment programs tend to that overall alignment quality has generally improved
be conservative with respect to inserting gaps as the loss of over time, but there is still room for improvement as
accuracy is mainly due to inferred alignments having alignment accuracy is still fairly low in many cases.
fewer and shorter gaps than the known alignments.
Although different programs have distinct allowances for With the advent of Simprot, there is another alternative to
terminal gaps we showed that terminal gaps did not have assess MSA performance. Our study shows that Simprot

Table 1: Factors analyzed in the alignment simulations, related program parameters and values used to simulate the sequences.

Factor Simprot Values Description


parameter

sequence length -r 50, 100, 150, 200, 250, 300 length (in amino acids) of the root sequence
indel frequency -g 0.03, 0.05, 0.1, 0.15, 0.2, 0.3 expected indel frequency (number of indels/aa) for 100PAM
gamma alpha -x 0.1, 0.7, 1, 5, 10 shape parameter of the gamma distribution of evolutionary rates
evolutionary scale factor -c 2, 3, 4 controls the expected length of indels according to the generalized
Qian-Goldstein distribution
branch length scale multiplier* -b 2, 3, 4 scale lengths of all branches in the input tree equally
terminal gaps insertion not available 1, 2, 3, 4, 5 controls the frequency and lengths of terminal gaps (as a function of
internal gap parameters)

* only for tree A (Figure 2)

Page 16 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

and simulated sequences present a reliable approach to number of correctly aligned residues pairs that are found
check alignment quality. This methodology proved to be in the test alignment divided by the total number of
more flexible and able to generate a broader range of aligned residue pairs of the reference alignment, increas-
alignment classes in comparison to methods used in the ing with the number of sequences aligned correctly. SP
past. Although the final conclusions were similar while calculation takes into account pairs of aligned residues
using our method and BAliBASE sets of protein align- occurring in both MSA, while the CS calculation only
ments, our methodology allows us to determine with checks for identical columns in each set of aligned
more detail the strengths and weaknesses of each align- sequences. SP is also known as fD, the developer's score
ment program and its algorithmic approach. Simprot also [35,36]. A third option is the modeler's score (fM) which
proved to be a suitable alternative for alignment quality indicates the fraction of residue pairs in the test alignment
testing. In conclusion, the ability to create large simulated that are correctly aligned in comparison to the reference
alignment sets in seconds, with full control of its charac- [37].
teristics, allows a quick and reliable analysis of different
evolutionary histories, some of them not available in the Alignment accuracy, either for BAliBASE sequences or
current database sets. Simprot simulated sequences, was measured using the
developer's score and the modeler's score [35]. Both
Methods scores are calculated as
BAliBASE
All five BAliBASE data sets, including sub-references, were c c
aligned using the nine programs described above and the fD = , fM =
r t
obtained alignment compared to the original alignment
file provided. Accuracy was measured for each alignment where c number of residue pairs in the test alignment that
and the average of each program was compared separately are correctly aligned with respect to the reference align-
for each reference and overall for the whole database. A ment, r number of aligned residue pairs in the reference
Wilcoxon signed ranks test was used to assess statistical alignment and t number of aligned residue pairs in the
significance of the results. test alignment. Both scores have a maximum value of 1
(all pairs correctly aligned), and a minimum equal to 0
Simprot simulated sequences (no pairs are correctly aligned). The developer's and mod-
Simprot was used to simulate sets of protein sequences in eler's scores have been the most widely used in alignment
different evolutionary conditions. This simulation pro- score assessment and were featured in a comparison of
gram requires a phylogenetic tree as its initial input in profile alignment scoring by Edgar and Sjölander [38].
order to generate a file with the known alignment which
is determined from the known evolutionary history of the The two alignment scoring functions tested yielded very
sequences. In this study, we used five different bifurcating similar results. In most cases, the modeler's score resulted
trees, attempting to include general scenarios of evolution in a value slightly lower than the developer's score, while
with distinct topologies and characteristics in trees in very few cases a modest improvement was observed.
obtained from protein MSA and trees created artificially We therefore decided to present here only the results
(Figure 2). Two of these trees, tree B (44 taxa) and tree C obtained using the developer's score.
(20 taxa) were obtained based on PFAM alignments [34].
Artificial trees, D and E, both with 16 taxa, had identical Performance evaluation
topologies and maximum evolutionary distance, differing All programs' performance was also tested in aligning
on when and where evolution had occurred. Finally, tree sequences of different sizes. We did not evaluate the speed
A also contained 16 taxa in a bifurcating topology, with a variation regarding the number of input sequences, due to
large monophyletic clade of 12 taxa and a small sister restrictions in some programs in aligning large numbers
clade of four taxa. All branches were of equal length except of sequences. Using Simprot's default parameters and tree
at the node that supported the larger clade, which was B (44 taxa) as input, ten sets of simulated protein
twice the size of the other branches. sequences were generated in seven lengths: 100, 200, 300,
400, 500, 1000 and 1500 amino acids. Total CPU time,
Alignment accuracy evaluation calculated by the system time command was averaged. All
Another critical step when comparing the results from dif- programs were run in a dual 3.0 Ghz Xeon with 4 GB of
ferent algorithms is the alignment scoring. There are many memory, running openMosix with Linux kernel 2.4.22.
scoring functions available. BAliBASE creators also have
introduced two distinct scores for the comparison of an Abbreviations
alignment against a reference set: column score (CS) and CS: column score
sum-of-pair score (SP) [33]. The SP score is defined as the

Page 17 of 18
(page number not for citation purposes)
BMC Bioinformatics 2006, 7:471 http://www.biomedcentral.com/1471-2105/7/471

FFT: Fast Fourier Transform and weight matrix choice. Nucleic Acids Res 1994,
22(22):4673-80.
15. Morgenstern B: DIALIGN 2: improvement of the segment-to-
indel: insertion/deletion segment approach to multiple sequence alignment. Bioinfor-
matics 1999, 15(3):211-8.
16. Morgenstern B, Dress A, Werner T: Multiple DNA and protein
MSA: multiple sequence alignment sequence alignment based on segment-to-segment compar-
ison. Proc Natl Acad Sci USA 1996, 93(22):12098-103.
PO: Partial Order 17. Notredame C, Higgins D, Heringa J: T-Coffee: A novel method
for fast and accurate multiple sequence alignment. J Mol Biol
2000, 302:205-17.
PO-MSA: Partial Order-Multiple Sequence Alignment 18. Huang X, Hardison R, Miller W: A space-efficient algorithm for
local similarities. Comput Appl Biosci 1990, 6(4):373-81.
19. Sander C, Schneider R: Database of homology-derived protein
SP: sum-of-pair score structures and the structural meaning of sequence align-
ment. Proteins 1991, 9:56-68.
20. Lee C, Grasso C, Sharlow M: Multiple sequence alignment using
WSP: weighted sum-of-pairs partial order graphs. Bioinformatics 2002, 18(3):452-64.
21. Needleman S, Wunsch C: A general method applicable to the
search for similarities in the amino acid sequence of two pro-
Authors' contributions teins. J Mol Biol 1970, 48(3):443-53.
PASN designed the study, conducted the analysis and 22. Smith T, Waterman M: Identification of common molecular
drafted the manuscript. ZW wrote the application to subsequences. J Mol Biol 1981, 147:195-7.
23. Edgar R: MUSCLE: a multiple sequence alignment method
measure alignment scores. ERMT participated in the with reduced time and space complexity. BMC Bioinformatics
design and coordination of the study and helped to draft 2004, 5:113.
the manuscript. All authors have read and approved the 24. Hirosawa M, Totoki Y, Hoshida M, Ishikawa M: Comprehensive
study on iterative algorithms of multiple sequence align-
final manuscript. ment. Comput Appl Biosci 1995, 11:13-8.
25. Katoh K, Kuma K, Toh H, Miyata T: MAFFT version 5: improve-
ment in accuracy of multiple sequence alignment. Nucleic
Acknowledgements Acids Res 2005, 33(2):511-8.
We thank Daniel Lei for helping in the analysis, RL Charlebois for critical 26. Gotoh O: A weighting system and algorithm for aligning
reading of the manuscript and Joy Abramson for Simprot support. We also many phylogenetically related sequences. Comput Appl Biosci
would like to thank both anonymous reviewers for suggestions. This work 1995, 11(5):543-51.
27. Katoh K, Misawa K, Kuma K, Miyata T: MAFFT: a novel method
was supported by the Canadian Institute for Health Research (CIHR).
for rapid multiple sequence alignment based on fast Fourier
transform. Nucleic Acids Res 2002, 30(14):3059-66.
References 28. Do C, Mahabhashyam M, Brudno M, Batzoglou S: ProbCons: Prob-
1. Thompson J, Koehl P, Ripp R, Poch O: BAliBASE 3.0: latest devel- abilistic consistency-based multiple sequence alignment.
opments of the multiple sequence alignment benchmark. Genome Res 2005, 15(2):330-40.
Proteins 2005, 61:127-36. 29. Subramanian A, Weyer-Menkhoff J, Kaufmann M, Morgenstern B:
2. Edgar R: MUSCLE: a multiple sequence alignment method DIALIGN-T: an improved algorithm for segment-based mul-
with reduced time and space complexity. BMC Bioinformatics tiple sequence alignment. BMC Bioinformatics 2005, 6:66.
2004, 5:113. 30. Lassmann T, Sonnhammer E: Kalign-an accurate and fast multi-
3. Walle IV, Lasters I, Wyns L: SABmark-a benchmark for ple sequence alignment algorithm. BMC Bioinformatics 2005,
sequence alignment that covers the entire known fold space. 6:298.
Bioinformatics 2005, 21(7):1267-8. 31. Wu S, Manber U: Fast text searching allowing errors. Commu-
4. Karplus K, Hu B: Evaluation of protein multiple alignments by nications of the ACM 1992, 35:83-91.
SAM-T99 using the BAliBASE multiple alignment test set. 32. Veerassamy S, Smith A, Tillier E: A transition probability model
Bioinformatics 2001, 17(8):713-20. for amino acid substitutions from blocks. J Comput Biol 2003,
5. Rosenberg M: MySSP: Non-stationary evolutionary sequence 10(6):997-1010.
simulation, including indels. Evol Bioinformatics Online 2005, 33. Thompson J, Plewniak F, Poch O: BAliBASE: a benchmark align-
1:51-53. ment database for the evaluation of multiple alignment pro-
6. Cartwright R: DNA assembly with gaps (Dawg): simulating grams. Bioinformatics 1999, 15:87-8.
sequence evolution. Bioinformatics 2005, 21(Suppl 3):iii31-iii38. 34. Bateman A, Coin L, Durbin R, Finn R, Hollich V, Griffiths-Jones S,
7. Rosenberg M: Evolutionary distance estimation and fidelity of Khanna A, Marshall M, Moxon S, Sonnhammer E, Studholme D, Yeats
pair wise sequence alignment. BMC Bioinformatics 2005, 6:102. C, Eddy S: The Pfam protein families database. Nucleic Acids Res
8. Rosenberg M: Multiple sequence alignment accuracy and evo- 2004:D138-41.
lutionary distance estimation. BMC Bioinformatics 2005, 6:278. 35. Sauder J, Arthur J, Dunbrack R: Large-scale comparison of pro-
9. Lassmann T, Sonnhammer E: Quality assessment of multiple tein sequence alignment algorithms with structure align-
alignment programs. FEBS Lett 2002, 529:126-30. ments. Proteins 2000, 40:6-22.
10. Stoye J, Evers D, Meyer F: Rose: generating sequence families. 36. Kahsay R, Wang G, Dongre N, Gao G, Dunbrack R: CASA: a server
Bioinformatics 1998, 14(2):157-63. for the critical assessment of protein sequence alignment
11. Pang A, Smith A, Nuin P, Tillier E: SIMPROT: using an empirically accuracy. Bioinformatics 2002, 18(3):496-7.
determined indel distribution in simulations of protein evo- 37. Zachariah M, Crooks G, Holbrook S, Brenner S: A generalized aff-
lution. BMC Bioinformatics 2005, 6:236. ine gap model significantly improves protein sequence align-
12. Qian B, Goldstein R: Distribution of Indel lengths. Proteins 2001, ment accuracy. Proteins 2005, 58(2):329-38.
45:102-4. 38. Edgar R, Sjölander K: A comparison of scoring functions for
13. Yang Z: Maximum-likelihood estimation of phylogeny from protein sequence profile alignment. Bioinformatics 2004,
DNA sequences when substitution rates differ over sites. Mol 20(8):1301-8.
Biol Evol 1993, 10(6):1396-401.
14. Thompson J, Higgins D, Gibson T: CLUSTAL W: improving the
sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties

Page 18 of 18
(page number not for citation purposes)

You might also like