A survey on the algorithm and development of multiple sequence alignment
A survey on the algorithm and development of multiple sequence alignment
https://doi.org/10.1093/bib/bbac069
Review
Yongqing Zhang He is an associate professor in the School of Computer Science at the Chengdu University of Information Technology. He is a senior member of
CCF. His research interests include machine learning and bioinformatics.
Qiang Zhang He is a graduate student in the School of Computer Science at the Chengdu University of Information Technology. His research interests include
deep learning and bioinformatics.
Jiliu Zhou He is a professor in the School of Computer Science at the Chengdu University of Information Technology. His research interests include intelligent
computing and image processing.
Quan Zou He is a professor in the Institute of Fundamental and Frontier Science at the University of Electronic Science and Technology of China. He is a senior
member of IEEE and ACM. His research interests include bioinformatics and machine learning.
Received: December 9, 2021. Revised: January 30, 2022. Accepted: February 9, 2022
© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2 | Zhang et al.
Figure 1. This figure shows various types of SA. (A) shows PSA and global
SA, (B) shows MSA and global SA and (C) shows MSA and local SA.
sequences homologous of similar to that sequence [26]. They analyzed the specificity required by SARS-CoV-2
However, with the sustainable development of sequenc- spike proteins to cause human infection [38]. The results
ing technology, the cumulative number of sequences showed that N-terminal sequence regions MESEFR’
has broken through the scope of human retrieval. Find- and SYLTPG’ were specific to human SARS-CoV-2. In
ing similar sequences from the sequence database has the receptor-binding domain, two sequence regions,
become one of the difficulties. VGGNY’ and EIYQAGSTPCNGV’ and a disulfide bond
Sequence database search tools mainly include FASTA connecting 480C and 488C are structural determinants
[27] and BLAST [26]. FASTA, derived from FASTAP, is the for recognizing human ACE-2 receptors. The analysis of
first program widely used in DNA and protein database virus variants also depends on the MSA results. Through
searching. The search process of FASTA is as follows: the study of base substitution, deletion, variation and
firstly, build a dictionary of sequence fragments of single nucleotide polymorphisms (SNP) in the alignment,
ktup-length, then match the same or similar sequence we can understand the evolution and classification of
Databases of biomolecules Figure 3. The figure demonstrates the relationship among five sub-
databases of UniProt [67].
the rDNA barcoding identification of bacterial and fungal example, Garriga et al. [76] proposed an MSA method
pathogens in vitreous f luids of endophthalmitis patients based on a T-Coffee regressive and progressive align-
[78], and the study on the molecular variation of ABO in ment algorithm. In their method, sequences are first
Chinese Han population [79]. clustered and then aligned starting from the most distant
Recently, there has been some studies similar to ones. Their experiments show that the technique can
ClustalW. For example, Kalign [13], developed by Las- significantly improve alignment accuracy in large-scale
mann and Sonnhammer in 2005, uses the Wu-Manber datasets.
character matching algorithm when calculating distance As the most classic algorithm, the progressive align-
matrix. GramAlign, proposed by Russell et al. [72], uses ment algorithm has more clear and faster advantages.
syntax based on Lempel–Ziv compression algorithm However, the progressive alignment algorithm’s accuracy
to compute distance matrix. In 2021, Maiolo et al. is relatively unsatisfactory, and its performance is eas-
introduced the Poisson Indel Process (PIP) model and ily affected by the number of sequences. Therefore, in
dynamic programming algorithm to calculate the dis- addition to the progressive alignment algorithm, most
tance matrix, which effectively reduced the complexity algorithms also use other methods to improve the accu-
of the whole method [75]. In the optimization of the racy of alignment, such as MUSCLE [15], MAFFT [14].
objective function, Al-Shatnawi adopts the objective
process of variable gap penalty in the MSAIndelFR to Iterative algorithm
improve the accuracy of alignment in 2015 [73]. The iterative algorithm is an algorithm that can contin-
To further improve the accuracy and avoid the severe uously optimize the alignment to improve the accuracy
pitfall caused by the greed, T-Coffee (Tree-based Consis- until it converges. The iterative algorithm can generally
tency Objective Function For alignmEnt Evaluation) was be divided into deterministic and stochastic [23]. The cal-
presented by Notredame and Higginsin in 2000, which culation results of the deterministic iterative algorithm
is based on progressive alignment algorithm and consis- are the same every time. In contrast, the results of the
tency [71]. The package of T-Coffee integrates a variety stochastic iterative algorithm may be different due to the
of MSA methods with different characteristics, such as internal use of random values. In this section, we mainly
low memory overhead or higher accuracy [80, 81]. For discuss the deterministic iterative algorithm. As shown
6 | Zhang et al.
Figure 5. Given seven protein sequences. The distance matrix of paired sequences is calculated firstly, then the NJ algorithm is used to construct the
guide tree, and finally, the progressive alignment is used to align the sequences gradually [11].
in both performance and accuracy. In 2015, Mirarabet et genetic algorithms or swarm intelligence algorithms to
al. [86] proposed PASTA based on SATé, and this method improve efficiency.
adopts a parallel strategy to expand the scale of SA.
To fast construct the codon-correct alignment, Libin Genetic algorithm
et al. proposed VIRULIGN [87] concerning the character- A genetic algorithm is a method to simulate biological
istics of mutual conversion between codons and amino evolution in nature. It transforms the problem-solving
acids in 2019. Specifically, firstly, VIRULIGN computes process into gene selection, crossover, mutation and
pairwise alignments between each target sequence other techniques similar to natural chromosomes and
and the reference sequence. Secondly, the same as the obtains better optimization results through multiple
first step, calculate the alignment between the target iterations [97]. The relative algorithms are shown in
sequence and the reference sequence represented by Table 4.
amino acid. Then adjust the target sequence according Notredame and Higgins [16] first proposed and devel-
by natural behavior can be regarded as a swarm intel- the search path. Kuang et al. [114] proposed an artificial
ligence algorithm. Since the advent of the swarm intelli- bee colony-based method that adopts a multi-strategy
gence algorithm, dozens of algorithms have been derived, to balance the global exploration and the local exploita-
including ant colony algorithm, particle swarm optimiza- tion in 2018. The strategies include the following three
tion (PSO) and artificial bee colony. In the field of MSA, aspects: tent chaos initialization population, neighbor-
swarm intelligence optimization is also applied, as shown hood search and tournament selection. The results show
in Table 5. that this method has better performance and biological
Long et al. [107] described an MSA method based on characteristics and has strong robustness.
binary SAGA in 2009. This method starts with multiple Some studies also adopt less commonly used algo-
alignments and then continuously modifies each align- rithms. In 2017, Manikandan and Duraisamy proposed
ment until convergence. The results show that most a method combining a bacterial foraging algorithm and
of the alignment scores of their ways are better than genetic algorithm based on the dispersion optimization
ClustalW, DiAlign, SAGA and T-Coffee. Besides, Chaabane principle of the individual and population of Escherichia
proposed Particle Swarm Optimization and Simulated coli[112]. They obtain individuals with higher fitness
Annealing (PSOSA) hybrid model in 2018 [113], which through bacterial foraging behavior to improve accuracy
combines PSO and simulated annealing to fully use the and speed. The result shows that this method is better
former’s exploration ability and the latter’s development than other swarm intelligence algorithms on multiple
ability. The experiment shows that the performance is benchmarks. In 2019, Hussein et al. adopted the flower
better than MUSCLE, MAFFT and ClustalW programs. pollination algorithm to optimize the MSA [115]. Besides,
In addition to PSO, there are many optimization algo- they also proposed a new profile algorithm to improve
rithms based on animal behavior in MSA. For example, in alignment quality. Finally, compared with other methods,
2009, Chen et al. proposed an ant colony algorithm-based this method is superior to other ways except for the
method [108]. They use a dispersion graph to represent MSAProbs.
the alignment and then an ant colony algorithm to find
an optimal path. The experimental results show that Hidden Markov model
this method can obtain a better solution than ClustalX. The HMM) is a statistical analysis model, which is used
Meanwhile, Xiang et al. [109] proposed a similar approach to describe a Markov process with unknown parameters
based on genetic algorithm and planar graph to optimize [119]. In short, HMM can calculate the state transition
A survey on the algorithm and development | 9
probability matrix by analyzing the object’s previous and ProbPFP method, which mainly refers to the above tech-
current observable states and inferring the next state niques. However, ProbPFP is unique in that it introduces
of the object through the probability matrix. Due to the PSO to optimize the parameters of HMM model.
Markov property of MSA, some researchers proposed sev- ClustalW has great limitations in large-scale data;
eral MSA methods based on HMM, as shown in Table 6. therefore, Sievers et al. developed Clustal Omega in 2011.
ProCons (Probabilistic Consistency) is the first program Clustal Omega is based on the mBed algorithm and
to use HMM for MSA, developed by Do et al. in 2005 HMM. Clustal Omega can process tens of thousands of
[17]. ProCons adopts a novel objective function based sequences, and its accuracy is greatly improved. Besides,
on probability consistency, which combines the posterior to improve the alignment efficiency, Clustal Omega can
probability matrix derived from HMM and the alignment use the existing alignment in the database to assist new
consistency to integrate the conservative information of sequences alignment [123, 128]. In 2020, Pachetti et al.
sequences. From the results, ProCons is more accurate [129] applied clustal Omega to the MSA of genomes of
than Align-m, ClustalW, MUSCLE and other programs. SARS-CoV-2 and found eight new recurrent mutations of
In 2006, Roshan and Livesay made some improvements virus.
based on ProCons and proposed Probalign (Probabilistic
alignment) [84]. The difference is that ProCons uses the Machine learning
matched partition function interested of HMM to calcu- Machine learning is still in the initial stage in MSA. In
late the posterior probability matrix. 2016, Mircea et al. [18] first proposed an MSA method
Efficient computation with high accuracy is an urgent based on reinforcement learning. This method is similar
need. Therefore, to solve this problem, Liu et al. [122] to the progressive alignment algorithm to find an optimal
developed a program, MASProbs based on ProCons and alignment order and then align the sequences in order.
Proalign. To enhance the performance, MSAProbs cal- They use a reinforcement learning method based on
culates the weight of the sequence, which participates dynamic programming–Q-learning to calculate the opti-
in the calculation of probability consistency score and mal alignment order of sequences and use this order to
progressive alignment. The result shows that MASProbs construct the alignment gradually. In 2019, Jafari et al. [20]
has higher accuracy than ClustalW, MAFFT, MUSCLE, replaced Q-learning with the A3C (Asynchronous Advan-
ProCons and Probalign. Since then, the author proposed tage Actor Critic) model in deep reinforcement learning
MSAProbs-MPI, which introduces parallel and distributed based on Mircea, which improved the convergence speed
memory based on MSAProbs and can cope with larger and reduced the possibility of falling into local optimum
datasets [126, 127]. In 2019, Zhan et al. [125] proposed the to a certain extent.
10 | Zhang et al.
Table 7. Summary of divide-and-conquer-based methods To reduce the time and memory consumption of long
References Description Year SA, Naznooshsadat et al. [22] proposed FAME (FAst and
MEmory), an MSA method based on splitting and splicing.
[21] SpliVert, splitting-splicing, refinement 2020
Its process consists of the following three steps: (1) divide
[22] FAME, common area, splice 2020
[130] MAGUS, divide-and-conquer, graph clustering 2021 the set of sequences into groups of subsequences verti-
[131] FMAlign, FM-index, common segments 2021 cally and maintain the order of these subsequences on
the original sequences. (2) Align the subsequence at com-
mon regions and align the unaligned subsequence with
other MSA methods. (3) Merge all aligned subsequences.
The results show that the FAME can process large-scale
data four times faster than the original method.
The alignment method for large-scale sequences
m
n−1
n
j
WSP = wj,k ∗ (ci , cki ) (2)
i=1 j=1 k=j+1
Figure 9. Two alignments for illustrating the calculation of TC score. In
this figure, there are three correctly aligned columns (1st, 2nd and 5th)
In addition to the weight difference, WSP also intro- and six TCs. Therefore, TC score is 0.5 (3 divided by 6).
duces the gap penalty. The gap penalty comprises the gap
Table 8. Summary of programs
opening penalty and gap extension penalty. The former
whereas the latter is the penalty points obtained by the ClustalW 1994 http://www.clustal.org/
extension of any gap starting from the opening gap. At DiAlign 1998 http://dialign.gobics.de/
j T-Coffee 2000 http://tcoffee.crg.cat/
this time, the score of s(ci , cki ) has the following three
MAFFT 2002 https://mafft.cbrc.jp/alignment/software/
cases. The first is to set the score according to PAM or MUSCLE 2004 http://www.drive5.com/muscle/
j
BLOSUM when neither ci nor cki are gaps. The second is ProbCons 2005 http://probcons.stanford.edu/
j Kalign 2005 https://www.ebi.ac.uk/Tools/msa/kalign/
to set s(ci , cki ) = 0 when both are gaps. And the last is to
j j Probalign 2006 http://probalign.njit.edu/standalone.html
set = G when either one of ci and cki is gap. The
s(ci , cki ) PRANK 2008 http://wasabiapp.org/software/prank/
calculation of G is shown in Equation 3, where gop is the MSAProbs 2010 http://msaprobs.sourceforge.net/
opening penalty for the gap, gep is the extension penalty NX4 2019 https://www.nx4.io
FAME 2020 http://github.com/naznoosh/msa
value for the gap and n is the length of consecutive gaps.
Table 9. Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]
MSAProbs [122] 0.441 0.865 0.464 0.607 0.622 0.608 0.607 12382.00
Probalign [84] 0.453 0.862 0.439 0.566 0.603 0.549 0.589 10095.20
MAFFT [14] 0.439 0.831 0.450 0.581 0.605 0.591 0.588 1475.40
Probcons [17] 0.417 0.855 0.406 0.544 0.532 0.573 0.558 13086.30
Clustal Omega [123] 0.358 0.789 0.450 0.575 0.579 0.533 0.554 539.91
T-Coffee [71] 0.410 0.848 0.402 0.491 0.545 0.587 0.551 81041.50
Kalign [72] 0.365 0.790 0.360 0.476 0.504 0.435 0.501 21.88
MUSCLE [15] 0.318 0.804 0.350 0.409 0.450 0.460 0.475 789.57
FSA [136] 0.270 0.818 0.187 0.259 0.474 0.398 0.419 53648.10
DiAlign [12] 0.265 0.696 0.292 0.312 0.441 0.425 0.415 3977.44
PRANK [137] 0.223 0.680 0.257 0.321 0.360 0.356 0.376 128355.00
Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references,
and total running times. The best results in each column are put in bold.
31. Khan BM, Sabir M, Alyemeni MN, et al. Genetic similarities and 51. Cantelli G, Bateman A, Brooksbank C, et al. The European
phylogenetic analysis of muntjac (muntiacus spp.) by com- Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res
paring the nucleotide sequence of 16s rrna and cytochrome b 2021;50(D1):D11–D19.
genome. Braz J Biol 2021;83:e248153. 52. Sayers EW, Cavanaugh M, Clark K, et al. Genbank. Nucleic Acids
32. Liu B, Pavel JA, Hausbeck MK, et al. Phylogenetic analysis, Res 2020;48(D1):D84–d86.
vegetative compatibility, virulence, and fungal filtrates of leaf 53. Ogasawara O, Kodama Y, Mashima J, et al. DDBJ database
curl pathogen Colletotrichum fioriniae from celery. Phytopathology updates and computational infrastructure enhancement.
2021;111(4):751–60. Nucleic Acids Res 2020;48(D1):D45–d50.
33. Wei R, Zhang XC. Phylogeny of diplazium (athyriaceae) revis- 54. Tuli MA, Flores TP, Cameron GN. Submission of nucleotide
ited: resolving the backbone relationships based on plastid sequence data to EMBL/genbank/DDBJ. Mol Biotechnol
genomes and phylogenetic tree space analysis. Mol Phylogenet 1996;6(1):47–51.
Evol 2020;143:106699. 55. The UniProt Consortium. Uniprot: a worldwide hub of protein
34. Hu Y, Xing W, Hu Z, et al. Phylogenetic analysis and substitution knowledge. Nucleic Acids Res 2019;47(D1):D506–15.
74. Bhat B, Ganai NA, Andrabi SM, et al. Tm-aligner: multiple 94. Ishikawa M, Toya T, Hoshida M, et al. Multiple sequence
sequence alignment tool for transmembrane proteins with alignment by parallel simulated annealing. Comput Appl Biosci
reduced time and improved accuracy. Sci Rep 2017;7(1):12543. 1993;9:267–73.
75. Maiolo M, Gatti L, Frei D, et al. Propip: a tool for progressive 95. Hernández-Guía M, Mulet R, Rodríguez-Pérez S. Simulated
multiple sequence alignment with Poisson indel process. BMC annealing algorithm for the multiple sequence alignment
Bioinformatics 2021;22(1):518. problem: the approach of polymers in a random medium. Phys
76. Garriga E, Di Tommaso, Magis C, et al. Multiple sequence Rev E 2005;72:031915.
alignment computation using the t-coffee regressive algorithm 96. Hwa T, Lässig M. Similarity detection and localization. Phys Rev
implementation. Methods Mol Biol 2021;2231:89–97. Lett 1996;76(14):2591–4.
77. Dhivya S, Ashutosh S, Gowtham I, et al. Molecular iden- 97. Mirjalili S. Genetic Algorithm. Cham: Springer International Pub-
tification and evolutionary relationships between the sub- lishing, 2019, 43–55.
species of Musa by DNA barcodes. BMC Genomics 2020;21(1): 98. Zhang C, Wong AKC. A genetic algorithm for multiple molecu-
659. lar sequence alignment. Bioinformatics 1997;13(6):565–81.
114. Kuang FJ, Zhang SY, Liu CC. Multiple sequence alignment 128. Sievers F, Higgins DG. The clustal omega multiple alignment
algorithm based on multi-strategy artificial bee colony. Kongzhi package. Methods Mol Biol 2021;2231:3–16.
yu Juece/Control Decision 2018;33:1990–6. 129. Pachetti M, Marini B, Benedetti F, et al. Emerging
115. Hussein AM, Abdullah R, AbdulRashid N. Flower pollination sars-cov-2 mutation hot spots include a novel RNA-
algorithm with profile technique for multiple sequence align- dependent-RNA polymerase variant. J Transl Med 2020;18(1):
ment. In: 2019 IEEE Jordan International Joint Conference on Electri- 179.
cal Engineering and Information Technology (JEEIT), 2019, 571–6. 130. Smirnov V, Warnow T. Magus: multiple sequence
116. Dabba A, Tari A, Zouache D. Multiobjective artificial fish swarm alignment using graph clustering. Bioinformatics 2021;37(12):
algorithm for multiple sequence alignment. INFOR: Inf Syst Oper 1666–72.
Res 2020;58(1):38–59. 131. Liu H, Zou Q, Xu Y. A novel fast multiple nucleotide
117. Kumar M, Kumar R, Nidhya R. Woamsa: whale optimiza- sequence alignment method based on fm-index. Brief Bioinform
tion algorithm for multiple sequence alignment of protein 2021;23(1):bbab519, https://doi.org/10.1093/bib/bbab519.
sequence. In: Smys S, Tavares JMRS, Balas VE et al. (eds). Compu- 132. Shen C, Zaharias P, Warnow T. Magus+ehmms: improved mul-