0% found this document useful (0 votes)
6 views

A survey on the algorithm and development of multiple sequence alignment

This document reviews the algorithms and developments in multiple sequence alignment (MSA), highlighting its importance in bioinformatics for revealing biological information such as function, evolution, and structure. It categorizes MSA algorithms into progressive alignment, iterative algorithms, heuristics, machine learning, and divide-and-conquer, while also discussing the challenges and opportunities in the field. The review aims to provide a comprehensive overview of MSA applications, including phylogenetic analysis, genomic analysis, and protein analysis, to guide future research efforts.

Uploaded by

fruitzebra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

A survey on the algorithm and development of multiple sequence alignment

This document reviews the algorithms and developments in multiple sequence alignment (MSA), highlighting its importance in bioinformatics for revealing biological information such as function, evolution, and structure. It categorizes MSA algorithms into progressive alignment, iterative algorithms, heuristics, machine learning, and divide-and-conquer, while also discussing the challenges and opportunities in the field. The review aims to provide a comprehensive overview of MSA applications, including phylogenetic analysis, genomic analysis, and protein analysis, to guide future research efforts.

Uploaded by

fruitzebra
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Briefings in Bioinformatics, 2022, 23(3), 1–16

https://doi.org/10.1093/bib/bbac069
Review

A survey on the algorithm and development of multiple


sequence alignment
Yongqing Zhang, Qiang Zhang, Jiliu Zhou and Quan Zou
Corresponding author: Quan Zou, Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, 610054, Chengdu,
China. Tel: 13656009020; Fax: +86-28-83201896; E-mail: [email protected]

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


Abstract
Multiple sequence alignment (MSA) is an essential cornerstone in bioinformatics, which can reveal the potential information in
biological sequences, such as function, evolution and structure. MSA is widely used in many bioinformatics scenarios, such as
phylogenetic analysis, protein analysis and genomic analysis. However, MSA faces new challenges with the gradual increase in
sequence scale and the increasing demand for alignment accuracy. Therefore, developing an efficient and accurate strategy for MSA
has become one of the research hotspots in bioinformatics. In this work, we mainly summarize the algorithms for MSA and its
applications in bioinformatics. To provide a structured and clear perspective, we systematically introduce MSA’s knowledge, including
background, database, metric and benchmark. Besides, we list the most common applications of MSA in the field of bioinformatics,
including database searching, phylogenetic analysis, genomic analysis, metagenomic analysis and protein analysis. Furthermore, we
categorize and analyze classical and state-of-the-art algorithms, divided into progressive alignment, iterative algorithm, heuristics,
machine learning and divide-and-conquer. Moreover, we also discuss the challenges and opportunities of MSA in bioinformatics.
Our work provides a comprehensive survey of MSA applications and their relevant algorithms. It could bring valuable insights for
researchers to contribute their knowledge to MSA and relevant studies.

Keywords: multiple sequence alignment, progressive alignment, iterative algorithm, heuristic

Introduction be divided into pairwise sequence alignment [4] (PSA,


Sequence alignment (SA) is one of the hotspots and shown as Figure 1A) and multiple sequence alignment [5]
critical steps in bioinformatics, which analyzes the (MSA, shown as Figure 1B or C). SA makes the resulting
essential biological characteristics between nucleotide sequences all have the same length and aligns the same
or protein sequences by comparing the similarity, such or similar parts of each sequence by inserting gaps. At
as functional information, structural information and present, dynamic programming and the trace-back pro-
evolutionary information [1]. For example, in com- cess can perfectly solve the problem of PSA, such as the
parative genomic analysis, SA can identify conserved Needleman–Wunsch algorithm for global alignment [6]
sequence motifs, estimate evolutionary differences and Smith–Waterman algorithm for local alignment [7].
among sequences and infer the historical relationship However, compared with PSA, MSA is more complex. It is
between genes and species [2]. an NP (nondeterministic polynomial)-complete problem
SA is mainly used to determine the similarity of two [8], there is no comprehensive solution for MSA [9].
or more sequences. As shown in Figure 1A, given two There are currently many algorithms for MSA that
nucleotide sequences, it is only necessary to insert one can be roughly divided into progressive alignment algo-
gap in the 3rd position of the 2nd sequence to maximize rithm, iterative algorithm, heuristics, machine learning
the number of matched nucleotide columns between and divide-and-conquer. Progressive alignment is a sim-
two sequences. According to the range of alignment, SA ple algorithm, which mainly includes two steps: con-
can be divided into global SA (shown as Figure 1A or structing the guide tree and gradually constructing the
B) and local SA (shown as Figure 1C) [3]. On the other alignment according to the guide tree [10]. The progres-
hand, according to the number of sequences, SA can also sive alignment algorithm is more straightforward and

Yongqing Zhang He is an associate professor in the School of Computer Science at the Chengdu University of Information Technology. He is a senior member of
CCF. His research interests include machine learning and bioinformatics.
Qiang Zhang He is a graduate student in the School of Computer Science at the Chengdu University of Information Technology. His research interests include
deep learning and bioinformatics.
Jiliu Zhou He is a professor in the School of Computer Science at the Chengdu University of Information Technology. His research interests include intelligent
computing and image processing.
Quan Zou He is a professor in the Institute of Fundamental and Frontier Science at the University of Electronic Science and Technology of China. He is a senior
member of IEEE and ACM. His research interests include bioinformatics and machine learning.
Received: December 9, 2021. Revised: January 30, 2022. Accepted: February 9, 2022
© The Author(s) 2022. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
2 | Zhang et al.

Figure 1. This figure shows various types of SA. (A) shows PSA and global
SA, (B) shows MSA and global SA and (C) shows MSA and local SA.

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


faster. It is still widely used in various methods or pro-
grams of MSA after more than 30 years [11–13]. However,
due to the lack of accuracy of progressive alignment
algorithm, some methods introduced iterative algorithm
into progressive alignment to solve this problem [14, 15].
These methods continuously optimize the alignment in
the iterative process, but it is undeniable that the itera-
tive algorithm comes with an additional time overhead.
Figure 2. The frame diagram of the paper.
Recently, some researchers are pinning their hopes on
heuristics [16, 17]. A heuristic algorithm is an iterative
algorithm in a broad sense suitable for solving the NP and discussed the future directions [25]. However, due
problem such as MSA. However, unlike the traditional to the increasing sequences scale and the diversity and
iterative algorithm, the heuristic algorithm may have dif- complexity of the approaches, these works are not sys-
ferent results each time due to its randomness. Although tematic and comprehensive. There is an urgent demand
the application of machine learning to MSA is currently to summarize the recent research of MSA clearly and
in its initial stage, there are still some valuable works systematically.
on this subject [18–20]. These works show that machine In this work, we provide a clear review of the research
learning-based methods for MSA perform better than of MSA and its applications in several scenarios. Our
traditional methods on some datasets. Moreover, divide- study is mainly shown in Figure 2. We summarize the rel-
and-conquer is also widely used in MSA in recent years evant knowledge of MSA from its application, database,
[21, 22]. It operates by dividing a group of sequences into algorithm, program and benchmark. Primarily, we intro-
multiple groups of subsequences for processing, which duce and classify the classical and latest algorithms for
significantly improves the efficiency of the methods. MSA in more detail. Furthermore, we also discuss the
With the development and extensive application of current challenges and opportunities for future direc-
MSA, there have been many valuable works to review tions of MSA. This work comprehensively reviews the
the research of MSA from several aspects. For example, research of MSA from several aspects, which aim to serve
Notredame described existing MSA algorithms and as a helpful guide for researchers. The contributions of
exposed the potential advantages and disadvantages of this work can be summarized as follows: (1) summarize
the widely used programs [23]. Furthermore, Chowdhury the research status of MSA systematically. (2) Classify
and Garai have shown different methods for SA and and detail the classical and latest algorithms for MSA.
the latest trend of multi-objective genetic algorithms (3) Discuss the current challenges and opportunities of
in MSA. Their research shows that there is excellent MSA.
progress in improving the accuracy of the alignment
[24]. Moreover, Chatzou et al. reviewed the development
and performance of the algorithms for protein, DNA Applications of MSA
and RNA alignment. Importantly, they also explored MSA plays a vital role in clarifying the biological patterns
the relationship between simulated and empirical data, of multiple related sequences. For two or more sequences
which affect the development of the algorithm [8]. with high similarity, they may show similarity in func-
Bawono et al.’s work provided the background and tion, evolution and structure [1]. For this reason, MSA
considerations of MSA techniques, primarily focusing has been widely used in sequence database searching,
on the protein SA. Then, they reviewed existing MSA phylogenetic analysis, genomic analysis, protein analysis
methods and summarized the specific strengths and and metagenomic analysis [23].
limitations of these methods [5]. Xia et al. conducted the
present situation of parallel local alignment, concen- Database searching
trating on the large-scale genomic alignment. Besides, For a newly gene sequenced or protein sequenced, a
they also surveyed the performance of typical methods vital clue to know its function is to search the known
A survey on the algorithm and development | 3

sequences homologous of similar to that sequence [26]. They analyzed the specificity required by SARS-CoV-2
However, with the sustainable development of sequenc- spike proteins to cause human infection [38]. The results
ing technology, the cumulative number of sequences showed that N-terminal sequence regions MESEFR’
has broken through the scope of human retrieval. Find- and SYLTPG’ were specific to human SARS-CoV-2. In
ing similar sequences from the sequence database has the receptor-binding domain, two sequence regions,
become one of the difficulties. VGGNY’ and EIYQAGSTPCNGV’ and a disulfide bond
Sequence database search tools mainly include FASTA connecting 480C and 488C are structural determinants
[27] and BLAST [26]. FASTA, derived from FASTAP, is the for recognizing human ACE-2 receptors. The analysis of
first program widely used in DNA and protein database virus variants also depends on the MSA results. Through
searching. The search process of FASTA is as follows: the study of base substitution, deletion, variation and
firstly, build a dictionary of sequence fragments of single nucleotide polymorphisms (SNP) in the alignment,
ktup-length, then match the same or similar sequence we can understand the evolution and classification of

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


fragments in the database. After the search, FASTA variants, which provides essential information for vac-
expands the matched fragments to obtain the best cine research about development and improvement [39].
alignment between each matched sequence and the In addition, MSA has been applied to many research
search sequence. In addition, BLAST is the most widely fields, including the analysis of viral diseases of different
used tool at present. It has more improvements and species [40], the impact of enhancers on the evolution of
faster speed than FASTA. The search process is similar to mammals [41] and the study of genes encoding killer-cell
FASTA. It needs to establish a dictionary for the sequence immunoglobulin-like receptors [42].
fragments and expand on the matched fragments.
The apparent difference is that BLAST introduces Metagenomic analysis
a fragment matching and expansion threshold. For
The genomic analysis mainly analyzes the genome of a
different application scenarios, many derivative versions
single species. In contrast, metagenomic analysis primar-
of BLAST have been born, including the latest BLASTP-
ily studies the genome of all microbial communities in a
ACC [28] and SMI-BLASE [29].
specific environment and explores their genetic compo-
Phylogenetic analysis sition, community function and diversity [43].
Metagenomic analysis can be combined with MSA
Phylogenetic analysis is a critical key of many evolu-
to quickly determine the genetic relationship between
tionary studies. It reveals the phylogenetic relationship
different organisms. In 2020, Zhou et al. collected the
between different species and analyzes the major tran-
metagenomes of 227 bat samples and identified the
sitions in evolution [30]. MSA is the prerequisite in many
new virus called RmYN02 through MSA. According to
studies of phylogenetic analysis, such as the construction
analysis, the virus has 93.3% nucleotide homology with
of the phylogenetic tree. Such as, Khan et al. compute
SARS-CoV-2 and is the closest relative to SARS-SoV-
the alignment of 16S rRNA of seven muntjac species to
2 found so far [44]. Furthermore, MSA is also often
determine the phylogenetic similarity among the munt-
used in metagenomic classification. For example, the
jac using DNASTAR. And then construct the phylogenetic
early metagenomic classifier uses BLAST to match each
tree of these species based on the alignment. Finally,
genome with all genomes in GenBank [45]. While now, to
by analyzing the phylogenetic tree, they summarize the
improve accuracy and efficiency, most classifiers prefer
phylogenetic relationship among the selected species
combining other algorithms based on MSA, such as k-
[31]. Furthermore, while investigating the Colletotrichum
mers [46] and Markov model [47].
acutatum sensu lato, which causes leaf curl of celery,
Liu et al. [32] align the genes of a collection of isolates
from celery and non-celery using ClustalX and conduct Protein analysis
a multilocus phylogenetic analysis based on the align- Similar to genomic analysis, protein sequence analysis
ment. In addition, Wei et al. [33] and Hu et al. [34] used mainly focuses on protein function analysis, homology
the MSA program MAFFT to construct the alignment for analysis, structure prediction and residue contact pre-
phylogenetic analysis. diction [23]. Primarily, protein structure prediction is one
of the current research hotspots. In Protein Data Bank
Genomic analysis (PDB), the protein structure is mainly obtained by X-ray
Genomic analysis is an essential means of obtaining crystal diffraction, nuclear magnetic resonance and elec-
biological information, and it aims to analyze the gene tron microscopy [48]. Because the cost of this structure
functional, genetic, evolutionary or structural informa- acquisition method is too high, their respective limita-
tion contained in these genomes [35]. tions are relatively large. Therefore, to reduce the cost
In early 2020, with the outbreak of SARS-CoV-2 of obtaining protein structure, some researchers have
(COVID-19), researchers paid particular attention to proposed MSA combined with some prediction methods
the virus [36, 37]. They constructed the MSA and the to predict protein structure. The main steps are as fol-
phylogenetic trees on the representative spike proteins lows: give a group of protein sequences homologous to
of SARS-CoV and SARS-CoV-2 from various host sources. be tested and use the SA results to select an appropriate
4 | Zhang et al.

method to predict the structure [49]. A prediction accu-


racy combined with deep learning is proposed to improve
the prediction accuracy. The core idea of this method is
to map the amino acid sequence to the continuous space
of adaptive learning with the help of natural language
processing to input the whole alignment into the deep
neural network for prediction. The results show that
this method has good performance in different protein
prediction problems [50].

Databases of biomolecules Figure 3. The figure demonstrates the relationship among five sub-
databases of UniProt [67].

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


The sequence database is mainly divided into gene,
protein and comprehensive databases. In 1982, with
sequencing technology development, biological sequences
increased sharply. European Bioinformatics Institute
established the first DNA sequence database, which
is called European Molecular Biology Laboratory Bank
(EMBL-Bank) [51]. In the same year, GeneBank was
established by the National Institutes of Health and
National Science Foundation [52]. After decades of
development, the above databases have become a 1st-
class comprehensive database [53]. To ensure data
integrity, these databases share and exchange data.
Therefore, from the perspective of data content, the
resources stored in these databases are almost the same
[54].
There are some deficiencies in the protein data in
the above database. Therefore, databases dedicated to
storing proteins have emerged, such as UniProt PDB.
UniProt is the free protein database with the rich
information and the most comprehensive resources. It
integrates EMBL bank, the Swiss Institute of Bioinfor-
matics and Protein Information Resource [55]. At present,
UniProt is mainly composed of five sub-databases: Swiss-
Prot, TrEMBL, UniParc, UniRef and Proteomes. The rela-
tionship of these data is shown in Figure 3. PDB mainly
records the protein structural information and other Figure 4. Classification of MSA algorithms.
biological macromolecules, such as polysaccharides
and nucleic acids [48]. The structural information in
PDB is mainly determined by X-ray crystal diffraction, alignment algorithm is more straightforward, efficient
nuclear magnetic resonance or electron microscope. It is and has lower memory consumption than other algo-
expensive but has higher accuracy, so it is more suitable rithms. However, it also has obvious defects, once a gap,
for research. The summary of the related database is always a gap’ meaning that the inserted gap cannot
shown in Table 1. be changed after the alignment sequence is confirmed
[69]. According to relevant research, the progressive
alignment algorithm may cause information loss when
Algorithms for MSA calculating the distance matrix, making the algorithm
This section mainly introduces the widely used MSA unstable [70].
algorithms, which are divided into the following five After more than 30 years of development, the progres-
types according to the characteristics of the algorithm: sive alignment algorithm has been widely used in various
progressive alignment algorithm, iterative algorithm, MSA methods, as shown in Table 2. Typically, Higgins et al.
heuristic, machine learning and divide-and-conquer developed ClustalW, derived from ClustalV in 1994 [11],
algorithm, as shown in Figure 4. which uses the weighted sum-of-pairs (WSP) objective
function and the neighbor-joining (NJ) method to build
Progressive alignment algorithm the guide tree, as shown in Figure 5. ClustalW is widely
Hogeweg and Hesper [68] first proposed the progres- used in genome analysis, such as the study on the evo-
sive alignment algorithm in 1984. The progressive lutionary relationship between banana subspecies [77],
A survey on the algorithm and development | 5

Table 1. Summary of DNA/RNA and protein databases

Database Description References

EMBL-Bank Comprehensive database of Europe, top-level database. [51]


DDBJ Comprehensive database of Japan, top-level database. [53]
GenBank Comprehensive database of America, top-level database. [52]
CNGB Comprehensive database of China. [56]
SRA Comprehensive database, storing raw sequences. [57]
RefSeq Comprehensive database, storing non redundant sequences. [58]
GDB Genome database, storing human genome sequences. [59]
MmtDB Metazoa mtDNA variants specialized database. [60]
MitBASE Integrated mitochondrial DNA database. [61]
SGD Saccharomyces genome database. [62]
AceDB Caenorhabditis elegans sequence database. [63]

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


dbSNP Variant gene database. [64]
OMIM Online Mendelian inheritance in MA. [65]
DGV The database of genomic variants from healthy people. [66]
UniProt Protein database. [55]
PDB Protein structure database. [48]

Table 2. Summary of progressive alignment algorithm-based methods

References Description Year

[11] ClustalW, guide tree, WSP 1994


[12] DiAlign, segment-segment, probabilistic consideration 1998
[71] T-Coffee, tree-based consistency objective 2000
[13] Kalign, Wu-Manber character matching algorithm 2005
[72] GramAlign, grammar-based distance 2008
[73] MSAIndelFR, indel f lanking region 2015
[74] TM-Aligner, Wu-Manber and dynamic string matching algorithm 2017
[75] ProPIP, dynamic programming 2021
[76] Regressive algorithm 2021

the rDNA barcoding identification of bacterial and fungal example, Garriga et al. [76] proposed an MSA method
pathogens in vitreous f luids of endophthalmitis patients based on a T-Coffee regressive and progressive align-
[78], and the study on the molecular variation of ABO in ment algorithm. In their method, sequences are first
Chinese Han population [79]. clustered and then aligned starting from the most distant
Recently, there has been some studies similar to ones. Their experiments show that the technique can
ClustalW. For example, Kalign [13], developed by Las- significantly improve alignment accuracy in large-scale
mann and Sonnhammer in 2005, uses the Wu-Manber datasets.
character matching algorithm when calculating distance As the most classic algorithm, the progressive align-
matrix. GramAlign, proposed by Russell et al. [72], uses ment algorithm has more clear and faster advantages.
syntax based on Lempel–Ziv compression algorithm However, the progressive alignment algorithm’s accuracy
to compute distance matrix. In 2021, Maiolo et al. is relatively unsatisfactory, and its performance is eas-
introduced the Poisson Indel Process (PIP) model and ily affected by the number of sequences. Therefore, in
dynamic programming algorithm to calculate the dis- addition to the progressive alignment algorithm, most
tance matrix, which effectively reduced the complexity algorithms also use other methods to improve the accu-
of the whole method [75]. In the optimization of the racy of alignment, such as MUSCLE [15], MAFFT [14].
objective function, Al-Shatnawi adopts the objective
process of variable gap penalty in the MSAIndelFR to Iterative algorithm
improve the accuracy of alignment in 2015 [73]. The iterative algorithm is an algorithm that can contin-
To further improve the accuracy and avoid the severe uously optimize the alignment to improve the accuracy
pitfall caused by the greed, T-Coffee (Tree-based Consis- until it converges. The iterative algorithm can generally
tency Objective Function For alignmEnt Evaluation) was be divided into deterministic and stochastic [23]. The cal-
presented by Notredame and Higginsin in 2000, which culation results of the deterministic iterative algorithm
is based on progressive alignment algorithm and consis- are the same every time. In contrast, the results of the
tency [71]. The package of T-Coffee integrates a variety stochastic iterative algorithm may be different due to the
of MSA methods with different characteristics, such as internal use of random values. In this section, we mainly
low memory overhead or higher accuracy [80, 81]. For discuss the deterministic iterative algorithm. As shown
6 | Zhang et al.

Figure 5. Given seven protein sequences. The distance matrix of paired sequences is calculated firstly, then the NJ algorithm is used to construct the
guide tree, and finally, the progressive alignment is used to align the sequences gradually [11].

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


Table 3. Summary of deterministic iterative algorithm-based methods

References Description Year

[82] MultAlign, hierarchical clustering 1988


[14] MAFFT, fast Fourier transform, refinement 2002
[15] MUSCLE, unweighted pair-group method with arithmetic mean 2004
[83] PRALINE, homology-extended, secondary structure 2005
[84] Probalign, posterior probabilities 2006
[85] SATé, minimizing tree-length, POY 2009
[86] PASTA, large-scale, parallel technique 2015
[87] VIRULIGN, reference-guide, codon-correct 2019
[88] ViralMSA, reference-guide, real-time 2021

in Table 3, the table lists the MSA methods related to the


deterministic iterative algorithm.
MultAlign is one of the earliest MSA methods using
the iterative algorithm and progressive alignment devel-
oped by Corpet in 1988 [82]. MultAlign calculates the
pairwise alignment and the distance matrix by FASTA.
While constructing the guide tree, it uses a hierarchical
clustering to cluster all similar sequences and generates
a tree with leaf nodes. MUSCLE (MUltiple Sequence Com-
parison by Log-Expectation) is one of the widely used
programs based on an iterative algorithm, which was
developed by Edgar in 2004 [15]. The process of MUS-
CLE is mainly divided into the following three steps (As
shown in Figure 6): (1) calculate the distance matrix D1,
construct the guide tree TREE1 by UPGMA (Unweighted
Pair-group Method with Arithmetic Mean) and produce Figure 6. The f low chart of MUSCLE [15].
an alignment MSA1 following TREE1. (2) Compute the
Kimura distance matrix from MSA1 to produce a guide
tree TREE2 and an accurate alignment. (3) Firstly, delete the program. The results demonstrate that MAFFT in
an edge near the root node on Tree2, generate two sub- iterative refinement mode is over 100 times faster than
trees and calculate the subtree profiles [89]. Then, realign T-Coffee when more than 60 sequences are without accu-
two profiles and obtain a new alignment. If the Sum-of- racy.
Pairs (SP) score of the alignment is higher than before, Due to the lack of accuracy in phylogeny, Liu et al.
retain the alignment; otherwise, discard it. Finally, repeat [85] proposed SATé, which first applies the affine gap
this step until the accuracy reaches the criterion. penalty to POY (a phylogenetic analysis program) in 2009.
To reduce the time cost, Katoh et al. developed MAFFT Compared with ClustalW, SATé has higher efficiency
(Multiple Alignment based on Fast Fourier Transform) while searching short alignment or tree pairs. Besides,
in 2002 [14, 90]. MAFFT uses Fourier transform to con- its constructed tree is also more accurate than ClustalW.
vert the sequence into a vector for processing, which Furthermore, Liu et al. [91] proposed the 2nd version
simplifies the calculation process of the algorithm. The of SATé in 2012. The method adopts a different divide-
most significant advantage is that it can quickly identify and-conquer from the previous one to produce smaller
the homologous region and shorten the overall time of related subsets. Therefore, Saté-II is superior to SATé-I
A survey on the algorithm and development | 7

in both performance and accuracy. In 2015, Mirarabet et genetic algorithms or swarm intelligence algorithms to
al. [86] proposed PASTA based on SATé, and this method improve efficiency.
adopts a parallel strategy to expand the scale of SA.
To fast construct the codon-correct alignment, Libin Genetic algorithm
et al. proposed VIRULIGN [87] concerning the character- A genetic algorithm is a method to simulate biological
istics of mutual conversion between codons and amino evolution in nature. It transforms the problem-solving
acids in 2019. Specifically, firstly, VIRULIGN computes process into gene selection, crossover, mutation and
pairwise alignments between each target sequence other techniques similar to natural chromosomes and
and the reference sequence. Secondly, the same as the obtains better optimization results through multiple
first step, calculate the alignment between the target iterations [97]. The relative algorithms are shown in
sequence and the reference sequence represented by Table 4.
amino acid. Then adjust the target sequence according Notredame and Higgins [16] first proposed and devel-

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


to the difference between the two alignments produced oped Sequence Alignment by Genetic Algorithm (SAGA)
in the 1st and 2nd steps. Finally, repeat from the 2nd step based on genetic algorithm in 1996. The method contin-
until no more differences are detected. Recently, Moshiri uously optimizes the alignment through crossover and
proposed ViralMSA to adopt the reference-guide strategy insertion operations until the alignment does not change,
[88], which aims to align the whole genome in real-time. as shown in Figure 7. The result indicates that SAGA is
Moshiri claims that ViralMSA is orders of magnitude suitable for most objective functions and can achieve
faster than VIRULIGN. better alignment. Different from SAGA, Chen et al. pro-
Compared with the progressive alignment algorithm, posed a method combining genetic algorithm, divide-
the most significant advantage of the iterative algorithm and-conquer algorithm and dynamic programming [99].
is that it can improve the alignment result, has good The method divides the sequences vertically into multi-
robustness and is insensitive to the number of sequences. ple subsequences by a genetic algorithm. Then compute
However, the drawback is that it is easy to fall into the the sub-alignments of all subsequences using dynamic
local optimization. programming. Finally, splice the sub-alignments into a
complete alignment.
Heuristics algorithm Some efforts also focus on the objective function to
A heuristics algorithm is a stochastic iterative algorithm improve genetic algorithm. For example, Arenas-Daz and
in a broad sense. The main idea is to calculate each fea- Ochoterena proposed a new objective function GLObal
sible solution of the combinatorial optimization problem Criterion for Sequence Alignment (GLOCSA), in 2009.
within the acceptable range of time or space. Heuristics Experiments show that GLOCSA can effectively improve
have good performance in solving NP-complete prob- the performance of the alignment [100]. Later, Ortuno
lems, and it can be roughly divided into the genetic algo- applied a multi-objective function to the genetic algo-
rithm, swarm intelligence, simulated annealing, hidden rithm, and the result indicates that the genetic algorithm
Markov model (HMM) and meta-heuristics [92]. with multi-objective function has more obvious advan-
tages than the traditional genetic algorithm [101].
Simulated annealing Moreover, Gao et al. adopted a chaos algorithm to opti-
Simulated annealing originates from the principle of mize the genetic algorithm in 2016., the premature phe-
solid annealing. When the stable temperature rises, the nomenon in the genetic algorithm is effectively solved
internal particles become disordered, and when the tem- Using the ergodicity and inherent randomness of chaotic
perature drops, the particles will gradually become an iteration [103]. Different from Gao’s method, Chatter-
ordered state. Therefore, a simulated annealing algo- jee uses a chemical reaction optimization algorithm to
rithm is often used in combinatorial optimization prob- optimize the population in the evolution process, thus
lems [93]. improving the efficiency [104]. In addition, there also are
In 1993, Ishikawa et al. [94] first proposed using sim- some genetic algorithm-based methods, such as Mishra
ulated annealing to solve MSA. The results demonstrate et al. [105] present progressive technique to enhance the
that the method can get a better solution in a specific value of the alignment, and Chowdhury et al. [106] com-
time than the traditional progressive alignment algo- bine with bi-objective function to optimize the selection.
rithm. In addition, this method lays the foundation for The genetic algorithm can give feasible solutions in
the subsequent process based on simulated annealing. a particular time, but the most feasible solutions are
For example, Hernndez-Gua et al. [95] described another local optimal solutions. Therefore, the genetic algorithm
simulated annealing-based way in 2005. They use the needs to be combined with other algorithms to improve
mapping between MSA and Directed Polymer in a Ran- stability and accuracy.
dom Media [96] to allow the insertion or deletion of gaps
in the process of modifying alignment and expand the Swarm intelligence algorithm
space of feasible solutions. Swarm intelligence algorithm simulates the biological
Due to the slow convergence speed of the simulated behavior of groups such as insects, animals or birds in
annealing algorithm, it is often necessary to combine nature. By definition, any algorithm or strategy inspired
8 | Zhang et al.

Table 4. Summary of genetic algorithm-based methods

References Description Year

[16] Affine gap penalty, crossover, mutation 1996


[98] String, simultaneousness, automated processing 1997
[99] Divide-and-conquer, dynamic programming 2005
[100] Global criterion for SA, evolutionary computation 2009
[101] Multi-objective, structural information, non-gaps percentage totally conserved columns 2013
[102] Multi-objective, affine gap penalty, similarity, support maximization 2014
[103] Chaotic sequences, logistic map 2016
[104] Chemical reaction optimization, single-point operator 2019
[105] Optimization algorithm, gap insertion mutation, gap removal mutation 2020
[106] Bi-objective, integer coding, Wilcoxon sign test 2020

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


Figure 7. The figure explains how to use the existing alignment to generate a better alignment in SAGA. Given two alignments, two new alignments are
combined by crossing and inserting gaps. Finally, the better one is selected to join the next generation [16].

by natural behavior can be regarded as a swarm intel- the search path. Kuang et al. [114] proposed an artificial
ligence algorithm. Since the advent of the swarm intelli- bee colony-based method that adopts a multi-strategy
gence algorithm, dozens of algorithms have been derived, to balance the global exploration and the local exploita-
including ant colony algorithm, particle swarm optimiza- tion in 2018. The strategies include the following three
tion (PSO) and artificial bee colony. In the field of MSA, aspects: tent chaos initialization population, neighbor-
swarm intelligence optimization is also applied, as shown hood search and tournament selection. The results show
in Table 5. that this method has better performance and biological
Long et al. [107] described an MSA method based on characteristics and has strong robustness.
binary SAGA in 2009. This method starts with multiple Some studies also adopt less commonly used algo-
alignments and then continuously modifies each align- rithms. In 2017, Manikandan and Duraisamy proposed
ment until convergence. The results show that most a method combining a bacterial foraging algorithm and
of the alignment scores of their ways are better than genetic algorithm based on the dispersion optimization
ClustalW, DiAlign, SAGA and T-Coffee. Besides, Chaabane principle of the individual and population of Escherichia
proposed Particle Swarm Optimization and Simulated coli[112]. They obtain individuals with higher fitness
Annealing (PSOSA) hybrid model in 2018 [113], which through bacterial foraging behavior to improve accuracy
combines PSO and simulated annealing to fully use the and speed. The result shows that this method is better
former’s exploration ability and the latter’s development than other swarm intelligence algorithms on multiple
ability. The experiment shows that the performance is benchmarks. In 2019, Hussein et al. adopted the flower
better than MUSCLE, MAFFT and ClustalW programs. pollination algorithm to optimize the MSA [115]. Besides,
In addition to PSO, there are many optimization algo- they also proposed a new profile algorithm to improve
rithms based on animal behavior in MSA. For example, in alignment quality. Finally, compared with other methods,
2009, Chen et al. proposed an ant colony algorithm-based this method is superior to other ways except for the
method [108]. They use a dispersion graph to represent MSAProbs.
the alignment and then an ant colony algorithm to find
an optimal path. The experimental results show that Hidden Markov model
this method can obtain a better solution than ClustalX. The HMM) is a statistical analysis model, which is used
Meanwhile, Xiang et al. [109] proposed a similar approach to describe a Markov process with unknown parameters
based on genetic algorithm and planar graph to optimize [119]. In short, HMM can calculate the state transition
A survey on the algorithm and development | 9

Table 5. Summary of swarm intelligence algorithm-based methods

References Description Year

[107] Binary SAGA 2009


[108] Dispersion graph, ant colony 2009
[109] Ant colony, genetic algorithm, planar graph 2010
[110] SAGA 2011
[111] Artificial fish swarm 2014
[112] Bacterial foraging optimization, genetic algorithm 2017
[113] SAGA, simulated annealing 2018
[114] Multi-strategy artificial bee colony 2018
[115] Flower pollination algorithm, profile algorithm 2019
[116] Multi-objective, artificial fish swarm 2020
[117] WOAMSA, whale optimization algorithm 2020

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


[118] Meta-heuristic, dynamic programming, dynamic simulated SAGA 2021

Table 6. Summary of HMM-based methods

References Description Year

[17] ProbCons, probabilistic consistency-based 2005


[120] MUMMALS, local structural information 2006
[121] PROMALS, database search, secondary structure prediction 2007
[122] MSAProbs, partition function 2010
[123] Clustal Omega, mBed, profile 2011
[124] RDPSO, SAGA 2014
[125] ProbPFP, SAGA 2019

probability matrix by analyzing the object’s previous and ProbPFP method, which mainly refers to the above tech-
current observable states and inferring the next state niques. However, ProbPFP is unique in that it introduces
of the object through the probability matrix. Due to the PSO to optimize the parameters of HMM model.
Markov property of MSA, some researchers proposed sev- ClustalW has great limitations in large-scale data;
eral MSA methods based on HMM, as shown in Table 6. therefore, Sievers et al. developed Clustal Omega in 2011.
ProCons (Probabilistic Consistency) is the first program Clustal Omega is based on the mBed algorithm and
to use HMM for MSA, developed by Do et al. in 2005 HMM. Clustal Omega can process tens of thousands of
[17]. ProCons adopts a novel objective function based sequences, and its accuracy is greatly improved. Besides,
on probability consistency, which combines the posterior to improve the alignment efficiency, Clustal Omega can
probability matrix derived from HMM and the alignment use the existing alignment in the database to assist new
consistency to integrate the conservative information of sequences alignment [123, 128]. In 2020, Pachetti et al.
sequences. From the results, ProCons is more accurate [129] applied clustal Omega to the MSA of genomes of
than Align-m, ClustalW, MUSCLE and other programs. SARS-CoV-2 and found eight new recurrent mutations of
In 2006, Roshan and Livesay made some improvements virus.
based on ProCons and proposed Probalign (Probabilistic
alignment) [84]. The difference is that ProCons uses the Machine learning
matched partition function interested of HMM to calcu- Machine learning is still in the initial stage in MSA. In
late the posterior probability matrix. 2016, Mircea et al. [18] first proposed an MSA method
Efficient computation with high accuracy is an urgent based on reinforcement learning. This method is similar
need. Therefore, to solve this problem, Liu et al. [122] to the progressive alignment algorithm to find an optimal
developed a program, MASProbs based on ProCons and alignment order and then align the sequences in order.
Proalign. To enhance the performance, MSAProbs cal- They use a reinforcement learning method based on
culates the weight of the sequence, which participates dynamic programming–Q-learning to calculate the opti-
in the calculation of probability consistency score and mal alignment order of sequences and use this order to
progressive alignment. The result shows that MASProbs construct the alignment gradually. In 2019, Jafari et al. [20]
has higher accuracy than ClustalW, MAFFT, MUSCLE, replaced Q-learning with the A3C (Asynchronous Advan-
ProCons and Probalign. Since then, the author proposed tage Actor Critic) model in deep reinforcement learning
MSAProbs-MPI, which introduces parallel and distributed based on Mircea, which improved the convergence speed
memory based on MSAProbs and can cope with larger and reduced the possibility of falling into local optimum
datasets [126, 127]. In 2019, Zhan et al. [125] proposed the to a certain extent.
10 | Zhang et al.

Table 7. Summary of divide-and-conquer-based methods To reduce the time and memory consumption of long
References Description Year SA, Naznooshsadat et al. [22] proposed FAME (FAst and
MEmory), an MSA method based on splitting and splicing.
[21] SpliVert, splitting-splicing, refinement 2020
Its process consists of the following three steps: (1) divide
[22] FAME, common area, splice 2020
[130] MAGUS, divide-and-conquer, graph clustering 2021 the set of sequences into groups of subsequences verti-
[131] FMAlign, FM-index, common segments 2021 cally and maintain the order of these subsequences on
the original sequences. (2) Align the subsequence at com-
mon regions and align the unaligned subsequence with
other MSA methods. (3) Merge all aligned subsequences.
The results show that the FAME can process large-scale
data four times faster than the original method.
The alignment method for large-scale sequences

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


has always been a hot spot. For example, Smirnov et
al. [130] proposed an MSA method based on graph
clustering and divide-and-conquer algorithm—MAGUS
(MSA using Graph clUStering) in 2021. This method is
similar to PASTA. It also introduces divide-and-conquer
to construct the alignment. The difference between
PASTA and MAGUS lies in the merging stage of the
subset. Smirnov uses a merging strategy of combining
Figure 8. This picture explains the optimization process of SpliVert. The disjoint alignment—GMC (Graph Clustering Merge) to
general process can be summarized to four steps: compute the alignment, merge subsets. In addition, benefit from GMC, MAGUS
split the alignment, realign the middle part and merge these parts into does not need iteration to improve accuracy like PASTA.
alignment [21].
Their experiments show that the method has strong
robustness, and it not only improves the accuracy and
exceeds PASTA in speed. Furthermore, Shen et al. [132]
Then, Ramakrishnan et al. [19] proposed an MSA introduced HMM based on Smirnov’s work to further
method based A3Cin 2018, called RLALIGN. Unlike enhance the accuracy of alignment in 2021.
Mircea’s method, RLALIGN takes the current alignment What’s more, Liu et al. proposed a technique, FMAlign,
as the state and the moving direction of the nucleotide in which is based on divide-and-conquer and FM-index.
the alignment as the action. Although this method can Their method has four steps: 1. building FM-index; 2.
better align some small datasets, the process may not querying anchors; 3. finding optimal chain; 4. breaking
converge with the sequence number or length increase. sequence and aligning. Compared with MAFFT, Halign
Machine learning has valuable achievements in many and FAME, FMAlign has a shorter running time and
fields of bioinformatics, but there is more work to be higher accuracy on long sequences. Besides, their results
done across MSA. In particular, there are three chal- show that FMAlign is applicable to large-scale sequences
lenges: (1) lack of sufficient optimal alignment samples alignment [131].
for learning. (2) There is no suitable loss function. (3) It Because divide-and-conquer can divide a whole prob-
is challenging to build a generalized model due to the lem into multiple sub-problems that can be handled
variability of the length and quantity of sequences. directly, its performance has apparent advantages over
other algorithms. In addition, the algorithm can flexibly
Divide-and-conquer algorithm combine with parallel techniques to further improve
Divide-and-conquer is a relatively simple algorithm performance because these sub-problems have the same
whose idea is to break down a whole problem into or similar types and are independent.
multiple independent sub-problems of the related or
same type for processing. The divide-and-conquer Objective functions
algorithm is a standard solution for solving MSA, and In MSA, an objective function is needed to judge whether
the relevant methods are shown in Table 7. the alignment results meet the expectations. SP is the
Zhan et al. [21] proposed an optimization method for most common and simple objective function [10]. Its for-
MSA based on splitting and splicing in 2020, which is mula is shown in Equation 1, where m is the length of the
called SpliVert (Splitting-splicing Vertically). The process j
alignment, n is the number of sequences. ci and cki denotes
is as shown in Figure 8. Firstly, obtain an alignment by th th th
the i unit in the j and k sequences, respectively.
other MSA methods. Second, divide the alignment into j
s(ci , cki ) represents the matching score between two units
three parts from the vertical direction and then remove
that can be set according to Percent Accepted Mutation
the gap in the middle part and realign the part. Finally,
(PAM) or BLOcks SUbstitution Matrix (BLOSUM).
merge the three regions and obtain the refined align-
ment. The experiment shows that the alignment opti- 
m 
n−1 
n
j
mized is better than those obtained by the original MSA SP = (ci , cki ) (1)
method or program. i=1 j=1 k=j+1
A survey on the algorithm and development | 11

Another commonly used function is the WSP, which


adds sequence weight based on SP function [11], and
its formula is shown in Equation 2, where w( j, k) is the
j
product of the sequence weights of ci and cki , and the
sequence weight is calculated from the guide tree.


m 
n−1 
n
j
WSP = wj,k ∗ (ci , cki ) (2)
i=1 j=1 k=j+1
Figure 9. Two alignments for illustrating the calculation of TC score. In
this figure, there are three correctly aligned columns (1st, 2nd and 5th)
In addition to the weight difference, WSP also intro- and six TCs. Therefore, TC score is 0.5 (3 divided by 6).
duces the gap penalty. The gap penalty comprises the gap
Table 8. Summary of programs
opening penalty and gap extension penalty. The former

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


represents the penalty for the appearance of anyone gap, Programs Year Website

whereas the latter is the penalty points obtained by the ClustalW 1994 http://www.clustal.org/
extension of any gap starting from the opening gap. At DiAlign 1998 http://dialign.gobics.de/
j T-Coffee 2000 http://tcoffee.crg.cat/
this time, the score of s(ci , cki ) has the following three
MAFFT 2002 https://mafft.cbrc.jp/alignment/software/
cases. The first is to set the score according to PAM or MUSCLE 2004 http://www.drive5.com/muscle/
j
BLOSUM when neither ci nor cki are gaps. The second is ProbCons 2005 http://probcons.stanford.edu/
j Kalign 2005 https://www.ebi.ac.uk/Tools/msa/kalign/
to set s(ci , cki ) = 0 when both are gaps. And the last is to
j j Probalign 2006 http://probalign.njit.edu/standalone.html
set = G when either one of ci and cki is gap. The
s(ci , cki ) PRANK 2008 http://wasabiapp.org/software/prank/
calculation of G is shown in Equation 3, where gop is the MSAProbs 2010 http://msaprobs.sourceforge.net/
opening penalty for the gap, gep is the extension penalty NX4 2019 https://www.nx4.io
FAME 2020 http://github.com/naznoosh/msa
value for the gap and n is the length of consecutive gaps.

G = gop + gep ∗ n (3)


[0, 1]. The higher the value, the more agreement the test
alignment with the reference alignment.
Another objective function is the consistency-based
objective function for alignment evaluation (COFFEE).
The higher the score calculated, the higher the biological
Programs
correlation of the alignment [133]. Before calculating the The commonly used programs for MSA include ClustalW,
COFFEE score, a library of pairwise alignments needs MUSCLE, MAFFT and T-Coffee, which are shown in
to be established by FSSP database (Fold Classification Table 8.
Based on Structure-Structure Alignment of Proteins) to Among the MSA programs, Clustal is the earliest
provide sufficient information for subsequent MSA. The program used for MSA and now, its improved version
formula of this objective function is shown in Equation ClustalW is still commonly used. ClustalW adopts a
4, where score(A(j,k) ) is the number of aligned pairs of simple progressive alignment algorithm [11], which can
residues that are shared between A(j,k) and the library, not guarantee the optimal alignment. MUSCLE is an MSA
len(A(j,k) ) is the alignment length of A(j,k) , and w(j,k) is the program based on progressive and iterative algorithm
weight for pairwise alignments, such as the similarity of [15], which is also one of the fastest MSA programs at
two sequences. present. MUSCLE is better than ClustalW in speed and
accuracy. For example, for 10 000 sequences with a length
n−1 n
j=1 k=j+1 wj,k ∗ score(A(j,k) ) of 350bp, MUSCLE can compute the alignment in >10
COFFEE = n−1 n (4) min, whereas ClustalW may take 1 year.
j=1 k=j+1 wj,k ∗ len(A(j,k) )
MAFFT is the first program to adopt fast Fourier trans-
form in MSA. The first version of MAFFT was released
The total column (TC) score is a relatively simple in 2002 [14]. MAFFT adds various options and alignment
evaluation metric indicating the similarity between two modes, such as FFT-NS-1, FFT-NS-2 and FFT-NS-i and
alignments. The score is the ratio of correctly aligned researchers can select appropriate options for different
columns between the test alignment and the reference occasions. In alignment accuracy, MAFFT is higher than
alignment to the total number of columns in the refer- MUSCLE. T-Coffee is one of the most popular programs
ence alignment [134]. Take two alignments in Figure 9 as [71], which has high scalability and abundant functions.
an example. There are three correctly aligned columns T-Coffee can integrate various sequences information
(1st, 2nd and 5th) and six TCs. Therefore, the TC score is during alignment, so its accuracy is more elevated than
0.5 (3 divided by 6). The value range of the TC score is ClustalW, MUSCLE and MAFFT.
12 | Zhang et al.

Table 9. Summary of the benchmarking results of several MSA program on BAliBASE 3.0 [123]

Program R1-1 R1-2 R2 R3 R4 R5 Avg Score Tot time(s)

MSAProbs [122] 0.441 0.865 0.464 0.607 0.622 0.608 0.607 12382.00
Probalign [84] 0.453 0.862 0.439 0.566 0.603 0.549 0.589 10095.20
MAFFT [14] 0.439 0.831 0.450 0.581 0.605 0.591 0.588 1475.40
Probcons [17] 0.417 0.855 0.406 0.544 0.532 0.573 0.558 13086.30
Clustal Omega [123] 0.358 0.789 0.450 0.575 0.579 0.533 0.554 539.91
T-Coffee [71] 0.410 0.848 0.402 0.491 0.545 0.587 0.551 81041.50
Kalign [72] 0.365 0.790 0.360 0.476 0.504 0.435 0.501 21.88
MUSCLE [15] 0.318 0.804 0.350 0.409 0.450 0.460 0.475 789.57
FSA [136] 0.270 0.818 0.187 0.259 0.474 0.398 0.419 53648.10
DiAlign [12] 0.265 0.696 0.292 0.312 0.441 0.425 0.415 3977.44
PRANK [137] 0.223 0.680 0.257 0.321 0.360 0.356 0.376 128355.00

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


ClustalW [11] 0.227 0.712 0.220 0.272 0.396 0.308 0.374 766.47

Several benchmarking results on BAliBASE 3.0 are summarized. This table presents total column (TC) scores for six references, average TC scores on all references,
and total running times. The best results in each column are put in bold.

Benchmarks Scale of sequence


Benchmark is the golden criterion for judging the MSA Compared with small-scale and straightforward data in
method. Table 9 [123] lists the benchmarking results of most bioinformatics fields, the MSA data may have a
the popular MSA programs on BAliBASE (Benchmark large discrepancy in the scale of each alignment. At the
Alignment dataBASE) [135]. BAliBASE consists of five same time, with the rapid accumulation of sequences,
groups of references, containing 82, 41, 30 and 16 MSA may face hyper-scale sequences. For example,
alignments, respectively. Besides, the first group is Koyama et al. [142] collected >10 000 genomes to analyze
divided into two subgroups, R1-1 and R1-2, which contain the variation of SARS-CoV-2. However, the existing
38 and 44 alignments, respectively. In Table 9. the TC MSA methods are challenging to cope with hyper-scale
scores of each reference are in columns 2–7. The average sequences. Even if possible, these methods have low
TC score general references are given in the second last accuracy or high consumption of memory and time.
column and the total run time for all references is given This problem is a big challenge to develop a technique
in the last column. The value range of the TC score is that can deal with hyper-scale sequences while ensuring
[0, 1] and the higher the value, the more agreement the accuracy and consumption.
result of the program with the benchmark. As can be
seen from the table, MSAProbs has the highest accuracy, Design of method
but its run time is higher than most programs. Kalign has
Most MSA methods are flexible and complicated. Various
the shortest run time, but the accuracy is not satisfactory.
efforts are made to improve accuracy and performance,
The worst program is PRANK, which runs much longer
and several works adopt more complex machine learn-
than other programs and has a lower score and run time.
ing, such as reinforcement learning and natural language
Currently, the most used benchmark is BAliBASE [135],
processing. In real-life sequence analysis scenarios, the
which provides multiple reference sets that have been
scale and length of sequences may differ, requiring that
optimized. Up to now, BAliBASE has three versions.
the method has good generalization ability and robust-
The 1st version provides four reference sets, each
ness to ensure accuracy. In addition, using friendly visu-
containing multiple reference alignments. The 2nd
alization technology is also essential for users. Making
version provides eight reference sets, and the 3rd version
the MSA method more reliable and usable is a promising
adds an update protocol to allow online updates of the
issue.
reference set. In addition to BAliBASE, other common
benchmark sets include SABMark (Sequence Alignment
BenchMark) [138], OXBench (OXford Benchmark) [139], Correctness of alignment
SMART (Simple Modular Architecture Research Tool) The correctness-sensitive biological sequence analysis
[140], PREFAB (Protein REFerence Alignment Benchmark) scenarios put forward higher requirements for the bio-
[15] and QuanTest [141]. logical correctness of alignment. The most existing MSA
methods are difficult to predict the evolution process
of sequences correctly and lack explanations. For these
Challenges and opportunities problems, there are three main reasons as follows: the
Although the existing methods for MSA can cope with randomness of the evolutionary process, the limitations
the most alignment tasks, these methods are not good of the bioinformatics method and the lack of an accu-
enough to provide comprehensive solutions for any anal- rate evolutionary model [143]. Therefore, focusing on
ysis scenario in any condition. There are still some chal- the above points is more helpful for applying MSA to
lenges and opportunities for future directions. biological sequence analysis scenarios.
A survey on the algorithm and development | 13

Conclusion 9. Warnow T. Revisiting evaluation of multiple sequence align-


ment methods. Methods Mol Biol 2021;2231:299–317.
MSA is widely used in many bioinformatics scenarios
10. Altschul SF, Lipman DJ. Trees, stars, and multiple biological
because it can extract potential information from sequence alignment. SIAM J Appl Math 1989;49(1):197–209.
sequences. In this work, we introduced the base knowl- 11. Thompson JD, Higgins DG, Gibson TJ. Clustal w: improving the
edge of MSA from several aspects. Furthermore, we sensitivity of progressive multiple sequence alignment through
described representative applications of MSA for database sequence weighting, position-specific gap penalties and weight
searching, phylogenetics, genomics, metagenomics and matrix choice. Nucleic Acids Res 1994;22(22):4673–80.
proteomics. We also categorize and analyze the classical 12. Morgenstern B, Frech K, Dress A, et al. Dialign: finding local
and state-of-the-art algorithms for MSA, such as progres- similarities by multiple sequence alignment. Bioinformatics
sive alignment, iterative algorithm, heuristics, machine 1998;14(3):290–4.
13. Lassmann T, Sonnhammer E. Kalign: an accurate and fast
learning and divide-and-conquer. Finally, We further
multiple sequence alignment algorithm. BMC Bioinformatics
discussed the challenges of this field and promising

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


2005;6:298.
directions. This review will provide valuable insight and
14. Katoh K, Misawa K, Kuma K-i, et al. Mafft: a novel method
serve as a starting point for studying MSA in future for rapid multiple sequence alignment based on fast fourier
research. transform. Nucleic Acids Res 2002;30:3059–66.
15. Edgar RC. Muscle: multiple sequence alignment with
high accuracy and high throughput. Nucleic Acids Res
Key Points 2004;32(5):1792–7.
• As many biomedical sequences have been accumulated, 16. Notredame C, Higgins DG. Saga: sequence alignment by genetic
MSA is widely applied to extract knowledge from these algorithm. Nucleic Acids Res 1996;24(8):1515–24.
massive sequences, such as evolution, structure and 17. Do CB, Mahabhashyam MS, Brudno M, et al. Probcons:
function. probabilistic consistency-based multiple sequence alignment.
• We systematically introduce MSA’s knowledge from Genome Res 2005;15(2):330–40.
several aspects, including background, application, 18. Mircea I, Czibula G, Bocicor M. A q-learning approach for
database, algorithm, metric, program and benchmark. aligning protein sequences. In: 2015 IEEE International Conference
• We also list and discuss the current challenges and on Intelligent Computer Communication and Processing (ICCP), IEEE,
opportunities for future directions in bioinformatics. Piscataway, 2015, 51–8.
• As a comprehensive review of current works, this paper 19. Ramakrishnan KR, Singh J, Blanchette M. Rlalign: a reinforce-
will provide valuable insight and serve as a launching ment learning approach for multiple sequence alignment. In:
point for researchers in their studies. 2018 IEEE 18th International Conference on Bioinformatics and Bio-
engineering (BIBE), IEEE, Piscataway, 2018, 61–6.
20. Jafari R, Javidi MM, Rafsanjani MK. Using deep reinforcement
learning approach for solving the multiple sequence alignment
Funding problem. SN Appl Sci 2019;1(6):592.
National Natural Science Foundation of China (Grant No. 21. Zhan Q, Fu Y, Jiang Q, et al. Splivert: a protein multiple
61922020). sequence alignment refinement method based on splitting-
splicing vertically. Protein Pept Lett 2020;27(4):295–302.
22. Naznooshsadat E, Elham P, Ali SZ. Fame: fast and memory effi-
References cient multiple sequences alignment tool through compatible
1. Wang T, Liang C, Hou Y, et al. Small design from big alignment: chain of roots. Bioinformatics 2020;36(12):3662–8.
engineering proteins with multiple sequence alignment as the 23. Notredame C. Recent progress in multiple sequence alignment:
starting point. Biotechnol Lett 2020;42(8):1305–15. a survey. Pharmacogenomics 2002;3:131–44.
2. Makigaki S, Ishida T. Sequence alignment generation using 24. Chowdhury B, Garai G. A review on multiple sequence align-
intermediate sequence search for homology modeling. Comput ment from the perspective of genetic algorithm. Genomics
Struct Biotechnol J 2020;18:2043–50. 2017;109(5-6):419–31.
3. Huang M, Shah ND, Yao L. Evaluating global and local sequence 25. Xia Z, Cui Y, Zhang A, et al. A review of parallel imple-
alignment methods for comparing patient medical records. mentations for the smith-waterman algorithm. Interdiscip Sci
BMC Med Inform Decis Mak 2019;19(Suppl 6):263. 2021;3:1–14.
4. Baharav TZ, Kamath GM, Tse DN, et al. Spectral jaccard simi- 26. Altschul SF, Gish W, Miller W, et al. Basic local alignment search
larity: a new approach to estimating pairwise sequence align- tool. J Mol Biol 1990;215(3):403–10.
ments. Patterns (N Y) 2020;1(6):100081. 27. Pearson WR, Lipman DJ. Improved tools for biological
5. Bawono P, Dijkstra M, Pirovano W, et al. Multiple sequence sequence comparison. Proc Natl Acad Sci U S A 1988;85(8):
alignment. Methods Mol Biol 2017;1525:167–89. 2444–8.
6. Needleman SB, Wunsch CD. A general method applicable to 28. Li YC, Lu YC. Blastp-acc: parallel architecture and hardware
the search for similarities in the amino acid sequence of two accelerator design for blast-based protein sequence alignment.
proteins. J Mol Biol 1970;48(3):443–53. IEEE Trans Biomed Circuits Syst 2019;13(6):1771–82.
7. Smith TF, Waterman MS. Identification of common molecular 29. Jin X, Liao Q, Wei H, et al. Smi-blast: a novel supervised search
subsequences. J Mol Biol 1981;147(1):195–7. framework based on psi-blast for protein remote homology
8. Chatzou M, Magis C, Chang JM, et al. Multiple sequence detection. Bioinformatics 2020.
alignment modeling: methods and applications. Brief Bioinform 30. Kapli P, Yang Z, Telford MJ. Phylogenetic tree building in the
2016;17(6):1009–23. genomic age. Nat Rev Genet 2020;21(7):428–44.
14 | Zhang et al.

31. Khan BM, Sabir M, Alyemeni MN, et al. Genetic similarities and 51. Cantelli G, Bateman A, Brooksbank C, et al. The European
phylogenetic analysis of muntjac (muntiacus spp.) by com- Bioinformatics Institute (EMBL-EBI) in 2021. Nucleic Acids Res
paring the nucleotide sequence of 16s rrna and cytochrome b 2021;50(D1):D11–D19.
genome. Braz J Biol 2021;83:e248153. 52. Sayers EW, Cavanaugh M, Clark K, et al. Genbank. Nucleic Acids
32. Liu B, Pavel JA, Hausbeck MK, et al. Phylogenetic analysis, Res 2020;48(D1):D84–d86.
vegetative compatibility, virulence, and fungal filtrates of leaf 53. Ogasawara O, Kodama Y, Mashima J, et al. DDBJ database
curl pathogen Colletotrichum fioriniae from celery. Phytopathology updates and computational infrastructure enhancement.
2021;111(4):751–60. Nucleic Acids Res 2020;48(D1):D45–d50.
33. Wei R, Zhang XC. Phylogeny of diplazium (athyriaceae) revis- 54. Tuli MA, Flores TP, Cameron GN. Submission of nucleotide
ited: resolving the backbone relationships based on plastid sequence data to EMBL/genbank/DDBJ. Mol Biotechnol
genomes and phylogenetic tree space analysis. Mol Phylogenet 1996;6(1):47–51.
Evol 2020;143:106699. 55. The UniProt Consortium. Uniprot: a worldwide hub of protein
34. Hu Y, Xing W, Hu Z, et al. Phylogenetic analysis and substitution knowledge. Nucleic Acids Res 2019;47(D1):D506–15.

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


rate estimation of colonial volvocine algae based on mitochon- 56. Chen FZ, You LJ, Yang F, et al. Cngbdb: China national genebank
drial genomes. Genes (Basel) 2020;11(1). database. Yi Chuan 2020;42(8):799–809.
35. Fariq A, Blazier JC, Yasmin A, et al. Whole genome sequence 57. Leinonen R, Sugawara H, Shumway M. The sequence read
analysis reveals high genetic variation of newly isolated archive. Nucleic Acids Res 2011;39(Database issue):D19–21.
Acidithiobacillus ferrooxidans io-2c. Sci Rep 2019;9(1):13049. 58. Pruitt KD, Tatusova T, Brown GR, et al. NCBI reference
36. Hu B, Guo H, Zhou P, et al. Characteristics of sars-cov-2 and sequences (refseq): current status, new features and genome
covid-19. Nat Rev Microbiol 2021;19(3):141–54. annotation policy. Nucleic Acids Res 2012;40(D1):D130–5.
37. Yin C. Genotyping coronavirus sars-cov-2: methods and impli- 59. Letovsky SI, Cottingham RW, Porter CJ, et al. GDB: the human
cations. Genomics 2020;112(5):3588–96. genome database. Nucleic Acids Res 1998;26(1):94–9.
38. Guruprasad L. Evolutionary relationships and sequence- 60. Caló D, De Pascali, Sasanelli D, et al. Mmtdb: a meta-
structure determinants in human SARS coronavirus- zoa mitochondrial DNA variants database. Nucleic Acids Res
2 spike proteins for host receptor recognition. Proteins 1997;25(1):200–5.
2020;88(11):1387–93. 61. Attimonelli M, Altamura N, Benne R, et al. Mitbase: a com-
39. Chang TJ, Yang DM, Wang ML, et al. Genomic analysis and prehensive and integrated mitochondrial dna database. The
comparative multiple sequences of SARS-cov2. J Chin Med Assoc present status. Nucleic Acids Res 2000;28(1):148–52.
2020;83(6):537–43. 62. Lang OW, Nash RS, Hellerstedt ST, et al. An introduction to
40. Madhavan A, Venkatesan G, Kumar A, et al. Comparative the saccharomyces genome database (SGD). Methods Mol Biol
sequence and structural analysis of the orf095 gene, a vaccinia 2018;1757:21–30.
virus a4l homolog of capripoxvirus in sheep and goats. Arch 63. Kelley S. Getting started with acedb. Brief Bioinform
Virol 2020;165(6):1419–31. 2000;1(2):131–7.
41. Hecker N, Hiller M. A genome alignment of 120 mammals 64. Sherry ST, Ward MH, Kholodov M, et al. DBSNP: the NCBI
highlights ultraconserved element variability and placenta- database of genetic variation. Nucleic Acids Res 2001;29(1):308–
associated enhancers. Gigascience 2020;9(1). 11.
42. Roe D, Vierra-Green C, Pyo CW, et al. A detailed view of 65. Amberger JS, Bocchini CA, Scott AF, et al. Omim.org: leveraging
kir haplotype structures and gene families as provided by a knowledge across phenotype-gene relationships. Nucleic Acids
new motif-based multiple sequence alignment. Front Immunol Res 2019;47(D1):D1038–43.
2020;11:585731. 66. MacDonald JR, Ziman R, Yuen RK, et al. The database of
43. Hunter CI, Mitchell A, Jones P, et al. Metagenomic analysis: the genomic variants: a curated collection of structural variation
challenge of the data bonanza. Brief Bioinform 2012;13:743–6, 6. in the human genome. Nucleic Acids Res 2014;42(Database
44. Zhou H, Chen X, Hu T, et al. A novel bat coronavirus closely issue):D986–92.
related to sars-cov-2 contains natural insertions at the s1/s2 67. Pundir S, Martin MJ, O’Donovan C. Uniprot protein knowledge-
cleavage site of the spike protein. Curr Biol 2020;30(11):2196– base. Methods Mol Biol 2017;1558:41–55.
2203.e3. 68. Hogeweg P, Hesper B. The alignment of sets of sequences and
45. Breitwieser FP, Lu J, Salzberg SL. A review of methods and the construction of phyletic trees: an integrated method. J Mol
databases for metagenomic classification and assembly. Brief Evol 1984;20:175–86.
Bioinform 2019;20(4):1125–36. 69. Feng DF, Doolittle RF. Progressive sequence alignment as
46. Storato D, Comin M. K2mem: discovering discriminative k- a prerequisite to correct phylogenetic trees. J Mol Evol
mers from sequencing data for metagenomic reads classifica- 1987;25(4):351–60.
tion. IEEE/ACM Trans Comput Biol Bioinform 2021;19(1):220–229. 70. Boyce K, Sievers F, Higgins DG. Instability in progressive
47. Burks DJ, Azad RK. Higher-order Markov models for metage- multiple sequence alignment algorithms. Algorithms Mol Biol
nomic sequence classification. Bioinformatics 2020;36(14):4130– 2015;10:26.
6. 71. Notredame C, Higgins DG, Heringa J. T-coffee: a novel method
48. Velankar S, Burley SK, Kurisu G, et al. The protein data bank for fast and accurate multiple sequence alignment. J Mol Biol
archive. Methods Mol Biol 2021;2305:3–21. 2000;302(1):205–17.
49. Makigaki S, Ishida T. Sequence alignment using machine learn- 72. Russell DJ, Otu HH, Sayood K. Grammar-based distance in
ing for accurate template-based protein structure prediction. progressive multiple sequence alignment. BMC Bioinformatics
Bioinformatics 2020;36(1):104–11. 2008;9:306.
50. Mirabello C, Wallner B. Rawmsa: end-to-end deep learn- 73. Al-Shatnawi M, Ahmad MO, Swamy MN. Msaindelfr: a scheme
ing using raw multiple sequence alignments. PLoS One for multiple protein sequence alignment using information on
2019;14(8):e0220182. indel flanking regions. BMC Bioinformatics 2015;16:393.
A survey on the algorithm and development | 15

74. Bhat B, Ganai NA, Andrabi SM, et al. Tm-aligner: multiple 94. Ishikawa M, Toya T, Hoshida M, et al. Multiple sequence
sequence alignment tool for transmembrane proteins with alignment by parallel simulated annealing. Comput Appl Biosci
reduced time and improved accuracy. Sci Rep 2017;7(1):12543. 1993;9:267–73.
75. Maiolo M, Gatti L, Frei D, et al. Propip: a tool for progressive 95. Hernández-Guía M, Mulet R, Rodríguez-Pérez S. Simulated
multiple sequence alignment with Poisson indel process. BMC annealing algorithm for the multiple sequence alignment
Bioinformatics 2021;22(1):518. problem: the approach of polymers in a random medium. Phys
76. Garriga E, Di Tommaso, Magis C, et al. Multiple sequence Rev E 2005;72:031915.
alignment computation using the t-coffee regressive algorithm 96. Hwa T, Lässig M. Similarity detection and localization. Phys Rev
implementation. Methods Mol Biol 2021;2231:89–97. Lett 1996;76(14):2591–4.
77. Dhivya S, Ashutosh S, Gowtham I, et al. Molecular iden- 97. Mirjalili S. Genetic Algorithm. Cham: Springer International Pub-
tification and evolutionary relationships between the sub- lishing, 2019, 43–55.
species of Musa by DNA barcodes. BMC Genomics 2020;21(1): 98. Zhang C, Wong AKC. A genetic algorithm for multiple molecu-
659. lar sequence alignment. Bioinformatics 1997;13(6):565–81.

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


78. Selva Pandiyan A, Karthikeyan RSG, Rameshkumar G, et al. 99. Chen S-M, Lin C-H, Chen S-J. Multiple DNA sequence align-
Identification of bacterial and fungal pathogens by rDNA gene ment based on genetic algorithms and divide-and-conquer
barcoding in vitreous fluids of endophthalmitis patients. Semin techniques. Int J Appl Sci Eng 2005;3:89–100.
Ophthalmol 2020;35(7-8):358–64. 100. Arenas-Díaz E, Ochoterena H. Multiple sequence alignment
79. Ying YL, Hong XZ, Xu XG, et al. Molecular basis of ABO variants using a genetic algorithm and glocsa. J Artif Evol Appl
including identification of 16 novel abo subgroup alleles in 2009;2009:963150.
Chinese Han population. Transfus Med Hemother 2020;47(2):160– 101. Ortuño FM, Valenzuela O, Rojas F, et al. Optimizing multiple
6. sequence alignments using a genetic algorithm based on three
80. Lladós J, Cores F, Guirado F, et al. Accurate consistency-based objectives: structural information, non-gaps percentage and
MSA reducing the memory footprint. Comput Methods Programs totally conserved columns. Bioinformatics 2013;29(17):2112–21.
Biomed 2021;208:106237. 102. Kaya M, Sarhan A, Alhajj R. Multiple sequence alignment with
81. Chang JM, Floden EW, Herrero J, et al. Incorporating align- affine gap by using multi-objective genetic algorithm. Comput
ment uncertainty into Felsenstein’s phylogenetic bootstrap to Methods Programs Biomed 2014;114(1):38–49.
improve its reliability. Bioinformatics 2019;37(11):1506–14. 103. Gao C, Wang B, Zhou CJ, et al. Multiple sequence alignment
82. Corpet F. Multiple sequence alignment with hierarchical clus- based on combining genetic algorithm with chaotic sequences.
tering. Nucleic Acids Res 1988;16(22):10881–90. Genet Mol Res 2016;15(2):gmr8788.
83. Simossis VA, Heringa J. Praline: a multiple sequence alignment 104. Chatterjee S, Barua P, Hasibuzzaman MM, et al. A hybrid genetic
toolbox that integrates homology-extended and secondary algorithm with chemical reaction optimization for multiple
structure information. Nucleic Acids Res 2005;33(Web Server sequence alignment. In: 2019 22nd International Conference on
issue):W289–94. Computer and Information Technology (ICCIT), IEEE, Piscataway,
84. Roshan U, Livesay DR. Probalign: multiple sequence alignment 2019, 1–6.
using partition function posterior probabilities. Bioinformatics 105. Mishra A, Tripathi BK, Singh Soam S. A genetic algorithm based
2006;22(22):2715–21. approach for the optimization of multiple sequence alignment.
85. Liu K, Nelesen S, Raghavan S, et al. Barking up the wrong In: 2020 International Conference on Computational Performance
treelength: the impact of gap penalty on alignment and Evaluation (ComPE), IEEE, Piscataway, 2020, 415–8.
tree accuracy. IEEE/ACM Trans Comput Biol Bioinform 2009;6(1): 106. Chowdhury B, Garai G. A bi-objective function optimization
7–21. approach for multiple sequence alignment using genetic algo-
86. Mirarab S, Nguyen N, Guo S, et al. Pasta: ultra-large multiple rithm. Soft Comput 2020;24(20):15871–88.
sequence alignment for nucleotide and amino-acid sequences. 107. Long H, Xu W, Sun J, et al. Multiple sequence alignment based
J Comput Biol 2015;22(5):377–86. on a binary particle swarm optimization algorithm. In: 2009
87. Libin PJK, Deforche K, Abecasis AB, et al. Virulign: fast codon- Fifth International Conference on Natural Computation, IEEE, Piscat-
correct alignment and annotation of viral genomes. Bioinfor- away, Vol. 3, 2009, 265–9.
matics 2019;35(10):1763–5. 108. Chen W, Liao B, Zhu W, et al. Multiple sequence alignment algo-
88. Moshiri N. Viralmsa: massively scalable reference-guided mul- rithm based on a dispersion graph and ant colony algorithm. J
tiple sequence alignment of viral genomes. Bioinformatics Comput Chem 2009;30(13):2031–8.
2021;37(5):714–6. 109. Xuyu X, Dafan Z, Jiaohua Q, et al. Ant colony with genetic algo-
89. Rychlewski L, Jaroszewski L, Li W, et al. Comparison of sequence rithm based on planar graph for multiple sequence alignment.
profiles. Strategies for structural predictions using sequence Inf Technol J 2010;9:274–81.
information. Protein Sci 2000;9(2):232–41. 110. Jagadamba PVSL, Prasad Babu MS, Rao AA, et al. An improved
90. Baxevanis AD. Practical aspects of multiple sequence align- algorithm for multiple sequence alignment using particle
ment. Methods Biochem Anal 1998;39:172–88. swarm optimization. In: 2011 IEEE 2nd International Conference
91. Liu K, Warnow TJ, Holder MT, et al. Sate-ii: very fast and accu- on Software Engineering and Service Science, 2011, 544–7.
rate simultaneous estimation of multiple sequence alignments 111. Yang W-H. An improved artificial fish swarm algorithm and
and phylogenetic trees. Syst Biol 2012;61(1):90–106. its application in multiple sequence alignment. J Comput Theor
92. Amorim AR, Zafalon GFD, de Godoi Contessoto, et al. Meta- Nanosci 2014;11:888–92.
heuristics for multiple sequence alignment: a systematic 112. Manikandan P, Ramyachitra D. Bacterial foraging optimiza-
review. Comput Biol Chem 2021;94:107563. tion -genetic algorithm for multiple sequence alignment with
93. Caiyang Y, Heidari AA, Chen H. A quantum-behaved simulated multi-objectives. Sci Rep 2017;7(1):8833.
annealing algorithm-based moth-flame optimization method. 113. Chaabane L. A hybrid solver for protein multiple sequence
App Math Model 2020;87:1–19. alignment problem. J Bioinform Comput Biol 2018;16(4):1850015.
16 | Zhang et al.

114. Kuang FJ, Zhang SY, Liu CC. Multiple sequence alignment 128. Sievers F, Higgins DG. The clustal omega multiple alignment
algorithm based on multi-strategy artificial bee colony. Kongzhi package. Methods Mol Biol 2021;2231:3–16.
yu Juece/Control Decision 2018;33:1990–6. 129. Pachetti M, Marini B, Benedetti F, et al. Emerging
115. Hussein AM, Abdullah R, AbdulRashid N. Flower pollination sars-cov-2 mutation hot spots include a novel RNA-
algorithm with profile technique for multiple sequence align- dependent-RNA polymerase variant. J Transl Med 2020;18(1):
ment. In: 2019 IEEE Jordan International Joint Conference on Electri- 179.
cal Engineering and Information Technology (JEEIT), 2019, 571–6. 130. Smirnov V, Warnow T. Magus: multiple sequence
116. Dabba A, Tari A, Zouache D. Multiobjective artificial fish swarm alignment using graph clustering. Bioinformatics 2021;37(12):
algorithm for multiple sequence alignment. INFOR: Inf Syst Oper 1666–72.
Res 2020;58(1):38–59. 131. Liu H, Zou Q, Xu Y. A novel fast multiple nucleotide
117. Kumar M, Kumar R, Nidhya R. Woamsa: whale optimiza- sequence alignment method based on fm-index. Brief Bioinform
tion algorithm for multiple sequence alignment of protein 2021;23(1):bbab519, https://doi.org/10.1093/bib/bbab519.
sequence. In: Smys S, Tavares JMRS, Balas VE et al. (eds). Compu- 132. Shen C, Zaharias P, Warnow T. Magus+ehmms: improved mul-

Downloaded from https://academic.oup.com/bib/article/23/3/bbac069/6546258 by guest on 15 September 2024


tational Vision and Bio-Inspired Computing. Springer International tiple sequence alignment accuracy for fragmentary sequences.
Publishing, Cham, 2019, 131–9. Bioinformatics 2021;38(4):918–24.
118. Chaabane L. An enhanced cooperative method to solve 133. Notredame C, Holm L, Higgins DG. Coffee: an objective
multiple-sequence alignment problem. Int J Data Mining Modell function for multiple sequence alignments. Bioinformatics
Manage 2021;13(1-2):1–16. 1998;14(5):407–22.
119. Baum Leonard E, Ted P, George S, et al. A maximization tech- 134. Narayan Behera MS, Jeevitesh JJ, Kant K, et al. Higher accuracy
nique occurring in the statistical analysis of probabilistic func- protein multiple sequence alignments by genetic algorithm.
tions of Markov chains. Ann Math Stat 1970;41(1):164–71. Proc Comput Sci 2017;108:1135–44.
120. Pei J, Grishin NV. Mummals: multiple sequence alignment 135. Thompson JD, Koehl P, Ripp R, et al. Balibase 3.0: latest develop-
improved by using hidden Markov models with local structural ments of the multiple sequence alignment benchmark. Proteins
information. Nucleic Acids Res 2006;34(16):4364–74. 2005;61(1):127–36.
121. Pei J, Grishin NV. Promals: towards accurate multiple sequence 136. Bradley RK, Roberts A, Smoot M, et al. Fast statistical alignment.
alignments of distantly related proteins. Bioinformatics PLoS Comput Biol 2009;5(5):e1000392.
2007;23(7):802–8. 137. Löytynoja A, Goldman N. Phylogeny-aware gap placement pre-
122. Liu Y, Schmidt B, Maskell DL. Msaprobs: multiple sequence vents errors in sequence alignment and evolutionary analysis.
alignment based on pair hidden Markov models and Science 2008;320(5883):1632–5.
partition function posterior probabilities. Bioinformatics 138. Van Walle, Lasters I, Wyns L. Sabmark-a benchmark for
2010;26(16):1958–64. sequence alignment that covers the entire known fold space.
123. Sievers F, Wilm A, Dineen D, et al. Fast, scalable generation Bioinformatics 2005;21(7):1267–8.
of high-quality protein multiple sequence alignments using 139. Raghava GP, Searle SM, Audley PC, et al. Oxbench:
clustal omega. Mol Syst Biol 2011;7:539. a benchmark for evaluation of protein multiple
124. Sun J, Palade V, Wu X, et al. Multiple sequence align- sequence alignment accuracy. BMC Bioinformatics 2003;
ment with hidden Markov models learned by random drift 4:47.
particle swarm optimization. IEEE/ACM Trans Comput Biol Bioin- 140. Schultz J, Copley RR, Doerks T, et al. Smart: a web-based tool
form 2014;11(1):243–57. for the study of genetically mobile domains. Nucleic Acids Res
125. Zhan Q, Wang N, Jin S, et al. Probpfp: a multiple sequence align- 2000;28(1):231–4.
ment algorithm combining hidden Markov model optimized 141. Sievers F, Higgins DG. Quantest2: benchmarking multiple
by particle swarm optimization with partition function. BMC sequence alignments using secondary structure prediction.
Bioinformatics 2019;20(Suppl 18):573. Bioinformatics 2020;36(1):90–5.
126. González-Domínguez J, Liu Y, Touriño J, et al. Msaprobs-mpi: 142. Koyama T, Platt D, Parida L. Variant analysis of SARS-
parallel multiple sequence aligner for distributed-memory sys- cov-2 genomes. Bull World Health Organ 2020;98(7):
tems. Bioinformatics 2016;32(24):3826–8. 495–504.
127. González-Domínguez J. Fast and accurate multiple sequence 143. Ashkenazy H, Sela I, Levy Karin E, et al. Multiple sequence
alignment with msaprobs-mpi. Methods Mol Biol 2021;2231: alignment averaging improves phylogeny reconstruction. Syst
39–47. Biol 2019;68(1):117–30.

You might also like