0% found this document useful (0 votes)
2 views

MultipleSequenceAlignment_2021_PDF

The document discusses multiple sequence alignment (MSA) algorithms in bioinformatics, emphasizing their significance in tasks like phylogenetic tree construction and protein structure prediction. It outlines various heuristic and iterative algorithms, such as ClustalW, MAFFT, and MUSCLE, detailing their methodologies and computational complexities. The paper highlights the evolution of MSA techniques and the challenges posed by increasing sequence lengths and data volumes.

Uploaded by

Mr. WK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

MultipleSequenceAlignment_2021_PDF

The document discusses multiple sequence alignment (MSA) algorithms in bioinformatics, emphasizing their significance in tasks like phylogenetic tree construction and protein structure prediction. It outlines various heuristic and iterative algorithms, such as ClustalW, MAFFT, and MUSCLE, detailing their methodologies and computational complexities. The paper highlights the evolution of MSA techniques and the challenges posed by increasing sequence lengths and data volumes.

Uploaded by

Mr. WK
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/355647634

Multiple Sequence Alignment Algorithms in Bioinformatics

Chapter · January 2022


DOI: 10.1007/978-981-16-4016-2_9

CITATIONS READS
13 3,254

2 authors, including:

Bharath Reddy
Schneider Electric, USA
9 PUBLICATIONS 118 CITATIONS

SEE PROFILE

All content following this page was uploaded by Bharath Reddy on 07 February 2022.

The user has requested enhancement of the downloaded file.


Multiple Sequence Alignment Algorithms in Bioinformatics

Bharath Reddy Richard Fields


Process Automation R&D Process Automation R&D
Schneider-Electric Schneider-Electric
Lake Forest, U.S.A Lake Forest, U.S.A
[email protected] [email protected]

Abstract—Bioinformatics is a fast-evolving topic today. It has used for pairwise sequencing and then dynamic programming was
useful from establishing phylogenetic trees, protein structure attempted for MSA. However, producing an optimal alignment turned
prediction to discovery of drugs and hence the importance of out to be computationally complex problem. Hence heuristic
bioinformatics cannot be underestimated. Multiple sequence algorithms were developed [1]. Heuristic algorithms do not produce
alignment (MSA) is the main step in performing the above tasks optimal alignment however, they produce ‘near-optimal’ alignment at
mentioned.
a faster speed. Hence today, we see a lot of heuristic algorithms.
Multiple sequence alignment is the science or a method where
more than two sequences are arranged one above the other to find
the regions of similarity between them. These regions of similarity 2 Multiple Sequence Alignment (MSA)
are called ‘conserved-regions’. Over time, there are many In this section we shall start with the first popular heuristic
algorithms which are developed to give a ‘good’ alignment. These algorithm developed by [5]. In their development, they used a
developments were essential to construct phylogenetic technique called “Progressive alignment”. The technique is where
reconstruction, protein structure and protein prediction pairwise alignment is done on all algorithms using optimal pairwise
accurately.
alignment algorithm both local and global sequence alignment [5] [6].
In this paper, we will talk about the most popular multiple After which, a relationship is built using k-tuple or mBed methods [7].
sequence alignment algorithms. We first begin with the definition Based on the similarity score a guide tree is built using the neighbor
of multiple sequence alignment. Thereafter, we shall talk about the join [8] and Unwanted pair group with arithmetic mean [9].
different techniques in multiple sequence alignment along with the This guide tree guides how the multiple sequence alignment
most popular MSA algorithms would progress step by step. Initially, the closely related sequence is
chosen and then the rest of the sequences are aligned on top of the
Keywords— Bioinformatics, phylogenetic trees, multiple already aligned pair sequence. This way the whole series of sequences
sequence alignment, conserved region are aligned.
This process is called progressive alignment. Some of the algorithms
I. INTRODUCTION based off of this approach are ClustalW [10], Clustal Omega [7],
Multiple sequence alignment is a procedure where more than two MAFFT [11], Kalign [12], Probalign [13], MUSCLE [14], DIALIGN
sequences are aligned as compared to pairwise sequence alignment [15], PRANK [16], FSA [17], T – Coffee [18, 19], and Probcons [20].
where only two sequences are aligned. The goal of both, pairwise and
multiple sequence alignment (MSA) are the same – to find similar The other approach is called an ‘iterative progressive
regions of similarity [1] [2] [3] [4]. The quality of the alignment is the procedure’ where the algorithm repeatedly applied dynamic
most important factor as it determines the similarity between different programming on the already sequenced pairs of sequences to eliminate
sequences and can provide ‘good’ biological information to the errors that would otherwise propagate throughout the progressive
biologists or the lab technicians in better understanding of the method. Since there is a repeated Dynamic programming applied at all
organisms under study or identify new functionality amongst the stages of the algorithm, the algorithms might turn out to be little slower
organisms. Since this is a significant endeavor, there is a lot of research but have better quality. The most popular iterative alignment
going on to develop new methods to align better or longer sequences algorithms include MUSCLE [14], Dialign [15], SAGA [21] and T-
or many sequences. COFFEE [18, 19].
The computational complexity of MSA is quite huge when
compared to pairwise sequence alignment. With the high sequencing MSA can be developed by reading the protein structures. By
technologies, the sequences are getting longer, and the data is also reading and analyzing the structural information to the alignment, the
exponentially increasing. Genome project and others are generating MSA can be developed. Their accuracy is little better than the iterative
huge sets of data. Therefore, the analysis of these sequences is a big algorithms. The structure based MSA are 3D – Coffee [22],
problem and a challenge at the same time. EXPRESSO [23] and MICAlign [24].

There are many computational algorithms to solve MSA problem. Genetic Algorithms (GA), Motifs and short sequence algorithms are
These include dynamic programming method, which is a slow but another category of MSA which was advancing more recently mainly
highly accurate. In the initial days, only dynamic programming was due to advancing techniques on sequencing. We will touch upon some
GA in the later part of the paper. Motifs based MSA are not so popular closely the sequences are related to each other. The algorithm now uses
because motifs discovery is quite hard to begin with when the this score to produce a guide tree using a Neighbor-Joining (NJ) [26].
sequences are quite large. Both motifs and short sequence based MSA
are not in the scope of this paper. NJ algorithm keeps track of all the nodes in the tree. When the nodes
are connected, their common ancestor is added to the tree and the
terminal nodes with their branches are removed from the tree. In this
3 Multiple Sequence Alignment Algorithms (MSA) process a newly added ancestor to be a terminal node. Eventually all
terminal nodes are replaced by one node. Eventually the tree is built
In the advent of genome project coupled with the progress in which is unrooted and its branches are in line to the divergence of each
computational power, there are increasingly great number of branch. Using this guide tree, weights are calculated for each sequence,
algorithms published every year. The number of sequences aligned and which is difference from the branch to the root.
the quality (accuracy in terms of biological useful data from the
sequence alignment) is also improving. This term ‘quality’ is Using these branches developed by the NJ method, ClustalW starts at
subjective and there is no perfect algorithm yet. When it comes to the tip and works its way to the root. At each branch level, DP
MSA, there is one algorithm which everyone is familiar with today, (dynamic programming) for sequence alignment and
ClustalW [10].
their respective score matric (scoring matrix is BLOSUM) and a score
is developed. Eventually all the sequences are merged to produce a
final alignment. This is shown in Fig 1.

3.2 Clustal Omega


Clustal mega is an updated version of the Clustal Algorithm previously
discussed, in that, this algorithm is concentrated towards proteins
sequences. ClustalW outperforms the previously described algorithm
Clustalw when the sequences are large. Similarly, to ClustalW, the
algorithm first starts with producing pairwise sequence alignment
using the k-tuple method [7]. Then the sequences are grouped together
using the mBed method [7]. Final the multiple sequence alignment is
produced using the HHalign package [26].

The clustering is done by using the methods like k-means or UPGMA


[27]. This method is widely used method of clustering [28]. This
method is simple yet fast at clustering the sequences and overcomes
the problems of cluster centers for k-means and therefore directly
improves the speed of the algorithm.
The guide tree is constructed in step by step procedure, first
the pairs of sequences which are most similar are first determined and
subsequently, new pairs are added with highest similarity and then are
clustered together overall. In the end, Clustal Omega uses a HHalign
package by Soding [26] for stitching all progressive alignments. This
method improves the sensitivity of the alignment.

3.3 T – Coffee

T – Coffee [18 -19] is an iterative MSA algorithm and differs from the
previous two algorithm discussed above.
Figure 1: Clustal Omega Procedure [29]
From Figure 2, we can see that, it receives both local and global
3.1 ClustalW sequence alignments in the first stage. In the same stage, a distance
matric is produced based on the alignment. This matrix is then used
to produce the guide tree using the same NJ method we mentioned in
ClustalW was developed by Thompson [10] in 1994 and it reached the previous section.
popularity quickly as its alignment quality was much better than earlier
algorithms and it also aligned the sequences in a much smaller time.
The first step of ClustalW is that it pairwise aligns all sequences and The tree is then used to segregate closely related sequences from the
develops a score and weighting scheme. It does this step using a distantly related ones. The closest sequences are first aligned using a
Wilbur and Lipman [25] algorithm. Now, the ClustalW would use the dynamic programming technique. In the next step, the next closest
similarity score of all the sequences and use this score to create a sequences are used to align, and the procedure continues until all the
distant score. In other words, distant score would determine how sequences are aligned. To align group of realigned sequences, the
algorithm uses scores from the extended library [29]. The algorithm
performs much better than the previous algorithms mainly because of
the use of dynamic programming but the algorithms cannot perform 3.5 MUSCLE
when the number of sequences increase because of the Dynamic
Programming. MUSCLE which stands for multiple sequence comparison using log
expectation is another fairy accurate algorithm developed by [14]. It
3.4 MAFFT uses two distance methods k-mer for unaligned pair of sequences and
Kimura method for aligned pair of sequences. Guide trees are
MAFFT [11] is a highly accurate algorithm. It has two things going developed using the UPGMA method.
for it, 1: similar regions are identified by Fast Fourier transform (FFT).
The amino acid sequences are converted to a sequence of volume and A progressive alignment is then produced based off the guide tree
polarity values [29]. 2: A more simplified version of the scoring built. Kimura method is used again to re-estimate the guide tree.
system is used in this algorithm [29]. MAFFT uses progressive method UPGMA method is used then after to group the sequences to produce
(FFT – NS -2) and Iterative refinement method (FFT – NS -i). In the the updated guide tree. There after the final alignment is produced
progressive method (FFT-NS-2), all distance matrix is calculated very using the first intermittent alignment. If the final alignment score is
quickly and an interim MSA is produced [29]. This algorithm is fast at below the intermittent alignment, then the final alignment is discarded
the same time, can align sequences up to 100000 sequences. and the intermittent is used as a final alignment.

There are other algorithms which fall in the MUSCLE, MAFFT and
Kalign category, the famous one being the MUMMALS

3.5 Probcons

Probcon [20] stands for probabilistic consistency MSA. It is based on


Hidden Markov Model (HMM) [29]. Hidden Markov Model is a
statistical model and MSA based off HMM are called statistical based
MSA. One of the famous MSA which is developed on this model is
ProbCons [20]. This algorithm is based on a new scoring function
which is based on statistical and probabilistic consistency and is a
progressive MSA. Using this probabilistic and statistical information,
ProbCons can produce a better quality MSA. ProbCons is very
effective when aligning homologous sequences and is very effective
in protein alignment.

3.6 Genetic Algorithms

Genetic Algorithms [29] are gathering pace lately and these algorithms
are based on the Genetic information in the sequences. Let’s
investigate these algorithms. One of the popular contributions on GA
based algorithms comes from Naznin [29][30][31]. The first is called
vertical Decomposition with genetic algorithm (VDGA) [30]. In this
approach, the sequences at hand are divided and then solved separately
using the guide tree. Finally, they are all combined to produce the final
alignment.

The other contribution from the same author is GAPAM, which stands
for genetic based progressive alignment method [31]. This is also
based on the guide tree but differs in what is used to produce the guide
Figure 2: T-Coffee Procedure [29]
trees and how many guide trees are developed in the process. In this
algorithm, there are three stages of guide tree development.
3.5 Kalign

In the first case, a distance table amongst the sequences is developed


Kalign [12] is another algorithm which falls into the progressive
to produce a guide tree. The distance table is calculated from
method. The algorithm follows the same steps as the previous
mismatches in the sequence alignment (pairwise). An intermittent
progressive algorithms – sequence alignment, pairwise distances
MSA is developed based on this guide tree. Again, another guide tree
calculated from K-tuple method. A guide tree then built using either
is produced which is another distance table (Kimura), which we have
UPGMA or NJ method. The algorithm differs from the previous
talked about in the previous algorithm. In the third stage of the guide
methods, in that, it uses Wu-Manber approximate string-matching
tree, sequences are randomly selected from the first two generated
algorithm [29]. This Wu-Manber method is used the distance
guide trees. Since this involves a lot of guide tree development, and
calculation and in the dynamic programming stage. Wu-Manber
perhaps the only step which is easy in the randomly generated guide
method allows string matching with mismatches and the distances
based on the sequences, this approach is very computational
between two strings are measured using Levenshtein edit distance
expensive, and scalability is quite hard although, the accuracy is
[29].
greatly increased.
[8] N. Saitou and M. Nei, “The neighbor-joining method: a new method for
4 Evaluation of MSA reconstructing phylogenetic trees,” Molecular Biology and Evolution, vol.4, no.4,
pp.406–425,1987.
[9] I.GronauandS.Moran,“Optimal implementations of UPGMA and other common
Although there is no accepted standard or uniform standard in
clustering algorithms,”InformationProcessingLetters,vol.104,no.6,pp.205–210,2007.
measuring the accuracy or quality of the sequences. Over time, a [10] J. D. Thompson, D. G. Higgins, and T. J. Gibson, “CLUSTAL W: improving the
generally accepted norm of a standard has developed, and most new sensitivity of progressive multiple sequence alignment through sequence weighting,
algorithms use the same metrics to compare the algorithms. position-specific gap penalties and weight matrix choice,”Nucleic Acids
Research,vol. 22,no.22,pp.4673–4680,1994.
[11] K. Katoh and D. M. Standley, “MAFFT multiple sequence alignment software
In other to do this, a standard set of input sequences and their version 7: improvements in performance and usability,” Molecular Biology and
Evolution, vol. 30, no. 4, pp.772–780,2013.
corresponding sequence alignment as a database have formed over [12] T. Lassmann and E. L. L. Sonnhammer, “Kalign—an accurate and fast multiple
time. The famous benchmarks are BAliBASE [32] and HOMSTRAD sequence alignment algorithm,” BMC Bioinformatics, vol.6, article298,2005.
[33]. Of the two, the widely used one is the BAliBASE. This database [13] U. Roshan and D. R. Livesay, “Probalign: multiple sequence alignment using
contains an application which produces a score for a sequence from a partition function posterior probabilities,” Bioinformatics, vol.22, no.22, pp.2715–
2721,2006
new MSA algorithm [29]. [14] R.C.Edgar,“MUSCLE: a multiple sequence alignment method with reduced time and
space complexity,”BMCBioinformatics, vol.5,article113,2004.
[15] B. Morgenstern, “DIALIGN: multiple DNA and protein
This application is written in C language and the score varies from 0 – sequencealignmentatBiBiServ,” Nucleic Acids Research, vol.32, supplement2, pp.
1. If the score is 0, then the alignment score is not in line with the W33–W36,2004.
BAliBASe, meaning the alignment is off and is significantly varies [16] A. L¨oytynoja and N. Goldman, “Phylogeny-aware gap placement prevents errors in
from the established norm. If the score is 1, then the alignment matches sequence alignment and evolutionary analysis,” Science, vol.320, no.5883, pp.1632–
1635,2008.
with established alignment in the database. If it matches in some part [17] R. K. Bradley, A. Roberts, M. Smoot et al., “Fast statistical alignment,” PLoS
of the established alignment in the database, then the alignment is in Computational Biology, vol. 5, no. 5, Article IDe1000392,2009.
between 0 – 1. [18] P. Di Tommaso, S. Moretti, I. Xenarios et al., “T-Coffee: a webserver for the multiple
sequence alignment of protein and RNA sequences using structural information and
homology extension,” Nucleic Acids Research, vol. 39, supplement 2, pp. W13–
The BAliBASE consists of eight reference sets. Reference 1 contains W17,2011.
several mostly homologous sequences. Reference 2 contains divergent [19] C. Notredame, D.G. Higgins, and J. Heringa, “T-coffee:a novel method for fast and
accurate multiple sequence alignment,”
sequences, reference 3 consists of varied divergent sequences. JournalofMolecularBiology,vol.302,no.1,pp.205–217,2000.
Reference 4 contains terminal extensions, reference 5 contains [20] C.B.Do,M.S.P.Mahabhashyam,M.Brudno,and S.Batzoglou, “ProbCons: probabilistic
insertions and deletions references and 6 – 8 contain repeats and others consistency-based multiple sequence
alignment,”GenomeResearch,vol.15,no.2,pp.330–340,2005.
[29]. [21] C.Notredame and D.G.Higgins,“SAGA: sequence alignment by genetic algorithm,”
Nucleic Acids Research,vol.24,no.8,pp. 1515–1524,1996.
[22] O. O’Sullivan, K. Suhre, C. Abergel, D. G. Higgins, and C. Notredame, “3DCoffee:
combining protein sequences and structures within multiple sequence alignments,”
5 Conclusion Journal of Molecular Biology, vol.340, no.2, pp.385–395,2004.
[23] F. Armougom, S.Moretti,O.Poirotetal.,“Expresso:automatic incorporation of
structural information in multiple sequence alignments using 3D-Coffee,” Nucleic
The MSA problem is a very difficult challenge and it poses not only Acids Research, vol. 34, supplement2,pp.W604–W608,2006.
the computational challenge but also on the quality of sequences [24] X. Xia, S. Zhang, Y. Su, and Z. Sun, “MICAlign: a sequence to-structure alignment
tool integrating multiple sources of information in conditional random fields,”
alignment if BAliBase and HOMSTRAD are held as a standard going Bioinformatics, vol. 25, no.11, pp.1433–1434,2009.
forward. Although there are many variations in the guide tree [25] W. J. Wilbur and D. J. Lipman, “Rapid similarity searches of nucleic acid and protein
development, distance scoring, alignment sequences or the number of data banks,” Proceedings of the National Academy of Sciences of the United States
guide trees used in the alignment process. The alignment either is of America, vol.80, no. 3, pp.726–730,1983.
[26] J.S¨oding,“Protein homology detection by HMM-HMM
accurate (relatively) and slower and faster and scalable but loses its comparison,”Bioinformatics,vol.21,no.7,pp.951–960,2005.
accuracy. This balancing is quite difficult to achieve and hence [27] I.Gronau and S.Moran,“ Optimal implementations of UPGMA and other common
researchers are looking at different methods to have alignment faster clustering algorithms,” Information Processing Letters, vol.104, no.6, pp.205 –210,
2007.
yet accurate. In the future we believe, a different guide tree or perhaps [28] D. Arthur and S. Vassilvitskii, “k-means++: the advantages of careful seeding,” in
a higher through put hardware using the old CLUSTALW would Proceedings of the 18thAnnualACM-SIAM Symposium on Discrete Algorithms,
perhaps lead to a breakthrough. Society for Industrial and Applied Mathematics, 2007.
[29] Chowdhury B, Garai G. A review on multiple sequence alignment from the
perspective of genetic algorithm. Genomics. 2017;109(5-6):419‐431. doi:
REFERENCES 10.1016/j.ygeno.2017.06.007
[30] F. Naznin, R. Sarker, D. Essam, Progressive alignment method using genetic
[1] C. Kemena and C. Notredame, “Upcoming challenges for multiple sequence algorithm for multiple sequence alignment, IEEE Trans. Evol. Comput. 16 (2012)
alignment methods in the high-throughput era,” Bioinformatics, vol.25, no.19, 615–631.
pp.2455–2465,2009. [31] F. Naznin, R. Sarker, D. Essam, Vertical decomposition with genetic algorithm for
[2] R. C. Edgar and S. Batzoglou, “Multiple sequence alignment,” Current Opinion in multiple sequence alignment, BMC Bioinf. 12 (2011) 353.
Structural Biology, vol. 16, no. 3, pp. 368– 373,2006. [32] J.D. Thompson, F. Plewniak, O. Poch, BAliBASE: a benchmark alignment database
[3] Haque, W., Aravind, A.A., & Reddy, B. (2008). An efficient algorithm for local for the evaluation of multiple alignment programs, Bioinformatics 15 (1999) 87–88.
sequence alignment. 2008 30th Annual International Conference of the IEEE [33] K. Mizuguchi, C.M. Deane, T.L. Blundell, J.P. Overington, HOMSTRAD: a database
Engineering in Medicine and Biology Society, 1367-1372. of protein structure alignments for homologous families, Protein Sci. 7 (1998) 2469–
[4] Reddy, B, Fields, R. 2020, Multiple Anchor Staged alignment algorithm – Sensitive, 2471.
2020, In proceedings with The International Conference on Information and
Computer Technologies (ICICT), 2020, San Jose. USA.
[5] D.-F. Fengand R.F. Doolittle, “Progressive sequence alignment as a prerequisite to
correct phylogenetic trees,” Journal of Molecular Evolution, vol.25, no.4, pp.351–
360,1987.
[6] I. M. Wallace, G. Blackshields, and D. G. Higgins, “Multiple sequence alignments,”
Current Opinion in Structural Biology, vol.15, no.3, pp.261–266,2005.
[7] F. Sievers, A. Wilm, D. Dineenetal., “Fast,scalable generation of high-quality protein
multiple sequence alignments using Clustal Omega,” Molecular Systems Biology,
vol. 7, article 539, 2011.

View publication stats

You might also like