MultipleSequenceAlignment_2021_PDF
MultipleSequenceAlignment_2021_PDF
net/publication/355647634
CITATIONS READS
13 3,254
2 authors, including:
Bharath Reddy
Schneider Electric, USA
9 PUBLICATIONS 118 CITATIONS
SEE PROFILE
All content following this page was uploaded by Bharath Reddy on 07 February 2022.
Abstract—Bioinformatics is a fast-evolving topic today. It has used for pairwise sequencing and then dynamic programming was
useful from establishing phylogenetic trees, protein structure attempted for MSA. However, producing an optimal alignment turned
prediction to discovery of drugs and hence the importance of out to be computationally complex problem. Hence heuristic
bioinformatics cannot be underestimated. Multiple sequence algorithms were developed [1]. Heuristic algorithms do not produce
alignment (MSA) is the main step in performing the above tasks optimal alignment however, they produce ‘near-optimal’ alignment at
mentioned.
a faster speed. Hence today, we see a lot of heuristic algorithms.
Multiple sequence alignment is the science or a method where
more than two sequences are arranged one above the other to find
the regions of similarity between them. These regions of similarity 2 Multiple Sequence Alignment (MSA)
are called ‘conserved-regions’. Over time, there are many In this section we shall start with the first popular heuristic
algorithms which are developed to give a ‘good’ alignment. These algorithm developed by [5]. In their development, they used a
developments were essential to construct phylogenetic technique called “Progressive alignment”. The technique is where
reconstruction, protein structure and protein prediction pairwise alignment is done on all algorithms using optimal pairwise
accurately.
alignment algorithm both local and global sequence alignment [5] [6].
In this paper, we will talk about the most popular multiple After which, a relationship is built using k-tuple or mBed methods [7].
sequence alignment algorithms. We first begin with the definition Based on the similarity score a guide tree is built using the neighbor
of multiple sequence alignment. Thereafter, we shall talk about the join [8] and Unwanted pair group with arithmetic mean [9].
different techniques in multiple sequence alignment along with the This guide tree guides how the multiple sequence alignment
most popular MSA algorithms would progress step by step. Initially, the closely related sequence is
chosen and then the rest of the sequences are aligned on top of the
Keywords— Bioinformatics, phylogenetic trees, multiple already aligned pair sequence. This way the whole series of sequences
sequence alignment, conserved region are aligned.
This process is called progressive alignment. Some of the algorithms
I. INTRODUCTION based off of this approach are ClustalW [10], Clustal Omega [7],
Multiple sequence alignment is a procedure where more than two MAFFT [11], Kalign [12], Probalign [13], MUSCLE [14], DIALIGN
sequences are aligned as compared to pairwise sequence alignment [15], PRANK [16], FSA [17], T – Coffee [18, 19], and Probcons [20].
where only two sequences are aligned. The goal of both, pairwise and
multiple sequence alignment (MSA) are the same – to find similar The other approach is called an ‘iterative progressive
regions of similarity [1] [2] [3] [4]. The quality of the alignment is the procedure’ where the algorithm repeatedly applied dynamic
most important factor as it determines the similarity between different programming on the already sequenced pairs of sequences to eliminate
sequences and can provide ‘good’ biological information to the errors that would otherwise propagate throughout the progressive
biologists or the lab technicians in better understanding of the method. Since there is a repeated Dynamic programming applied at all
organisms under study or identify new functionality amongst the stages of the algorithm, the algorithms might turn out to be little slower
organisms. Since this is a significant endeavor, there is a lot of research but have better quality. The most popular iterative alignment
going on to develop new methods to align better or longer sequences algorithms include MUSCLE [14], Dialign [15], SAGA [21] and T-
or many sequences. COFFEE [18, 19].
The computational complexity of MSA is quite huge when
compared to pairwise sequence alignment. With the high sequencing MSA can be developed by reading the protein structures. By
technologies, the sequences are getting longer, and the data is also reading and analyzing the structural information to the alignment, the
exponentially increasing. Genome project and others are generating MSA can be developed. Their accuracy is little better than the iterative
huge sets of data. Therefore, the analysis of these sequences is a big algorithms. The structure based MSA are 3D – Coffee [22],
problem and a challenge at the same time. EXPRESSO [23] and MICAlign [24].
There are many computational algorithms to solve MSA problem. Genetic Algorithms (GA), Motifs and short sequence algorithms are
These include dynamic programming method, which is a slow but another category of MSA which was advancing more recently mainly
highly accurate. In the initial days, only dynamic programming was due to advancing techniques on sequencing. We will touch upon some
GA in the later part of the paper. Motifs based MSA are not so popular closely the sequences are related to each other. The algorithm now uses
because motifs discovery is quite hard to begin with when the this score to produce a guide tree using a Neighbor-Joining (NJ) [26].
sequences are quite large. Both motifs and short sequence based MSA
are not in the scope of this paper. NJ algorithm keeps track of all the nodes in the tree. When the nodes
are connected, their common ancestor is added to the tree and the
terminal nodes with their branches are removed from the tree. In this
3 Multiple Sequence Alignment Algorithms (MSA) process a newly added ancestor to be a terminal node. Eventually all
terminal nodes are replaced by one node. Eventually the tree is built
In the advent of genome project coupled with the progress in which is unrooted and its branches are in line to the divergence of each
computational power, there are increasingly great number of branch. Using this guide tree, weights are calculated for each sequence,
algorithms published every year. The number of sequences aligned and which is difference from the branch to the root.
the quality (accuracy in terms of biological useful data from the
sequence alignment) is also improving. This term ‘quality’ is Using these branches developed by the NJ method, ClustalW starts at
subjective and there is no perfect algorithm yet. When it comes to the tip and works its way to the root. At each branch level, DP
MSA, there is one algorithm which everyone is familiar with today, (dynamic programming) for sequence alignment and
ClustalW [10].
their respective score matric (scoring matrix is BLOSUM) and a score
is developed. Eventually all the sequences are merged to produce a
final alignment. This is shown in Fig 1.
3.3 T – Coffee
T – Coffee [18 -19] is an iterative MSA algorithm and differs from the
previous two algorithm discussed above.
Figure 1: Clustal Omega Procedure [29]
From Figure 2, we can see that, it receives both local and global
3.1 ClustalW sequence alignments in the first stage. In the same stage, a distance
matric is produced based on the alignment. This matrix is then used
to produce the guide tree using the same NJ method we mentioned in
ClustalW was developed by Thompson [10] in 1994 and it reached the previous section.
popularity quickly as its alignment quality was much better than earlier
algorithms and it also aligned the sequences in a much smaller time.
The first step of ClustalW is that it pairwise aligns all sequences and The tree is then used to segregate closely related sequences from the
develops a score and weighting scheme. It does this step using a distantly related ones. The closest sequences are first aligned using a
Wilbur and Lipman [25] algorithm. Now, the ClustalW would use the dynamic programming technique. In the next step, the next closest
similarity score of all the sequences and use this score to create a sequences are used to align, and the procedure continues until all the
distant score. In other words, distant score would determine how sequences are aligned. To align group of realigned sequences, the
algorithm uses scores from the extended library [29]. The algorithm
performs much better than the previous algorithms mainly because of
the use of dynamic programming but the algorithms cannot perform 3.5 MUSCLE
when the number of sequences increase because of the Dynamic
Programming. MUSCLE which stands for multiple sequence comparison using log
expectation is another fairy accurate algorithm developed by [14]. It
3.4 MAFFT uses two distance methods k-mer for unaligned pair of sequences and
Kimura method for aligned pair of sequences. Guide trees are
MAFFT [11] is a highly accurate algorithm. It has two things going developed using the UPGMA method.
for it, 1: similar regions are identified by Fast Fourier transform (FFT).
The amino acid sequences are converted to a sequence of volume and A progressive alignment is then produced based off the guide tree
polarity values [29]. 2: A more simplified version of the scoring built. Kimura method is used again to re-estimate the guide tree.
system is used in this algorithm [29]. MAFFT uses progressive method UPGMA method is used then after to group the sequences to produce
(FFT – NS -2) and Iterative refinement method (FFT – NS -i). In the the updated guide tree. There after the final alignment is produced
progressive method (FFT-NS-2), all distance matrix is calculated very using the first intermittent alignment. If the final alignment score is
quickly and an interim MSA is produced [29]. This algorithm is fast at below the intermittent alignment, then the final alignment is discarded
the same time, can align sequences up to 100000 sequences. and the intermittent is used as a final alignment.
There are other algorithms which fall in the MUSCLE, MAFFT and
Kalign category, the famous one being the MUMMALS
3.5 Probcons
Genetic Algorithms [29] are gathering pace lately and these algorithms
are based on the Genetic information in the sequences. Let’s
investigate these algorithms. One of the popular contributions on GA
based algorithms comes from Naznin [29][30][31]. The first is called
vertical Decomposition with genetic algorithm (VDGA) [30]. In this
approach, the sequences at hand are divided and then solved separately
using the guide tree. Finally, they are all combined to produce the final
alignment.
The other contribution from the same author is GAPAM, which stands
for genetic based progressive alignment method [31]. This is also
based on the guide tree but differs in what is used to produce the guide
Figure 2: T-Coffee Procedure [29]
trees and how many guide trees are developed in the process. In this
algorithm, there are three stages of guide tree development.
3.5 Kalign