0% found this document useful (0 votes)
17 views6 pages

Chapter 7 Multiple Alignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views6 pages

Chapter 7 Multiple Alignment

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Chapter 7 Multiple Alignment

7.1 Introduction
Using multiple sequence alignment (MSA) is a crucial first step in any bioinformatics
analysis of protein or nucleotide sequences since it allows researchers to determine how
similar their many sample sequences are to one another. It is difficult to determine the
evolutionary connection without using MSA, which is hence required for phylogenetic
analysis. A phylogenetic tree cannot be constructed without it. In addition to this, it is
also used for finding conserved domains.

7.2 Progressive alignment


The most widely used approach to multiple sequence alignments uses a heuristic search
known as progressive technique (also known as the hierarchical or tree method)
developed by Da-Fei Feng and Doolittle in 1987. Progressive alignment builds up a final
MSA by combining pairwise alignments beginning with the most similar pair and
progressing to the most distantly related.
All progressive alignment methods require two stages:
• a first stage in which the relationships between the sequences are represented as
a tree, called a guide tree,
• and a second step in which the MSA is built by adding the sequences
sequentially to the growing MSA according to the guide tree.
The guide tree (Phylogenetic tree) determines the order of pairwise alignments in the
progressive alignment scheme. The initial guide tree is determined by an efficient
clustering method such as neighbor-joining or Unweighted Pair-Group Method with
Arithmetic Mean (UPGMA).
Steps
• Compute pairwise distance scores for all pairs of sequences
• Generate a guide tree, which ensures that similar sequences are near in the tree
• Align sequences one by one according to the guide tree

48
Distance based UPGMA (Unweighted Pair-Group Method with Arithmetic Mean)
• Align all sequences and name them
• Count the mismatches and record them in the distance matrix table
• Complete the table by comparing all the sequences
• Look for the fewest value to find the first group
• By finding the arithmetic mean step by step, and construct the tree
The problem with the progressive alignment method is its dependence on pairwise
sequence alignment at the initial step for ultimate MSA results. Also, as this method is
based on global alignment, it is not compatible with sequence alignments of different
lengths. CLUSTALW is perhaps the most suitable example of progressive alignment,
along with MAFFT, T-COFFEE, and MUSCLE.
Progressive alignments are not guaranteed to be globally optimal. The primary problem
is that when errors are made at any stage in growing the MSA, these errors are then
propagated through to the final result. Performance is also particularly bad when all of
the sequences in the set are rather distantly related.

7.3 Iterative method


Similar in operation to progressive approaches, iterative methods repeatedly realign the
original sequences and add new sequences to the developing MSA. Since progressive
techniques are always included in the final product, and once a sequence is aligned into
the MSA, its alignment is not reevaluated, they are dependent on a high-quality initial
alignment. While increasing efficiency, this approximation compromises accuracy. To
optimize a generic objective function like producing a high quality alignment score,
iterative algorithms can, however, refer back to previously calculated pairwise
alignments or subMSAs.
The two software packages PRRN and PRRP are examples of iterative methods. They
utilize a hill-climbing method to maximize its MSA alignment score and rectify
alignment weights and “gappy” sections of the MSA repeatedly. PRRP performs better
when revising an alignment created by a faster approach. MUSCLE (multiple sequence
alignment by log-expectation), which is another well-known iterative method, improves
progressive methods by using a more accurate distance metric to figure out if two
sequences are linked.

49
7.4 MSA filtering
The MSA approaches have several flaws because they rely on heuristic searches that are
guided by faulty objective roles. As a result, even if their output is excellent overall, it is
invariably polluted with faults. A number of software tools have been created with the
goal of filtering MSAs so that only the most reliable parts remain.
This is accomplished by removing sites, sequences, or remnants (by replacing them
with the gap symbol, “-” or an ambiguity symbol, “?,” “N,” or “X,” respectively).
It’s crucial to make certain that filtering doesn’t eliminate both the signal and the noise
created by the improperly aligned sections. The equilibrium present between noise
cutting and signal disappearance is critical in the case of MSA filtering.
Filtering techniques are mainly separated into two divisions. The very first one consists
of ways for filtering MSA by completely deleting certain regions or sequences.
These methods simply give the user two options for each region and sequence: “take it
or leave it,” which is why they are known as TILI-filtering techniques.
MSA filtering techniques that function by hiding remnants (replacing them with the
gap symbol “-” or a character depicting ambivalence such as “?,” “N,” or “X,” as per the
type of sequence) fall into the second group. These are termed “picky filtering”
techniques because they grab bits of information from a region or sequence while
ignoring the rest.
Filtering techniques’ fundamental principles
A. Gaps show locations that are difficult to align and maybe saturated Sequence
alignment involves filling gaps. The number of gaps at a place directly affects how
much effort the alignment approach must make; hence, the procedure is more likely to

50
cause mistakes. Biologically, insertions and deletions are less prevalent in proteins than
point substitutions. Several gaps show an aberrant evolutionary pattern, perhaps owing
to an MSA problem. Multiple mutations at the same location are likely, obscuring the
evolutionary signal.
B. Per site, a small number of residues with similar properties are expected
Homologous regions are likely to share traits, notably amino acid sequences. If all
amino acids at one site are the same, they all descend from the same hereditary amino
acid. Second, the protein has remained functional despite the physicochemical
properties of the remnant at this point, and at least 19 substitutions have occurred to do
this (which can suggest a site that is saturated). In this situation, removing this
alignment piece would be safer. Hydrophobic or positively charged residues should be
retained. Estimating residual transformation in one spot may be done in a variety of
ways, from easy (a simple count of amino acids found on the site) to complicated.
C. Similarity in homologous sequences is expected
In most processes, homologous sequences are identified by their similarity. This ensures
that the sequences that are supposed to be aligned have a minimum value of total
similarity. Despite this, it’s unusual for a portion of one or more sequences to deviate
dramatically from the rest of the alignment. This sequence is unlikely to be homologous
to the remaining area, maybe because it was not properly aligned and the segment is
homologous to a different portion of the alignment or because it has no homology at all.
This latter circumstance can be recognized before sequences are aligned, and various
approaches have been developed. This filtering should be done before sequence
alignment, since long insertions in particular sequences might hamper the MSA.
D. Orthologous sequences should be consistent across loci (post-filtering)
Alignment filtering algorithms will almost certainly fail to detect a non-orthologous
sequence that is homologous with the remaining sequences. However, in the case of a
phylogenomic setting, it is conceivable to inspect the MSAs for every locus under
consideration. With a high count regarding taxa and genes, it is easier to learn loci,
along with the speed of evolution of taxonomic categories. The OrthoMaM v10 process
has one basic technique for detecting those sequences that are non-orthologous,
whereas Phylo-MCOA offers a more detailed solution. However, this is not applicable
in cases where the study is focused on the evolution of an entire family of genes (both
paralogous and orthologous).

7.5 Programs and methods for multiple sequence alignment


7.5.1 Clustal family
MSA’s most well-known program is the Clustal family. There are two programs in this
family: ClustalX and ClustalW. Clustal family tools are very fast and relatively reliable;
they work in the same way as any other progressive alignment tool by either using

51
PAM250 or BLOSUM62 with global DP for the pairwise alignment of all the provided
sequences.
The pairwise scores are then used by Cluster to generate a neighbor joining tree; some
early versions are used to follow the UPGMA method. The sequence alignment is done
by following the leaves inward on the tree. Clustal is much more sensitive than other
alignment tools as it considers the gap penalties and the short hydrophilic stretches as
an indication of a random coil region so that the gap penalties are reduced for these
stretches.
Clustal Omega: The most recent and best-performing webserver in the Clustal family. It
can align around 200,000 sequences with just a single run and give results in a few
hours. In terms of accuracy, it’s very similar to other high-quality web servers, but on a
large scale of sequencing, Clustal Omega is quite good in terms of both quality and the
completion time of alignment. The very first step that this server follows is pairwise
alignment by the k-tuple method. Afterward, the mBed method is used for clustering
along with the K-means clustering method. Then it is followed by the building of a
guide tree with the help of the UPGMA method.
At last, the HHalign package aligns two hidden Markov model profiles and generates
the alignment of multiple sequences.
7.5.2 DIAlign
DIAlign combines global and local pairwise alignments. DIAlign has MSA, which is
composed of segments of equal length that have statistical similarity, which is quite
significant. DIAlign is very similar to FASTA alignment.
7.5.3 Tree-based consistency objective function for alignment evaluation (T-coffee)
It is also a progressive multiple sequence alignment tool that uses ClustalW and
LALIGN at the start to obtain a library of primary pairwise sequences that combine
global pairwise sequences from ClustalW and local pairwise sequences from LALIGN.
T-Coffee creates an extended library based on primary library triplets, where the third
sequence aligns the other two sequences and creates a new pairwise alignment;
duplicate pairs are removed completely. This extended library is actually a list of
weighted residue pairs. From this extended library, the final multiple sequence
alignment is generated by performing progressive alignment. 3D-coffee is a new and
improved version of T-coffee. It uses a specialized server to create a 3D structure
alignment, which shows a superposition, which is an area where two structures
overlap. This superposition of two sequences is used to increase the weight of residues
in the library.
7.5.4 FAlign
FAlign uses a combination of both iterative and progressive algorithms to align
multiple sequences. To use FAlign, users need to identify and define all the regions in
all the sequences where a motif is present. At motif boundaries, all the sequences split,
and the obtained segments are then aligned progressively. These segments are aligned,

52
and they are properly assembled, which generates an alignment. BLOSUM62 is used in
FAlign to create a score matrix that gives a sum-of-pair score.
The alignment score is improved by the non-motif and by shifting the gaps iteratively
and randomly.

53

You might also like