Multiple Sequence Alignments
Multiple Sequence Alignments
Multiple Sequence Alignments
A multiple sequence alignment is similar to the pairwise alignment, but it uses more
than two DNA/protein sequences. It is an extrapolation of the pairwise sequence
alignment and aims in assessing and identifying the conserved sequence regions of
proteins domains and structures across a group of sequences hypothesized to be
evolutionarily related. Various analysis like Homology modelling for prediction of
protein structure, Phylogenetic analysis, motif detection etc. can be done based on
the results of multiple sequence alignment. Eg. Information from conserved
sequence motifs observed from MSA can be used in conjunction with structural and
mechanistic information to locate the catalytic active sites of enzymes. Alignments
are also used to aid in establishing evolutionary relationships by constructing
phylogenetic trees. A multiple sequence alignment can sometimes be presented in a
tree-shape fashion.
Both Exhaustive and Heuristic approaches are used in MSA. The Exhaustive
alignment method involves examining all possible aligned positions simultaneously.
An algorithm similar to dynamic programming in pairwise alignment is used to
perform MSA in Exhaustive search but is computationally prohibitive and not feasible
to use for a large data set. Alignment techniques which are faster and heuristic are
used instead. Heuristic algorithm fall into three categories: progressive alignment
type, Iterative alignment type and block-based alignment type. The progressive
method is a step-wise assembly of multiple alignment according to pairwise
similarity. The iterative approach works by repetitive refinement of suboptimal
alignments. The block-based method focuses on identifying regional similarities.
Module 18 |1
There are many software like Clustal, T-Coffee, Phylip, MSA, MUSCLE or obtaining
multiple sequence alignment.
Gap open – The penalty to open a gap. The presence of a gap is frequently given
more significance than the length of the gap. By default, the gap opening penalty is
10.
Gap extension – The penalty to extend a gap. Extension of the gap also involves
additional of amino acids which is penalized in the scoring of the alignment. By
default, gap extension penalty is 0.20.
Figure 1 : Multiple alignment of the protein kinase with a set of five PI3- kinases (which
have considerable overall homology to each other) has the effect of forcing the best-
conserved regions to be matched. Here the DFG motif and the important N and D (green)
residues are aligned correctly in all the sequences. In addition it is apparent that a G
(green) is also totally conserved (identical) and that three more residues are conserved in
their
Hierarchical/Progressive alignment
Module 18 |2
Figure 2. The tree method for the multiple alignment of sequences A, B, C, D, and
E. Pairwise alignments are first made between all possible pairs of sequences—that is,
AB, AC, AD, and so on—to determine their relative similarity to each other (not shown).
A cluster analysis is performed on this preliminary round of alignments, and the individual
sequences are ranked in a tree according to their similarity to each other. (B) In the next
step, the most similar sequences are aligned in pairs as far as possible. These are then
aligned to the next closest sequence. This is repeated until all sequences or groups of
sequences are aligned.
In the next step, the next closest sequence based on the guide tree is aligned with
the consensus sequence using dynamic programming. More distant sequences or
sequence profiles are subsequently added one at a time in accordance with their
relative position on the guide tree. After realignment with a new sequence using
dynamic programming a new consensus is derived, this is then used for the next
round of alignment. The process is repeated until all the sequences are aligned
Iterative method
Major problem with the progressive alignment method is that errors in the initial
alignments of the most closely related sequence are propagated to the MSA. This
problem is more acute when the starting alignments are between more distantly
related sequences. Iterative methods attempts to correct for this problem by
repeatedly realigning subgroups of the sequences. The objective is to improve the
overall alignment score.
Module 18 |3
choice of a matrix depends on the evolutionary distances measured from the guide
tree.
Another feature is the use of adjustable gap penalties that allow more insertions and
deletions in regions that are outside the conserved domains, but fewer in conserved
regions. Eg. A gap near a series of hydrophobic residues carriers more penalties
than the one next to a series of hydrophilic or glycine residues, which are common in
loop regions. In addition, gaps that are too close to one another can be penalized
more than gaps occurring in isolated loci.
The algorithm also applies a weighting scheme to increase the reliability of aligning
divergent sequences i.e sequences with less than 25% identity. This is done by
down weighting redundant and closely related groups of sequences in the alignment
by a certain factor. This helps in preventing similar sequences from dominating the
alignment. The weighting factor is each sequence is determined by its branch length
on the guide tree. The branch lengths are normalized by how many times sequences
share a basal branch from the root of the tree. The obtained values for each
sequence is subsequently used to multiply the raw alignment scores of residues from
that sequence so to achieve the goal of decreasing the matching scores of frequent
characters in multiple alignment and thereby increasing the ones of infrequent
characters.
Pileup
Pileup is a MSA program that uses a method very similar to ClustalW. The
sequences are aligned pair-wise using the Needleman-Wunsch dynamic
programming algorithm, and the scores are used to produce a tree by the
unweighted pair-group method using arithmetic averages UPGMA. The resulting tree
Module 18 |4
is then used to guide the alignment of the most closely related sequences and
groups of sequences. The resulting alignment is a global alignment produced by the
Needleman-Wunsch algorithm. Standard scoring matrices and gap
opening/extension penalties are used. Pileup does not guarantee an optimal
alignment
Module 18 |5
that motifs taken from regions known to be functionally important a priori conserve
the overall phylogeny of the family i.e MINER reserves this scenario to look for
regions that reproduce the phylogenetic clustering, and then presents them as
putative functional sites.
MINNER algorithm
MINER takes as an input any multiple sequence alignment (MSA) and aligns by
using ClustalW if the sequences are unaligned. MINER requires a minimum of five
sequences in the FASTA format, however, it is recommend using 25 or more
sequences to ensure sufficient evolutionary diversity. Optionally, a Protein Data
Bank (PDB) structure can be submitted to better highlight PM regions. A sliding
sequence window algorithm is used to quantitatively evaluate the phylogenetic
similarity between each sequence region and the whole sequence. Distance-based
trees are calculated both for the whole alignment and each window. Phylogenetic
similarity is calculated based on tree topology, using the partition metric algorithm.
Partition metric values are recast as Z-scores. Overlapping sequence windows
scoring past some preset phylogenetic similarity Z-score (PSZ) threshold are
identified as PMs. MINER empirically uses a window width between 5 and 10, and a
PSZ threshold between −1.5 and −2.2 (lower scores indicate greater similarity)
represent ideal default parameters for functional site prediction. MINER allows the
user to easily change these parameters as desired. By default, alignment positions
with >50% gaps are eliminated (masked). MINER can automatically determine the
PSZ threshold without human subjectivity. Alternatively, MINER also provides the
option to identify traditional motifs using the False Positive Expectation (FPE) of a
regular expression or profile. When used in conjunction with the PM results, these
alternative approaches often provide synergistic information.
Mathematically methods for carrying out alignments, using empirical models, can be
extremely demanding in computer resources for more than a couple of sequences. It
is also difficult to apply when the sequences are less than 30% identical. In practice,
heuristic methods are used for all and the most commonly used heuristic methods
are based on the progressive-alignment strategy. The idea is to take an initial,
approximate, phylogenetic tree between the sequences and to gradually build up the
alignment, following the order in the tree. Although successful in a wide variety of
cases, this method suffers from its greediness. Errors made in the first alignments
cannot be rectified later as the rest of the sequences are added in. Tree-based
Module 18 |6
Consistency Objective Function for alignment Evaluation or T- coffee is a new
progressive method for sequence alignment. It can combine signals from
heterogeneous sources (e.g. sequence-alignment programs, structure alignments,
threading, manual alignment, motifs and specific constraints) into a unique
consensus multiple sequence alignment and attempt to minimize this effect.
Although the strategy in T-coffee is also a greedy progressive method, it allows for
much better use of information in the early stages.
T-Coffee Algorithm
T-Coffee has two main features. First, it provides a simple and flexible means of
generating multiple alignments, using heterogeneous data sources. The data from
these sources are provided to T-Coffee via a library of pair-wise alignments. T-
Coffee by using a library that was generated using a mixture of local and global pair-
wise alignments computes multiple alignments. The second main feature of T-Coffee
is the optimization method, which is used to find the multiple alignment that best fits
the pair-wise alignments from the input library by a progressive strategy which is
similar to that used in ClustalW.
T-Coffee using the information from input library carry out progressive alignment in a
manner that allows us to consider the alignments between all the pairs while carrying
out each step of the progressive multiple alignment. This gives us progressive
alignment, with all its advantages of speed and simplicity, but with a far lesser
tendency to make errors like the one shown in Figure 6(a), i.e. misalignment of the
word CAT. T-Coffee is a progressive alignment with an ability to consider
information from all of the sequences during each alignment step but not just those
being aligned at that stage.
Module 18 |7
Figure 5. Layout of the T-Coffee strategy; the main steps required to compute a multiple
sequence alignment using the T-Coffee method. Square blocks designate procedures while
rounded blocks indicate data structures.
Module 18 |8
Figure 6.The library extension. (a) Progressive alignment. Four sequences have been
designed. The tree indicates the order in which the sequences are aligned when using a
progressive method such as ClustalW. The resulting alignment is shown, with the word CAT
misaligned. (b) Primary library. Each pair of sequences is aligned using ClustalW. In these
alignments, each pair of aligned residues is associated with a weight equal to the average
identity among matched residues within the complete alignment (mismatches are indicated
in bold type). (c) Library extension for a pair of sequences. The three possible alignments of
sequence A and B are shown (A and B, A and B through C, A and B through D). These
alignments are combined, as explained in the text, to produce the position-specific library.
This library is resolved by dynamic programming to give the correct alignment. The thickness
of the lines indicates the strength of the weight.
Module 18 |9