Sequence Analysis 2
Sequence Analysis 2
Sequence Analysis 2
: 06 Computational Biology
Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Description of Module
Subject Name Biotechnology
Module Id 03
Pre-requisites
Objectives
Keywords
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Sequence Comparison: Multiple Alignments
A global multiple alignment of 𝐾 > 2 strings 𝑆 = {𝑆1 , 𝑆2 , 𝑆3 , ⋯ , 𝑆𝑘 } is a natural generalization of the
alignment of two strings. Chosen spaces are inserted in between, or at either end of, each of the k
strings so that the resultant strings have the same length L. The strings are arrayed in k rows and L
columns so that each character and space of each string is in a unique row and column.
In order to understand the importance of constructing multiple alignments in Biology, one needs to
first understand the concept of homology. A pair of homologous sequences are said to have been
derived from a common ancestor by a process of evolutionary divergence. Homologous sequences
are often similar to each other, however the converse is not always true, i.e., often sequence level
similarities can arise even when the sequences themselves are not derived from a common
ancestor, in other words, they are not homologous. In the literature one often encounters a rather
peculiar notion called “percent homology” usually meant as a quantitative measure of similarity
between two sequences. This is a false notion because in its precise biological meaning, homology is
a concept of quality. Two sequences are either homologous or they are not, they cannot exhibit a
particular level of homology or percent homology.1 When spaces are inserted in a multiple
alignment in such a way as to maximize the similarity of amino acids/bases in a particular column of
the alignment, it can be thought of as a hypothesis of positional homology between the bases/amino
acids in a set of nucleic acid or protein sequences. It should be noted that a pair of sequences can be
partially homologous in the sense that a part of the sequence shares a common ancestor, while
another part does not. Such cases can occur by a number of biological processes most commonly by
gene fusion.
Multiple sequence alignments (MSA) can be used for highlighting the similarities and differences
among proteins belonging to different structural or functional families. As a hypothesis of positional
homology it becomes the input for phylogenetic tree construction algorithms. By highlighting
regions within a gene or protein that is exceptionally conserved, it can be used to identify and
characterize residues of exceptional functional importance within a gene or a protein. Accurate
construction of MSAs is one of those skills which a practicing biologist of many different
specializations must pickup as one of the essential skills of their field.
Figure 1. An example multiple alignment of four protein fragments. The numbers at the top of the figure denote
the column index of the alignment. If there were no indels (indicated by dots) then this would correspond to the
residue index of the proteins to be aligned. The alignment has been annotated by colouring those columns that
are identical in all the four proteins red, and those that are physic-chemically similar in blue. Arrows below the
alignment indicate secondary structure elements that are found in some (or all) of the proteins.
1
Reeck et. al. (1987): “Homology” in Proteins and Nucleic Acids: A terminology muddle and a way out of it.
Cell 50, 667
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Accuracy of Multiple Sequence Alignments
A number of factors can affect the accuracy of an MSA. One can, in principle, construct alignments
from gene sequences, i.e., DNA sequences using a four-letter alphabet (A,T,G,C), or one can first
translate the gene sequences to the 20-letter protein alphabet before alignment. Because of the
small size of the DNA alphabet, there is a much greater chance of getting random matches when
using DNA alphabets, than the corresponding protein alphabet. Hence it is much easier to construct
MSAs with protein sequences rather than DNA sequences. Even when DNA sequence alignments are
required, it is not unusual to first construct a protein alignment and then to back-translate the
protein alignment to the corresponding DNA alphabet.
A second factor that can affect the accuracy of an MSA is the average sequence identity of the
sequences that are to be aligned. In case of proteins, if the average sequence identity is greater than
60%, then alignments become extremely easy to construct and can be done even by automatic
methods, which do not require any manual intervention or post-alignment corrections. When the
average sequence identity falls between 30-60%, one finds that slightly different results are returned
by different automatic alignment programs. In such cases, it is better to use a number of different
alignment algorithms on the same sequence data and attempt to create a consensus alignment that
agrees best with the different algorithms. Manual post-alignment corrections of the alignment may
often be required in this case. MSAs become extremely difficult to construct when the average
sequence identity is in the region between 10-30%. Nevertheless one can construct accurate
sequence alignments, if one also includes additional structural information in the process of
alignment construction. When the average sequence identity falls below 10% it becomes nearly
impossible to judge the accuracy of the alignment. In fact this region of sequence identity is often
called the “midnight zone”, where too little information, is available to be able to construct accurate
MSAs with high confidence. The region between 10-30% is often called the “Twilight zone”, where
one finds some information, but not enough to construct an accurate alignment and therefore one
has to supplement the amount of available information with structural data.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Multiple Sequence Alignment Algorithms
Figure 3. Dynamic programming grid for the alignment of three sequences. The grid is now in the shape of a
cuboid, three edges of which are used to represent the three sequences. The best alignment path runs close to the
body diagonal of the cuboid. For more than n sequences (n > 3), the corresponding grid will be a n-dimensional
hypercube.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
𝐿 𝑘−1 𝑘
Where 𝜎(𝐴𝑖,ℎ , 𝐴𝑗,ℎ ) is the score for replacing the character in the hth column of the ith sequence with
the character in the hth column of the jth sequence, and wi,j is a weighting factor. In other words, the
SP score is just the weighted sum of all possible pair-wise alignment scores of the sequences to be
aligned. The optimal MSA is the one that maximizes (or minimizes) the SP score, depending on
whether the character comparison function (A,B) is a similarity score or a penalty score. Cast in this
way, the multiple alignment problem based on SP score becomes an optimization problem, which
can in principle, be solved by a function optimization method. However, the straightforward solution
using dynamic programming has been shown to be impractical as discussed in the previous section.
A number of heuristic solutions have been suggested that optimizes the SP score without
guaranteeing that the final optimized score will be a maximum (or minimum). One such algorithm is
the method of ‘Centre-star alignment’ which can be described as follows:
Given two strings S and T, define D(S,T) as the value of the minimum penalty for aligning S and T. If
the input is a set So of k strings, find S1 ɛ S that minimizes:
∑ 𝐷(𝑆1 , 𝑆)
𝑆∈𝑆𝑜 −𝑆1
This can be done by running the 2-sequence dynamic programming algorithm on all the 𝑘(𝑘 − 1)/2
pairs of strings. The string 𝑆1 can be thought to be like an “average” of all the strings in the set,
because it is most similar to all of them. Call the remaining strings in So {𝑆2 , 𝑆3 , ⋯ , 𝑆𝑘 }. Add these
strings one at a time to the multiple alignment that initially contains only S1 as follows:
′ }.
Suppose {𝑆1 , 𝑆2 , ⋯ , 𝑆𝑖−1 } are already aligned as {𝑆1′ , 𝑆2′ , ⋯ , 𝑆𝑖−1 Run the 2-sequence dynamic
′ ′ }
programming algorithm on S’1 and Si to produce 𝑆1 and 𝑆𝑖 . Adjust {𝑆2′ , 𝑆3′ , ⋯ 𝑆𝑖−1
′′
by adding spaces
to those columns where spaces were added to obtain 𝑆1 from S’1 . Replace S’1 by 𝑆1′′ . Although the
′′
Centre-star algorithm does not actually return the absolute minimum SP penalty, it can be shown
that there exists an upper bound to the error introduced. Thus if 𝑣(𝑀) be the SP penalty of given
multiple alignment of k strings and 𝑣(𝑀∗ ) be the theoretically attainable minimum penalty, then it
can be shown that:
𝑣(𝑀) 2(𝑘 − 1)
≤ <2
𝑣(𝑀∗ ) 𝑘
Progressive Alignment
Given that a multiple sequence alignment can be considered to be a hypothesis of positional
homology between a set of sequence, it is to be expected that a biologically meaningful algorithm
for producing a multiple alignment should make use of evolutionary information. With this idea in
mind a number of algorithms have been devised that makes explicit or implicit use of evolutionary
information. The best known class of such algorithms is known as Progressive alignment which is
based on an idea first proposed by Feng and Doolittle.2 Given k strings to align, the first step in
progressive alignment is to carry out all possible pair-wise alignment of the k sequences, i.e., 𝑘(𝑘 −
1)/2 pair-wise alignments are carried out. A distance matrix whose elements the scores of these
2
Feng, D. And Doolittle, RF. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic
trees. J.Mol.Evol. 25,351-60.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
pair-wise alignments is then constructed, which becomes the basis for arranging the sequences in
the form of a phylogenetic tree also known as the guide tree. The tree begins with a root node that
bifurcates into at most two branches ending into an intermediate node that in its turn may further
bifurcate into a pair of nodes until it finds a terminal node (or leaf) which does not bifurcate further.
The leaves (terminal nodes) of the tree represent the sequences. The intermediate nodes represent
local evolutionary divergence points and the root node represents the universal common ancestor of
the sequences to be aligned. The evolutionary distance between a pair of sequences is denoted by
the combined length of the branches connecting the terminal nodes represented by them. Given
such a phylogenetic tree the multiple alignments is now constructed in the following step-wise
fashion:
i. The alignment is seeded with the two most closely related sequences, i.e., the
sequences separated by the least number of branches on the guide tree. Once
aligned, a profile or position specific weight matrix, which is a matrix whose
elements are the frequencies of the different characters appearing on a particular
column of the alignment. The number of rows in the profile matrix equals the
alphabet size (which may also include a null character) and the number of columns
equal to the aligned length of the two sequences. The two terminal nodes of the
guide tree are now merged into one terminal node that is now represented by the
alignment profile of the two sequences.
ii. The modified guide tree obtained from the previous step is now searched for the
pair of terminal nodes that are now the closest. Three situations might arise at this
point which are described below:
a. Both the terminal nodes are represented by single sequences. In this case a
repeat of step i is carried out, which results in a merger of the two nodes
and their replacement with a new terminal node represented by a partial
alignment with a profile matrix created from the sequences in the partial
alignment.
b. One of the two nodes is represented by a single sequence, while the other is
represented by a partial alignment or profile with its corresponding profile
matrix. In this case a sequence-profile alignment is carried out, which is a
modification of the standard 2-sequence alignment algorithm with the
following modified recursive relations:
For a character y in column j in the profile matrix, let p(y,j) be the frequency
with which character y appears in column j of the profile. Let 𝑆1 (𝑥, 𝑗) =
∑𝑦[𝜎(𝑥, 𝑦) × 𝑝(𝑦, 𝑗)] be the score for aligning the character x with column j
of the profile. With this notation, the recursive relations can be written
down as under:
𝑉(0, 𝑗) = ∑ 𝑆(−, 𝑘)
𝑘≤𝑗
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
If a space is inserted in the profile, it means that a space has to be inserted
in the same position of every sequence represented by the profile. The
aligned sequence is now added to the profile and a new profile matrix
created. Finally the two nodes are merged and a new node represented by
the newly updated profile is created.
The main advantage of progressive alignment is that it makes explicit use of evolutionary
information which makes it meaningful for most biological applications. However, there are a few
limitations as well. One of the main disadvantages of the method is that it lacks an objective
function; therefore the accuracy of the alignment cannot be quantitatively determined. The feature
of the algorithm that once a partial alignment is made, it is never changed, leads to the possibility
that if somehow an error creeps into a partial alignment early on, it can never be corrected and thus
the entire alignment becomes seriously compromised. In spite of the above limitations, the method
of progressive alignment enjoys immense popularity among practicing biologists. One its
implementations known as ClustalW,3 was for a long time the de facto standard multiple sequence
alignment application.
3
Larkin, MA et. al. (2007) ClustalW and ClustalX version 2.0 Bioinformatics 23,(21), 2947-2948
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Unfortunately, depending on which physic-chemical parameters are being considered, several quite
different parametrizations of the character comparison functions are possible and it is often not
clear at the outset which function to use for a given biological problem.
An alternative way to solve the problem is to consider the observed rate of substitution between
amino acid pairs in homologous proteins and use that to parameterize the comparison function. A
simple count of transitions between pairs of amino acids in homologous proteins is however not a
good way to parametrize the comparison functions, because such counts will be more for commonly
occurring amino acids regardless of the substitution probability between them. One thus has to
normalize the substitution counts with the observed frequency of the amino acids themselves and
one obtains what is known as the log odds score.4
𝑝(𝑥, 𝑦)
𝜎 = 𝑙𝑜𝑔 ( )
𝑞(𝑥)𝑞(𝑦)
Where 𝑝(𝑥, 𝑦) is the substitution probability of amino acid 𝑥 to amino acid 𝑦 in “correct” alignments
of homologous proteins and 𝑞(𝑥), 𝑞(𝑦) are the observed frequencies of the amino acids 𝑥 and 𝑦.
Naive application of the log odds score on arbitrary pairs of homologous proteins can also cause
problems. In order to understand the nature of the problem let us use a hypothetical example.
Suppose the transition probability for amino acid X to Y is very high, but the transition probability for
amino acids X to Z is very low. One would imagine that between pairs of homologous proteins, one
would find a much larger fraction of X→Y as compared to X→Z substitutions. This might indeed be
true for some pairs of homologous proteins, however in other pairs there might be a surprisingly
high fraction of X→Z substitutions, which might be even higher than X→Y substitutions. This will
happen if in our hypothetical evolutionary model, the Y→Z and Y→X substitution probabilities are
both high. Thus if the pair of homologous proteins are very similar, i.e., the time of divergence from
their last common ancestor is relatively short, then only X→Y or X→Z types of substitution will be
seen, however if the time of divergence from the last common ancestor is somewhat larger then
transitions of the type X→Y→Z and X→Y→X will have occurred. The former cannot be distinguished
from the simple X→Z type of substitution and the latter will be missed entirely.
A way out of this conundrum is to consider only those pairs of homologous proteins, which diverged
out of a common ancestor only a short while ago in terms of evolutionary time. In such cases,
multiple substitutions at the same site, i.e., transitions of the type X→Y1→Y2…Z would not have
appeared. A log-odds substitution matrix of this type called PAM1 (Point Accepted Mutation) was
prepared by comparing pairs homologous proteins that were more than 85% identical.5 The PAM1
matrix, although based upon sound theoretical considerations, is not very useful for general purpose
sequence alignments because in most cases homologous proteins have undergone evolutionary
divergence for great lengths of time, where there were ample opportunities for multiple
substitutions at the same time, something that is not accounted for in the PAM1 substitution matrix.
There is however, an elegant approach to extrapolate from the PAM1 matrices so as to obtain
substitution matrices capable of handling multiple transitions at the same site. In order to
4
It is common to use the log odds value rounded to the nearest integer. Although an approximation, this does
not significantly degrade the quality of alignments produced.
5
Given the fact that calculation of the substitution matrix requires accurate positioning of the two sequences,
so that homologous positions are properly matched. This matching should require an alignment to be
performed, and the alignment in turn requires a substitution matrix. However, when sequences are more than
85% identical, there is little chance of large blocks of indels to appear. The alignments, in this case, can be
done simply by manual inspection without the need of an algorithm that depends upon a substitution matrix.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
understand how this may be done let us consider the specific case of homologous sequence pairs
that have diverged for just long enough to allow double substitution at the same site. Let 𝑝1 (𝑥, 𝑦) be
the substitution probability of character 𝑥 → 𝑦 and 𝑝2 (𝑥, 𝑦) be the substitution probability of the
same character pair but when the transition is of the type 𝑥 → 𝑧 → 𝑦 i.e., a double substitution with
an intermediate character 𝑧.
Figure 4. Mutation from Alanine (A) to Threonine (T) in PAM2 units of evolutionary time. The thin arrows
represent a substitution event with a given probability of substitution. Thus pAY represents the probability of
substitution from Alanine to Tyrosine and pYT denotes the probability of substitution from Tyrosine to
Threonine. The thick blue arrows denote PAM1 and the thick green arrow denotes PAM2 units of evolutionary
time.
Where, the summation extends over all possible values of the character z in the alphabet S. This is
explained further in Figure 4, with the example of an Ala→Thr substitution in PAM2 units of
evolutionary time. Since at least two substitutions are possible within this span of time, let Ala→Tyr
be the first substitution with probability pAY and Tyr→Thr be the second substitution with probability
pYT. The probability of Ala→Tyr→Thr substitution will now be given by the product pAY.pYT. There is
however no restriction on the path by which Alanine can take to mutate into Threonine. Thus it is
possible that instead of Tyrosine any of the other 19 amino acids are chosen as the intermediate
state. Since each of the substitution paths can be considered to be independent of the others, the
total substitution probability from Alanine to Threonine will be the sum of the probabilities of the
individual Ala→X→Thr paths (where X is any amino acid), which leads to the formula given above.
One immediately notices that the formula thus derived is identical to the well known formula for
matrix multiplication. Thus to obtain the PAM2 matrix capable of handling double transitions, one
has to simply multiply the PAM1 matrix with itself. Repeating this process, one can easily obtain
substitution matrices capable of handling any given degree of transitions between homologous
proteins. Even though, the above extrapolation is reasonable, one may further argue that in case of
arbitrary pairs of proteins, an unknown quantity of evolutionary time must have passed since their
emergence from their common ancestor, hence it is not clear at the outset as to which PAM matrix
one should use. However, it is observed that that as one calculates higher and higher powers of the
basic PAM1 matrix, the elements of the resulting matrices tend to settle down to some stationary
value which do not change further in any significant way. This behaviour is consistent with the
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
general intuition that after a sufficiently large evolutionary time period the degree of divergence
settles down to a particular level and does not change significantly any further. Hence, for arbitrary
pairs of proteins, where the degree of divergence cannot be specified exactly (as long as it is
sufficiently large) one can safely use any power of the PAM matrix, as long as it is sufficiently high. In
common practice PAM250 (i.e., the PAM1 matrix raised to the power 250) is routinely used for
proteins with unknown, but high levels of divergence.
The most important advantages of the PAM series of substitution matrices are that they are based
not on the perceived biological importance of arbitrary physico-chemical parameters, but on an
explicit evolutionary model. A second advantage is that since the mutations are calculated based on
global alignments, they apply equally well for highly conserved as well as highly mutable regions.
However there are also some limitations. It is well known that rates of mutations vary among
different regions within a protein and also varies between different proteins. Moreover, within the
same homologous set of proteins, the rates of evolution may differ considerably in different
branches of the phylogenetic tree that describes the evolution of the set of proteins. The
extrapolation process used in the derivation of PAM matrices of higher indexes makes the
assumption that the rates of mutations are constant throughout. A second serious difficulty of the
PAM matrix, also associated with the extrapolation process is that even a small error in the initial
parameters (i.e., the elements of the PAM1 matrix) may lead to large errors in PAM matrices of
higher powers due to the process of repeated matrix multiplication. Some of these limitations have
been addressed in alternative versions of substitution matrices like the BLOSUM series of matrices
which will be discussed below.
The BLOSUM series of substitution matrices find their origin in the BLOCKS database created by
Henikoff and Henikoff.6 The BLOCKS database is collection of ungapped multiply aligned
subsequences from highly conserved regions in protein families. The subsequences were those
which display a specific sequence pattern or motif. Such motifs are often found to be associated
with specific protein functions or have other biological importance.7 Each group of aligned sequence
in the BLOCKS database have greater than 500 members and therefore can be considered to
represent the natural sequence diversity contained in the motif. With the large amount of
alignment data available, it is straightforward to calculate a log odds matrix. There is also no need
for extrapolation because one can simply subdivided the input data into sequence sets where the
individual members have a similarity greater than a predetermined cut-off. The BLOSUM90 matrix,
for example was created from sequence sets that were at least 90% similar, while the BLOSUM50
matrix was created from sequences that are at least 50% similar.
There is a terminology difference in the way the PAM and BLOSUM matrices are identified. Low
order PAM matrices (like PAM1) should be used for highly similar proteins while higher order PAM
matrices (like PAM250) should be used for proteins which are more diverged. In case of the BLOSUM
series, high order matrices (like BOSUM90) should be used for proteins that are most similar while
low order matrices (like BLOSUM62) should be used for more diverged proteins.
6
Henikoff, S. And Henikoff, JG. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad.
Sci. USA 89 (22), 10915-9
7
Sequence motifs will be discussed in detail in a subsequent module.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Summary
To summarize the above discussion one can make the following critical points:
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
the alignment protocol will return the best possible alignment consistent with that scoring
function. However, for the alignment to be biologically meaningful, it is imperative to use a
scoring function that makes biological sense. Depending on the particular biological problem
at hand, one can think up of a number of different scoring functions. For example, one can
build a scoring function based on the similarities and differences of different amino acids
based on a set of physic-chemical parameters. However the arbitrariness involved in
choosing the set of relevant physic-chemical parameters will make the scoring function
essentially arbitrary which is unlikely to be suitable for most biological problems. It can be
argued that for large majority of biological significant processes, the underlying mechanism
of the process is rooted in the theory of evolution. Hence a scoring function based on
evolutionary principles is likely to work for most biological problems. The PAM and the
BLOSUM series of scoring functions are based on evolutionary principles and are the most
popularly used scoring functions.
The PAM (Point Accepted Mutation) sries of scoring functions calculates the log odds ratio of
amino acid x mutating to amino acid y in a certain number of steps. In PAM1, it is assumed
to occur in a single step, in PAM2 it is assumed to occur in two steps and so on. PAM1
scoring matrices were generated by calculating the log odds ratio from alignments that were
more than 85% identical. PAM matrices with higher indices were extrapolated for PAM1 by a
process of matrix multiplication. PAM250 (PAM1 multiplied with itself 250 times) scoring
matrix can be used for aligning sequences that are highly diverged.
The BLOSUM series of scoring matrices were generated by directly calculating the log odds
ratios from ungapped alignments that are similar to a certain degree. For example the
alignments used to calculate BLOSUM62 matrix contains sequences that are at least 62%
similar to each other. Unlike the PAM matrices, BLOSUM matrices of lower indices, e.g.,
BLOSUM30 should be used for aligning highly diverged sequences.
Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment