Sequence Analysis 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 13

Paper No.

: 06 Computational Biology

Module : 03 Sequence Analysis – II – Multiple sequence alignment

Principal Investigator: Dr. Vibha Dhawan, Distinguished Fellow and Sr. Director
The Energy and Resources Institute (TERI), New Delhi

Co-Principal Investigator: Prof S K Jain, Professor,


Jamia Hamdard University, New Delhi

Paper Coordinator: Dr. Indira Ghosh, Professor


Jawaharlal Nehru University, New Delhi

Content Writer: Dr. Devapriya Choudhury, Associate Professor


Jawaharlal Nehru University, New Delhi

Paper Reviewer: Dr. Debasisa Mohanty


National Institute of Immunology, New Delhi

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Description of Module
Subject Name Biotechnology

Paper Name Computational Biology

Module Name/Title Sequence Analysis – II – Multiple sequence alignment

Module Id 03

Pre-requisites

Objectives

Keywords

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Sequence Comparison: Multiple Alignments
A global multiple alignment of 𝐾 > 2 strings 𝑆 = {𝑆1 , 𝑆2 , 𝑆3 , ⋯ , 𝑆𝑘 } is a natural generalization of the
alignment of two strings. Chosen spaces are inserted in between, or at either end of, each of the k
strings so that the resultant strings have the same length L. The strings are arrayed in k rows and L
columns so that each character and space of each string is in a unique row and column.

In order to understand the importance of constructing multiple alignments in Biology, one needs to
first understand the concept of homology. A pair of homologous sequences are said to have been
derived from a common ancestor by a process of evolutionary divergence. Homologous sequences
are often similar to each other, however the converse is not always true, i.e., often sequence level
similarities can arise even when the sequences themselves are not derived from a common
ancestor, in other words, they are not homologous. In the literature one often encounters a rather
peculiar notion called “percent homology” usually meant as a quantitative measure of similarity
between two sequences. This is a false notion because in its precise biological meaning, homology is
a concept of quality. Two sequences are either homologous or they are not, they cannot exhibit a
particular level of homology or percent homology.1 When spaces are inserted in a multiple
alignment in such a way as to maximize the similarity of amino acids/bases in a particular column of
the alignment, it can be thought of as a hypothesis of positional homology between the bases/amino
acids in a set of nucleic acid or protein sequences. It should be noted that a pair of sequences can be
partially homologous in the sense that a part of the sequence shares a common ancestor, while
another part does not. Such cases can occur by a number of biological processes most commonly by
gene fusion.

Multiple sequence alignments (MSA) can be used for highlighting the similarities and differences
among proteins belonging to different structural or functional families. As a hypothesis of positional
homology it becomes the input for phylogenetic tree construction algorithms. By highlighting
regions within a gene or protein that is exceptionally conserved, it can be used to identify and
characterize residues of exceptional functional importance within a gene or a protein. Accurate
construction of MSAs is one of those skills which a practicing biologist of many different
specializations must pickup as one of the essential skills of their field.

Figure 1. An example multiple alignment of four protein fragments. The numbers at the top of the figure denote
the column index of the alignment. If there were no indels (indicated by dots) then this would correspond to the
residue index of the proteins to be aligned. The alignment has been annotated by colouring those columns that
are identical in all the four proteins red, and those that are physic-chemically similar in blue. Arrows below the
alignment indicate secondary structure elements that are found in some (or all) of the proteins.

1
Reeck et. al. (1987): “Homology” in Proteins and Nucleic Acids: A terminology muddle and a way out of it.
Cell 50, 667

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Accuracy of Multiple Sequence Alignments
A number of factors can affect the accuracy of an MSA. One can, in principle, construct alignments
from gene sequences, i.e., DNA sequences using a four-letter alphabet (A,T,G,C), or one can first
translate the gene sequences to the 20-letter protein alphabet before alignment. Because of the
small size of the DNA alphabet, there is a much greater chance of getting random matches when
using DNA alphabets, than the corresponding protein alphabet. Hence it is much easier to construct
MSAs with protein sequences rather than DNA sequences. Even when DNA sequence alignments are
required, it is not unusual to first construct a protein alignment and then to back-translate the
protein alignment to the corresponding DNA alphabet.

Figure 2. Average sequence identity and appropriate multiple alignment methodology.

A second factor that can affect the accuracy of an MSA is the average sequence identity of the
sequences that are to be aligned. In case of proteins, if the average sequence identity is greater than
60%, then alignments become extremely easy to construct and can be done even by automatic
methods, which do not require any manual intervention or post-alignment corrections. When the
average sequence identity falls between 30-60%, one finds that slightly different results are returned
by different automatic alignment programs. In such cases, it is better to use a number of different
alignment algorithms on the same sequence data and attempt to create a consensus alignment that
agrees best with the different algorithms. Manual post-alignment corrections of the alignment may
often be required in this case. MSAs become extremely difficult to construct when the average
sequence identity is in the region between 10-30%. Nevertheless one can construct accurate
sequence alignments, if one also includes additional structural information in the process of
alignment construction. When the average sequence identity falls below 10% it becomes nearly
impossible to judge the accuracy of the alignment. In fact this region of sequence identity is often
called the “midnight zone”, where too little information, is available to be able to construct accurate
MSAs with high confidence. The region between 10-30% is often called the “Twilight zone”, where
one finds some information, but not enough to construct an accurate alignment and therefore one
has to supplement the amount of available information with structural data.

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Multiple Sequence Alignment Algorithms

Dynamic programming algorithm


In theory, it is possible to construct multiple alignment algorithms by a straight forward extension of
the dynamic programming algorithms for two-sequence alignment. In case of MSAs to be
constructed by dynamic programming, all that one has to do is to increase the dimensionality of the
dynamic programming grid to equal the number of sequence being aligned, i.e., if n sequences are
being aligned, the dynamic programming grid becomes an n-dimensional hypercube instead of the
simple two-dimensional row-column type of matrix used in two-sequence alignment. Although this
approach is quite easy to visualize theoretically, it is fraught with serious practical difficulties. Recall
that both the space and time complexities of two-sequence alignment is 𝑂(𝑚𝑛), where m and n are
the lengths of the two sequences. For multiple alignment using k sequences this becomes 𝑂(∏𝑘1 𝑚𝑖 ),
where 𝑚𝑖 is the length of the ith sequence. For example, let us assume that we have to align 20
sequences, each of length 100 characters; both the time and space complexities for this specific case
will be 10020 which is an astronomically large number. It is possible to devise heuristic algorithms
that do not have to completely fill up the n-dimensional dynamic programming grid. However, even
with the use of such heuristics, alignments of more than 5-7 strings of normal lengths are not
practical even with the fastest available computers.

Figure 3. Dynamic programming grid for the alignment of three sequences. The grid is now in the shape of a
cuboid, three edges of which are used to represent the three sequences. The best alignment path runs close to the
body diagonal of the cuboid. For more than n sequences (n > 3), the corresponding grid will be a n-dimensional
hypercube.

Multiple Alignments based on Sum of Pairs (SP) Score


Let us assume a MSA with k sequences, where each sequence has an aligned length L. The SP score
can now be defined as:

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
𝐿 𝑘−1 𝑘

𝑆𝑃𝑠𝑐𝑜𝑟𝑒 = ∑ ∑ ∑ 𝑤𝑖,𝑗 𝜎(𝐴𝑖,ℎ , 𝐴𝑗,ℎ )


ℎ=1 𝑖=1 𝑗=𝑖+1

Where 𝜎(𝐴𝑖,ℎ , 𝐴𝑗,ℎ ) is the score for replacing the character in the hth column of the ith sequence with
the character in the hth column of the jth sequence, and wi,j is a weighting factor. In other words, the
SP score is just the weighted sum of all possible pair-wise alignment scores of the sequences to be
aligned. The optimal MSA is the one that maximizes (or minimizes) the SP score, depending on
whether the character comparison function (A,B) is a similarity score or a penalty score. Cast in this
way, the multiple alignment problem based on SP score becomes an optimization problem, which
can in principle, be solved by a function optimization method. However, the straightforward solution
using dynamic programming has been shown to be impractical as discussed in the previous section.
A number of heuristic solutions have been suggested that optimizes the SP score without
guaranteeing that the final optimized score will be a maximum (or minimum). One such algorithm is
the method of ‘Centre-star alignment’ which can be described as follows:

Given two strings S and T, define D(S,T) as the value of the minimum penalty for aligning S and T. If
the input is a set So of k strings, find S1 ɛ S that minimizes:

∑ 𝐷(𝑆1 , 𝑆)
𝑆∈𝑆𝑜 −𝑆1

This can be done by running the 2-sequence dynamic programming algorithm on all the 𝑘(𝑘 − 1)/2
pairs of strings. The string 𝑆1 can be thought to be like an “average” of all the strings in the set,
because it is most similar to all of them. Call the remaining strings in So {𝑆2 , 𝑆3 , ⋯ , 𝑆𝑘 }. Add these
strings one at a time to the multiple alignment that initially contains only S1 as follows:
′ }.
Suppose {𝑆1 , 𝑆2 , ⋯ , 𝑆𝑖−1 } are already aligned as {𝑆1′ , 𝑆2′ , ⋯ , 𝑆𝑖−1 Run the 2-sequence dynamic
′ ′ }
programming algorithm on S’1 and Si to produce 𝑆1 and 𝑆𝑖 . Adjust {𝑆2′ , 𝑆3′ , ⋯ 𝑆𝑖−1
′′
by adding spaces
to those columns where spaces were added to obtain 𝑆1 from S’1 . Replace S’1 by 𝑆1′′ . Although the
′′

Centre-star algorithm does not actually return the absolute minimum SP penalty, it can be shown
that there exists an upper bound to the error introduced. Thus if 𝑣(𝑀) be the SP penalty of given
multiple alignment of k strings and 𝑣(𝑀∗ ) be the theoretically attainable minimum penalty, then it
can be shown that:
𝑣(𝑀) 2(𝑘 − 1)
≤ <2
𝑣(𝑀∗ ) 𝑘

Progressive Alignment
Given that a multiple sequence alignment can be considered to be a hypothesis of positional
homology between a set of sequence, it is to be expected that a biologically meaningful algorithm
for producing a multiple alignment should make use of evolutionary information. With this idea in
mind a number of algorithms have been devised that makes explicit or implicit use of evolutionary
information. The best known class of such algorithms is known as Progressive alignment which is
based on an idea first proposed by Feng and Doolittle.2 Given k strings to align, the first step in
progressive alignment is to carry out all possible pair-wise alignment of the k sequences, i.e., 𝑘(𝑘 −
1)/2 pair-wise alignments are carried out. A distance matrix whose elements the scores of these

2
Feng, D. And Doolittle, RF. (1987) Progressive sequence alignment as a prerequisite to correct phylogenetic
trees. J.Mol.Evol. 25,351-60.

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
pair-wise alignments is then constructed, which becomes the basis for arranging the sequences in
the form of a phylogenetic tree also known as the guide tree. The tree begins with a root node that
bifurcates into at most two branches ending into an intermediate node that in its turn may further
bifurcate into a pair of nodes until it finds a terminal node (or leaf) which does not bifurcate further.
The leaves (terminal nodes) of the tree represent the sequences. The intermediate nodes represent
local evolutionary divergence points and the root node represents the universal common ancestor of
the sequences to be aligned. The evolutionary distance between a pair of sequences is denoted by
the combined length of the branches connecting the terminal nodes represented by them. Given
such a phylogenetic tree the multiple alignments is now constructed in the following step-wise
fashion:

i. The alignment is seeded with the two most closely related sequences, i.e., the
sequences separated by the least number of branches on the guide tree. Once
aligned, a profile or position specific weight matrix, which is a matrix whose
elements are the frequencies of the different characters appearing on a particular
column of the alignment. The number of rows in the profile matrix equals the
alphabet size (which may also include a null character) and the number of columns
equal to the aligned length of the two sequences. The two terminal nodes of the
guide tree are now merged into one terminal node that is now represented by the
alignment profile of the two sequences.
ii. The modified guide tree obtained from the previous step is now searched for the
pair of terminal nodes that are now the closest. Three situations might arise at this
point which are described below:
a. Both the terminal nodes are represented by single sequences. In this case a
repeat of step i is carried out, which results in a merger of the two nodes
and their replacement with a new terminal node represented by a partial
alignment with a profile matrix created from the sequences in the partial
alignment.
b. One of the two nodes is represented by a single sequence, while the other is
represented by a partial alignment or profile with its corresponding profile
matrix. In this case a sequence-profile alignment is carried out, which is a
modification of the standard 2-sequence alignment algorithm with the
following modified recursive relations:
For a character y in column j in the profile matrix, let p(y,j) be the frequency
with which character y appears in column j of the profile. Let 𝑆1 (𝑥, 𝑗) =
∑𝑦[𝜎(𝑥, 𝑦) × 𝑝(𝑦, 𝑗)] be the score for aligning the character x with column j
of the profile. With this notation, the recursive relations can be written
down as under:
𝑉(0, 𝑗) = ∑ 𝑆(−, 𝑘)
𝑘≤𝑗

𝑉(𝑖, 0) = ∑ 𝑆(𝑆1 (𝑘), −)


𝑘≤𝑖
𝑉(𝑖 − 1, 𝑗 − 1) + 𝑆(𝑆1 (𝑖), 𝑗)
𝑉(𝑖, 𝑗) = 𝑚𝑎𝑥 { 𝑉(𝑖 − 1, 𝑗) + 𝑆(𝑆1 (𝑖), −) }
𝑉(𝑖, 𝑗 − 1) + 𝑆(−, 𝑗)

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
If a space is inserted in the profile, it means that a space has to be inserted
in the same position of every sequence represented by the profile. The
aligned sequence is now added to the profile and a new profile matrix
created. Finally the two nodes are merged and a new node represented by
the newly updated profile is created.

c. Both the nodes are represented by partial alignments or profiles. In this


case a profile-profile alignment is carried out using a straightforward
extension of the sequence-profile alignment procedure just described. As
before, inserting a space at any position of the profile means insertion of a
space at the corresponding position on all sequences represented by the
profile.
iii. The previous steps are repeated, with each step sequences are aligned and merged
into profiles, and profiles are aligned and merged into larger profiles, until a single
profile remains which is the final multiple alignment.

The main advantage of progressive alignment is that it makes explicit use of evolutionary
information which makes it meaningful for most biological applications. However, there are a few
limitations as well. One of the main disadvantages of the method is that it lacks an objective
function; therefore the accuracy of the alignment cannot be quantitatively determined. The feature
of the algorithm that once a partial alignment is made, it is never changed, leads to the possibility
that if somehow an error creeps into a partial alignment early on, it can never be corrected and thus
the entire alignment becomes seriously compromised. In spite of the above limitations, the method
of progressive alignment enjoys immense popularity among practicing biologists. One its
implementations known as ClustalW,3 was for a long time the de facto standard multiple sequence
alignment application.

Scoring Matrices for Sequence Alignment


One important ingredient for the construction of accurate multiple alignments is the scoring
function or the character comparison function𝜎(𝐴, 𝐵). If this function is not correct, then the
multiple alignment will not be correct regardless of the method by which it is constructed. While
designing a character comparison function, care should be taken that the function provides a
numerical value of the similarity (or difference) between two characters in the sequence alphabet
that is biologically meaningful. Thus depending upon the nature of the specific biological problem at
hand, one might like to use different functions for character comparison. One of the simplest
functions to use would be the binary function which returns 1 for a character match and 0 for a
mismatch or indel. When using this function, the similarity between a pair of sequences simply
becomes their relative sequence identity. Such a function, while being useful for some biological
problems, can become a little less informative for other biological problems. For example, one finds
that even in homologous proteins that retain the same function, there are often a few changes in
the amino acid sequences. Often such changes are between pairs of chemically or structurally
related amino acids like Leucine/Isoleucine, Asparagine/Glutamine etc. Transitions between
chemically/structurally very different amino acids like Glycine/Arginine are relatively rare. In view of
this, it is natural to require that the comparison function be modified to reflect the relative ease of
transitions between similar amino acids. One way to achieve this would be to parametrize the
character comparison function on the basis of physic-chemical similarities between the amino acids.

3
Larkin, MA et. al. (2007) ClustalW and ClustalX version 2.0 Bioinformatics 23,(21), 2947-2948

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Unfortunately, depending on which physic-chemical parameters are being considered, several quite
different parametrizations of the character comparison functions are possible and it is often not
clear at the outset which function to use for a given biological problem.

An alternative way to solve the problem is to consider the observed rate of substitution between
amino acid pairs in homologous proteins and use that to parameterize the comparison function. A
simple count of transitions between pairs of amino acids in homologous proteins is however not a
good way to parametrize the comparison functions, because such counts will be more for commonly
occurring amino acids regardless of the substitution probability between them. One thus has to
normalize the substitution counts with the observed frequency of the amino acids themselves and
one obtains what is known as the log odds score.4
𝑝(𝑥, 𝑦)
𝜎 = 𝑙𝑜𝑔 ( )
𝑞(𝑥)𝑞(𝑦)
Where 𝑝(𝑥, 𝑦) is the substitution probability of amino acid 𝑥 to amino acid 𝑦 in “correct” alignments
of homologous proteins and 𝑞(𝑥), 𝑞(𝑦) are the observed frequencies of the amino acids 𝑥 and 𝑦.

Naive application of the log odds score on arbitrary pairs of homologous proteins can also cause
problems. In order to understand the nature of the problem let us use a hypothetical example.

Suppose the transition probability for amino acid X to Y is very high, but the transition probability for
amino acids X to Z is very low. One would imagine that between pairs of homologous proteins, one
would find a much larger fraction of X→Y as compared to X→Z substitutions. This might indeed be
true for some pairs of homologous proteins, however in other pairs there might be a surprisingly
high fraction of X→Z substitutions, which might be even higher than X→Y substitutions. This will
happen if in our hypothetical evolutionary model, the Y→Z and Y→X substitution probabilities are
both high. Thus if the pair of homologous proteins are very similar, i.e., the time of divergence from
their last common ancestor is relatively short, then only X→Y or X→Z types of substitution will be
seen, however if the time of divergence from the last common ancestor is somewhat larger then
transitions of the type X→Y→Z and X→Y→X will have occurred. The former cannot be distinguished
from the simple X→Z type of substitution and the latter will be missed entirely.

A way out of this conundrum is to consider only those pairs of homologous proteins, which diverged
out of a common ancestor only a short while ago in terms of evolutionary time. In such cases,
multiple substitutions at the same site, i.e., transitions of the type X→Y1→Y2…Z would not have
appeared. A log-odds substitution matrix of this type called PAM1 (Point Accepted Mutation) was
prepared by comparing pairs homologous proteins that were more than 85% identical.5 The PAM1
matrix, although based upon sound theoretical considerations, is not very useful for general purpose
sequence alignments because in most cases homologous proteins have undergone evolutionary
divergence for great lengths of time, where there were ample opportunities for multiple
substitutions at the same time, something that is not accounted for in the PAM1 substitution matrix.
There is however, an elegant approach to extrapolate from the PAM1 matrices so as to obtain
substitution matrices capable of handling multiple transitions at the same site. In order to

4
It is common to use the log odds value rounded to the nearest integer. Although an approximation, this does
not significantly degrade the quality of alignments produced.
5
Given the fact that calculation of the substitution matrix requires accurate positioning of the two sequences,
so that homologous positions are properly matched. This matching should require an alignment to be
performed, and the alignment in turn requires a substitution matrix. However, when sequences are more than
85% identical, there is little chance of large blocks of indels to appear. The alignments, in this case, can be
done simply by manual inspection without the need of an algorithm that depends upon a substitution matrix.

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
understand how this may be done let us consider the specific case of homologous sequence pairs
that have diverged for just long enough to allow double substitution at the same site. Let 𝑝1 (𝑥, 𝑦) be
the substitution probability of character 𝑥 → 𝑦 and 𝑝2 (𝑥, 𝑦) be the substitution probability of the
same character pair but when the transition is of the type 𝑥 → 𝑧 → 𝑦 i.e., a double substitution with
an intermediate character 𝑧.

Figure 4. Mutation from Alanine (A) to Threonine (T) in PAM2 units of evolutionary time. The thin arrows
represent a substitution event with a given probability of substitution. Thus pAY represents the probability of
substitution from Alanine to Tyrosine and pYT denotes the probability of substitution from Tyrosine to
Threonine. The thick blue arrows denote PAM1 and the thick green arrow denotes PAM2 units of evolutionary
time.

From simple rules of probability one obtains:

𝑝2 (𝑥, 𝑦) = ∑ 𝑝1 (𝑥, 𝑧)𝑝1 (𝑧, 𝑦)


𝑧∈𝑆

Where, the summation extends over all possible values of the character z in the alphabet S. This is
explained further in Figure 4, with the example of an Ala→Thr substitution in PAM2 units of
evolutionary time. Since at least two substitutions are possible within this span of time, let Ala→Tyr
be the first substitution with probability pAY and Tyr→Thr be the second substitution with probability
pYT. The probability of Ala→Tyr→Thr substitution will now be given by the product pAY.pYT. There is
however no restriction on the path by which Alanine can take to mutate into Threonine. Thus it is
possible that instead of Tyrosine any of the other 19 amino acids are chosen as the intermediate
state. Since each of the substitution paths can be considered to be independent of the others, the
total substitution probability from Alanine to Threonine will be the sum of the probabilities of the
individual Ala→X→Thr paths (where X is any amino acid), which leads to the formula given above.
One immediately notices that the formula thus derived is identical to the well known formula for
matrix multiplication. Thus to obtain the PAM2 matrix capable of handling double transitions, one
has to simply multiply the PAM1 matrix with itself. Repeating this process, one can easily obtain
substitution matrices capable of handling any given degree of transitions between homologous
proteins. Even though, the above extrapolation is reasonable, one may further argue that in case of
arbitrary pairs of proteins, an unknown quantity of evolutionary time must have passed since their
emergence from their common ancestor, hence it is not clear at the outset as to which PAM matrix
one should use. However, it is observed that that as one calculates higher and higher powers of the
basic PAM1 matrix, the elements of the resulting matrices tend to settle down to some stationary
value which do not change further in any significant way. This behaviour is consistent with the

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
general intuition that after a sufficiently large evolutionary time period the degree of divergence
settles down to a particular level and does not change significantly any further. Hence, for arbitrary
pairs of proteins, where the degree of divergence cannot be specified exactly (as long as it is
sufficiently large) one can safely use any power of the PAM matrix, as long as it is sufficiently high. In
common practice PAM250 (i.e., the PAM1 matrix raised to the power 250) is routinely used for
proteins with unknown, but high levels of divergence.

The most important advantages of the PAM series of substitution matrices are that they are based
not on the perceived biological importance of arbitrary physico-chemical parameters, but on an
explicit evolutionary model. A second advantage is that since the mutations are calculated based on
global alignments, they apply equally well for highly conserved as well as highly mutable regions.
However there are also some limitations. It is well known that rates of mutations vary among
different regions within a protein and also varies between different proteins. Moreover, within the
same homologous set of proteins, the rates of evolution may differ considerably in different
branches of the phylogenetic tree that describes the evolution of the set of proteins. The
extrapolation process used in the derivation of PAM matrices of higher indexes makes the
assumption that the rates of mutations are constant throughout. A second serious difficulty of the
PAM matrix, also associated with the extrapolation process is that even a small error in the initial
parameters (i.e., the elements of the PAM1 matrix) may lead to large errors in PAM matrices of
higher powers due to the process of repeated matrix multiplication. Some of these limitations have
been addressed in alternative versions of substitution matrices like the BLOSUM series of matrices
which will be discussed below.

The BLOSUM series of substitution matrices find their origin in the BLOCKS database created by
Henikoff and Henikoff.6 The BLOCKS database is collection of ungapped multiply aligned
subsequences from highly conserved regions in protein families. The subsequences were those
which display a specific sequence pattern or motif. Such motifs are often found to be associated
with specific protein functions or have other biological importance.7 Each group of aligned sequence
in the BLOCKS database have greater than 500 members and therefore can be considered to
represent the natural sequence diversity contained in the motif. With the large amount of
alignment data available, it is straightforward to calculate a log odds matrix. There is also no need
for extrapolation because one can simply subdivided the input data into sequence sets where the
individual members have a similarity greater than a predetermined cut-off. The BLOSUM90 matrix,
for example was created from sequence sets that were at least 90% similar, while the BLOSUM50
matrix was created from sequences that are at least 50% similar.

There is a terminology difference in the way the PAM and BLOSUM matrices are identified. Low
order PAM matrices (like PAM1) should be used for highly similar proteins while higher order PAM
matrices (like PAM250) should be used for proteins which are more diverged. In case of the BLOSUM
series, high order matrices (like BOSUM90) should be used for proteins that are most similar while
low order matrices (like BLOSUM62) should be used for more diverged proteins.

6
Henikoff, S. And Henikoff, JG. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad.
Sci. USA 89 (22), 10915-9
7
Sequence motifs will be discussed in detail in a subsequent module.

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
Summary
To summarize the above discussion one can make the following critical points:

 A multiple alignment is a straight forward generalization of the two-sequence alignment


process, in which a set of sequences (more than two) are taken and spaces are inserted
between them in such a way that the lengths of all the sequences become equal.
 Often the spaces are inserted in such a way that residues that are similar in terms of
functional properties are brought together in the same column of the alignment.
 A multiple alignment provides a succinct description of the similarities and differences
among members of homologous group of proteins/genes. In this way it becomes a
hypothesis that seeks to define the evolutionary pathways followed by that group of
proteins/genes.
 While it is possible, in principle, to carry out multiple alignments by extending the dynamic
programming based 2-sequence alignment algorithms to handle more than two sequences,
such an algorithm will have an unacceptable level of run-time complexity. Hence, dynamic
programming is almost never used to align more than two sequences in practice.
 The accuracy of different multiple alignment algorithms depend upon the diversity of the
sequences that are to be aligned. If the average sequence identity is greater than 60%, then
any alignment method will give acceptable results. If the identity falls below 60% but is
greater than 30%, then one may find slight differences between alignments carried out by
different methods. In such cases, either a consensus between a number of different
algorithms, or manually carried out corrections by an expert will produce acceptable results.
If the sequence identity falls below 30%, one may need additional information most often in
the form of structural data in order to create acceptable alignments. When the sequence
identity falls below 10%, one may use sophisticated pattern recognition approaches to
obtain a reasonable alignment. In any cases the accuracy of the resulting alignment is often
difficult to prove.
 Practically realizable multiple alignment algorithms come in two classes. One in which a
“Sum of Pairs” (SP) Score is minimized. It can be shown that the general form of the
Minimization of SP Score algorithm cannot be written down in polynomial time, several
approximations, notably the “Centre-Star Algorithm” can be written down in polynomial
time. The Centre-Star approximation boils down to finding an “average” sequence from the
set to be aligned. The average sequence is the one from which has the lowest sequence
space distance from all other sequences in the set. Once such a sequence is identified, the
multiple alignment can be constructed by simply aligning the remaining sequences to this
sequences using a slight modification of the standard dynamic programming algorithm. The
second class of algorithms is known as “Progressive Alignment”. This algorithm runs in a
step-wise manner, in the first step all possible pair-wise alignments are carried out among
sequences in the set. This is followed by the creation of a type of phylogenetic tree known as
the “guide-tree” where similar sequences are to be found in branches close to each other in
the tree. The final alignment is now started with the pair of sequences that are closest to
each other. Subsequently the closest sequences in the tree are added to the growing
alignment one at a time until the whole alignment is built.
 A critical aspect of multiple sequence alignment is the scoring function or the character
comparison function that one uses. In principle one can use any type of scoring function and

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment
the alignment protocol will return the best possible alignment consistent with that scoring
function. However, for the alignment to be biologically meaningful, it is imperative to use a
scoring function that makes biological sense. Depending on the particular biological problem
at hand, one can think up of a number of different scoring functions. For example, one can
build a scoring function based on the similarities and differences of different amino acids
based on a set of physic-chemical parameters. However the arbitrariness involved in
choosing the set of relevant physic-chemical parameters will make the scoring function
essentially arbitrary which is unlikely to be suitable for most biological problems. It can be
argued that for large majority of biological significant processes, the underlying mechanism
of the process is rooted in the theory of evolution. Hence a scoring function based on
evolutionary principles is likely to work for most biological problems. The PAM and the
BLOSUM series of scoring functions are based on evolutionary principles and are the most
popularly used scoring functions.
 The PAM (Point Accepted Mutation) sries of scoring functions calculates the log odds ratio of
amino acid x mutating to amino acid y in a certain number of steps. In PAM1, it is assumed
to occur in a single step, in PAM2 it is assumed to occur in two steps and so on. PAM1
scoring matrices were generated by calculating the log odds ratio from alignments that were
more than 85% identical. PAM matrices with higher indices were extrapolated for PAM1 by a
process of matrix multiplication. PAM250 (PAM1 multiplied with itself 250 times) scoring
matrix can be used for aligning sequences that are highly diverged.
 The BLOSUM series of scoring matrices were generated by directly calculating the log odds
ratios from ungapped alignments that are similar to a certain degree. For example the
alignments used to calculate BLOSUM62 matrix contains sequences that are at least 62%
similar to each other. Unlike the PAM matrices, BLOSUM matrices of lower indices, e.g.,
BLOSUM30 should be used for aligning highly diverged sequences.

Computational Biology
Biotechnology
Sequence Analysis – II – Multiple sequence alignment

You might also like