Lec4 - Multiple Sequence Alignment
Lec4 - Multiple Sequence Alignment
2
Why Multiple Sequence Alignment?
• Important for phylogenetic analyses
• Study evolutionary history of a set of sequences. At
what point in history did certain mutations occur?
• Finding common DNA binding motif
• Finding homologous residues and domains of a set
of proteins both structurally and evolutionary, then
characterizing them
• Building profile for sequence based search for
remote related proteins/DNA motifs
• Note protein structures evolve slower than proteins
sequences
3
MSA of different CoVs
4
Jana et al. PNAS Nexus 2022
Multiple Sequence Alignment (MSA)
Given:
• A set of > 2 sequences
• A scoring method for alignment
Problem:
• Find similarity between the sequences maximizing
the alignment score
5
Recap of pairwise alignment
6
Substitution matrix for pairwise
• Consider
• Pair of amino acid sequences
• 𝑥 with 𝑖-th symbol 𝑥! and 𝑦 with 𝑗-th symbol 𝑦"
• 𝑞# is probability of amino acid 𝑎
• By Random probability 𝑅
• Probability of the two sequences
𝑃 𝑥, 𝑦|𝑅 = ( 𝑞"$ ( 𝑞$%
! #
• Assume aligned pairs occur with a joint probability 𝑝%& .
• Probability of a matching model 𝑀, all aligned at each 𝑖
𝑃 𝑥, 𝑦|𝑀 = ( 𝑝"$$$
!
7
Substitution matrix for pairwise
• Odds ratio = ratio of the two likelihoods
𝑃 𝑥, 𝑦|𝑀 ∏! 𝑝"$$$ 𝑝"$$$
= =(
𝑃 𝑥, 𝑦|𝑅 ∏! 𝑞"$ ∏! 𝑞$$ 𝑞"$ 𝑞$$
!
• We use log-odds to make it a summation:
𝑆 = / 𝑠 𝑥! , 𝑦!
!
where
𝑝%&
𝑠 𝑎, 𝑏 = log
𝑞% 𝑞&
which is log likelihood of residue pair (𝑎, 𝑏) occurring as
an aligned pair, as opposed to unaligned pair.
8
Scoring in MSA
• Need to account for the fact that:
• some positions are more conserved than others
• sequences are related by a phylogenetic tree
• Ideal way – use a complete probabilistic model of
molecular sequence evolution, which would be very
complex as it would depend on when exactly the
sequence evolved on the time axis.
• Assuming individual columns of an alignment are
statistically independent
𝑆 𝑚 = 𝐺 + / 𝑆 𝑚!
!
where 𝐺 = function for scoring gaps in the alignment,
𝑚! = column 𝑖 of the MSA 𝑚,
𝑆 𝑚! = score of column 𝑖.
9
MSA using Dynamic Programming
• Optimal alignment using DP
• Generalizing pairwise alignment method for MSA
• Instead of pair of sequences consider 𝑘 sequences
• Instead of 2-D matrix we’ll need a matrix of 𝑘
dimensions
• Space complexity instead of 𝑂 𝑛! it is 𝑂 𝑛"
10
MSA using Dynamic Programming
• Maximum score of MSA upto subsequences ending with 𝑥!"! , 𝑥!#" , … , 𝑥!$#
𝛼!! '",!" '",…,!# '" + 𝑆 𝑥!"! , 𝑥!#" , … , 𝑥!$#
𝛼!! ,!" '",…,!# '" + 𝑆 −, 𝑥!#" , … , 𝑥!$#
𝛼!! '",!" ,…,!# '" + 𝑆 𝑥!"! , −, … , 𝑥!$#
.
𝛼!! ,!" ,…,!# = 𝑚𝑎𝑥 .
.
𝛼!! '",!" '",…,!# + 𝑆 𝑥!"! , 𝑥!#" , … , −
𝛼!! ,!" ,…,!# '" + 𝑆 −, −, … , 𝑥!$#
.
.
There are 2$ − 1 such combinations
11
MSA using Dynamic Programming
• Assuming total 𝑘 sequences of approximately same
length 𝑛
• Space complexity 𝑂 𝑛"
• Time complexity 𝑂 2" 𝑛"
• We can use such DP approach for MSA using
• minimizing entropy 𝑂 𝑘 or
• sum of pairs method 𝑂 𝑘 #
• These are possible methods, but very high
complexity.
• Thus, we use heuristic approaches.
12
Method : Sum of pairs (SP) scoring
• Treating gap as an extra residue
𝑆 𝑚! = 0 𝑠 𝑚!$ , 𝑚!)
$()
• where 𝑠 𝑎, 𝑏 come from substitution matrix,
• 𝑚!$ = residue of 𝑘-th sequence in the 𝑖-th column,
• All 𝑘, 𝑙 pairs are evaluated
• 𝑠 𝑎, − & 𝑠 −, 𝑎 to treat gaps. Linear gap penalty.
*$%&
• We should be computing log
+$ +% +&
*$% *%& *$&
• But we are considering log + log + log
+$ +% +% +& +$ +&
• We need a multidimensional Dynamic programming
algorithm
13
Heuristic Methods
• Heuristic Methods to replace DP’s exponential time
complexity for MSA
• Note, Heuristic approaches aren’t the optimal
methods
• Most of time (not always) we don’t need the
optimal solution and a close result is enough
• Types
• Progressive alignment
• Iterative refinement
14
Heuristic Methods: Progressive alignment
• Constructing a succession of pairwise alignments.
Steps:
• Two sequences chosen and aligned using a pairwise alignment
method
• Next sequence is aligned to the previous alignment
• Iterate until all sequences are aligned
• It does not directly optimize any global scoring
function, so tend to deviate from optimum solution for
MSA.
• Fast, efficient, reasonable alignments.
• Example – Tree alignment methods like Feng-Doolittle,
CLUSTALW, others.
• Star alignment method
• “Once a gap always a gap.”
15
Tree Alignments
• Organize MSA using a guide tree
• Leaves represent sequences
• Internal nodes represent alignments
• Determine alignments from bottom of tree upward
• Return MSA at the root of the tree
16
Feng-Doolittle Progressive MSA
(Feng-Doolittle 1987)
Steps
• Calculate diagonal matrix of 𝑘 𝑘 − 1 /2 distances
between all pair of 𝑘 sequences by standard
pairwise alignment methods
• Construct a guide tree from distance matrix using
clustering algorithm (Fitch and Margoliash, 1967 )
• Traverse the nodes in the order of addition to the
tree, repeatedly align the child nodes
17
Fitch and Margoliash clustering
• Alignment done using pairwise DP algorithms.
• A sequence is added to an existing group by
aligning it pairwise to each sequence. Highest
scoring pairwise alignment determines sequence
alignment to the group.
• Aligning group to group: all sequences are aligned
first pairwise. Best pairwise sequence alignment
determines alignment of the groups.
• After alignment gap is replaced by symbol ‘X’
• Problem 1 “Once a gap always a gap.”
• Problem 2 for two groups all sequences are aligned
first pairwise
18
CLUSTALW
Difference comparison to Feng-Doolittle
• For a group instead of sequence alignment to each
sequence of the group – a sequence profile is used for
the group for alignment.
• Guide tree using Neighbour-joining clustering algorithm
(Saitou & Nei, 1987)
• Progressively align at nodes in order of decreasing
similarity, using alignment of
• Sequence to sequence
• Sequence to profile
• Profile with a profile
• In all cases we can use dynamic programming
• for the profile cases, use sum of pairs scoring
19
Thompson et al. 1994, Dr. Colin Dewey notes
Tree-alignment example
20
Dr. Colin Dewey notes
Iterative refinement methods
• Problem with progressive alignment algorithm –
subalignments are frozen. Iterative refinement
methods tries to solve this problem.
Steps
• Initial alignment is generated using any method
• One sequence (or set of sequences) is taken out and
realigned to profile of remaining.
• Repeated until alignment stop changing.
• Guaranteed to converge at a local minima.
21
Thank you
• Inspired from Dr. Colin Dewey notes
• Biological Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. By R. Durbin, S.
Eddy, A. Krogh, and G. Mitchison.
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/
22