Lec7 - Multiple Sequence Alignment
Lec7 - Multiple Sequence Alignment
2
Why Multiple Sequence Alignment?
• Important for phylogenetic analyses
• Study evolutionary history of a set of sequences. At
what point in history did certain mutations occur?
• Finding common DNA binding motif
• Finding homologous residues and domains of a set
of proteins both structurally and evolutionary, then
characterizing them
• Building profile for sequence based search for
remote related proteins/DNA motifs
• Note protein structures evolve slower than proteins
sequences
3
MSA of different CoVs
4
Jana et al. PNAS Nexus 2022
Multiple Sequence Alignment (MSA)
Given:
• A set of > 2 sequences
• A scoring method for alignment
Problem:
• Find similarity between the sequences maximizing
the alignment score
5
Recap of pairwise alignment
6
Substitution matrix for pairwise
• Consider
• Pair of amino acid sequences
• 𝑥𝑥 with 𝑖𝑖-th symbol 𝑥𝑥𝑖𝑖 and 𝑦𝑦 with 𝑗𝑗-th symbol 𝑦𝑦𝑗𝑗
• 𝑞𝑞𝑎𝑎 is probability of amino acid 𝑎𝑎
• By Random probability 𝑅𝑅
• Probability of the two sequences
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑅𝑅 = � 𝑞𝑞𝑥𝑥𝑖𝑖 � 𝑞𝑞𝑦𝑦𝑗𝑗
𝑖𝑖 𝑗𝑗
• Assume aligned pairs occur with a joint probability 𝑝𝑝𝑎𝑎𝑎𝑎 .
• Probability of a matching model 𝑀𝑀, all aligned at each 𝑖𝑖
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑀𝑀 = � 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖
𝑖𝑖
7
Substitution matrix for pairwise
• Odds ratio = ratio of the two likelihoods
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑀𝑀 ∏𝑖𝑖 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖
= =�
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑅𝑅 ∏𝑖𝑖 𝑞𝑞𝑥𝑥𝑖𝑖 ∏𝑖𝑖 𝑞𝑞𝑦𝑦𝑖𝑖 𝑞𝑞𝑥𝑥𝑖𝑖 𝑞𝑞𝑦𝑦𝑖𝑖
𝑖𝑖
• We use log-odds to make it a summation:
𝑆𝑆 = � 𝑠𝑠 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖
𝑖𝑖
where
𝑝𝑝𝑎𝑎𝑎𝑎
𝑠𝑠 𝑎𝑎, 𝑏𝑏 = log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏
which is log likelihood of residue pair (𝑎𝑎, 𝑏𝑏) occurring as
an aligned pair, as opposed to unaligned pair.
8
Scoring in MSA
• Need to account for the fact that:
• some positions are more conserved than others
• sequences are related by a phylogenetic tree
• Ideal way – use a complete probabilistic model of
molecular sequence evolution, which would be very
complex as it would depend on when exactly the
sequence evolved on the time axis.
• Assuming individual columns of an alignment are
statistically independent
𝑆𝑆 𝑚𝑚 = 𝐺𝐺 + � 𝑆𝑆 𝑚𝑚𝑖𝑖
𝑖𝑖
where 𝐺𝐺 = function for scoring gaps in the alignment,
𝑚𝑚𝑖𝑖 = column 𝑖𝑖 of the MSA 𝑚𝑚,
𝑆𝑆 𝑚𝑚𝑖𝑖 = score of column 𝑖𝑖.
9
MSA using Dynamic Programming
• Optimal alignment using DP
• Generalizing pairwise alignment method for MSA
• Instead of pair of sequences consider 𝑘𝑘 sequences
• Instead of 2-D matrix we’ll need a matrix of 𝑘𝑘
dimensions
• Space complexity instead of 𝑂𝑂 𝑛𝑛2 it is 𝑂𝑂 𝑛𝑛𝑘𝑘
10
MSA using Dynamic Programming
• Maximum score of MSA upto subsequences ending with 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 −, 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , −, … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
.
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘 = 𝑚𝑚𝑚𝑚𝑚𝑚 .
.
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , −
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 −, −, … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
.
.
There are 2𝑘𝑘 − 1 such combinations
11
MSA using Dynamic Programming
• Assuming total 𝑘𝑘 sequences of approximately same
length 𝑛𝑛
• Space complexity 𝑂𝑂 𝑛𝑛𝑘𝑘
• Time complexity 𝑂𝑂 2𝑘𝑘 𝑛𝑛𝑘𝑘
• We can use such DP approach for MSA using
• minimizing entropy 𝑂𝑂 𝑘𝑘 or
• sum of pairs method 𝑂𝑂 𝑘𝑘 2
• These are possible methods, but very high
complexity.
• Thus, we use heuristic approaches.
12
Method : Sum of pairs (SP) scoring
• Treating gap as an extra residue
𝑆𝑆 𝑚𝑚𝑖𝑖 = � 𝑠𝑠 𝑚𝑚𝑖𝑖𝑘𝑘 , 𝑚𝑚𝑖𝑖𝑙𝑙
𝑘𝑘<𝑙𝑙
• where 𝑠𝑠 𝑎𝑎, 𝑏𝑏 come from substitution matrix,
• 𝑚𝑚𝑖𝑖𝑘𝑘 = residue of 𝑘𝑘-th sequence in the 𝑖𝑖-th column,
• All 𝑘𝑘, 𝑙𝑙 pairs are evaluated
• 𝑠𝑠 𝑎𝑎, − & 𝑠𝑠 −, 𝑎𝑎 to treat gaps. 𝑠𝑠 −, − = 1. Thus linear
gap penalty.
𝑝𝑝𝑎𝑎𝑎𝑎𝑐𝑐
• We should be computing log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏 𝑞𝑞𝑐𝑐
𝑝𝑝𝑎𝑎𝑎𝑎 𝑝𝑝𝑏𝑏𝑏𝑏 𝑝𝑝𝑎𝑎𝑐𝑐
• But we are considering log + log + log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏 𝑞𝑞𝑏𝑏 𝑞𝑞𝑐𝑐 𝑞𝑞𝑎𝑎 𝑞𝑞𝑐𝑐
• We need a multidimensional Dynamic programming
algorithm
13
Heuristic Methods
• Heuristic Methods to replace DP’s exponential time
complexity for MSA
• Note, Heuristic approaches aren’t the optimal
methods
• Most of time (not always) we don’t need the
optimal solution and a close result is enough
• Types
• Progressive alignment
• Iterative refinement
14
Heuristic Methods: Progressive alignment
• Constructing a succession of pairwise alignments.
Steps:
• Two sequences chosen and aligned using a pairwise alignment
method
• Next sequence is aligned to the previous alignment
• Iterate until all sequences are aligned
• It does not directly optimize any global scoring
function, so tend to deviate from optimum solution for
MSA.
• Fast, efficient, reasonable alignments.
• Example – Tree alignment methods like Feng-Doolittle,
CLUSTALW, others.
• Star alignment method
• “Once a gap always a gap.”
15
Tree Alignments
• Organize MSA using a guide tree
• Leaves represent sequences
• Internal nodes represent alignments
• Determine alignments from bottom of tree upward
• Return MSA at the root of the tree
16
Feng-Doolittle Progressive MSA
(Feng-Doolittle 1987)
Steps
• Calculate diagonal matrix of 𝑘𝑘 𝑘𝑘 − 1 /2 distances
between all pair of 𝑘𝑘 sequences by standard
pairwise alignment methods
• Construct a guide tree from distance matrix using
clustering algorithm (Fitch and Margoliash, 1967 )
• Traverse the nodes in the order of addition to the
tree, repeatedly align the child nodes
17
Fitch and Margoliash clustering
• Alignment done using pairwise DP algorithms.
• A sequence is added to an existing group by
aligning it pairwise to each sequence. Highest
scoring pairwise alignment determines sequence
alignment to the group.
• Aligning group to group: all sequences are aligned
first pairwise. Best pairwise sequence alignment
determines alignment of the groups.
• After alignment gap is replaced by symbol ‘X’
• Problem 1 “Once a gap always a gap.”
• Problem 2 for two groups all sequences are aligned
first pairwise
18
CLUSTALW
Difference comparison to Feng-Doolittle
• For a group instead of sequence alignment to each
sequence of the group – a sequence profile is used for
the group for alignment.
• Guide tree using Neighbour-joining clustering algorithm
(Saitou & Nei, 1987)
• Progressively align at nodes in order of decreasing
similarity, using alignment of
• Sequence to sequence
• Sequence to profile
• Profile with a profile
• In all cases we can use dynamic programming
• for the profile cases, use sum of pairs scoring
19
Thompson et al. 1994, Dr. Colin Dewey notes
Tree-alignment example
20
Dr. Colin Dewey notes
Iterative refinement methods
• Problem with progressive alignment algorithm –
subalignments are frozen. Iterative refinement
methods tries to solve this problem.
Steps
• Initial alignment is generated using any method
• One sequence (or set of sequences) is taken out and
realigned to profile of remaining.
• Repeated until alignment stop changing.
• Guaranteed to converge at a local minima.
21
Thank you
• Inspired from Dr. Colin Dewey notes
• Biological Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. By R. Durbin, S.
Eddy, A. Krogh, and G. Mitchison.
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/
22