0% found this document useful (0 votes)
34 views22 pages

Lec7 - Multiple Sequence Alignment

This document discusses multiple sequence alignment (MSA) and different methods used for MSA. It begins by explaining why MSA is important and then describes dynamic programming and heuristic approaches for MSA. It specifically covers sum of pairs scoring, progressive alignment using tree methods like Feng-Doolittle and CLUSTALW, and iterative refinement methods. Progressive alignment builds the alignment incrementally using pairwise alignments while iterative refinement realigns portions to optimize the overall alignment.

Uploaded by

ayush saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views22 pages

Lec7 - Multiple Sequence Alignment

This document discusses multiple sequence alignment (MSA) and different methods used for MSA. It begins by explaining why MSA is important and then describes dynamic programming and heuristic approaches for MSA. It specifically covers sum of pairs scoring, progressive alignment using tree methods like Feng-Doolittle and CLUSTALW, and iterative refinement methods. Progressive alignment builds the alignment incrementally using pairwise alignments while iterative refinement realigns portions to optimize the overall alignment.

Uploaded by

ayush saini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Computational Biology

BE-305 Autumn 2023-24

Dr. Devesh Bhimsaria


Office: F9, Old Building
Department of Biosciences and Bioengineering
Indian Institute Of Technology–Roorkee
[email protected]
1
Multiple Sequence Alignment

2
Why Multiple Sequence Alignment?
• Important for phylogenetic analyses
• Study evolutionary history of a set of sequences. At
what point in history did certain mutations occur?
• Finding common DNA binding motif
• Finding homologous residues and domains of a set
of proteins both structurally and evolutionary, then
characterizing them
• Building profile for sequence based search for
remote related proteins/DNA motifs
• Note protein structures evolve slower than proteins
sequences

3
MSA of different CoVs

Alignment of one pair of HR2 helices of different human infecting coronaviruses

4
Jana et al. PNAS Nexus 2022
Multiple Sequence Alignment (MSA)
Given:
• A set of > 2 sequences
• A scoring method for alignment
Problem:
• Find similarity between the sequences maximizing
the alignment score

5
Recap of pairwise alignment

6
Substitution matrix for pairwise
• Consider
• Pair of amino acid sequences
• 𝑥𝑥 with 𝑖𝑖-th symbol 𝑥𝑥𝑖𝑖 and 𝑦𝑦 with 𝑗𝑗-th symbol 𝑦𝑦𝑗𝑗
• 𝑞𝑞𝑎𝑎 is probability of amino acid 𝑎𝑎
• By Random probability 𝑅𝑅
• Probability of the two sequences
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑅𝑅 = � 𝑞𝑞𝑥𝑥𝑖𝑖 � 𝑞𝑞𝑦𝑦𝑗𝑗
𝑖𝑖 𝑗𝑗
• Assume aligned pairs occur with a joint probability 𝑝𝑝𝑎𝑎𝑎𝑎 .
• Probability of a matching model 𝑀𝑀, all aligned at each 𝑖𝑖
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑀𝑀 = � 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖
𝑖𝑖

7
Substitution matrix for pairwise
• Odds ratio = ratio of the two likelihoods
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑀𝑀 ∏𝑖𝑖 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖 𝑝𝑝𝑥𝑥𝑖𝑖 𝑦𝑦𝑖𝑖
= =�
𝑃𝑃 𝑥𝑥, 𝑦𝑦|𝑅𝑅 ∏𝑖𝑖 𝑞𝑞𝑥𝑥𝑖𝑖 ∏𝑖𝑖 𝑞𝑞𝑦𝑦𝑖𝑖 𝑞𝑞𝑥𝑥𝑖𝑖 𝑞𝑞𝑦𝑦𝑖𝑖
𝑖𝑖
• We use log-odds to make it a summation:
𝑆𝑆 = � 𝑠𝑠 𝑥𝑥𝑖𝑖 , 𝑦𝑦𝑖𝑖
𝑖𝑖
where
𝑝𝑝𝑎𝑎𝑎𝑎
𝑠𝑠 𝑎𝑎, 𝑏𝑏 = log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏
which is log likelihood of residue pair (𝑎𝑎, 𝑏𝑏) occurring as
an aligned pair, as opposed to unaligned pair.
8
Scoring in MSA
• Need to account for the fact that:
• some positions are more conserved than others
• sequences are related by a phylogenetic tree
• Ideal way – use a complete probabilistic model of
molecular sequence evolution, which would be very
complex as it would depend on when exactly the
sequence evolved on the time axis.
• Assuming individual columns of an alignment are
statistically independent
𝑆𝑆 𝑚𝑚 = 𝐺𝐺 + � 𝑆𝑆 𝑚𝑚𝑖𝑖
𝑖𝑖
where 𝐺𝐺 = function for scoring gaps in the alignment,
𝑚𝑚𝑖𝑖 = column 𝑖𝑖 of the MSA 𝑚𝑚,
𝑆𝑆 𝑚𝑚𝑖𝑖 = score of column 𝑖𝑖.
9
MSA using Dynamic Programming
• Optimal alignment using DP
• Generalizing pairwise alignment method for MSA
• Instead of pair of sequences consider 𝑘𝑘 sequences
• Instead of 2-D matrix we’ll need a matrix of 𝑘𝑘
dimensions
• Space complexity instead of 𝑂𝑂 𝑛𝑛2 it is 𝑂𝑂 𝑛𝑛𝑘𝑘

10
MSA using Dynamic Programming
• Maximum score of MSA upto subsequences ending with 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 −, 𝑥𝑥𝑖𝑖22 , … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , −, … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
.
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘 = 𝑚𝑚𝑚𝑚𝑚𝑚 .
.
𝛼𝛼𝑖𝑖1 −1,𝑖𝑖2 −1,…,𝑖𝑖𝑘𝑘 + 𝑆𝑆 𝑥𝑥𝑖𝑖11 , 𝑥𝑥𝑖𝑖22 , … , −
𝛼𝛼𝑖𝑖1 ,𝑖𝑖2 ,…,𝑖𝑖𝑘𝑘−1 + 𝑆𝑆 −, −, … , 𝑥𝑥𝑖𝑖𝑘𝑘𝑘𝑘
.
.
There are 2𝑘𝑘 − 1 such combinations
11
MSA using Dynamic Programming
• Assuming total 𝑘𝑘 sequences of approximately same
length 𝑛𝑛
• Space complexity 𝑂𝑂 𝑛𝑛𝑘𝑘
• Time complexity 𝑂𝑂 2𝑘𝑘 𝑛𝑛𝑘𝑘
• We can use such DP approach for MSA using
• minimizing entropy 𝑂𝑂 𝑘𝑘 or
• sum of pairs method 𝑂𝑂 𝑘𝑘 2
• These are possible methods, but very high
complexity.
• Thus, we use heuristic approaches.
12
Method : Sum of pairs (SP) scoring
• Treating gap as an extra residue
𝑆𝑆 𝑚𝑚𝑖𝑖 = � 𝑠𝑠 𝑚𝑚𝑖𝑖𝑘𝑘 , 𝑚𝑚𝑖𝑖𝑙𝑙
𝑘𝑘<𝑙𝑙
• where 𝑠𝑠 𝑎𝑎, 𝑏𝑏 come from substitution matrix,
• 𝑚𝑚𝑖𝑖𝑘𝑘 = residue of 𝑘𝑘-th sequence in the 𝑖𝑖-th column,
• All 𝑘𝑘, 𝑙𝑙 pairs are evaluated
• 𝑠𝑠 𝑎𝑎, − & 𝑠𝑠 −, 𝑎𝑎 to treat gaps. 𝑠𝑠 −, − = 1. Thus linear
gap penalty.
𝑝𝑝𝑎𝑎𝑎𝑎𝑐𝑐
• We should be computing log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏 𝑞𝑞𝑐𝑐
𝑝𝑝𝑎𝑎𝑎𝑎 𝑝𝑝𝑏𝑏𝑏𝑏 𝑝𝑝𝑎𝑎𝑐𝑐
• But we are considering log + log + log
𝑞𝑞𝑎𝑎 𝑞𝑞𝑏𝑏 𝑞𝑞𝑏𝑏 𝑞𝑞𝑐𝑐 𝑞𝑞𝑎𝑎 𝑞𝑞𝑐𝑐
• We need a multidimensional Dynamic programming
algorithm

13
Heuristic Methods
• Heuristic Methods to replace DP’s exponential time
complexity for MSA
• Note, Heuristic approaches aren’t the optimal
methods
• Most of time (not always) we don’t need the
optimal solution and a close result is enough
• Types
• Progressive alignment
• Iterative refinement

14
Heuristic Methods: Progressive alignment
• Constructing a succession of pairwise alignments.
Steps:
• Two sequences chosen and aligned using a pairwise alignment
method
• Next sequence is aligned to the previous alignment
• Iterate until all sequences are aligned
• It does not directly optimize any global scoring
function, so tend to deviate from optimum solution for
MSA.
• Fast, efficient, reasonable alignments.
• Example – Tree alignment methods like Feng-Doolittle,
CLUSTALW, others.
• Star alignment method
• “Once a gap always a gap.”

15
Tree Alignments
• Organize MSA using a guide tree
• Leaves represent sequences
• Internal nodes represent alignments
• Determine alignments from bottom of tree upward
• Return MSA at the root of the tree

16
Feng-Doolittle Progressive MSA
(Feng-Doolittle 1987)
Steps
• Calculate diagonal matrix of 𝑘𝑘 𝑘𝑘 − 1 /2 distances
between all pair of 𝑘𝑘 sequences by standard
pairwise alignment methods
• Construct a guide tree from distance matrix using
clustering algorithm (Fitch and Margoliash, 1967 )
• Traverse the nodes in the order of addition to the
tree, repeatedly align the child nodes

17
Fitch and Margoliash clustering
• Alignment done using pairwise DP algorithms.
• A sequence is added to an existing group by
aligning it pairwise to each sequence. Highest
scoring pairwise alignment determines sequence
alignment to the group.
• Aligning group to group: all sequences are aligned
first pairwise. Best pairwise sequence alignment
determines alignment of the groups.
• After alignment gap is replaced by symbol ‘X’
• Problem 1 “Once a gap always a gap.”
• Problem 2 for two groups all sequences are aligned
first pairwise
18
CLUSTALW
Difference comparison to Feng-Doolittle
• For a group instead of sequence alignment to each
sequence of the group – a sequence profile is used for
the group for alignment.
• Guide tree using Neighbour-joining clustering algorithm
(Saitou & Nei, 1987)
• Progressively align at nodes in order of decreasing
similarity, using alignment of
• Sequence to sequence
• Sequence to profile
• Profile with a profile
• In all cases we can use dynamic programming
• for the profile cases, use sum of pairs scoring

19
Thompson et al. 1994, Dr. Colin Dewey notes
Tree-alignment example

Shift entire column to


incorporate gaps

20
Dr. Colin Dewey notes
Iterative refinement methods
• Problem with progressive alignment algorithm –
subalignments are frozen. Iterative refinement
methods tries to solve this problem.
Steps
• Initial alignment is generated using any method
• One sequence (or set of sequences) is taken out and
realigned to profile of remaining.
• Repeated until alignment stop changing.
• Guaranteed to converge at a local minima.

21
Thank you
• Inspired from Dr. Colin Dewey notes
• Biological Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. By R. Durbin, S.
Eddy, A. Krogh, and G. Mitchison.
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/

22

You might also like