0% found this document useful (0 votes)
23 views22 pages

Lec4 - Multiple Sequence Alignment

Uploaded by

mbathula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views22 pages

Lec4 - Multiple Sequence Alignment

Uploaded by

mbathula
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Bioinformatics

BEC-108 Spring 2023-24

Dr. Devesh Bhimsaria


Office: F9, Old Building
Department of Biosciences and Bioengineering
Indian Institute Of Technology–Roorkee
[email protected]
1
Multiple Sequence Alignment

2
Why Multiple Sequence Alignment?
• Important for phylogenetic analyses
• Study evolutionary history of a set of sequences. At
what point in history did certain mutations occur?
• Finding common DNA binding motif
• Finding homologous residues and domains of a set
of proteins both structurally and evolutionary, then
characterizing them
• Building profile for sequence based search for
remote related proteins/DNA motifs
• Note protein structures evolve slower than proteins
sequences

3
MSA of different CoVs

Alignment of one pair of HR2 helices of different human infecting coronaviruses

4
Jana et al. PNAS Nexus 2022
Multiple Sequence Alignment (MSA)
Given:
• A set of > 2 sequences
• A scoring method for alignment
Problem:
• Find similarity between the sequences maximizing
the alignment score

5
Recap of pairwise alignment

6
Substitution matrix for pairwise
• Consider
• Pair of amino acid sequences
• 𝑥 with 𝑖-th symbol 𝑥! and 𝑦 with 𝑗-th symbol 𝑦"
• 𝑞# is probability of amino acid 𝑎
• By Random probability 𝑅
• Probability of the two sequences
𝑃 𝑥, 𝑦|𝑅 = ( 𝑞"$ ( 𝑞$%
! #
• Assume aligned pairs occur with a joint probability 𝑝%& .
• Probability of a matching model 𝑀, all aligned at each 𝑖
𝑃 𝑥, 𝑦|𝑀 = ( 𝑝"$$$
!

7
Substitution matrix for pairwise
• Odds ratio = ratio of the two likelihoods
𝑃 𝑥, 𝑦|𝑀 ∏! 𝑝"$$$ 𝑝"$$$
= =(
𝑃 𝑥, 𝑦|𝑅 ∏! 𝑞"$ ∏! 𝑞$$ 𝑞"$ 𝑞$$
!
• We use log-odds to make it a summation:
𝑆 = / 𝑠 𝑥! , 𝑦!
!
where
𝑝%&
𝑠 𝑎, 𝑏 = log
𝑞% 𝑞&
which is log likelihood of residue pair (𝑎, 𝑏) occurring as
an aligned pair, as opposed to unaligned pair.
8
Scoring in MSA
• Need to account for the fact that:
• some positions are more conserved than others
• sequences are related by a phylogenetic tree
• Ideal way – use a complete probabilistic model of
molecular sequence evolution, which would be very
complex as it would depend on when exactly the
sequence evolved on the time axis.
• Assuming individual columns of an alignment are
statistically independent
𝑆 𝑚 = 𝐺 + / 𝑆 𝑚!
!
where 𝐺 = function for scoring gaps in the alignment,
𝑚! = column 𝑖 of the MSA 𝑚,
𝑆 𝑚! = score of column 𝑖.
9
MSA using Dynamic Programming
• Optimal alignment using DP
• Generalizing pairwise alignment method for MSA
• Instead of pair of sequences consider 𝑘 sequences
• Instead of 2-D matrix we’ll need a matrix of 𝑘
dimensions
• Space complexity instead of 𝑂 𝑛! it is 𝑂 𝑛"

10
MSA using Dynamic Programming
• Maximum score of MSA upto subsequences ending with 𝑥!"! , 𝑥!#" , … , 𝑥!$#
𝛼!! '",!" '",…,!# '" + 𝑆 𝑥!"! , 𝑥!#" , … , 𝑥!$#
𝛼!! ,!" '",…,!# '" + 𝑆 −, 𝑥!#" , … , 𝑥!$#
𝛼!! '",!" ,…,!# '" + 𝑆 𝑥!"! , −, … , 𝑥!$#
.
𝛼!! ,!" ,…,!# = 𝑚𝑎𝑥 .
.
𝛼!! '",!" '",…,!# + 𝑆 𝑥!"! , 𝑥!#" , … , −
𝛼!! ,!" ,…,!# '" + 𝑆 −, −, … , 𝑥!$#
.
.
There are 2$ − 1 such combinations
11
MSA using Dynamic Programming
• Assuming total 𝑘 sequences of approximately same
length 𝑛
• Space complexity 𝑂 𝑛"
• Time complexity 𝑂 2" 𝑛"
• We can use such DP approach for MSA using
• minimizing entropy 𝑂 𝑘 or
• sum of pairs method 𝑂 𝑘 #
• These are possible methods, but very high
complexity.
• Thus, we use heuristic approaches.
12
Method : Sum of pairs (SP) scoring
• Treating gap as an extra residue
𝑆 𝑚! = 0 𝑠 𝑚!$ , 𝑚!)
$()
• where 𝑠 𝑎, 𝑏 come from substitution matrix,
• 𝑚!$ = residue of 𝑘-th sequence in the 𝑖-th column,
• All 𝑘, 𝑙 pairs are evaluated
• 𝑠 𝑎, − & 𝑠 −, 𝑎 to treat gaps. Linear gap penalty.
*$%&
• We should be computing log
+$ +% +&
*$% *%& *$&
• But we are considering log + log + log
+$ +% +% +& +$ +&
• We need a multidimensional Dynamic programming
algorithm
13
Heuristic Methods
• Heuristic Methods to replace DP’s exponential time
complexity for MSA
• Note, Heuristic approaches aren’t the optimal
methods
• Most of time (not always) we don’t need the
optimal solution and a close result is enough
• Types
• Progressive alignment
• Iterative refinement

14
Heuristic Methods: Progressive alignment
• Constructing a succession of pairwise alignments.
Steps:
• Two sequences chosen and aligned using a pairwise alignment
method
• Next sequence is aligned to the previous alignment
• Iterate until all sequences are aligned
• It does not directly optimize any global scoring
function, so tend to deviate from optimum solution for
MSA.
• Fast, efficient, reasonable alignments.
• Example – Tree alignment methods like Feng-Doolittle,
CLUSTALW, others.
• Star alignment method
• “Once a gap always a gap.”

15
Tree Alignments
• Organize MSA using a guide tree
• Leaves represent sequences
• Internal nodes represent alignments
• Determine alignments from bottom of tree upward
• Return MSA at the root of the tree

16
Feng-Doolittle Progressive MSA
(Feng-Doolittle 1987)
Steps
• Calculate diagonal matrix of 𝑘 𝑘 − 1 /2 distances
between all pair of 𝑘 sequences by standard
pairwise alignment methods
• Construct a guide tree from distance matrix using
clustering algorithm (Fitch and Margoliash, 1967 )
• Traverse the nodes in the order of addition to the
tree, repeatedly align the child nodes

17
Fitch and Margoliash clustering
• Alignment done using pairwise DP algorithms.
• A sequence is added to an existing group by
aligning it pairwise to each sequence. Highest
scoring pairwise alignment determines sequence
alignment to the group.
• Aligning group to group: all sequences are aligned
first pairwise. Best pairwise sequence alignment
determines alignment of the groups.
• After alignment gap is replaced by symbol ‘X’
• Problem 1 “Once a gap always a gap.”
• Problem 2 for two groups all sequences are aligned
first pairwise
18
CLUSTALW
Difference comparison to Feng-Doolittle
• For a group instead of sequence alignment to each
sequence of the group – a sequence profile is used for
the group for alignment.
• Guide tree using Neighbour-joining clustering algorithm
(Saitou & Nei, 1987)
• Progressively align at nodes in order of decreasing
similarity, using alignment of
• Sequence to sequence
• Sequence to profile
• Profile with a profile
• In all cases we can use dynamic programming
• for the profile cases, use sum of pairs scoring

19
Thompson et al. 1994, Dr. Colin Dewey notes
Tree-alignment example

Shift entire column to


incorporate gaps

20
Dr. Colin Dewey notes
Iterative refinement methods
• Problem with progressive alignment algorithm –
subalignments are frozen. Iterative refinement
methods tries to solve this problem.
Steps
• Initial alignment is generated using any method
• One sequence (or set of sequences) is taken out and
realigned to profile of remaining.
• Repeated until alignment stop changing.
• Guaranteed to converge at a local minima.

21
Thank you
• Inspired from Dr. Colin Dewey notes
• Biological Sequence Analysis: Probabilistic Models
of Proteins and Nucleic Acids. By R. Durbin, S.
Eddy, A. Krogh, and G. Mitchison.
• All my slides/notes excluding third party material
are licensed by various authors including myself
under https://creativecommons.org/licenses/by-
nc/4.0/

22

You might also like