Bioinformatics 1 - Lecture 8: Multiple Sequence Alignment
Bioinformatics 1 - Lecture 8: Multiple Sequence Alignment
m* P(S≥x) = 4 m*P(S≥x) = 1
m* P(S≥x) = 3 2
freq.
1 2 3 4 5 6 7 8 9 10
m* P(S≥x)
In class competition exercise:
Editing a multiple sequence alignment in
UGENE
• Download and open “bad alignment”
from the course web page
• Export all sequences as alignment.
• Edit the alignment.
• Try to improve the %identity, and
consolidate gaps.
3
Methods for multiple
sequence alignment
• Dynamic programming
• Star
• Progressive
– ClustalW, uses variable gap penalty
– Muscle, stochastic. Uses profiles.
– Kalign. Very fast. Uses exact match.
5
• Is optimality possible? DP for three or more
sequences.
A 3D alignment matrix... DP in 3D
S(i,j,k) = MAX {
A(i-1,j-1,k-1)+S(i,j,k),
A(i-1,j,k)-gap,
A(i,j-1,k)-gap,
A(i,j,k-1)-gap,
A(i-1,j-1,k)-gap,
A(i-1,j,k-1)-gap,
A(i,j-1,k-1)-gap }
How many more arrows when we
add a 4th seq?
How does DP scale? 6
Multiple sequence alignment -- Star
method
1. Align all sequences to one sequence.
2. Stack them up. B
D
C
A
E G
Each pairwise alignment by itself looks fine, but when you stack them up, you see disagreements. 7
BLAST query-anchored alignments are
star alignments
8
Multiple sequence alignment --
Progressive method
distance
1. Align all pairs. Save scores in matrix
2. Pairwise align two most similar.
guide tree
3. Add the next most similar sequence. Etc.
4. Continue until all sequences are aligned
Current alignment { A G H I . W W P F
A G H I I F W P Y
sequence A DP alignment matrix
to add W
P
Y S(P,[W,F]) =(1/2)(S(P,W) + S(P,F))
9
Distance and similarity are
interconvertable metrics.
Maximizing similarity and Minimizing distance are
equivalent if
• d(i,j) + s(i,j) = smax,
where smax is the maximum possible similarity, and
the minimum distance is d=0. For each position in
the alignment.
• Distance based on identity score
(p-distance) dJC
! d = 100 - %identity
• Distance using empirical J-C correction
! dJC = -ln((Sreal-Srand)/(Sident-Srand)) sreal
where Sident = score of an identity alignment, and
Srand = mode score of a false alignment.
• For proteins, Srand ≈ 25%. “Twilight zone”
(R. Doolittle, 1986)
In class: progressive alignment
Making a guide tree
Neighbor-joining algorithm:
A B C D E F A
A 97 81 82 59 32 B
B 77 80 55 31 C
C 90 65 40 D
D 61 42 E
E 33 F
F
Draw guide tree here
Fill in J-C distances.
CLUSTALW
JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994
A G H I . W W P F
A G H I I F W P Y
A
W
P
Y 3
14
MUSCLE
RC Edgar - Nucleic acids research, 2004
• Iterative MSA
based on short identical
– k-mer distance matrix
matches
– UPGMA tree
– progressive alignment--> MSA1
– Kimura distances from MSA1
– UPGMA tree
– progressive alignment -->MSA2
– For all tree branches:
• split tree into two
Z&B p174
• calculate profiles
• align profiles
• accept or reject the alignment.
• Repeat
15
MUSCLE iterative alignment
XP_001615335
XP_002259219
YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS--
phylogenetic tree
XP_001347897 YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN
XP_726635 YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN--
XP_671449
XP_001458064
------------------------------------------------------------
VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- X
XP_001347129 VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR--
XP_002283970 DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE--
XP_002367832 RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA
random cut point
YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS--
YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN
YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYV..SIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYV..SIFIYGNIAMTTEKENENATS--
VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR--
VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR--
DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE--
RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA
YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYV..FIYGNIIISDLKGEENITKNN
YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYI..NIFIYGNLSIPNEINIKNETN--
VVQAQYYTAELFLEELNILDLESLQQFHS..NYFSNFRVSSFVSGNILRSEVEDLLHSIR--
VVQAQYYTSQLFQDELATLDLESLQEFHS..NYFSNFRVSSFVSGNILRSEVEDLLHTIR--
DNTWPWMDG---LEVIPHLEADDLAKFVP..MLLSRAFLECYIAGNIEPKEAEAMIHHIE--
RNRFSQLDLRSAVTDASS-QFEDFKVFLE..KVLTKNALDVFIMGDIDYEEARKLAEDFRAA
new MSA
In each iteration:
The phylogenetic tree
is cut at a random
branch, the two
subtrees are converted
to profiles, and aligned.
The new alignment is
either accepted or
rejected 16
DP profile-profile alignment
Databases of multiple
sequence alignments
• bAliBase -- structural alignment-based
• BLOCKS -- gapless regions
• PFAM -- Hidden Markov models
• CDD -- conserved domain database
• FSSP -- structural alignment-based
(families)
17
UGENE podcast: large
alaignments
• Watch UGENE podcast #13
• http://ugene.unipro.ru/
podcast_archive.html
18
Selective re-alignment
• Global affine-gap DP alignment may
be used to refine an alignment between
two, conserved and confidently aligned
columns.
– Select. Align with MUSCLE. Selected
columns.
– Or, paste into ClustalW web site. Use same
penalty for opening gap and end gap.
19
Review
• Are multiple sequence alignments optimal?
• How is phylogenetic information used in MSA
algorithms?
• What are the advantages/disadvantages of a
“star” alignment?
• What information is ClustalW encoding in its
MSA algorithm?
• What is the outermost loop in the MUSCLE
alignment probably look like?
20