0% found this document useful (0 votes)
49 views

Bioinformatics 1 - Lecture 8: Multiple Sequence Alignment

1. Muscle performs multiple sequence alignment in an iterative manner, building an initial alignment, calculating distances, building a new tree and refining the alignment repeatedly until convergence. 2. It splits branches of the phylogenetic tree and realigns the resulting profile alignments, only keeping changes that improve the alignment score. 3. This allows it to incorporate more information from related sequences compared to progressive methods, improving alignment accuracy.

Uploaded by

Mohsan Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Bioinformatics 1 - Lecture 8: Multiple Sequence Alignment

1. Muscle performs multiple sequence alignment in an iterative manner, building an initial alignment, calculating distances, building a new tree and refining the alignment repeatedly until convergence. 2. It splits branches of the phylogenetic tree and realigns the resulting profile alignments, only keeping changes that improve the alignment score. 3. This allows it to incorporate more information from related sequences compared to progressive methods, improving alignment accuracy.

Uploaded by

Mohsan Ullah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

BIoinformatics 1-- lecture 8

Multiple sequence alignment


pop-quiz explanation
• At the end of lecture 7, I presented a pop quiz. How many random
scores have e-values less than or equal to 10? (Answer? 10) Why?
• Consider this random score distribution. Each zone marked has the
same area, right?
• If you randomly pick from this distribution (think about darts), what
does the distribution zones (e-values) look like? (it’s flat!)
m* P(S≥x) = 10
m*P(S≥x) = 2

m* P(S≥x) = 4 m*P(S≥x) = 1

m* P(S≥x) = 3 2

freq.

1 2 3 4 5 6 7 8 9 10
m* P(S≥x)
In class competition exercise:
Editing a multiple sequence alignment in
UGENE
• Download and open “bad alignment”
from the course web page
• Export all sequences as alignment.
• Edit the alignment.
• Try to improve the %identity, and
consolidate gaps.

How many indel events are implied by your alignment?

3
Methods for multiple
sequence alignment
• Dynamic programming
• Star
• Progressive
– ClustalW, uses variable gap penalty
– Muscle, stochastic. Uses profiles.
– Kalign. Very fast. Uses exact match.

MSA algorithms must be computationally efficient


AND biologically relevant. 4
Applications of MSAs
• Phylogenetic analysis
• Function prediction
• Structure prediction

5
• Is optimality possible? DP for three or more
sequences.

A 3D alignment matrix... DP in 3D
S(i,j,k) = MAX {
A(i-1,j-1,k-1)+S(i,j,k),
A(i-1,j,k)-gap,
A(i,j-1,k)-gap,
A(i,j,k-1)-gap,
A(i-1,j-1,k)-gap,
A(i-1,j,k-1)-gap,
A(i,j-1,k-1)-gap }
How many more arrows when we
add a 4th seq?
How does DP scale? 6
Multiple sequence alignment -- Star
method
1. Align all sequences to one sequence.
2. Stack them up. B
D
C
A

E G

Potential problems with


F
star alignment: A G H . I . W W . P F W P
•Unaligned gaps. A G H . I I F W . P Y . .
•Ambiguous associations
A G H I I . . W F P F W P
A G H . I P W W . P . . .

Each pairwise alignment by itself looks fine, but when you stack them up, you see disagreements. 7
BLAST query-anchored alignments are
star alignments

8
Multiple sequence alignment --
Progressive method
distance
1. Align all pairs. Save scores in matrix
2. Pairwise align two most similar.
guide tree
3. Add the next most similar sequence. Etc.
4. Continue until all sequences are aligned

Current alignment { A G H I . W W P F
A G H I I F W P Y
sequence A DP alignment matrix
to add W
P
Y S(P,[W,F]) =(1/2)(S(P,W) + S(P,F))
9
Distance and similarity are
interconvertable metrics.
Maximizing similarity and Minimizing distance are
equivalent if
• d(i,j) + s(i,j) = smax,
where smax is the maximum possible similarity, and
the minimum distance is d=0. For each position in
the alignment.
• Distance based on identity score
(p-distance) dJC
! d = 100 - %identity
• Distance using empirical J-C correction
! dJC = -ln((Sreal-Srand)/(Sident-Srand)) sreal
where Sident = score of an identity alignment, and
Srand = mode score of a false alignment.
• For proteins, Srand ≈ 25%. “Twilight zone”
(R. Doolittle, 1986)
In class: progressive alignment
Making a guide tree
Neighbor-joining algorithm:

A B C D E F A
A 97 81 82 59 32 B
B 77 80 55 31 C
C 90 65 40 D
D 61 42 E
E 33 F
F
Draw guide tree here
Fill in J-C distances.
CLUSTALW
JD Thompson, DG Higgins, TJ Gibson - Nucleic acids research, 1994

• Start with unrooted tree, using Neighbor


joining.
• choose root to get guide tree
• progressive alignment
– matches are scored using sequence weights
– gaps are position dependent
• GOP lower for polar residues
• GOP zero where there is already a gap
http://www.ebi.ac.uk/Tools/msa/clustalw2/
http://www.ch.embnet.org/software/ClustalW-XXL.html 12
There should be no gap penalty for
aligning a gap to an already existing gap.
If i is already a gap position in any sequence, set gap(i)=0.

A G H I . W W P F
A G H I I F W P Y
A
W
P
Y 3

A(i,j) = A(i-1,j) - gap(i)


NOTE: DP is still optimal when the gap penalty is position-specific.
13
CLUSTALW Position specific gap penalty

14
MUSCLE
RC Edgar - Nucleic acids research, 2004

• Iterative MSA
based on short identical
– k-mer distance matrix
matches
– UPGMA tree
– progressive alignment--> MSA1
– Kimura distances from MSA1
– UPGMA tree
– progressive alignment -->MSA2
– For all tree branches:
• split tree into two
Z&B p174
• calculate profiles
• align profiles
• accept or reject the alignment.
• Repeat

15
MUSCLE iterative alignment
XP_001615335
XP_002259219
YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS--
phylogenetic tree
XP_001347897 YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN
XP_726635 YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN--
XP_671449
XP_001458064
------------------------------------------------------------
VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR-- X
XP_001347129 VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR--
XP_002283970 DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE--
XP_002367832 RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA
random cut point

YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYVSIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYVSIFIYGNIAMTTEKENENATS--
YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYVVIFIYGNIIISDLKGEENITKNN
YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYINIFIYGNLSIPNEINIKNETN-- YEPTDKEMDDILSAYFFYPSYKDYTRYVVDIFHRNYV..SIFIYGNIAMPTEKEDENATS--
YDPTDKEMDDLLSAYFFYPSYKDYTKYVVDFFHRNYV..SIFIYGNIAMTTEKENENATS--
VVQAQYYTAELFLEELNILDLESLQQFHSNYFSNFRVSSFVSGNILRSEVEDLLHSIR--
VVQAQYYTSQLFQDELATLDLESLQEFHSNYFSNFRVSSFVSGNILRSEVEDLLHTIR--
DNTWPWMDG---LEVIPHLEADDLAKFVPMLLSRAFLECYIAGNIEPKEAEAMIHHIE--
RNRFSQLDLRSAVTDASS-QFEDFKVFLEKVLTKNALDVFIMGDIDYEEARKLAEDFRAA

YTPTNKEMYDILNAYFFYPSYNAYRTYVNEYFLRNYV..FIYGNIIISDLKGEENITKNN
YIPTNKEIYDILNAYLFYPLYNSYIKYINNFFHKNYI..NIFIYGNLSIPNEINIKNETN--
VVQAQYYTAELFLEELNILDLESLQQFHS..NYFSNFRVSSFVSGNILRSEVEDLLHSIR--
VVQAQYYTSQLFQDELATLDLESLQEFHS..NYFSNFRVSSFVSGNILRSEVEDLLHTIR--
DNTWPWMDG---LEVIPHLEADDLAKFVP..MLLSRAFLECYIAGNIEPKEAEAMIHHIE--
RNRFSQLDLRSAVTDASS-QFEDFKVFLE..KVLTKNALDVFIMGDIDYEEARKLAEDFRAA

new MSA
In each iteration:
The phylogenetic tree
is cut at a random
branch, the two
subtrees are converted
to profiles, and aligned.
The new alignment is
either accepted or
rejected 16
DP profile-profile alignment
Databases of multiple
sequence alignments
• bAliBase -- structural alignment-based
• BLOCKS -- gapless regions
• PFAM -- Hidden Markov models
• CDD -- conserved domain database
• FSSP -- structural alignment-based
(families)

17
UGENE podcast: large
alaignments
• Watch UGENE podcast #13
• http://ugene.unipro.ru/
podcast_archive.html

– Reproduce the steps from the podcast!

18
Selective re-alignment
• Global affine-gap DP alignment may
be used to refine an alignment between
two, conserved and confidently aligned
columns.
– Select. Align with MUSCLE. Selected
columns.
– Or, paste into ClustalW web site. Use same
penalty for opening gap and end gap.

19
Review
• Are multiple sequence alignments optimal?
• How is phylogenetic information used in MSA
algorithms?
• What are the advantages/disadvantages of a
“star” alignment?
• What information is ClustalW encoding in its
MSA algorithm?
• What is the outermost loop in the MUSCLE
alignment probably look like?
20

You might also like