Multiple Alignment
Multiple Alignment
Ranojit Sarker
ITME
1
Reasons for aligning sets of
sequences
• Organise data to reflect sequence homology
• Infer phylogenetic trees from homologous
sites
• Highlight conserved sites/regions
• Highlight variable sites/regions
• Uncover changes in gene structure
• Summarise information
2
Multiple sequence alignments (MA)
Assumptions:
1) Sequences are related by a common ancestor+
2) Sequence are independent*+
3) Positions (columns) of the alignment are independent*
Ideally we would have a species tree and the sequences from each species …5
Multiple Sequence Alignment
Again, we need a scoring system and a search method.
Scoring:
a) Same substitution matrices and gap penalties as pair-wise
b) Score of an alignment is the ‘Sum of Pairs’ (SP)*
* Here is where having some very closely related species can skew things.
Alignment methods
7
Pairwise vs Multiple Sequences
8
• Sequence alignment is easy with
sufficiently closely related sequences
11
How Clustal works
• Exploit the fact that similar sequences are
evolutionarily related.
• Build up a multiple alignment progressively by a
series of pairwise alignments, following the
branches of a guide tree.
• In brief: we first guess the evolutionary picture,
then we generate the alignment according to it.
• Naturally, the alignment will suggest an
evolutionary picture which might be different
from the one we guessed first.
12
How Clustal works
1) all pairs of sequences are aligned separately in
order to calculate a distance matrix giving the
divergence of each pair of sequences;
2) a guide tree is calculated from the distance matrix
(how? we’ll see);
3) the sequences are progressively aligned according
to the branching order (i.e. starting from the
closest pairs) in the guide tree.
13
step 1: pairwise alignments
• Global pairwise alignment between every
couple of sequences.
14
step1b: distance matrix
• The score of the alignment between any two
sequences is converted into a distance in [0
1] (1 being non-identical sequences).
15
distance matrix
seq C - 0.6
seq D -
16
Step 2: the guide tree
17
step 2: the guide tree
A
D
18
Progressive Pairwise
Methods
• Most of the available multiple alignment
programs use some sort of incremental
or progressive method that makes
pairwise alignments, then adds new
sequences one at a time to these
aligned groups.
• This is an approximate method!
19
Step 3: progressive alignment
• Align gradually sequences starting from the
closest ones on the tree. Each time sequences are
aligned, we make a further hypothesis as to how
evolution has worked.
• Every time an alignment is performed, the original
sequences are substituted with their alignment.
• Along the way we align alignments instead of
sequences. This is not a problem (can align
profiles against sequences, or profiles against
profiles)
20
progressive alignment
A
D
21
progressive alignment
A+C
D
22
progressive alignment
A+C
D
23
progressive alignment
(A+C) + B
D
24
progressive alignment
(A+C) + B
D
25
((A+C) + B) + D
26
ClustalW: a common heuristic multiple alignment program
Advantages of ClustalW:
3. Uses affine gap penalties that are influenced by existing gaps in the
multiple alignment
Y D D G A V - E A L
Y D G G - - - E A L
F E G G I L V E A L
F D - G I L V Q A V
Y E G G A V V Q A L
Y D G G A/I V/L V E A L
28