0% found this document useful (0 votes)
3 views28 pages

Multiple Alignment

The document discusses multiple sequence alignment (MSA), its importance for organizing sequence data, inferring phylogenetic relationships, and identifying conserved and variable regions. It outlines the challenges of aligning multiple sequences compared to pairwise alignments and describes methods such as Clustal W, which uses a progressive approach based on evolutionary relationships. The document also highlights scoring systems and the significance of consensus sequences in representing common features across aligned sequences.

Uploaded by

rebernate92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views28 pages

Multiple Alignment

The document discusses multiple sequence alignment (MSA), its importance for organizing sequence data, inferring phylogenetic relationships, and identifying conserved and variable regions. It outlines the challenges of aligning multiple sequences compared to pairwise alignments and describes methods such as Clustal W, which uses a progressive approach based on evolutionary relationships. The document also highlights scoring systems and the significance of consensus sequences in representing common features across aligned sequences.

Uploaded by

rebernate92
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 28

Multiple Sequence Alignment

Ranojit Sarker
ITME

1
Reasons for aligning sets of
sequences
• Organise data to reflect sequence homology
• Infer phylogenetic trees from homologous
sites
• Highlight conserved sites/regions
• Highlight variable sites/regions
• Uncover changes in gene structure
• Summarise information
2
Multiple sequence alignments (MA)

• We may want to find the optimal alignment


of multiple sequences instead of pairs of
sequences.
• For instance, we have proteins with the
same function for multiple organisms: we
want to find out which parts of the
sequences match and which parts contain
most gaps and mismatches.
3
Multiple Alignments
• In theory, making an optimal alignment
between two sequences is computationally
straightforward (Smith-Waterman
algorithm), but aligning a large number of
sequences using the same method is almost
impossible.
• The problem increases exponentially with
the number of sequences involved
(the product of the sequence lengths)
4
Multiple Alignment: an extension of pair-wise alignment
Sequence a A C A - - - A T G
Sequence b T C A A C T A T C
Sequence c A C A C - - A G C
Sequence e A G A - - - A T C
Sequence d A C C G - - A T C

Assumptions:
1) Sequences are related by a common ancestor+
2) Sequence are independent*+
3) Positions (columns) of the alignment are independent*

* Actually, neither of these is true … but it makes the computation easier.

BUT, treating all the sequences as independent can cause


significant biases. +The sequences are related by a common ancestor – multiple,
closely related sequences can skew the alignment.

Ideally we would have a species tree and the sequences from each species …5
Multiple Sequence Alignment
Again, we need a scoring system and a search method.

Scoring:
a) Same substitution matrices and gap penalties as pair-wise
b) Score of an alignment is the ‘Sum of Pairs’ (SP)*

* Here is where having some very closely related species can skew things.

c) Other scoring systems also (like maximum entropy)

Alignment methods

1. Multi-dimensional dynamic programming

2. Profile Hidden Markov Model (HMM)

3. Progressive pair-wise alignment


6
Optimal Alignment
• For a given group of sequences, there
is no single "correct" alignment, only an
alignment that is "optimal" according to
some set of calculations.

• Determining what alignment is best for a


given set of sequences is really up to
the judgement of the investigator.

7
Pairwise vs Multiple Sequences

• Pairs of sequences typically aligned using


exhaustive algorithms (dynamic
programming)

• Multiple sequence alignment using heuristic


methods

8
• Sequence alignment is easy with
sufficiently closely related sequences

• Below a certain level of identity sequence


alignment may become meaningless
– twilight zone for aa sequences ~ 30%

• In the twilight zone it is good to make use


of additional information if possible (e.g.
structure)
9
evolution
• Biological sequences are not randomly
sampled.
• If two sequences are similar, most probably
they have an evolutionary relationship.
• When we have many similar sequences, we
can try to guess their relationship.
• This guess can drive the search for an
optimal multiple alignment.
10
Clustal W
• A heuristic program for multiple sequence
alignment: not exact, but quite fast.

• The main paper on it [Thompson, Higgins,


Gibson 1994] has over 10k citations,
making it the single most cited paper in
Computer Science.

11
How Clustal works
• Exploit the fact that similar sequences are
evolutionarily related.
• Build up a multiple alignment progressively by a
series of pairwise alignments, following the
branches of a guide tree.
• In brief: we first guess the evolutionary picture,
then we generate the alignment according to it.
• Naturally, the alignment will suggest an
evolutionary picture which might be different
from the one we guessed first.

12
How Clustal works
1) all pairs of sequences are aligned separately in
order to calculate a distance matrix giving the
divergence of each pair of sequences;
2) a guide tree is calculated from the distance matrix
(how? we’ll see);
3) the sequences are progressively aligned according
to the branching order (i.e. starting from the
closest pairs) in the guide tree.

13
step 1: pairwise alignments
• Global pairwise alignment between every
couple of sequences.

• If we have S sequences of length n, this


costs:
S(S-1)/2 alignments.

14
step1b: distance matrix
• The score of the alignment between any two
sequences is converted into a distance in [0
1] (1 being non-identical sequences).

• The distances are stored in an SxS


(symmetric) distance matrix.

15
distance matrix

seq A seq B seq C seq D

seq A - 0.2 0.1 0.6

seq B - 0.4 0.1

seq C - 0.6

seq D -

16
Step 2: the guide tree

• branch lengths proportional to estimated


divergence along each branch
• estimated divergence is based on distance
contained in the distance matrix
• in ClustalW the tree is built based on a
method called Neighbour Joining (NJ).

17
step 2: the guide tree
A

D
18
Progressive Pairwise
Methods
• Most of the available multiple alignment
programs use some sort of incremental
or progressive method that makes
pairwise alignments, then adds new
sequences one at a time to these
aligned groups.
• This is an approximate method!

19
Step 3: progressive alignment
• Align gradually sequences starting from the
closest ones on the tree. Each time sequences are
aligned, we make a further hypothesis as to how
evolution has worked.
• Every time an alignment is performed, the original
sequences are substituted with their alignment.
• Along the way we align alignments instead of
sequences. This is not a problem (can align
profiles against sequences, or profiles against
profiles)

20
progressive alignment
A

D
21
progressive alignment
A+C

D
22
progressive alignment
A+C

D
23
progressive alignment

(A+C) + B

D
24
progressive alignment

(A+C) + B

D
25
((A+C) + B) + D

26
ClustalW: a common heuristic multiple alignment program

Advantages of ClustalW:

1. Weights sequences according to evolutionary distance:


Sequences that are recently related through evolution are more
likely to be similar in sequence because they haven’t had time to
diverge

2. Uses different substitution matrices depending on sequence similarity

3. Uses affine gap penalties that are influenced by existing gaps in the
multiple alignment

4. Guided by guide tree to choose the order of the sequences to align

5. Takes advantage of profile alignment rather than only doing pairwise


alignments
27
Consensus Sequences
• Simplest Form:
A single sequence which represents the most common
amino acid/base in that position

Y D D G A V - E A L
Y D G G - - - E A L
F E G G I L V E A L
F D - G I L V Q A V
Y E G G A V V Q A L
Y D G G A/I V/L V E A L

28

You might also like