0% found this document useful (0 votes)
11 views

Module Comparing and Visualizing Multiple Biological Sequences

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Module Comparing and Visualizing Multiple Biological Sequences

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

From Pairwise to Multiple Alignment

• Up until now we have only


tried to align two sequences.
• A faint (and statistically
insignificant) similarity
between two sequences
becomes significant if it is
present in many other
sequences.
• Multiple alignments can
reveal subtle similarities that
pairwise alignments do not
reveal.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Generalizing Pairwise to Multiple Alignment

• Alignment of 2 sequences is a 2-row matrix.


• Alignment of 3 sequences is a 3-row matrix

A T - G C G -
A - C G T - A
A T C A C - A

• Our scoring function should score alignments with


conserved columns higher.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

0 1 2 3 3 4
A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

0 1 1 2 3 4

A -- T G C

0 1 2 3 3 4
A A T -- C

0 0 1 2 3 4
-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
2-D Alignment Cell versus 3-D Alignment Cell
(i-1,j-1,k-1) (i-1,j,k-1)

(i-1,j-1,k) (i-1,j,k)

2-D (i,j,k-1)
(i,j-1,k-1)

(i,j-1,k) (i,j,k)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Dynamic Programming

• δ(x, y, z) is an entry in the 3-D scoring matrix.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is
proportional to 7n3
• For a k-way alignment, build a k-dimensional
Gedi graph with
– nk nodes
– most nodes have 2k – 1 incoming edges.
– Runtime: O(2knk)

Calculate the runtime for aligning 10 sequences


of length 100 each ?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment Induces Pairwise
Alignments

Every multiple alignment induces pairwise alignments:


AC-GCGG-C
AC-GC-GAG
GCCGC-GAG

ACGCGG-C AC-GCGG-C AC-GCGAG


ACGC-GAC GCCGC-GAG GCCGCGAG

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Idea: Construct Multiple from Pairwise Alignments

Given a set of arbitrary pairwise alignments, can


we construct a multiple alignment that induces
them?
AAAATTTT---- ----AAAATTTT TTTTGGGG----
----TTTTGGGG GGGGAAAA---- ----GGGGAAAA

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Sequence

• In the past we were aligning a sequence


against a sequence.

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Profile Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?
– Can we align a profile against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
0
0
1
0
0
0 .8 0 0 0
1 0 .6 .2 0
0 .2 0 0 .4
0 0 0 0 .2
0 0 .4 .8 .4
1 0
0 .4
0 0
0 .6
0 0
0
0
0
1
0
0
1
0
0
0
0 0
0 0
1 .2
0 0
0 .8
A
C
G
T
-
Approximate methods to perform MSA
Approximate methods:
* Alignment based on small conserved regions or based on statistical or
probabilistic models that is FAST

* Give sub-optimal alignment which is almost the best in reasonable time.

* Most popular methods:


Progressive (Greedy approach)
Iterative
Multiple Alignment: Greedy Approach
• Choose the most similar sequences and
combine them into a profile, thereby reducing
alignment of k sequences to an alignment of
of k – 2 sequences and 1 profile.
• Iterate
• Used by ClustalW, T-COFFEE

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Sequences: GATTCA, GTCTGA, GATATT, GTCAGC.

• 6 pairwise alignments (premium for match +1,


penalties for indels and mismatches -1)
s2 GTCTGA s1 GATTCA--
s4 GTCAGC (score = 2) s4 G—T-CAGC (score = 0)

s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)

s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c

What will be the next steps ?


Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Approximate algorithms: Iterative alignment

ALGORITHM:

1. Obtain a Draft Progressive alignment


2. Improve alignment using Kimura distance
matrix
3. Refine alignment
Approximate algorithms: Refine Iterative alignment

REFINEMENT:

1. Choose a sequence with gaps


2. Move gaps within an aligned sequence to
obtain more matches
3. Goto step 1 until all sequences with gaps
have been realigned
How do these alignments look like in practice ?

Similar (Aa with similar properties)> Identity (Same Aa)

Sequence conservation implies function


Why do we need multiple sequence alignments ?

MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:

● Which part of the sequence is shared between different organisms?

● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function
Why do we need multiple sequence alignments ?

MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:

● Which part of the sequence is shared between different organisms?


It is known that selective pressure of evolution results from the need to
conserve function

● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function.
In proteins, maintaining their function generally requires a specific 3D
structure - MSA can give information about protein structure
Other uses of MSA

Predicting the structure of new protein Building phylogenetic trees


(Depicting evolutionary relationships)

How do we make phylogenetic trees ?

Next module
Revisiting Darwin’s experiments using MSA

You might also like