0% found this document useful (0 votes)
7 views

Module Comparing and Visualizing Multiple Biological Sequences

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Module Comparing and Visualizing Multiple Biological Sequences

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

From Pairwise to Multiple Alignment

• Up until now we have only


tried to align two sequences.
• A faint (and statistically
insignificant) similarity
between two sequences
becomes significant if it is
present in many other
sequences.
• Multiple alignments can
reveal subtle similarities that
pairwise alignments do not
reveal.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignment of Three A-domains

YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA

-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS

IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Generalizing Pairwise to Multiple Alignment

• Alignment of 2 sequences is a 2-row matrix.


• Alignment of 3 sequences is a 3-row matrix

A T - G C G -
A - C G T - A
A T C A C - A

• Our scoring function should score alignments with


conserved columns higher.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

0 1 1 2 3 4 #symbols up to a given position


A -- T G C

0 1 2 3 3 4
A A T -- C

-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Alignments = Paths in 3-D

• Alignment of ATGC, AATC, and ATGC

(0,0,0)→(1,1,0)→(1,2,1) →(2,3,2) →(3,3,3) →(4,4,4)

0 1 1 2 3 4

A -- T G C

0 1 2 3 3 4
A A T -- C

0 0 1 2 3 4
-- A T G C

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
2-D Alignment Cell versus 3-D Alignment Cell
(i-1,j-1,k-1) (i-1,j,k-1)

(i-1,j-1,k) (i-1,j,k)

2-D (i,j,k-1)
(i,j-1,k-1)

(i,j-1,k) (i,j,k)

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Dynamic Programming

• δ(x, y, z) is an entry in the 3-D scoring matrix.

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment: Running Time
• For 3 sequences of length n, the run time is
proportional to 7n3
• For a k-way alignment, build a k-dimensional
Gedi graph with
– nk nodes
– most nodes have 2k – 1 incoming edges.
– Runtime: O(2knk)

Calculate the runtime for aligning 10 sequences


of length 100 each ?

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Multiple Alignment Induces Pairwise
Alignments

Every multiple alignment induces pairwise alignments:


AC-GCGG-C
AC-GC-GAG
GCCGC-GAG

ACGCGG-C AC-GCGG-C AC-GCGAG


ACGC-GAC GCCGC-GAG GCCGCGAG

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Idea: Construct Multiple from Pairwise Alignments

Given a set of arbitrary pairwise alignments, can


we construct a multiple alignment that induces
them?
AAAATTTT---- ----AAAATTTT TTTTGGGG----
----TTTTGGGG GGGGAAAA---- ----GGGGAAAA

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Profile Representation of Multiple Alignment

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Sequence

• In the past we were aligning a sequence


against a sequence.

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Sequence Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?

- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Aligning Profile Against Profile

• In the past we were aligning a sequence


against a sequence.
– Can we align a sequence against a profile?
– Can we align a profile against a profile?
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G

A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
0
0
1
0
0
0 .8 0 0 0
1 0 .6 .2 0
0 .2 0 0 .4
0 0 0 0 .2
0 0 .4 .8 .4
1 0
0 .4
0 0
0 .6
0 0
0
0
0
1
0
0
1
0
0
0
0 0
0 0
1 .2
0 0
0 .8
A
C
G
T
-
Approximate methods to perform MSA
Approximate methods:
* Alignment based on small conserved regions or based on statistical or
probabilistic models that is FAST

* Give sub-optimal alignment which is almost the best in reasonable time.

* Most popular methods:


Progressive (Greedy approach)
Iterative
Multiple Alignment: Greedy Approach
• Choose the most similar sequences and
combine them into a profile, thereby reducing
alignment of k sequences to an alignment of
of k – 2 sequences and 1 profile.
• Iterate
• Used by ClustalW, T-COFFEE

Bioinformatics Algorithms: An Active Learning Approach.


Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Sequences: GATTCA, GTCTGA, GATATT, GTCAGC.

• 6 pairwise alignments (premium for match +1,


penalties for indels and mismatches -1)
s2 GTCTGA s1 GATTCA--
s4 GTCAGC (score = 2) s4 G—T-CAGC (score = 0)

s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)

s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c

What will be the next steps ?


Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Approximate algorithms: Iterative alignment

ALGORITHM:

1. Obtain a Draft Progressive alignment


2. Improve alignment using Kimura distance
matrix
3. Refine alignment
Approximate algorithms: Refine Iterative alignment

REFINEMENT:

1. Choose a sequence with gaps


2. Move gaps within an aligned sequence to
obtain more matches
3. Goto step 1 until all sequences with gaps
have been realigned
How do these alignments look like in practice ?

Similar (Aa with similar properties)> Identity (Same Aa)

Sequence conservation implies function


Why do we need multiple sequence alignments ?

MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:

● Which part of the sequence is shared between different organisms?

● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function
Why do we need multiple sequence alignments ?

MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:

● Which part of the sequence is shared between different organisms?


It is known that selective pressure of evolution results from the need to
conserve function

● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function.
In proteins, maintaining their function generally requires a specific 3D
structure - MSA can give information about protein structure
Other uses of MSA

Predicting the structure of new protein Building phylogenetic trees


(Depicting evolutionary relationships)

How do we make phylogenetic trees ?

Next module
Revisiting Darwin’s experiments using MSA

You might also like