Module Comparing and Visualizing Multiple Biological Sequences
Module Comparing and Visualizing Multiple Biological Sequences
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
YAFDLGYTCMFPVLLGGGELHIVQKETYTAPDEIAHYIKEHGITYIKLTPSLFHTIVNTASFAFDANFESLRLIVLGGEKIIPIDVIAFRKMYGHTE-FINHYGPTEATIGA
-AFDVSAGDFARALLTGGQLIVCPNEVKMDPASLYAIIKKYDITIFEATPALVIPLMEYI-YEQKLDISQLQILIVGSDSCSMEDFKTLVSRFGSTIRIVNSYGVTEACIDS
IAFDASSWEIYAPLLNGGTVVCIDYYTTIDIKALEAVFKQHHIRGAMLPPALLKQCLVSA----PTMISSLEILFAAGDRLSSQDAILARRAVGSGV-Y-NAYGPTENTVLS
A T - G C G -
A - C G T - A
A T C A C - A
A -- T G C
A A T -- C
-- A T G C
A A T -- C
-- A T G C
0 1 2 3 3 4
A A T -- C
-- A T G C
0 1 1 2 3 4
A -- T G C
0 1 2 3 3 4
A A T -- C
0 0 1 2 3 4
-- A T G C
(i-1,j-1,k) (i-1,j,k)
2-D (i,j,k-1)
(i,j-1,k-1)
(i,j-1,k) (i,j,k)
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
- A G G C T A T C A C C T G
T A G – C T A C C A - - - G
C A G – C T A C C A - - - G
C A G – C T A T C A C – G G
C A G – C T A T C G C – G G
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
A 0 1 0 0 0 0 1 0 0 .8 0 0 0 0
C .6 0 0 0 1 0 0 .4 1 0 .6 .2 0 0
G 0 0 1 .2 0 0 0 0 0 .2 0 0 .4 1
T .2 0 0 0 0 1 0 .6 0 0 0 0 .2 0
- .2 0 0 .8 0 0 0 0 0 0 .4 .8 .4 0
s1 GAT-TCA s2 G-TCTGA
s2 G-TCTGA (score = 1) s3 GATAT-T (score = -1)
s1 GAT-TCA s3 GAT-ATT
s3 GATAT-T (score = 1) s4 G-TCAGC (score = -1)
Bioinformatics Algorithms: An Active Learning Approach.
Copyright 2018 Compeau and Pevzner.
Greedy Approach: Example
• Since s2 and s4 are closest, we consolidate them
into a profile:
s2 GTCTGA
s2,4 = GTCt/aGa/cA
s4 GTCAGC
• New set of 3 sequences to align:
s1 GATTCA
s3 GATATT
s2,4 GTCt/aGa/c
ALGORITHM:
REFINEMENT:
MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:
● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function
Why do we need multiple sequence alignments ?
MSA of DNA or protein sequence can yield more information than analysis
of a single sequence such as:
● When dealing with a new protein with unknown function, presence of several
domains (functional parts of a protein) similar to domains in other “known”
protein sequence, can imply a similar structure or function.
In proteins, maintaining their function generally requires a specific 3D
structure - MSA can give information about protein structure
Other uses of MSA
Next module
Revisiting Darwin’s experiments using MSA