PCB Lect02 Pairwise Allign
PCB Lect02 Pairwise Allign
C Mutation?
An example of a dot plot where the relation
between sequences in not obvious
3000 3000
2500 2500
2000 2000
Vpa Chr I
Vpa Chr I
1500 1500
1000 1000
500 500
0 0
0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500
Vvu Chr I Vch Chr I
OWEN: aligning long collinear regions of genomes
OWEN is an interactive tool for aligning two long DNA sequences that represents similarity
between them by a chain of collinear local similarities. OWEN employs several methods for
constructing and editing local similarities and for resolving conflicts between them.
Sequence alignment
• Write one sequence along the other so that to expose
any similarity between the sequences. Each element of
a sequence is either placed alongside of corresponding
element in the other sequence or alongside a special
“gap” character
• Example: TGKGI and AGKVGL can be aligned as
TGK - GI
AGKVGL
• Is there a better alignment? How can we compare the
“goodness” of two alignments.
• We need to have:
– A way of scoring an alignment
– A way of computing maximum score alignment.
Identity score
Let (x,y) be an aligned pair of elements of two
sequences (at least one of x,y must not be a gap).
1 if x= y
id(x, y)= { 0 if x ≠ y
ATCG AT – C G
1+1+0+1=3 ATTG AT T - G
1+1-2-1-2-1+1=-3
ATC - - T A AT - C - T A
ATT T T TA AT T T T TA
1+1+0-2-1-1+1+1=0 1+1-2-1+0-2-1+1+1=-2
Problems with identity score
• In the two pairs of aligned sequence below there are
mutations at the first and 6th position and insertion (or
deletion) on the 4th position. However while V and A share
significant biophysical similarity and we often see
mutation between them, W and A do not often substitute
one for the other.
VGK – GI… WGK – GI…
AGKVGL… AGKVGL
• What if I mutated to V and then back to I should this have
the same score as when I was unchanged? If we will like to
use the score to estimate evolutionary distances it would
be wrong to consider them as identical.
Scoring Matrices
An amino-acid scoring matrix is a 20x20 table such that position
indexed with amino-acids so that position X,Y in the table gives the
score of aligning amino-acid X with amino-acid Y
{
Align(Si-1,S’j-1)+ S(ai, a’j)
Align(Si,S’j)= max Align(Si,S’j-1) - g
Align(Si-1,S’j) -g
Organizing the computation – dynamic
programming table
Align
j
Align(i,j) =
Align(Si,S’j)= max
i
Align(Si-1,S’j-1)+ s(ai, a’j)
{ Align(Si-1,S’j) - g
Align(Si,S’j-1) - g
+s(ai,aj)
max
Example of DP computation with
g = 0; match = 1; mismatch=0
Maximal Common Subsequence
initialization
A T T G C G C G C A T
0 0 0 0 0 0 0 0 0 0 0 0
A 0 1 1 1 1 1 1 1 1 1 1 1
T 0 1 2 2 2 2 2 2 2
G 0 1 2
C 0 1
T 0 1
T 0 1
A 0 1
A 0 1
+1 if match else 0
C 0 1
C
A
0 1
max
0 1
Example of DP computation with
g = 2 match = 2; mismatch = -1
Initialization (penalty for starting with a gap)
A T T G C G C G C A T
A -2 2 0 -2
T -4 0 4
G -6 6
C -8
T -10
T
A
-12
-14
+2 if matched -1 else
A -16
C -18 -2
C
A
-20
max
-22
-2
The iterative algorithm
m = |S|; n = |S’|
for i " 0 to m do A[i,0]"- i * g
for j " 0 to n do A[0,j]" - j * g
for i " 1 to m do
for j " 1 to n
A[i,j]"max (
A[i-1,j] – g
A[i-1,j-1] + s(i,j)
A[i,j-1] – g
)
return(A[m,n])
Complexity of the algorithm
A
T
G Start path
from here!
C
ATTG- If at some position several choices lead
AT- GC to the same max value, the path need
not be unique.
Extra information not obligatory
Reducing space complexity in the
global alignment
Recall: Computing the score in linear space is easy.
Leaving “trace” for finding optimal alignment is harder. Why?
Let
OPT [ ]x
y
be an optimal alignment between
sequence x and y
[ ]
x[1,…i-1] x[i+1,…m]
[
x[i] +
OR
OPT ]
y[1,…,j-1] + y[j]
OPT y[j+1,…,n]
OPT
[
x[1,…i-1]
y[1,…,j] ] +
x[i] +
-
OPT [ x[i+1,…m]
y[j+1,…,n] ]
Extra information – not obligatory
Take the largest value in the last row /column and trace-back form there
Example of DP computation
ignoring flanking gaps by assigning 0 to initial gap penalties
A T T G C G C G C A T
0 0 0 0 0 0 0 0 0 0 0 0
0 1 -1
A
T 0 1 2
G 0
C 0
T 0
T 0
A 0
A 0
C
C
0
+s(ai,aj) -2
0
A
0
max
-2
To ignore final gap penalties choose the highest scoring entry in last
column or last row and trace the path from there.
Trace back from the highest score in red row or column
Compressing the gaps
The two alignments below have the same score.
The second alignment is better.
ATTTTAGTAC ATTTTAGTAC
ATT- - AGTAC A-T-T -AGTAC
Complexity O(n3)
General gap penalty
a[i-1,j-1] + s(i,j)
{
a[i,j]= max max b[i,j-k] – w(k) for 0 <=k<=i
max b[i-k,j] – w(k) for 0 <=k<=j
close gap
b
a[0,0] = 0;
b[i,0] = - infinity
a[i,0] = - infinity (i<>0);
b[0,j] = - (h+gj)
a[0,j] = -infinity (j<>0)
c[i,0] = - (h+gi)
c[0,j] = - infinity
Affine gap penalty function - cont
w(k) = h + gk ; h,g constants
Interpretation: const of starting a gap: h+g, extending gap: +g
Let a,b,c be as before. Now they can be completed as follows:
a[i-1,j-1] + s(i,j)
a[i,j]= max
{ b[i,j]
c[i,j]
b[i,j]= max
{ a[i,j-1] – (h+g)
b[i,j-1] – g
--- start a new gap in first seq
-- extend gap in second first by one
c[i,j]= max
{ a[i-1,j] – (h+g)
c[i-1,j] – g
More sophisticated gap penalties
• gap penalty can be made to dependent non-
linearly on length (e.g. as log function)
Modify the “max” expressions so that cell outside the strip are
not considered.
j-i
k-band alignment
n = |S|= |S’|
for i " 0 to k do A[i,0]"- i * g
for j " 0 to k do A[0,j]" - j * g
for i " 1 to n do
for d " -k to k
j = i+d;
if inside_strip(i,j,k) then:
A[i,j]"max (
if inside_strip(i-1,j,k) then A[i-1,j] – g else -infinity
A[i-1,j-1] + s(i,j)
if inside_strip(i-1,j,k) then A[i,j-1] – g else -infinity
)
return(A[m,n])
Where insid _strip(i,j,k) is a test if cell A[i.j] is inside the strip that is if |i-j|<=k
Local alignment
• The alignment techniques considered so far
worked well for sequences which are similar over
all their length
• This does not need to be the case: example gene
from hox family have very short but highly
conserved subsequence – the so called hox
domain.
• Considered so far global alignment methods (that
is algorithm that try to find the best alignment
over whole length can miss this local similarity
region
Global
Local
Local alignment (Smith, Waterman)
So far we have been dealing with global alignment.
Local alignment – alignment between substrings.
Main idea: If alignment becomes to bad – drop it.
Finding the alignment: find the highest scoring cell and trace it back
Example
Global/local comparison
alignment
• semiglobal
• global
Step 2 of FASTA