Inexact Matching, Sequence Alignment, and Dynamic Programming
Inexact Matching, Sequence Alignment, and Dynamic Programming
RIMDMDMMI
v intner
Wri t ers
• The permitted edit operations are
– Insertion (I) of a character into the first string
– Deletion (D) of a character from the first string
– Substitution (or replacement) (R) of a character in the first string with
a character in the second string
• For Match (M) no operation is necessary
Edit Transcript vs. Edit Distance
Edit Transcript: A string over the alphabet I, D, R, M that
describes a transformation of one string to another is called
an edit transcript, or transcript for short, of the two strings.
RIMDMDMMI
v intner
wri t ers
Edit Distance: The minimum number of edit operations –
insertions, deletions and substitutions – needed to
transform the first string into the second. Also known as
Levenshtein distance.
1 2 3 4 5 6 7
2 2 2 3
O (nm)
3
For optimal edit transcript, follow any path from cell (n, m) to cell (0, 0)
1. Horizontal edge, from (i, j) to (i, j-1), is insertion (I) of character S2(j) into S1
2. Vertical edge, from (i, j) to (i-1, j), is deletion (D) of S1(i) from S1
3. Diagonal edge, from (I, j) to (i-1, j-1) is a match (M) if S1(i) = S2(j) and a
substitution (R) if S1(i) ≠ S2(j)
The Traceback
• Let V(i, j) is
the optimal
alignment of
prefixes S1[1..i]
and S2[1..j]
End-space Free Variant
0
Local vs. global alignment
• Initialization
• Scoring
• Trace back (Alignment)
• Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Scoring Scheme
• Match Score = +1
• Mismatch Score = -1
• Gap penalty = -1
• Substitution Matrix
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the score
matrix are filled as multiple of gap penalty
T C G
0 -1 -2 -3
A -1
T -2
C -3
G -4
Scoring
• The score of any cell C(i, j) is the maximum of:
scorediag = C(i-1, j-1) + S(i, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
where S(i, j) is the substitution score for letters
i and j, and g is the gap penalty
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = -1 + -1 = -2
scoreleft = C(i, j-1) + g = -1 + -1 = -2
T C G
0 -1 -2 -3
A -1 -1
T -2
C -3
G -4
Scoring ….
• Final Scoring Matrix
T C G
0 -1 -2 -3
A -1 -1 -2 -3
T -2 0 -1 -2
C -3 -1 1 0
G -4 -2 0 2
• The only possible predecessor is the diagonal match/mismatch neighbor. If more than
one possible predecessor exists, any can be chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
• Final Trace back
T C G
0 -1 -2 -3
A -1 -1 -2 -3
T -2 0 -1 -2
C -3 -1 1 0
G -4 -2 0 2
Best Alignment:
A T C G
| | | |
_ T C G
Local Sequence Alignment
• The Smith-Waterman algorithm performs a
local alignment on two sequences
• It is an example of dynamic programming
• Useful for dissimilar sequences that are
suspected to contain regions of similarity or
similar sequence motifs within their larger
sequence context
• Aim: The best alignment over the conserved
domain of two sequences
Differences in Needleman-Wunsch and
Smith-Waterman Algorithms:
• In the initialization stage, the first row and first
column are all filled in with 0s
• While filling the matrix, if a score becomes
negative, put in 0 instead
• In the traceback, start with the cell that has
the highest score and work back until a cell
with a score of 0 is reached.
Three steps in Smith-Waterman Algorithm
• Initialization
• Scoring
• Trace back (Alignment)
• Consider the two DNA sequences to be
globally aligned are:
ATCG (x=4, length of sequence 1)
TCG (y=3, length of sequence 2)
Scoring Scheme
• Match Score = +1
• Mismatch Score = -1
• Gap penalty = -1
• Substitution Matrix
A C G T
A 1 -1 -1 -1
C -1 1 -1 -1
G -1 -1 1 -1
T -1 -1 -1 1
Initialization Step
• Create a matrix with X +1 Rows and Y +1
Columns
• The 1st row and the 1st column of the score
matrix are filled with 0s
T C G
0 0 0 0
A 0
T 0
C 0
G 0
Scoring
• The score of any cell C(i, j) is the maximum of:
scorediag = C(i-1, j-1) + S(i, j)
scoreup = C(i-1, j) + g
scoreleft = C(i, j-1) + g
And
0
(here S(i, j) is the substitution score for letters i
and j, and g is the gap penalty)
Scoring ….
• Example:
The calculation for the cell C(2, 2):
scorediag = C(i-1, j-1) + S(I, j) = 0 + -1 = -1
scoreup = C(i-1, j) + g = 0 + -1 = -1
scoreleft = C(i, j-1) + g = 0 + -1 = -1
T C G
0 0 0 0
A 0 0
T 0
C 0
G 0
Scoring ….
• Final Scoring Matrix
T C G
0 0 0 0
A 0 0 0 0
T 0 1 0 0
C 0 0 2 1
G 0 0 1 3
• The only possible predecessor is the diagonal match/mismatch neighbor. If more than
one possible predecessor exists, any can be chosen. This gives us a current alignment of
Seq 1: G
|
Seq 2: G
Trace back ….
• Final Trace back
T C G
0 0 0 0
A 0 0 0 0
T 0 1 0 0
C 0 0 2 1
G 0 0 1 3
Best Alignment:
T C G
| | |
T C G
Gaps
• A gap is any maximal, consecutive run of spaces in a
single string of a given alignment.
c t t t a a c _ _ a _ a c
c _ _ _ c a c c c a t _ c
Four gaps and seven spaces
The simplest objective function that includes gaps
• Top row shows part of the RNA sequences of one strain of the HIV-1 virus.
• The HIV virus mutates rapidly
• The three bottom rows, each shows the mutated virus strain from the original
one.
• Dark one is the matching portion, white space represents gap
• Matching means similarity, i.e. mismatch or space could be there but in small
percentage of the region
cDNA Matching: A Concrete Example
Exon
Intron
The cDNA
• Each cell contains the same chromosome, the same set
of genes
• Yet, in each specialized cell (a liver cell for example)
only a small fraction of the genes are expressed
• You want to hunt the location of the encoding gene for
that specific protein
• Capture the mRNA in that cell after it leaves the cell
nucleus
• That mRNA is used to create a DNA string
complementary to it , which is known as cDNA
cDNA Problem
cDNA
Why Gaps in the Objective Function
• You will not get long gaps or you can not get
gaps of your own choice or problem specific
Choice of Gap Weights
• Constant
– Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps)]
– Or
• Affine
– Maximize [Wm(# matches) – Wms(# mismatches) – Wg(# gaps) – Ws(#
spaces)]
– Wg gap initiation cost, Ws gap extension cost
• Convex
• Arbitrary
c t t t a a c _ _ a _ a c
c _ _ _ c a c c c a t _ c
Reference
• Chapter 10, 11: Algorithms on Strings, Trees
and Sequences