Introduction Dynamic Programming
Introduction Dynamic Programming
Introduction Dynamic Programming
Dynamic Programming
The sequence alignment problem
Korf, I., Yandell, M., and Bedell, J. (2003). BLAST. O’Reilly Media, Inc.
Design goals
Query: ATTACCAG
|| |||||
Subject: ATCACCAG
Sequences must have high percent identity
Applications:
PAM scoring matrix (align sequences with >= 85% identity)
Align mononucleotide runs during sequence improvement
Strategy #2: Enumerate all alignments
Brute-force algorithm
Establish baseline performance and test cases
Identify patterns in the problem space
Apply the brute force algorithm to a single column of
the alignment
Align
two bases are identical
Query (x)
Assessment of the three sequence
alignment strategies
Infeasible to examine all possible alignments
Need to reduce the search space
Query: A T A - T A T -
Subject: A T - A T A - T
Subject
100 Query
Query (x)
Assume the query and subject sequences are the same
Three different ways to reach cell (i,j)
in the alignment matrix
A
Subject (y)
(i-1,j-1) (i,j-1)
S(i,j) = max {
𝛾 S(i-1,j-1) + 𝛔(a,b)
𝛔(a,b)
S(i-1,j ) + 𝛾
𝛾
S(i ,j-1) + 𝛾
b }
(i-1,j) (i,j)
Query (x)
Align Gap in subject Gap in query
Determine the best way to reach cell (i,j) if
it were part of the optimal alignment
(i,j)
Query
?
Subject
Optimal alignment
S(i,j) = max { a
Align
b
a
Gap in subject
Gap in query
b
}
Use the maximum score at each cell to eliminate entire
branch of suboptimal alignments
(i,j)
Fill in the cells in the first row and first column with the
cumulative gap costs
Calculate the maximum score for subsequent cells (i,j)
Keep track of the decision that leads to the maximum score (S)
S(i-1,j-1) + 𝛔(a,b)
S(i,j) = max S(i-1,j ) + 𝛾
S(i ,j-1) + 𝛾
Needleman SB, Wunsch CD. A general method applicable to the search for similarities in
the amino acid sequence of two proteins. J Mol Biol. 1970 Mar;48(3):443-53.
Initialize the alignment matrix
(Match = +5; Mismatch = -2; Gap = -6)
0 1 2 3 4 5 6 7 8
T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query (Eddy, 2004)
Calculate the possible scores for the cell
at position (1,1)
T
Subject (y)
(0,0) (1,0)
0 -6 S(1,1) = max {
𝛔(T,T) 𝛾 S(0,0) + 𝛔(T,T)
S(0,1) + 𝛾
-6 S(1,0) + 𝛾
T 𝛾 }
(0,1) (1,1)
Query (x)
Align Gap in subject Gap in query
Calculate the optimal score for the cell
at position (1,1)
T
Subject (y)
S(1,1) = max {
0 -6 0 + (+5) = 5
-6
+5 -6 + (-6) = -12
-6 + (-6) = -12
5 -12
-6 }
T -6
-12 5 S(1,1) = 5
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Calculate the possible scores for the cell
at position (2,1)
T G
Subject (y)
(1,0) (2,0)
S(2,1) = max {
-6 -12 -6 + (-2) = -8
-6
-2 5 + (-6) = -1
-12 + (-6) = -18
-8 -18
5 }
T -6
-1 -1 S(2,1) = -1
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Align Alignment matrix after two iterations
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subject
Gap in 0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query
Calculate the optimal score for the cell
at position (3,1)
G C
Subject (y)
S(3,1) = max {
-12 -18 -12 + (-2) = -14
-6
-2 -1 + (-6) = -7
-18 + (-6) = -24
-14 -24
-1 }
T -6
-7 -7 S(3,1) = -7
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Align Matrix after three iterations
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subject
Gap in 0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1 -7
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query
Calculate the optimal score for the cell
at position (1,2)
T
S(1,2) = max {
T -6 5 -6 + (+5) = -1
Subject (y)
-6
+5 -12 + (-6) = -18
5 + (-6) = -1
-1 -1 }
T
-12
-6
-18 -1 S(1,2) = -1
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Align Complete alignment matrix
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subject
Gap in 0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1 -7 -13 -19 -25 -31 -37
2 T -12 -1 3 -3 -2 -8 -14 -20 -26
Subject
3 C -18 -7 -3 8 2 3 -3 -9 -15
4 A -24 -13 -9 2 6 0 1 -5 -4
5 T -30 -19 -15 -4 7 4 -2 6 0
6 A -36 -25 -21 -10 1 5 2 0 11
Query
Use traceback to recover the
optimal alignment
Start from the cell within the last row and last column that
has the highest score
Recall the step (color) that leads to this optimal score
Report this step in the alignment output
All the alignment decisions have already been made
C -18 -7 -3 8 2 3 -3 -9 -15
A -24 -13 -9 2 6 0 1 -5 -4
T -30 -19 -15 -4 7 4 -2 6 0
A -36 -25 -21 -10 1 5 2 0 11
Calculate the optimal score for the cell at
position (5,3)
T C
Subject (y)
S(5,3) = max {
-2 -8 -2 + (+5) = 3
-6
+5 2 + (-6) = -4
-8 + (-6) = -14
3 -14
2 }
C -6
-4 3 S(5,3) = 3
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Traceback must follow the steps that produce the
optimal cumulative global alignment score
T C
T -2 -8
Subject (y)
C 2 3
Query (x)
Query: T G C T C G T A
Subject: T - - T C A T A
Traceback:
Query T G C T C G T A
0 -6 -12 -18 -24 -30 -36 -42 -48
T -6 5 -1 -7 -13 -19 -25 -31 -37
T -12 -1 3 -3 -2 -8 -14 -20 -26
Subject
C -18 -7 -3 8 2 3 -3 -9 -15
A -24 -13 -9 2 6 0 1 -5 -4
T -30 -19 -15 -4 7 4 -2 6 0
A -36 -25 -21 -10 1 5 2 0 11
The Needleman-Wunsch algorithm is an example
of a dynamic programming algorithm
S(i-1,j-1) + 𝛔(a,b)
S(i-1,j ) + 𝛾
S(i,j) = max
S(i ,j-1) + 𝛾
0
Smith TF, Waterman MS. Identification of common molecular subsequences.
J Mol Biol. 1981 Mar 25;147(1):195-7.
Global versus local alignments
Global alignment
Optimal alignment along the entire length of two sequences
Compare protein sequences to identify orthologs
Local alignment
Optimal alignment between parts of two sequences
Identify conserved domains within protein sequences
0 1 2 3 4 5 6 7 8
T G C T C G T A
0 0 0 0 0 0 0 0 0 0
1 T 0
2 T 0
Subject
3 C 0
4 A 0
5 T 0
6 A 0
Query
Calculate the possible local alignment
scores for the cell at position (1,1)
T
Subject (y)
(0,0) (1,0)
S(1,1) = max {
0 0 S(0,0) + 𝛔(T,T)
𝛔(T,T) 𝛾
S(0,1) + 𝛾
S(1,0) + 𝛾
0 0
T 𝛾 0
(0,1) (1,1) }
Query (x)
Align Gap in subject Gap in query
Calculate the optimal local alignment
score for the cell at position (1,1)
T
Subject (y)
S(1,1) = max {
0 0 0 + (+5) = 5
-6 0 + (-6) = -6
+5
0 + (-6) = -6
5 -6 0
0 }
T -6
0
-6 5 S(1,1) = 5
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Align Local alignment matrix
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subject
Gap in 0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 0 0 0 0 0 0 0 0
1 T 0 5 0 0 5 0 0 5 0
2 T 0 5 3 0 5 3 0 5 3
Subject
3 C 0 0 3 8 2 10 4 0 3
4 A 0 0 0 2 6 4 8 2 5
5 T 0 5 0 0 7 4 2 13 7
6 A 0 0 3 0 1 5 2 7 18
Query
Query: T C G T A
Subject: 0 T C A T A
Traceback:
Query T G C T C G T A
0 0 0 0 0 0 0 0 0
T 0 5 0 0 5 0 0 5 0
T 0 5 3 0 5 3 0 5 3
Subject
C 0 0 3 8 2 10 4 0 3
A 0 0 0 2 6 4 8 2 5
T 0 5 0 0 7 4 2 13 7
A 0 0 3 0 1 5 2 7 18
Techniques to improve the performance of
sequence alignment
Time and space complexity: O(MN)
Double the size of the two sequences leads to a four-fold
increase in the amount of time and space required
Korf, I., Yandell, M. and Bedell, J. (2003). The BLAST Algorithm. In BLAST (76-87).
Sebastopol, CA: O’Reilly Media, Inc.
Number of alignments for two sequences with length N
Stirling’s approximation
Number of alignments for two sequences with length N
Number of alignments for two sequences with length N
Brute force alignment approach is
computationally intractable
Sequence # possible
length (N) alignments
10 1.87E+05
50 1.01E+29
100 9.07E+58
200 1.03E+119
300 1.35E+179
400 1.88E+239
500 2.70E+299