Lecture 5 Introduction Dynamic Programming
Lecture 5 Introduction Dynamic Programming
Dynamic
Programming
The sequence alignment problem
Brute-force algorithm
Establish baseline performance and test cases
Identify patterns in the problem space
Apply the brute force algorithm to a single
column of the alignment
Subject
10 Query
0 Query (x)
Assume the query and subject sequences are the same
Three different ways to reach cell
(i,j) in the alignment matrix
A
Subject (y)
(i-1,j- (i,j-1)
1)
Align with A
subject A
(i-1, j-1)
Gap in subject A
(i-1, j) -
A Gap in query -
(i-1,j) (i,j) (i, j-1) A
Query (x) Arrow = alignment
Construct a scoring system to
measure similarity between two
sequences
Scoring system for the aligned state: 𝛔
𝛔(a, b) = Score for aligning a in query with b in
subject
𝛔(A, A) = Bonus for aligning A in query with A in subject
𝛔(A, T) = Penalty for aligning A in query with T in subject
(i-1,j- (i,j-1)
1) S(i,j) = max {
𝛾
𝛔(a,b)
𝛔
S(i-1,j-1) +
(a,b)
𝛾 𝛾
S(i-1,j )
+
}+ 𝛾
S(i ,j-1)
b
(i-1,j) (i,j)
Query (x)
Align Gap in Gap in
Determine the best way to reach cell (i,j)
if it were part of the optimal alignment
(i,j)
Query
?
Subjec
t Optimal alignment
S(i,j) = max { a
Align
b
a
Gap in
subject
Gap in
b
} query
Use the maximum score at each cell to
eliminate entire branch of suboptimal
alignments
(i,j)
Fill in the cells in the first row and first column with the
cumulative gap costs
Calculate the maximum score for subsequent cells
(i,j)
Keep track of the decision that leads to the maximum score (S)
𝛔(i-1,j
S(i-1,j-1) +
𝛾 ,j-1)
S(i,j) = max S (a,b) )
+ 𝛾
+
S(i
Needleman SB, Wunsch CD. A general method applicable to the search
for similarities in the amino acid sequence of two proteins. J Mol Biol.
Initialize the alignment matrix
(Match = +5; Mismatch = -2; Gap = -6)
0 1 2 3 4 5 6 7 8
T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query (Eddy,
Calculate the possible scores
for the cell at position (1,1)
T
Subject (y)
(0,0) (1,0)
0 -6 S(1,1) = max {
𝛔(T,T) 𝛾
𝛔
S(0,0) +
(T,T) + 𝛾
S(0,1)
S(1,0) + 𝛾
𝛾
T -6
}
(0,1) (1,1)
Query (x)
Align Gap in Gap in
Calculate the optimal score for
the cell at position (1,1)
T
Subject (y)
S(1,1) = max {
0 -6 0 + (+5) = 5
-6
+5 -6 + (-6) = -
12
-6 + (-6) = -
5 -12
-6 } 12
T -6
- 5 S(1,1) = 5
Query 12
(x)
(Match = +5; Mismatch = -2; Gap = -6)
Calculate the possible scores
for the cell at position (2,1)
T G
Subject (y)
(1,0) (2,0)
-6 - S(2,1) = max {
𝛔(T,G) 12𝛾
𝛔
S(1,0) +
(T,G) + 𝛾
S(1,1)
S(2,0) + 𝛾
𝛾
T 5
}
(1,1) (2,1)
Query (x)
Align Gap in Gap in
Calculate the optimal score for
the cell at position (2,1)
T G
Subject (y)
S(2,1) = max {
-6 - -6 + (-2) = -8
12 -6
-2 5 + (-6) = -1
-12 + (-6) = -
-8 -18
5 }18
T -6
-1 -1 S(2,1) = -
Query (x) 1
(Match = +5; Mismatch = -2; Gap = -6)
Align Alignment matrix after two iterations
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subjec
tGap in
0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query
Calculate the optimal score for
the cell at position (3,1)
G C
Subject (y)
S(3,1) = max {
- - -12 + (-2) = -
12-2 18 -6 14
-1 + (-6) = -7
-18 + (-6) = -
- -24
-1 }24
T -6 14
-7 -7 S(3,1) = -
Query (x) 7
(Match = +5; Mismatch = -2; Gap = -6)
Align Matrix after three iterations
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subjec
tGap in
0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1 -7
2 T -12
Subject
3 C -18
4 A -24
5 T -30
6 A -36
Query
Calculate the optimal score for
the cell at position (1,2)
T
S(1,2) = max {
-6 5 -6 + (+5) = -
Subject (y)
T
+5
-6 1
-12 + (-6) = -
185 + (-6) = -
-1 -1 }1
T
-
-6
12 - -1 S(1,2) = -
Query 18
(x) 1
(Match = +5; Mismatch = -2; Gap = -6)
Align Complete alignment matrix
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subjec
tGap in
0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 -6 -12 -18 -24 -30 -36 -42 -48
1 T -6 5 -1 -7 -13 -19 -25 -31 -37
2 T -12 -1 3 -3 -2 -8 -14 -20 -26
Subject
3 C -18 -7 -3 8 2 3 -3 -9 -15
4 A -24 -13 -9 2 6 0 1 -5 -4
5 T -30 -19 -15 -4 7 4 -2 6 0
6 A -36 -25 -21 -10 1 5 2 0 11
Query
Use traceback to recover
the optimal alignment
Start from the cell within the last row and last column
that has the highest score
Recall the step (color) that leads to this optimal
score
Report this step in the alignment output
All the alignment decisions have already been made
C -18 -7 -3 8 2 3 -3 -9 -15
A -24 -13 -9 2 6 0 1 -5 -4
T -30 -19 -15 -4 7 4 -2 6 0
A -36 -25 -21 -10 1 5 2 0 11
Calculate the optimal score for the
cell at position (5,3)
T C
Subject (y)
S(5,3) = max {
-2 -8 -2 + (+5) = 3
-6
+5 2 + (-6) = -4
-8 + (-6) = -
3 -14
2 }14
C -6
-4 3 S(5,3) = 3
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Traceback must follow the steps that produce
the optimal cumulative global alignment
score
T C
T -2 -8
Subject (y)
C 2 3
Query (x)
Query: T G C T C G T A
Subject T - - T C A T A
:
Traceback
Query : T G C T C G T A
0 -6 -12 -18 -24 -30 -36 -42 -48
T -6 5 -1 -7 -13 -19 -25 -31 -37
T -12 -1 3 -3 -2 -8 -14 -20 -26
Subject
C -18 -7 -3 8 2 3 -3 -9 -15
A -24 -13 -9 2 6 0 1 -5 -4
T -30 -19 -15 -4 7 4 -2 6 0
A -36 -25 -21 -10 1 5 2 0 11
The Needleman-Wunsch algorithm is an
example of a dynamic programming
algorithm
Problem must satisfy two criteria:
Optimal substructure
Optimal solution to the complete problem is composed of
optimal solutions to the subproblems
Overlapping problems
Re-use the results for the subproblems (e.g., lookup
table)
𝛔(i-1,j
S(i-1,j-1) +
𝛾 ,j-1)
S (a,b) )
S(i,j) = max
0 𝛾
+
S(i
+
Smith TF, Waterman MS. Identification of common molecular
subsequences.
Global versus local alignments
Global alignment
Optimal alignment along the entire length of two
sequences
Compare protein sequences to identify orthologs
Local alignment
Optimal alignment between parts of two sequences
Identify conserved domains within protein sequences
0 1 2 3 4 5 6 7 8
T G C T C G T A
0 0 0 0 0 0 0 0 0 0
1 T 0
2 T 0
Subject
3 C 0
4 A 0
5 T 0
6 A 0
Query
Calculate the possible local alignment
scores for the cell at position (1,1)
T
Subject (y)
(0,0) (1,0)
S(1,1) = max {
0 0
𝛔(T,T) 𝛾 𝛔
S(0,0) +
(T,T) + 𝛾
S(0,1)
S(1,0) + 𝛾
𝛾
0 0
T 0
(0,1) (1,1) }
Query (x)
Align Gap in Gap in
Calculate the optimal local alignment
score for the cell at position (1,1)
T
Subject (y)
S(1,1) = max {
0 0 0 + (+5) = 5
-6 0 + (-6) = -6
+5
0 + (-6) = -6
5 -6 0
0 }
T -6
0
-6 5 S(1,1) = 5
Query (x)
(Match = +5; Mismatch = -2; Gap = -6)
Align Local alignment matrix
Gap in (Match = +5; Mismatch = -2; Gap = -6)
subjec
tGap in
0 1 2 3 4 5 6 7 8
query T G C T C G T A
0 0 0 0 0 0 0 0 0 0
1 T 0 5 0 0 5 0 0 5 0
2 T 0 5 3 0 5 3 0 5 3
Subject
3 C 0 0 3 8 2 10 4 0 3
4 A 0 0 0 2 6 4 8 2 5
5 T 0 5 0 0 7 4 2 13 7
6 A 0 0 3 0 1 5 2 7 18
Query
Query: T C G T A
Subject 0 T C A T A
:
Traceback
:
Query T G C T C G T A
0 0 0 0 0 0 0 0 0
T 0 5 0 0 5 0 0 5 0
T 0 5 3 0 5 3 0 5 3
Subject
C 0 0 3 8 2 10 4 0 3
A 0 0 0 2 6 4 8 2 5
T 0 5 0 0 7 4 2 13 7
A 0 0 3 0 1 5 2 7 18
Techniques to improve the performance
of sequence alignment
Time and space complexity: O(MN)
Double the size of the two sequences leads to a
four-fold increase in the amount of time and space
required
Stirling’s
approximation
Number of alignments for two sequences with length N
Number of alignments for two sequences with length N
Brute force alignment approach
is computationally intractable
Sequenc # possible
e length alignments
(N)
10 1.87E+05
50 1.01E+29
100 9.07E+58
200 1.03E+119
300 1.35E+179
400 1.88E+239
500 2.70E+299