Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
Pairwise Sequence Alignment: CS 838 WWW - Cs.wisc - Edu/ Craven/cs838.html Mark Craven Craven@biostat - Wisc.edu January 2001
CS 838
www.cs.wisc.edu/~craven/cs838.html
Mark Craven
[email protected]
January 2001
Announcements
• New optional, but recommended, reading on the
course web page: Molecular Biology for Computer
Scientists by Larry Hunter
1
Pairwise Alignment:
Task Definition
• Given
– a pair of sequences (DNA or protein)
– a method for scoring the similarity of a pair of
characters
• Do
– determine the correspondences between
substrings in the sequences such that the
similarity score is maximized
Motivation
• comparing sequences to gain information
about the structure/function of a query
sequence
• putting together a set of sequenced
fragments (fragment assembly)
• comparing a segment sequenced by two
different labs
2
The Role of Homology
• homology: similarity due to descent from a
common ancestor
• often we can infer homology from similarity
• thus we can sometimes infer
structure/function from sequence similarity
Homology
• homologous sequences can be divided into
two groups
– orthologous sequences: sequences that differ
because they are found in different species
(e.g. human α-globin and mouse α-globin)
– paralogous sequences: sequences that differ
because of a gene duplication event
(e.g. human α-globin and human β-globin,
various versions of both )
3
Issues in Sequence Alignment
• the sequences we’re comparing probably differ in
length
• there may be only a relatively small region in the
sequences that matches
• we want to allow partial matches (i.e. some amino
acid pairs are more substitutable than others)
• variable length regions may have been
inserted/deleted from the common ancestral
sequence
Gaps
• sequences may have diverged from a
common ancestor through various types of
mutations:
– substitutions (ACGA AGGA)
– insertions (ACGA ACCGA)
– deletions (ACGA AGA)
• the latter two will result in gaps in
alignments
4
Insertions/Deletions and
Protein Structure
Example Alignment
GSAQVKGHGKKVADALTNAVAHV---D--DMPNALSALSDLHAHKL
++ ++++H+ KV + +A ++ +L+ L+++H+ K
NNPELQAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKG
5
Alignments in the Olden Days:
Dot Plots
G A C G G A T T A G
G n n n n
A n n n
T n n
C n
G n n n n
G n n n n
A n n n
A n n n
T n n
A n n n
G n n n n
Types of Alignment
• global: find best match of both sequences in their
entirety
• local: find best subsequence match
• semi-global: find best match without penalizing
gaps on the ends of the alignment
6
Pairwise Alignment Via Dynamic
Programming
• Needleman & Wunsch, Journal of Molecular
Biology, 1970
• dynamic programming: solve an instance of a
problem by taking advantage of computed
solutions for smaller subparts of the problem
• determine alignment of two sequences by
determining alignment of all prefixes of the
sequences
7
Linear Gap Penalty Function
• different gap penalty functions require
somewhat different DP algorithms
• the simplest case is when a linear gap
function is used
w(k ) = gk
where g is a constant
• we’ll start by considering this case
AGC -
alignment of + aligning
these prefixes this pair
8
Dynamic Programming Idea
• given an n-character sequence x, and an m-
character sequence y
• construct an (n+1) x (m+1) matrix F
• F [ i, j ] = score of the best alignment of
x[1…i ] with y[1…j ]
+g
+ s(x[i],y[j])
F[i-1, j] F[i, j]
+g
9
Dynamic Programming Idea
• in extending an alignment, we have 3 choices:
– align x[ 1… i-1] with y[ 1… j-1] and match x[ i ]
with y[ i ]
– align x[1… i ] with y[ 1… j-1 ] and match a gap
with y[ j ]
– align x[ 1…i-1 ] with y[ 1… j ] and match a gap
with x[ i ]
• choose highest scoring choice to fill in F [ i, j ]
F (i − 1, j − 1) +s ( xi, yj )
F (i, j ) = max F (i − 1, j ) + g
F (i, j − 1) + g
10
Initializing Matrix: Global
Alignment with Linear Gap Penalty
A G C
0 g 2g 3g
A g
A 2g
A 3g
C 4g
DP Algorithm Sketch
• initialize first row and column of matrix
• fill in rest of matrix from top to bottom, left
to right
• for each F [ i, j ], save pointer(s) to cell(s)
that resulted in best score
• F [m, n] holds the optimal alignment score;
trace pointers back from F [m, n] to F [0, 0]
to recover alignment
11
DP Algorithm Example
• suppose we choose the following scoring scheme:
s(x[i], y[j]) =
+1 when x[i] = y[j]
-1 when x[i] <> y[j]
g (penalty for aligning with a gap) = -2
DP Algorithm Example
A G C
0 -2 -4 -6
A -6 -3 -2 -1
C -8 -5 -4 -1
12
DP Comments
• works for either DNA or protein sequences,
although the substitution matrices used
differ
• finds an optimal alignment
• the exact algorithm (and computational
complexity) depends on gap penalty
function (we’ll come back to this issue)
3 1
• highroad and loadroad alignments show the two
most different optimal alignments
13
Highroad & Lowroad Alignments
A G C
highroad alignment
0 -2 -4 -6
x: A A A C
A y: A G - C
-2 1 -1 -3
A -4 -1 0 -2 lowroad alignment
x: A A A C
A -6 -3 -2 -1 y: - A G C
C -8 -5 -4 -1
2n (2n)! 2 2 n
= ≈
n ( n! ) 2
πn
possible alignments of length n
• e.g. two sequences of length 1000 have ≈ 10
600
possible alignments
• but the DP approach finds an optimal alignment
efficiently
14
Computational Complexity
• initialization: O(m), O(n)
• filling in rest of matrix: O(mn)
• traceback: O(m + n)
• hence, if sequences have nearly same
length, the computational complexity is
O (n 2 )
Local Alignment
• so far we have discussed global alignment,
where we are looking for best match
between sequences from one end to the
other.
• more commonly, we will want a local
alignment, the best match between
subsequences of x and y.
15
Local Alignment Motivation
• useful for comparing protein sequences that
share a common domain but differ
elsewhere
• useful for comparing against genomic
sequences (long stretches of
uncharacterized sequence)
• more sensitive when comparing highly
diverged sequences
16
Local Alignment DP Algorithm
• the recurrence relation is slightly different than for
global algorithm
F (i − 1, j − 1) +s( xi, yj )
F (i − 1, j ) + g
F (i, j ) = max
F (i, j − 1) + g
0
17
Local Alignment Example
A A G A
0 0 0 0 0
T 0 0 0 0 0
T 0 0 0 0 0
A 0 1 1 0 1
A 0 1 2 0 1
G 0 0 0 3 1
x: A A G
y: A A G
18