Pairwise Alignment 2017
Pairwise Alignment 2017
Pairwise Alignment
1
CG © Ron Shamir
Main source
2
CG © Ron Shamir
Why compare
www.mathworks.com/.../jan04/bio_genome.html
sequences?
Human hexosaminidase A vs Mouse hexosaminidase A
3
CG © Ron Shamir
Sequence Alignment
עימוד רצפים
• The problem: Comparing two sequences while
allowing certain mismatches between them.
• Main motivation:
– Comparing DNA seqs and proteins from databases,
• Comparing two or more sequences for similarity
• Searching databases for related sequences and
subsequences
• Finding informative elements in protein and DNA sequences
4
CG © Ron Shamir,
Alignment definition
• Input: Two sequneces of possibly different lengths
• Goal: Space the sequences so that they have the same
length and no two spaces are matched.
acbcdb a c - - b c d b
cadbd - c a d b - d -
acbc a c b c
acdc a c d c
acbcdbacdbb
5
CG © Ron Shamir
cadbdadbbb
Alignment scoring: Similarity vs. Difference
7
CG © Ron Shamir
Simplest model: Edit
Distance
The edit distance between two sequences is the
min no. of edit operations (single letter
insertion, deletion and substitution) needed to
transform one sequence into the other.
www.mathworks.com/.../jan04/bio_genome.html
Human hexosaminidase A vs Mouse hexosaminidase A
13
CG © Ron Shamir
How many alignments are
possible?
• Each alignment matches 0 ≤ k ≤ min(n,m) pairs.
n m
• #alignments with k matched pairs is
k k
n m n + m
min( n , m )
=N ∑
=
k = 0
k k min( n , m )
14
CG © Ron Shamir
Global Alignment Algorithm
• First dynamic programming solution by Needleman &
Wunsch (70); improved later by Sankoff (72).
Notation:
• σ(a,b) : the score (weight) of the alignment of
character a with character b.
• V(i,j) : the optimal score of the alignment of S’=s1…si
and T’=t1…tj (0 ≤ i ≤ n, 0 ≤ j ≤ m)
15
CG © Ron Shamir
V(i,j) := optimal score of the alignment
of S’=s1…si and T’=t1…tj (0 ≤ i ≤ n, 0 ≤ j ≤ m)
18
CG © Ron Shamir
Backtracking the alignment
Example
x = AGTA m= 1
y = ATA s = -1
d = -1
F(i,j) i=0 1 2 3 4
V(1, 1) =
`1` `` A G T A max{V(0,0) + s(A, A),
V(0, 1) + d,
V(1, 0) + d} =
j=0 0 -1 -2 -3 -4
max{0 + 1,
1 A -1 1 0 -1 -2 -1 – 1,
-1 – 1} = 1
2 T -2 0 0 1 0
3 A -3 -1 -1 0 2 AGTA
A - TA
CS262 Lecture 2, Win06, Batzoglou
λ C T C G C A G C
(0,0)
σ(-, tj+1)
σ(si+1,-) σ(si+1,tj+1)
(n,m)
22
CG © Ron Shamir
Alignment Graph
Definition: The alignment graph of sequences S=s1…sn
and T=t1…tm, is a directed graph G=(V,E) on
(n+1)x(m+1) nodes, each labeled with a distinct pair
(i,j) (0≤i≤n, 0≤j≤m), with the following weighted
edges:
• ((i,j), (i+1,j)) with weight σ(si+1,-)
• ((i,j), (i,j+1)) with weight σ(-, tj+1)
• ((i,j), (i+1,j+1)) with weight σ(si+1,tj+1)
Note: a path from node (0,0) to node (n,m)
corresponds to an alignment and its total weight is
the alignment score.
Goal: find an optimal path from node (0,0) to node
(n,m)
23
CG © Ron Shamir
Complexity
• Time: O(mn) (proportional to |E|)
• Space to find opt alignment: O(mn)
(proportional to |V|)
• Space is often the bottleneck!
• Space to find opt alignment value
only: O(m+n). Why?
• Can we improve space complexity for
finding opt alignment?
24
CG © Ron Shamir
Warm-up questions
How do we efficiently compute the opt
alignment scores of S to each prefix
t1….tk of T?
25
CG © Ron Shamir
Reducing Space Complexity
V*(n-i,m-j) = opt alignment value of si+1…sn and tj+1….tm
Lemma: V ( n, m) = max V ( n , k ) + V * ( n , m − k )
0≤ k ≤ m
2 2
28
CG © Ron Shamir
Linear-space Alignments
mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn
CG © Ron Shamir 29
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
Hirschberg Alg in more detail
k* - position k maximizing
V(n/2,k)+V*(n/2,m-k)
Proved: ∃ opt path L through (n/2,k*)
Def: Ln/2 – subpath of L that
• starts with the last node in L in row
n/2-1 and
• ends with the first node in L in row
n/2+1
30
CG © Ron Shamir
Ln/2
k1 k* m
n/2 -1
n/2
n/2 +1
n
k2
CG © Ron Shamir 31
Lemma: k* can be found in O(mn) time and O(m)
space. Ln/2 can be found and stored in same
bounds
Run DP up to row n/2, getting values V(n/2, i) for O(mn)
all i and back pointers for row n/2 time,
O(m)
Run DP backwards up to row n/2, getting values space
V*(n/2, i) for all i and forward pointers for row
n/2 O(m)
Compute V(n/2,i)+V*(n/2,m-i) for each i, get time,
space
maximizing index k*
Use back pointers to compute subpath from O(m)
(n/2,k*) to last node in row n/2-1 time,
space
Use forward pointers to compute subpath from incl
(n/2,k*) to first node in row n/2+1 storage
32
CG © Ron Shamir
Full Alg and Analysis
• Assume time to fill a p by q DP matrix : cpq
• time to compute rows V(n/2,.), V*(n/2,.): cmn
• time cmn, space O(m) to find k*, k1 , k2, Ln/2
• Recursively solve top subproblem of size ≤ nk*/2,
bottom subproblem of size ≤ n(m-k*)/2
• Time for top level cmn, 2nd level cmn/2
• Time for all i-th level computations cmn/2i-1 (each
subproblem has n/2i rows, the cols of all subprobs are
distinct)
33
CG © Ron Shamir
Dan Hirschberg
Daniel S. Hirschberg is a full
professor in Computer
Science at University of
California, Irvine. His
research interests are in the
theory of design and analysis
of algorithms.
Sequence assembly 35
CG © Ron Shamir
End-Space Free Alignment (2)
• solution: similar to global alignment alg:
37
CG © Ron Shamir
Local Alignment
Definition: Given sequences S, T, find subsequences α
of S and β of T, of maximum similarity ( i.e., with
optimal global alignment between α & β).
Motivation:
• ignore stretches of non-coding DNA
• protein domains (functional subunits)
Example:
• S=abcxdex, T=xxxcded,
• Similarity score: 2 per match,
-1 for subs/indel,
• α=cxde and β=c-de have optimal
alignment score. a b c x d e x
x x x c - d e d
38
CG © Ron Shamir
Local alignments in the
alignment graph
39
CG © Ron Shamir
Computing Local Alignment
The local suffix alignment problem for S’, T’: find a
(possibly empty) suffix α of S’=s1…si and a
(possibly empty) suffix β of T’=t1…tj such that the
value of their alignment is maximum over all values
of alignments of suffixes of S’ and T’.
40
CG © Ron Shamir
Computing Local Alignment (2)
41
CG © Ron Shamir
λ C T C G C A G C
λ 0 0 0 0 0 0 0 0 0
C 0 1 0 1 0 1 0 0 1
A 0 0 0 0 0 0 2 0 0
T 0 0 1 0 0 0 0 1 0
T 0 0 1 0 0 0 0 0 0
C 0 1 0 2 0 1 0 0 1
A 0 0 0 0 1 0 2 0 0
C 0 1 0 1 0 2 0 1 1
+1 for a match, -1 for a mismatch, -5 for a space
CG © Ron Shamir 42
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
Computing Local Alignment (3)
• Time O(nm)
• Space O(n+m)
The optimum value and the ends of subsequences α
and β can be found in linear space
• Finding the starting point of the two subsequences
can be done in linear space (ex.)
• The actual alignment can be computed using
Hirschberg’s algorithm
• Smith-Waterman 81
43
CG © Ron Shamir
44
CG © Ron Shamir
Gap Penalties
• Observation: spaces tend to occur in
batches.
• Idea: when scoring an alignment, use the no.
of contiguous gaps and not the no. of spaces
• Definitions:
– A gap is any maximal run of consecutive spaces
in a single sequence of a given alignment.
– The length of a gap is the number of spaces in
it.
– The number of gaps in the alignment is denoted
by #gaps. S= attc--ga-tggacc
• Example: T= a--cgtgatt---cc
– 4 gaps, 8 spaces, 7 matches, 0 mismatches.
45
CG © Ron Shamir
Gap Penalty Models
Motivation:
–Indelsof entire subsequence in a single mutation.
–When comparing cDNA to DNA, introns are gapped.
Base Conditions:
E(i,j) S.....i------
V(i, 0) = F(i, 0) = Wg + iWs T..............j
V(0, j) = E(0, j) = Wg + jWs
Recursive Computation: F(i,j) S...............i
V(i, j) = max{ E(i, j), F(i, j), G(i, j)} T.....j-------
where:
• G(i, j) = V(i-1, j-1) + σ(si, tj)
• E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg +
Ws , F(i, j-1) + Wg + Ws }
• F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg +
Ws , E(i-1, j) + Wg + Ws }
• Time complexity O(nm) - compute 4 matrices instead of one.
48
• Space complexity
CG © Ron Shamir O(nm) - saving 4 matrices (trivial implementation).
Other Gap Penalty Models:
Convex Gap Penalty Model:
• Each additional space in a gap contributes less to
the gap weight.
• Example: Wg + log(q), where q is the length of the
gap.
• solvable in O(nm log m) time
49
CG © Ron Shamir