0% found this document useful (0 votes)
93 views49 pages

Pairwise Alignment 2017

pairwise sequence alignment Bioinformatics

Uploaded by

mn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
93 views49 pages

Pairwise Alignment 2017

pairwise sequence alignment Bioinformatics

Uploaded by

mn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Lecture 2:

Pairwise Alignment

1
CG © Ron Shamir
Main source

2
CG © Ron Shamir
Why compare

www.mathworks.com/.../jan04/bio_genome.html
sequences?
Human hexosaminidase A vs Mouse hexosaminidase A

3
CG © Ron Shamir
Sequence Alignment
‫עימוד רצפים‬
• The problem: Comparing two sequences while
allowing certain mismatches between them.
• Main motivation:
– Comparing DNA seqs and proteins from databases,
• Comparing two or more sequences for similarity
• Searching databases for related sequences and
subsequences
• Finding informative elements in protein and DNA sequences

4
CG © Ron Shamir,
Alignment definition
• Input: Two sequneces of possibly different lengths
• Goal: Space the sequences so that they have the same
length and no two spaces are matched.

acbcdb a c - - b c d b
cadbd - c a d b - d -

acbc a c b c
acdc a c d c

acbcdbacdbb

5
CG © Ron Shamir
cadbdadbbb
Alignment scoring: Similarity vs. Difference

• Resemblance of DNA sequences of


different organisms explained by common
ancestral origin
• Differences are explained by mutation:
– Insertion
“Indel”
– Deletion
– Substitution
• Distance between two sequences is the
minimum (weighted) sum of mutations
transforming one into the other.
• Similarity of two sequences is the
maximum (weighted) sum of resemblances
between them. 6
CG © Ron Shamir
Nomenclature
• Computer Science: • Biology:
– String, word – Sequence
– Substring – Subsequence
non (contiguous) .
contiguous – Subsequence – N/a
segment of a
sequence – Exact matching – N/a
– inexact matching – Alignment

We shall use the biology nomenclature

7
CG © Ron Shamir
Simplest model: Edit
Distance
The edit distance between two sequences is the
min no. of edit operations (single letter
insertion, deletion and substitution) needed to
transform one sequence into the other.

ACCTGA and AGCTTA


space
ACCTGA
A_CCTGA ACCTGA
AGCCTGA
AGCTTGA AGCTT_A AGCTTA
AGCTTA
3 operations 2 operations
8
CG © Ron Shamir dist=2
Alignment
match
SEQ 1 GTAGTACAGCT-CAGTTGGGATCACAGGCTTCT
|||| || ||| |||||| |||||| |||
SEQ 2 GTAGAACGGCTTCAGTTG---TCACAGCGTTC–

substitution insertion deletion


•24 matches,
•Subs: TA, AG, GC, CG
•Indels -T, G-, G-, A-, T-
•Distance I : match 0, subs 1, indel 2  dist = 14
9
CG © Ron Shamir
Alignment
SEQ 1 GTAGTACAGCT-CAGTTGGGATCACAGGCTTCT
|||| || ||| |||||| |||||| |||
SEQ 2 GTAGAACGGCTTCAGTTG---TCACAGCGTTC–
•24 matches, Subs: TA, AG, GC, CG, Indels -T, G-, G-, A-, T-

•Distance II: match 0, d(A,T)=d(G,C)=1, d(A,G)=1.5 indel 2


 dist=14.5
•Similarity I: match 1, subs 0, indel –1.5
 similarity =16.5
•General setup: substitution matrix S(i,j), indel S(i,-)
Usually symmetric, Alignment – to - not allowed. 10
CG © Ron Shamir
Models for Alignment
Problem: Global Alignment
Input: Two sequences S, T of roughly the same
length
Goal: Find an optimal alignment between them

Problem: Local Alignment


Input: Two sequences S, T
Goal: Find a subsequence of S and a
subsequence of T with optimal alignment
(most similar subsequences).
11
CG © Ron Shamir
Models for Alignment (2)
Problem: Ends free alignment (End space-free alignment)
Input: Two sequences S, T
Question: Find an optimal alignment between subsequences
of S and T where at least one of these subsequences is a
prefix of the original sequence and one (not necessarily
the other) is a suffix.
gap: max contiguous run of spaces in a single
sequence within a given alignment

Problem: Alignment with Gap Penalty


Input: Two sequences S, T measures the
Question: Find an optimal alignment cost of a gap as
between them, given a gap penalty a (nonlinear)
function of its
function. length
12
CG © Ron Shamir
Global alignment

www.mathworks.com/.../jan04/bio_genome.html
Human hexosaminidase A vs Mouse hexosaminidase A

Global Alignment Problem:


Input: Two sequences S=s1…sn, T=t1….tm (n~m)
Goal: Find an optimal (max. similarity) alignment under
a given scoring function.

13
CG © Ron Shamir
How many alignments are
possible?
• Each alignment matches 0 ≤ k ≤ min(n,m) pairs.
 n  m 
• #alignments with k matched pairs is   
 k  k 

 n  m   n + m 
min( n , m )
=N ∑
=    
k = 0  

k k   min( n , m ) 

14
CG © Ron Shamir
Global Alignment Algorithm
• First dynamic programming solution by Needleman &
Wunsch (70); improved later by Sankoff (72).

Notation:
• σ(a,b) : the score (weight) of the alignment of
character a with character b.
• V(i,j) : the optimal score of the alignment of S’=s1…si
and T’=t1…tj (0 ≤ i ≤ n, 0 ≤ j ≤ m)

15
CG © Ron Shamir
V(i,j) := optimal score of the alignment
of S’=s1…si and T’=t1…tj (0 ≤ i ≤ n, 0 ≤ j ≤ m)

Lemma: V(i,j) has the following


properties: Alignment with 0 elements ≡ spaces
• Base conditions: S’=s ...s with T’=t ...t
– V(i,0) = Σk=0..iσ(sk,-)
1 i-1 1 j-1
si with tj.
– V(0,j) = Σk=0..jσ(-,tk)
• Recurrence relation: V(i-1,j-1) + σ(si,tj)
∀1≤i≤n, 1≤j≤m: V(i,j) = max V(i-1,j) + σ(si,-)
V(i,j-1) + σ(-,tj)
S’=s1...si with T’=t1...tj-1
and ‘-’ with tj. 16
CG © Ron Shamir
Optimal Alignment - Tabular
Computation
• Use dynamic programming to compute V(i,j) for all
possible i,j values:
for i=1 to n do
begin
For j=1 to m do
begin
Calculate V(i,j) using
V(i-1,j-1), V(i,j-1), V(i-1,j)
end
end

Costs: match 2, mismatch/indel -1


Snapshot of computing the table
17
CG © Ron Shamir
Optimal Alignment - Tabular
Computation
• Add back
pointer(s) from
cell (i,j) to
father cell(s)
realising V(i,j).
• Trace back the
pointers from
(m,n) to (0,0)

18
CG © Ron Shamir
Backtracking the alignment
Example

x = AGTA m= 1
y = ATA s = -1
d = -1

F(i,j) i=0 1 2 3 4
V(1, 1) =
`1` `` A G T A max{V(0,0) + s(A, A),
V(0, 1) + d,
V(1, 0) + d} =
j=0 0 -1 -2 -3 -4
max{0 + 1,
1 A -1 1 0 -1 -2 -1 – 1,
-1 – 1} = 1

2 T -2 0 0 1 0

3 A -3 -1 -1 0 2 AGTA
A - TA
CS262 Lecture 2, Win06, Batzoglou
λ C T C G C A G C

λ 0 -5 -10 -15 -20 -25 -30 -35 -40


C -5 10 5
A -10
T -15
T -20
C -25
A -30
C -35
+10 for match, -2 for mismatch, -5 for space
CG © Ron Shamir 20
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
λ C T C G C A G C

λ 0 -5 -10 -15 -20 -25 -30 -35 -40


C -5 10 5 0 -5 -10 -15 -20 -25
A -10 5 8 3 -2 -7 0 -5 -10
T -15 0 15 10 5 0 -5 -2 -7
*
T -20 -5 10 * 13 8 3 -2 -7 -4
C -25 -10 5 20 15 18 13 8 3
A -30 -15 0 15 18 13 28 23 18
C -35 -20 -5 10 13 28 23 26 33

Traceback can yield both optimum alignments


CG © Ron Shamir 21
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
Alignment Graph

(0,0)

σ(-, tj+1)

σ(si+1,-) σ(si+1,tj+1)

(n,m)
22
CG © Ron Shamir
Alignment Graph
Definition: The alignment graph of sequences S=s1…sn
and T=t1…tm, is a directed graph G=(V,E) on
(n+1)x(m+1) nodes, each labeled with a distinct pair
(i,j) (0≤i≤n, 0≤j≤m), with the following weighted
edges:
• ((i,j), (i+1,j)) with weight σ(si+1,-)
• ((i,j), (i,j+1)) with weight σ(-, tj+1)
• ((i,j), (i+1,j+1)) with weight σ(si+1,tj+1)
Note: a path from node (0,0) to node (n,m)
corresponds to an alignment and its total weight is
the alignment score.
Goal: find an optimal path from node (0,0) to node
(n,m)

23
CG © Ron Shamir
Complexity
• Time: O(mn) (proportional to |E|)
• Space to find opt alignment: O(mn)
(proportional to |V|)
• Space is often the bottleneck!
• Space to find opt alignment value
only: O(m+n). Why?
• Can we improve space complexity for
finding opt alignment?

24
CG © Ron Shamir
Warm-up questions
How do we efficiently compute the opt
alignment scores of S to each prefix
t1….tk of T?

How do we efficiently compute the opt


alignment score of the sequence
suffixes si+1…sn and tj+1….tm ?

25
CG © Ron Shamir
Reducing Space Complexity
V*(n-i,m-j) = opt alignment value of si+1…sn and tj+1….tm

Lemma: V ( n, m) = max V ( n , k ) + V * ( n , m − k )
 
0≤ k ≤ m
 2 2 

Pf: max{...} ≤V(n,m):


• ∀ position k’ in T, ∃ alignment of S and
T consisting of:
– an opt alignment of s1...sn/2 and t1...tk’ and
– a disjoint opt alignment of sn/2 + 1...sn and
tk’+1...tm.
26
CG © Ron Shamir
Proof (contd)
• max{...} ≥ V(n,m) :
• For an opt. alignment of S and T, let k’
be the rightmost position in T that is
aligned with a character at or before
position n/2 in S. Then the optimal
alignment of S and T consists of:
– an alignment of s1...sn/2 and t1...tk’
and
– a disjoint alignment of sn/2 +1...sn and
tk’+1...tm.
27
CG © Ron Shamir
‘Divide & Conquer’ Alg (Hirschberg ’75)
• Compute opt cost of all
paths from start, to
any point at centerline
n/4 • Compute opt cost of
back paths from end to
any pt at centerline
n/2 • Pick centerline pt with
opt sum of the two
costs
3n/4
• Continue recursively on
the subproblems

28
CG © Ron Shamir
Linear-space Alignments

mn + ½ mn + ¼ mn + 1/8 mn + 1/16 mn + … = 2 mn
CG © Ron Shamir 29
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
Hirschberg Alg in more detail
k* - position k maximizing
V(n/2,k)+V*(n/2,m-k)
Proved: ∃ opt path L through (n/2,k*)
Def: Ln/2 – subpath of L that
• starts with the last node in L in row
n/2-1 and
• ends with the first node in L in row
n/2+1
30
CG © Ron Shamir
Ln/2

k1 k* m

n/2 -1
n/2
n/2 +1

n
k2
CG © Ron Shamir 31
Lemma: k* can be found in O(mn) time and O(m)
space. Ln/2 can be found and stored in same
bounds
Run DP up to row n/2, getting values V(n/2, i) for O(mn)
all i and back pointers for row n/2 time,
O(m)
Run DP backwards up to row n/2, getting values space
V*(n/2, i) for all i and forward pointers for row
n/2 O(m)
Compute V(n/2,i)+V*(n/2,m-i) for each i, get time,
space
maximizing index k*
Use back pointers to compute subpath from O(m)
(n/2,k*) to last node in row n/2-1 time,
space
Use forward pointers to compute subpath from incl
(n/2,k*) to first node in row n/2+1 storage

32
CG © Ron Shamir
Full Alg and Analysis
• Assume time to fill a p by q DP matrix : cpq
•  time to compute rows V(n/2,.), V*(n/2,.): cmn
•  time cmn, space O(m) to find k*, k1 , k2, Ln/2
• Recursively solve top subproblem of size ≤ nk*/2,
bottom subproblem of size ≤ n(m-k*)/2
• Time for top level cmn, 2nd level cmn/2
• Time for all i-th level computations cmn/2i-1 (each
subproblem has n/2i rows, the cols of all subprobs are
distinct)

• Total time: ∑i=1 to log n cmn/2i-1 ≤ 2cmn


• Total space: O(m+n)

33
CG © Ron Shamir
Dan Hirschberg
Daniel S. Hirschberg is a full
professor in Computer
Science at University of
California, Irvine. His
research interests are in the
theory of design and analysis
of algorithms.

Hirschberg, D. S. (1975). "A linear space


algorithm for computing maximal common
subsequences". Communications of the
ACM 18 (6): 341–343 34
CG © Ron Shamir
End-Space Free Alignment
• Suppose spaces at the beginning and the end of the
alignment contribute zero weight.
• Example: No S=--cac-dbdvl No
weight weight
T=ltcabddb---
• Motivation: “shotgun sequence assembly” - finding
the original sequence given many of its
subsequences (possibly overlapping).

Sequence assembly 35
CG © Ron Shamir
End-Space Free Alignment (2)
• solution: similar to global alignment alg:

Instead of: Σk=0..iσ(sk,-)


• Base conditions: and Σk=0..jσ(-,tk)
V(i,0) = 0
V(0,j) = 0 The same as in global
• Recurrence relation: alignment
V(i-1,j-1) + σ(si,tj)
• Time complexity: O(nm)
V(i,j) = max V(i-1,j) + σ(si,-) – computing the matrix: O(nm),
V(i,j-1) + σ(-,tj) – finding i* and j*: O(n+m).
• Search for i* and j* such that • Space complexity: for opt value: O(n+m)
V(n, i*) = maxi{V(n, i)} – computing the matrix: O(n+m),
V(j*, m) = maxj{V(j, m)} – computing i* and j* requires the
• V(S, T) = max{ V(n, i*), V(j*, m) } last row and column to be saved:
O(n+m)
36
CG © Ron Shamir Instead of V(n,m)
Why compare sequences?
(II)

37
CG © Ron Shamir
Local Alignment
Definition: Given sequences S, T, find subsequences α
of S and β of T, of maximum similarity ( i.e., with
optimal global alignment between α & β).
Motivation:
• ignore stretches of non-coding DNA
• protein domains (functional subunits)
Example:
• S=abcxdex, T=xxxcded,
• Similarity score: 2 per match,
-1 for subs/indel,
• α=cxde and β=c-de have optimal
alignment score. a b c x d e x
x x x c - d e d

38
CG © Ron Shamir
Local alignments in the
alignment graph

39
CG © Ron Shamir
Computing Local Alignment
The local suffix alignment problem for S’, T’: find a
(possibly empty) suffix α of S’=s1…si and a
(possibly empty) suffix β of T’=t1…tj such that the
value of their alignment is maximum over all values
of alignments of suffixes of S’ and T’.

• V(i,j) : the value of optimal local suffix alignment


for a given pair i, j of indices.
• How are the V(i,j) related to opt local alignment
value?

40
CG © Ron Shamir
Computing Local Alignment (2)

A scheme of the algorithm: Algorithm - Recursive Definition


• Assumption: match ≥ 0, Base Condition:
mismatch/indel ≤ 0
∀i,j V(i,0) = 0, V(0,j) = 0
• Solve local suffix alignment
Recursion Step: ∀ i>0, j>0
for S’=s1...si and T’=t1...tj by
discarding prefixes whose 0,
similarity is ≤ 0 V(i,j) = max V(i-1, j-1) + σ(si, tj),
• Find the indices i*, j* after V(i, j-1) + σ(-, tj),
which the similarity only V(i-1, j) + σ(si, -)
decreases. Compute i*, j*
s.t. V(i*, j*) = max1≤i ≤n, 1 ≤j ≤mV(i,j)

41
CG © Ron Shamir
λ C T C G C A G C

λ 0 0 0 0 0 0 0 0 0
C 0 1 0 1 0 1 0 0 1
A 0 0 0 0 0 0 2 0 0
T 0 0 1 0 0 0 0 1 0
T 0 0 1 0 0 0 0 0 0
C 0 1 0 2 0 1 0 0 1
A 0 0 0 0 1 0 2 0 0
C 0 1 0 1 0 2 0 1 1
+1 for a match, -1 for a mismatch, -5 for a space
CG © Ron Shamir 42
Fernandez-Baca & Dobbs http://www.cs.iastate.edu/~cs544/
Computing Local Alignment (3)
• Time O(nm)
• Space O(n+m)
The optimum value and the ends of subsequences α
and β can be found in linear space
• Finding the starting point of the two subsequences
can be done in linear space (ex.)
• The actual alignment can be computed using
Hirschberg’s algorithm
• Smith-Waterman 81

43
CG © Ron Shamir
44
CG © Ron Shamir
Gap Penalties
• Observation: spaces tend to occur in
batches.
• Idea: when scoring an alignment, use the no.
of contiguous gaps and not the no. of spaces
• Definitions:
– A gap is any maximal run of consecutive spaces
in a single sequence of a given alignment.
– The length of a gap is the number of spaces in
it.
– The number of gaps in the alignment is denoted
by #gaps. S= attc--ga-tggacc
• Example: T= a--cgtgatt---cc
– 4 gaps, 8 spaces, 7 matches, 0 mismatches.
45
CG © Ron Shamir
Gap Penalty Models
Motivation:
–Indelsof entire subsequence in a single mutation.
–When comparing cDNA to DNA, introns are gapped.

Constant Gap Penalty Model:


• Each individual space is free,
• Constant weight Wg for each gap, independent of
its length (gap opening cost)
Goal: maximize Σσ(s’i, t’i) + Wg × #gaps

Affine Gap Penalty Model:


• Additionally to Wg, each space has cost Ws . (gap
extension cost)
Goal: max. Σσ(s’i, t’i) + Wg × #gaps + Ws × #spaces
46
CG © Ron Shamir
Alignment with Affine Gap
Penalty
Three Types of Alignments:

1 S.....i • G(i,j) is max value of any alignment


T.....j of type 1, where si and tj match

2 S.....i------- • E(i,j) is max value of any


T..............j alignment of type 2, where tj
matches a space
3 S...............i
T.....j------- • F(i,j) is max value of any alignment
of type 3, where si matches a
space 47
CG © Ron Shamir
Alignment with Affine G(i,j) S.....i
Gap Penalty (2) T.....j

Base Conditions:
E(i,j) S.....i------
V(i, 0) = F(i, 0) = Wg + iWs T..............j
V(0, j) = E(0, j) = Wg + jWs
Recursive Computation: F(i,j) S...............i
V(i, j) = max{ E(i, j), F(i, j), G(i, j)} T.....j-------
where:
• G(i, j) = V(i-1, j-1) + σ(si, tj)
• E(i, j) = max{ E(i, j-1) + Ws , G(i, j-1) + Wg +
Ws , F(i, j-1) + Wg + Ws }
• F(i, j) = max{ F(i-1, j) + Ws , G(i-1, j) + Wg +
Ws , E(i-1, j) + Wg + Ws }
• Time complexity O(nm) - compute 4 matrices instead of one.
48
• Space complexity
CG © Ron Shamir O(nm) - saving 4 matrices (trivial implementation).
Other Gap Penalty Models:
Convex Gap Penalty Model:
• Each additional space in a gap contributes less to
the gap weight.
• Example: Wg + log(q), where q is the length of the
gap.
• solvable in O(nm log m) time

Arbitrary Gap Penalty Model:


• Most general gap weight.
• Weight of a gap is an arbitrary function of its
length w(q).
• solvable in O(nm2+n2m) time.

49
CG © Ron Shamir

You might also like