0% found this document useful (0 votes)
60 views124 pages

Lecture5 Newest

The document discusses sequence alignment and dynamic programming techniques. It introduces global alignment, semi-global alignment and local alignment. It also covers affine gap penalty models which allow gaps to be scored more accurately than a simple linear gap penalty. The document provides examples and explanations of how dynamic programming can be used to compute optimal alignments between sequences in linear time and space.

Uploaded by

Times Square
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views124 pages

Lecture5 Newest

The document discusses sequence alignment and dynamic programming techniques. It introduces global alignment, semi-global alignment and local alignment. It also covers affine gap penalty models which allow gaps to be scored more accurately than a simple linear gap penalty. The document provides examples and explanations of how dynamic programming can be used to compute optimal alignments between sequences in linear time and space.

Uploaded by

Times Square
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 124

6.

096AlgorithmsforComputationalBiology
SequenceAlignment
andDynamicProgramming
Lecture 1- Introduction
Lecture 2- HashingandBLAST
Lecture 3- CombinatorialMotifFinding
Lecture4 - StatisticalMotifFinding
5
ChallengesinComputationalBiology
4 GenomeAssembly
Regulatorymotifdiscovery 1 GeneFinding
DNA
2 Sequencealignment
6 ComparativeGenomics
TCATGCTAT
TCGTGATAA
3 Databaselookup
7 EvolutionaryTheory
TGAGGATAT
TTATCATAT
TTATGATTT
8 Geneexpressionanalysis
RNAtranscript
Proteinnetworkanalysis 11
9 Gibbssampling 10
12
Regulatorynetworkinference
Emergingnetworkproperties 13
Clusterdiscovery
A C G T C A T C A
T A G T G T C A
ComparingtwoDNAsequences
GiventwopossiblyrelatedstringsS1andS2
Whatisthelongestcommonsubsequence?
A C G T C A T C A
T A G T G T C A
S1
S2
S1 S1
A C G T C A T C A
T A G T G T C A
A G T T C A
S2 S2
LCSS
Editdistance:
Numberofchanges
neededforS1S2
Howcanwecomputebestalignment
S1
S2
A C G T C A T C A
T A G T G T C A
Needscoringfunction:
Score(alignment)=TotalcostofeditingS1intoS2
Costofmutation
Costofinsertion/deletion
Rewardofmatch
Needalgorithmforinferringbestalignment
Enumeration?
Howwouldyoudoit?
Howmanyalignmentsarethere?
Whyweneedasmartalgorithm
Waystoaligntwosequencesoflengthm,n
n m +
|


m + n
|
(m + n)!
~=
2
t=
|
.
|
(m!)
2
\
m
m
Fortwosequencesoflengthn
n Enumeration Today'slecture
10 184,756 100
20 1.40E+11 400
100 9.00E+58 10,000
Keyinsight: scoreisadditive!
A C G T C A T C A
T A G T G T C A
S1
S2
i
j
Computebestalignmentrecursively
Foragivensplit(i, j),thebestalignmentis:
BestalignmentofS1[1..i] andS2[1..j]
+BestalignmentofS1[ i..n]andS2[ j..m]
i i
A C G T C A T C A
T A G T G T C A
S1
S2
j j
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T
T A G T G
S1
S2
A C G T C A T C A
T A G T G T C A
S1
S2
S2
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T A G T G T C A
S1
C G T C A T C A
T G T C A
S1
S2
Keyinsight: re-usecomputation
Identicalsub-problems! Wecanreuseourwork!
Solution#1Memoization
Createabigdictionary,indexedbyalignedseqs
Whenyouencounteranewpairofsequences
Ifitisinthedictionary:
Lookupthesolution
Ifitisnotinthedictionary
Computethesolution
Insertthesolutioninthedictionary
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Topdownapproach
Solution#2Dynamicprogramming
Createabigtable,indexedby(i,j)
Fillitinfromthebeginningallthewaytilltheend
Youknowthatyoullneedeverysubpart
Guaranteedtoexploreentiresearchspace
Ensuresthatthereisnoduplicatedwork
Onlyneedtocomputeeachsub-alignmentonce!
Verysimplecomputationally!
Bottomupapproach
A C G T C A T C A
T A G T G T C A
S1
S2
A C G T C A T C A
T
A
G
T
G
T
C
A
A
G
T
C/G
T
C
A
Goal:
Findbestpath
throughthematrix
Keyinsight: Matrixrepresentationofalignments
Sequencealignment
DynamicProgramming
Globalalignment
0.Settingupthescoringmatrix
-
A G T
A
A
G
C
- 0
Initialization:

UpdateRule:
A(i,j)=max{
}
Termination:

Topright:0
Bottomright
1.Allowinggapsins
-
A G T
A
A
G
C
- 0
-2
-4
-6
-8
Initialization:

UpdateRule:
A(i,j)=max{
i-1, j
}
Termination:

Topright:0
Bottomright
A( )- 2
0
2.Allowinggapsint
-
A G T
-
A
A
G
-2 -4 -6
-2 -4 -6 -8
-4 -6 -8 -10
-6 -8 -10 -12
-8 -10 -12 -14
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
}
Termination:
Bottomright
C
3.Allowingmismatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)-1
}
Termination:
Bottomright
C
4.Choosingoptimalpaths
-
A G T
-
A
A
G
0 -2 -4 -6
-2 -1 -3 -5
-4 -3 -2 -4
-6 -5 -4 -3
-8 -7 -6 -5
-1
-1
-1
-1
-1
-1 -1
-1 -1
-1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)-1
Termination:
Bottomright
C
5.Rewardingmatches
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -3
-4 -1 0 -2
-6 -3 0 -1
-8 -5 -2 -1
1
1
1
-1 -1
-1
-1
Initialization:
Topright:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
Sequencealignment
GlobalAlignment
Semi-Global
DynamicProgramming
Semi-GlobalMotivation
Aligningthefollowingsequences
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
Wemightpreferthealignment
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)=-12
CAGCA- CTTGGATTCTCGG
match mismatch
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - = 6(1)+1(-1)+12(-2)=-19
gap
Newqualitiessought,newscoringscheme
designed
Intuitively,dontpenalizemissingendofthe
sequence
Wedliketomodelthisintuition
Ignoringstartinggaps
-
A G T
Initialization:
-
/ l 1strow co :0
UpdateRule:
A(i,j)=max{
A
A(i-1, j)- 2
A( i ,j-1)- 2
A
A(i-1,j-1)1
}
Termination: G
Bottomright
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
C
Ignoringtrailinggaps
-
A G T
-
A
A
G
0 0 0 0
0 1 -1 -1
0 1 0 -2
0 -1 2 0
0 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
1strow/col:0
UpdateRule:
A(i,j)=max{
A(i-1, j)- 2
A( i ,j-1)- 2
A(i-1,j-1)1
}
Termination:
max(lastrow/col)
C
Usingthenewscoringscheme
Withtheoldscoringscheme(allgapscount-2)
CAGCACTTGGATTCTCGG
CAGC- - - - - G- T- - - - GG
vvvv- - - - - v- v- - - - vv = 8(1)+0(-1)+10(-2)+0(-0)=-12
Newscore(endgapsarefree)
6(1)+1(-1)+1(-2)+11(-0)=3
match mismatch gap
CAGCA- CTTGGATTCTCGG
endgap
- - - CAGCGTGG- - - - - - - -
- - - vv- vxvvv- - - - - - - - =
Semi-globalalignments
Applications:
query
Findingageneinagenome
Aligningareadontoanassembly
subject
FindingthebestalignmentofaPCRprimer
Placingamarkerontoachromosome
Thesesituationshaveincommon
Onesequenceismuchshorterthantheother
Alignmentshouldspantheentirelengthofthesmaller
sequence
Noneedtoaligntheentirelengthofthelongersequence
Inourscoringschemeweshould
Penalizeend-gapsforsubjectsequence
Donotpenalizeend-gapsforquerysequence
Semi-GlobalAlignment
-
A G T
-
A
A
G
C
Query:s
Subject:t
alignallofs
Initialization:

UpdateRule:
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:

0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
...or...
0 -2 -4 -6
-2 1 -1 -1
-4 1 0 -2
-6 -1 2 0
-8 -1 0 -1
-
A G T
A
A
G
C
-
Initialization:
1strow
A(i,j)=max{
A(i-1, j
A( i ,j
A(i-1,j-1)1
}
Termination:
max(lastrow)
Query:t
Subject:s
alignalloft
1stcol
max(lastcol)
)- 2
-1)- 2
Update Rule:
)- 2
-1)- 2
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
DynamicProgramming
IntrotoLocalAlignments
Statementoftheproblem
A localalignmentofstringssandt
isanalignmentofasubstringofs
withasubstringoft
Definitions(reminder):
A substringconsistsofconsecutivecharacters
A subsequenceofsneedsnotbecontiguousins
Navealgorithm
Nowthatweknowhowtousedynamicprogramming
TakeallO((nm)
2
),andruneachalignmentinO(nm)time
Dynamicprogramming
Bymodifyingourexistingalgorithms,weachieveO(mn)
s
t
GlobalAlignment
-
A G T
-
A
A
G
0 -2 -4 -6
-2 1 -1 -5
-4 1 0 -2
-6 -1 2 0
-8 -1 0 1
1
1
1
-1
-1 -1
-1 -1
-1
-1
Initialization:
Topleft:0
UpdateRule:
A(i,j)=max{
A(i-1,
A( i ,
A(i-1,
}
j)- 2
j-1)- 2
j-1)1
Termination:
Bottomright
C
LocalAlignment
-
A G T
A
A
G
C
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 0 0 1
1
1
1
-1
Initialization:

UpdateRule:
A(i,j)=max{
i-1, j
i ,j
i-1,j-1)1
0
}
Termination:
Anywhere
-1
Topleft:0
A(
A(
A(
)- 2
-1)- 2
LocalAlignmentissues
Resolvingambiguities
Whenfollowingarrowsback,onecanstopatanyofthezero
entries. Onlystopwhennoarrowleaves. Longest.
Correctnesssketchbyinduction
Assumewevecorrectlyalignedupto(i,j)
Considerthefourcasesofourmaxcomputation
Byinductivehypothesisrecurseon(i-1,j-1),(i-1,j),(i,j-1)
Basecase: emptystringsaresuffixesalignedoptimally
Timeanalysis
O(mn)time
O(mn)space,canbebroughttoO(m+n)
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
AffineGapPenalty
DynamicProgramming
Scoringthegapsmoreaccurately
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
foralln,(n+1)- (n)s (n)- (n1)
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
Compromise:affinegaps
(n)=d+(n1)e
(n)
| |
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x
i
toy
1
y
j
if if x
i
alignstoy
j
G(i,j): scoreif if x
i
,ory
j
,alignstoagap
Motivationforaffinegappenalty
Modelingevolution
Tointroducethefirstgap,abreakmustoccurinDNA
Multipleconsecutivegapslikelytobeintroducedbythesame
evolutionaryevent. Oncethebreakismade,itsrelativelyeasy
tomakemultipleinsertionsordeletions.
Fixedcostforopeningagap: p+q
Linearcostincrementforincreasingnumberofgaps: q
Affinegapcostfunction
Newgapfunctionforlengthk: w(k)=p+q*k
p+qisthecostofthefirstgapinarun
qistheadditionalcostofeachadditionalgapinsamerun
AdditionalMatrices
Theamountofstateneededincreases
Inscoringasingleentryinourmatrix,weneed
rememberanextrapieceofinformation
Arewecontinuingagapins?(ifnot,startismore
expensive)
Arewecontinuingagapint?(ifnot,startismore
expensive)
Arewecontinuingfromamatchbetweens(i)andt(j)?
Dynamicprogrammingframework
Weencodethisinformationinthreedifferentstates
foreachelement(i,j)ofouralignment. Usethree
matrices
a(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]witht[j]
b(i,j):bestalignmentofs[1..i]&t[1..j]thatalignsgapwitht[j]
c(i,j):bestalignmentofs[1..i]&t[1..j]thatalignss[i]withgap
Updaterules
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
=
(
|=
( ( [ t j ( i a ,j) i s score ],[ ])+ max

i b , 1 j1)
|=
differentforeach
|
pairofchars
(i c , 1 j1)
\ .=
Whent[j]alignswithagapins
|
i a ,j1)(p q)
|
+
startingagapins
=
(
|
( ( i b ,j) max

i b ,j1) q
|=
extendingagapins
|
i c ,j1)(p q) ( + Stoppingagapint,
\ .
andstartingoneins
Whens[i]alignswithagapint
|
i a 1 )(p q)
|=
, j +
=
(
|=
( ( , j i c ,j) max

i c 1 ) q
|=
|
( , j + i b 1 )(p q)
\ .
Findmaximumoverallthreearraysmax(a[m,n],b[m,n],c[m,n]).
Followarrowsback,skippingfrommatrixtomatrix
Simplifiedrules
Transitionsfrombtocarenotnecessary...
iftheworstmismatchcostslessthanp+q
ACC-GGTA
ACCGGTA
A--TGGTA
A-TGGTA
=
(
Whens[j]andt[j]arealigned
|
i a , 1 j1)
|=
Scorecanbe
|=
( [ ], ( i a ,j) score( t i s [j])+ max

i b , 1 j1)
|=
differentforeach
|=
pairofchars
(
\
i c , 1 j1)
.
Whent[j]alignswithagapins
(
i b( ,j) max
|
i a ,j1)(p+ q)
|=
startingagapins
|=
|
\
i b ,j1) q (
extendingagapins
.
Whens[i]alignswithagapint
(
i c( ,j) max
|
i a , 1 j)(p+ q)
|=
|=
|
\
i c , 1 j) q (
.
GeneralGapPenalty
Gappenaltiesarelimitedbytheamountofstate
Affinegappenalty: w(k)=k*p
State:Currentindextellsifinagapornot
Lineargappenalty: w(k)=p+q*k,whereq<p
State: addbinaryvalueforeachsequence: startingagapornot
Whataboutquadriatic:w(k)=p+q*k+rk
2
.
State: needstoencodethelengthofthegap,whichcanbeO(n)
ToencodeitweneedO(logn)bitsofinformation.Notfeasible
Whatabouta(mod3)gappenaltyforproteinalignments
Gapsoflengthdivisibleby3arepenalizedless:conserveframe
Thisisfeasible,butrequiresmorepossiblestates
Possiblestatesare: starting,mod3=1,mod3=2,mod3=0
Sequencealignment
GlobalAlignment
Semi-Global
LocalAlignment
LinearGapPenalty
VariationsontheTheme
DynamicProgramming
DynamicProgrammingVersatility
Unifiedframework
Dynamicprogrammingalgorithm.Localupdates.
Re-usingpastresultsinfuturecomputations.
Memoryusageoptimizations
Toolsinourdisposition
Globalalignment:entirelengthoftwoorthologousgenes
Semi-globalalignment: pieceofalargersequencealigned
entirely
Localalignment: twogenessharingafunctionaldomain
LinearGapPenalty:penalizefirstgapmorethansubsequent
gaps
Editdistance,min#ofeditoperations.M=0,m=g=-1,every
operationsubtracts1,beitmutationorgap
Longestcommonsubsequence: M=1,m=g=0. Everymatch
addsone,beitcontiguousornotwithprevious.
DPAlgorithmVariations
t
s
t
s
t
s
-
A G T
A
A
G
C
- 0 -2 -4 -6
-2 1 -1 -1
-4 -1 -1 -2
-6 -1 0 0
-8 -3 0 -1
GlobalAlignment
Semi-GlobalAlignment
LocalAlignment
-
A G T
A
A
G
C
- 0 -2 -4 -6
0 1 -1 -1
0 1 0 -2
0 -1 2 1
0 -1 0 0
-
A G T
A
A
G
A
- 0 0 0 0
0 1 0 0
0 1 0 0
0 0 2 0
0 1 0 1
BoundedDynamicProgramming
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1

y
N
k(N)
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-SpaceAlignment
Hirschbergsalgorithm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Linear-spacealignment
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Linear-spacealignment
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Linear-spacealignment
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-spacealignment
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
Linear-spacealignment
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Linear-spacealignment
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
TheFour-RussianAlgorithm
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
TheFour-RussianAlgorithm
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
TheFour-RussianAlgorithm
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j

b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
TheFour-RussianAlgorithm
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
TheFour-RussianAlgorithm
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
TheFour-RussianAlgorithm
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
TheFour-RussianAlgorithm
t t
t
EvolutionattheDNAlevel
ACGGTGCAGTCACCA
ACGTTGCAGTCCACCA
C
Sequence Changes Computing best alignment
In absence of gaps
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
Howdowecomputethebestalignment?
A
G
T
G
A
C
C
T
G
G
G
A
A
G
A
C
C
C
T
G
A
C
C
C
T
G
G
G
T
C
A
C
A
A
A
A
C
T
C

AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA
Too many possible
alignments:
O(2
M+N
)
Alignmentisadditive
Observation:
Thescoreofaligning x
1
x
M
y
1
y
N
isadditive
Saythat x
1
x
i
x
i+1
x
M
alignsto y
1
y
j
y
j+1
y
N
Thetwoscoresaddup:
F(x[1:M],y[1:N])= F(x[1:i],y[1:j])+F(x[i+1:M],y[j+1:N])
DynamicProgramming
Wewillnowdescribeadynamicprogramming
algorithm
Supposewewishtoalign
x
1
x
M
y
1
y
N
Let
F(i,j) = optimalscoreofaligning
x
1
x
i
y
1
y
j
DynamicProgramming(contd)
Noticethreepossiblecases:
1. x
i
alignstoy
j
x
1
x
i-1
x
i
y
1
y
j-1
y
j
m,ifx
i
=y
-s,ifnot
j
F(i,j)=F(i-1,j-1)+
2. x
i
alignstoagap
x
1
x
i-1
x
i
y
1
y
j
-
3. y
j
alignstoagap
F(i,j)=F(i-1,j)- d
x
1
x -
i
y
1
y
j-1
y
j
F(i,j)=F(i,j-1)- d
DynamicProgramming(contd)
Howdoweknowwhichcaseiscorrect?
Inductiveassumption:
F(i,j-1),F(i-1,j),F(i-1,j-1)areoptimal
Then,
F(i-1,j-1)+s(x,y
j
)
F(i,j)=max
i
F(i-1, j)d
F( i,j-1)d
Where s(x,y
j
)=m,ifx =y; -s,ifnot
i i j
Example
x=AGTA
y=ATA
F(i,j) i=0 1 2 3 4
j=0
1
2
3
A G T A
0 -1 -2 -3 -4
A -1 1 0 -1 -2
T -2 0 0 1 0
A -3 -1 -1 0 2
m= 1
s =-1
d =-1
OptimalAlignment:
F(4,3)=2
AGTA
A- TA
TheNeedleman-WunschMatrix
y
1

y
N

x
1
x
M
Everynondecreasing
path
from(0,0)to(M,N)
correspondsto
analignment
ofthetwosequences
Canthinkofitasa
divide-and-conqueralgorithm
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
Performance
O(NM)
O(NM)

me:
Laterwewillcovermoreefficientmethods
Ti
Space:
Avariantofthebasicalgorithm:
MaybeitisOKtohaveanunlimited#ofgapsin
thebeginningandend:
----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC
GCGAGTTCATCTATCAC--GACCGC--GGTCG--------------
Then,wedontwanttopenalizegapsintheends
Differenttypesofoverlaps
TheOverlapDetectionvariant
Changes:
1. Initialization
x
1
x
M
y
1

y
N
Foralli,j,
F(i,0)=0
F(0,j)=0
2. Termination
max
i
F(i,
N)
F
OPT
=max
max F(M,
j
j)
Thelocalalignmentproblem
Giventwostrings x=x
1
x
M
,
y=y
1
y
N
(optimalglobalalignmentvalue)
ismaximum
e.g. x=aaaacccccgggg
y=cccgggaaccaacc
Findsubstringsx,ywhosesimilarity
Whylocalalignment
Genesareshuffledbetweengenomes
Portionsofproteins(domains)areoftenconserved
Imageremovedduetocopyrightrestrictions.
Cross-speciesgenomesimilarity
98%ofgenesareconservedbetweenanytwomammals
>70%averagesimilarityinproteinsequence
hum_a:GTTGACAATAGAGGGTCTGGCAGAGGCTC--------------------- @57331/400001
mus_a:GCTGACAATAGAGGGGCTGGCAGAGGCTC--------------------- @78560/400001
rat_a:GCTGACAATAGAGGGGCTGGCAGAGACTC--------------------- @112658/369938
fug_a:TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG@36008/68174
hum_a:CTGGCCGCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@57381/400001
mus_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@78610/400001
rat_a:CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG@112708/369938
atohenhancerin
fug_a:TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG@36058/68174
human,mouse,
hum_a:AGCGCACTCTCCTTTCAGGCAGCTCCCCGGGGAGCTGTGCGGCCACATTT@57431/400001
rat,fugufish
mus_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT@78659/400001
rat_a:AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGCGCGGCCACATTT@112757/369938
fug_a:AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC@36084/68174
hum_a:AACACCATCATCACCCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@57481/400001
mus_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@78708/400001
rat_a:AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG@112806/369938
fug_a:CCGAGGACCCTGA------------------------------------- @36097/68174
TheSmith-Watermanalgorithm
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
TheSmith-Watermanalgorithm
Termination:
1. Ifwewantthebestlocalalignment
F
OPT
=max
i,j
F(i,j)
2. Ifwewantalllocalalignmentsscoring>t
Foralli,jfindF(i,j)>t,andtraceback
Scoringthegapsmoreaccurately
Currentmodel:
(n)
Gapoflength n
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
foralln,(n+1)- (n)s (n)- (n1)
(n)
Generalgapdynamicprogramming
Initialization: same
Iteration:
F(i-1,j-1)+s(x
i
,y
j
)
F(i,j) =max
max
max
k=0i-1
F(k,j)(i-k)
k=0j-1
F(i,k)(j-k)
Termination: same
RunningTime: O(N
2
M)
Space:
(assumeN>M)
O(NM)
Compromise:affinegaps
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Needleman-Wunschwithaffinegaps
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
SequenceAlignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Definition
Giventwostrings x=x
1
x
2
...x
M
, y=y
1
y
2
y
N
,
analignmentisanassignmentofgapstopositions
0,,Minx,and0,,Niny,soastolineupeach
letterinonesequencewitheitheraletter,oragap
intheothersequence
ScoringFunction
Sequenceedits:
AGGCCTC
Mutations
AGGACTC
Insertions
AGGGCCTC
Deletions
AGG.CTC
ScoringFunction:
Match: +m
Mismatch: -s
Gap: -d
Score F=(#matches)=m- (#mismatches)=s(#gaps)=
d
TheNeedleman-WunschAlgorithm
1. Initialization.
a. F(0,0) = 0
b. F(0,j) =- j=d
c. F(i,0) =- i=d
2. MainIteration.Filling-inpartialalignments
a. Foreach i=1M
Foreach j=1N
F(i-1,j-1)+s(x,y
j
)
i
F(i,j) = max F(i-1,j)d
F(i,j-1)d
DIAG, if [case1]
Ptr(i,j) = LEFT, if [case2]
UP, if[case3]
3. Termination.F(M,N)istheoptimalscore,and
fromPtr(M,N)cantracebackoptimalalignment
[case1]
[case2]
[case3]
TheSmith-Watermanalgorithm
Idea:Ignorebadlyaligningregions
ModificationstoNeedleman-Wunsch:
Initialization: F(0,j)=F(i,0)=0
0
Iteration: F(i,j)=max F(i1,j)d
F(i,j1)d
F(i1,j1)+s(x,y
j
)
i
Scoringthegapsmoreaccurately
Simple,lineargapmodel:
Gapoflength n
(n)
incurspenalty nd
However,gapsusuallyoccurinbunches
Convexgappenaltyfunction:
(n):
(n)
foralln,(n+1)- (n)s (n)- (n1)
Algorithm:O(N
3
)time,O(N
2
)space
Compromise:affinegaps
(n)=d+(n1)e
| |
(n)
gap gap
open extend
d
Tocomputeoptimalalignment,
e
Atpositioni,j,needtorememberbestscoreifgapisopen
bestscoreifgapisnotopen
F(i,j): scoreofalignmentx
1
x toy
1
y
j i
if if x
i
alignstoy
j
G(i,j): scoreif if x,ory
j
,alignstoagap
i
Whydoweneedtwomatrices?
x
i
alignstoy
j
x
1
x
i-1
x
i
x
i+1
y
1
y
j-1
y
j
-
2. x
i
alignstoagap
x
1
x
i-1
x
i
x
i+1
y
1
y
j
- -
Add-d
Add-e
Needleman-Wunschwithaffinegaps
Needleman-Wunschwithaffinegaps
Initialization: F(i,0)=d+(i1)e
F(0,j)=d+(j1)e
Iteration:
F(i1,j1)+s(x,y
j
)
i
F(i,j)=max
G(i1,j1)+s(x,y
j
)
i
F(i1,j)d
F(i,j1)d
G(i,j)=max
G(i,j1)e
G(i 1,j) e
Termination: same
Togeneralizealittle
thinkofhowyouwouldcomputeoptimalalignment
withthisgapfunction
(n)
.intimeO(MN)
BoundedDynamicProgramming
Assumeweknowthatxandyareverysimilar
Assumption: #gaps(x,y) <k(N) (sayN>M)
x
i
Then, | implies |ij|<k(N)
y
j
Wecanalignxandymoreefficiently:
Time,Space: O(N=k(N)) <<O(N
2
)
BoundedDynamicProgramming
Initialization:
F(i,0),F(0,j)undefinedfori,j>k
Iteration:
Fori=1M
Forj=max(1,i k)min(N,i+k)
F(i 1,j 1)+s(x
i
,y
j
)
F(i,j)=max F(i,j 1) d,ifj>i k(N)
F(i 1,j) d,ifj<i+k(N)
Termination: same
Easytoextendtotheaffinegapcase
x
1
x
M
y
1

y
N
k(N)
Linear-SpaceAlignment
Hirschbergsalgortihm
Longestcommonsubsequence
Givensequencess=s
1
s
2
s ,t=t
1
t
2
t
n
,
m
Findlongestcommonsubsequenceu=u
1
u
k
Algorithm:
F(i-1,j)
F(i,j)=max F(i,j-1)
F(i-1,j-1)+[1,ifs =t
j
;0otherwise]
i
Hirschbergsalgorithmsolvesthisinlinearspace
Introduction:Computeoptimalscore
ItiseasytocomputeF(M,N)inlinearspace
F(i,j)
Allocate(column[1])
Allocate(column[2])
For i=1.M
If i>1,then:
Free(column[i2])
Allocate(column[i])
For j=1N
F(i,j)=
Linear-spacealignment
Tocomputeboththeoptimalscoreandtheoptimalalignment:
Divide&Conquerapproach:
Notation:
r
x ,y
r
:reverseofx,y
E.g.x =accgg;
r
x =ggcca
r r
F
r
(i,j):optimalscoreofaligningx
r
1
x & y
r
1
y
j i
sameasF(M-i+1,N-j+1)
Linear-spacealignment
Lemma:
F(M,N)=max
k=0N
(F(M/2,k)+F
r
(M/2,N-k))
x
y
M/2
k
*
F
r
(M/2,N-k) F(M/2,k)
Linear-spacealignment
Now,using2columnsofspace,wecancompute
fork=1M,F(M/2,k),F
r
(M/2,N-k)
PLUSthebackpointers
Linear-spacealignment
Now,wecanfindk
*
maximizingF(M/2,k)+F
r
(M/2,N-k)
Also,wecantracethepathexitingcolumnM/2fromk
*
k
*
k
*
Linear-spacealignment
Iteratethisproceduretotheleftandright!
k
*
N-k
*
M/2
M/2
Linear-spacealignment
HirschbergsLinear-spacealgorithm:
MEMALIGN(l,l,r,r): (alignsxx
l
withy
r
y
r
)
l
1. Leth=(l-l)/2(=
2. FindinTimeO((ll)=(r-r)),SpaceO(r-r)
theoptimalpath, L
h
,enteringcolumnh-1,exitingcolumnh
Letk
1
=posnatcolumnh2whereL
h
enters
k
2
=posnatcolumnh+1whereL
h
exits
3. MEMALIGN(l,h-2,r,k
1
)
4. OutputL
h
5. MEMALIGN(h+1,l,k
2
,r)
Toplevelcall:MEMALIGN(1,M,1,N)
Linear-spacealignment
Time,SpaceanalysisofHirschbergsalgorithm:
Tocomputeoptimalpathatmiddlecolumn,
ForboxofsizeM=N,
Space: 2N
Time: cMN, forsomeconstantc
Then,left,rightcallscostc(M/2=k
*
+M/2=(N-k
*
))=cMN/2
Allrecursivecallscost
TotalTime: cMN+cMN/2+cMN/4+..=2cMN=O(MN)
TotalSpace:O(N)forcomputation,
O(N+M)tostoretheoptimalalignment
TheFour-RussianAlgorithm
AusefulspeedupofDynamicProgramming
MainObservation
WithinarectangleoftheDP
matrix,
valuesofDdependonly
onthevaluesofA,B,C,
andsubstringsx
l...l
,y
rr
Definition:
At-blockisat tsquareof
theDPmatrix
Idea:
Dividematrixint-blocks,
Precompute t-blocks
Speedup:O(t)
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Mainstructureofthealgorithm:
DivideNN DPmatrixintoKK
log
2
N-blocksthatoverlapby1
column&1row
Fori=1K
Forj=1K
ComputeD
i,j
asafunctionof
A
i,j
,B
i,j
,C
i,j
,x[l
i
l
i
],y[r
j
r
j
]
Time: O(N
2
/log
2
N)
timesthecostofstep4
t t
t
TheFour-RussianAlgorithm
Anotherobservation:
(Assumem=0,s=1,d=1)
Lemma.TwoadjacentcellsofF(.,.)differbyatmost1
Gusfieldsbookcoverscasewherem=0,
calledtheeditdistance(p.216):
minimum#ofsubstitutions+gapstotransformonestringtoanother
TheFour-RussianAlgorithm
ProofofLemma:
1. Samerow:
a. F(i,j)F(i1,j)s=+1
Atworst,onemoregap: x
1
x
i-1
x
i
y
1
y
j

b. F(i,j)F(i1,j)>=-1
F(i,j) F(i1,j1) F(i,j)F(i1,j1)
x x x x
1
x
1 i-1 i i-1
y
1
y
a-1
y
a
y
a+1
y
j
y
1
y
a-1
y
a
y
a+1
y
j
>=-1
x
1
x x x x
i-1 i 1 i-1
y
1
y
a-1
y
a
y
j
y
1
y
a-1
y
a
y
j
+1
2. Samecolumn:similarargument
TheFour-RussianAlgorithm
ProofofLemma:
3. Samediagonal:
a. F(i,j)F(i1,j1)s=+1
Atworst,oneadditionalmismatchinF(i,j)
b. F(i,j)F(i1,j1)>=-1
F(i,j)
x
1
x x
i-1 i
|
y
1
y
i-1
y
j
x
1
x x
i-1 i
y
1
y
a-1
y
a
y
j
F(i1,j1)
x x
1 i-1
y
1
y
j-1
x x
1 i-1
y
1
y
a-1
y
a
y
j
F(i,j)F(i1,j1)
>-1
+1
TheFour-RussianAlgorithm
Definition:
Theoffsetvectorisa
t-longvectorofvalues
from{-1,0,1},
wherethefirstentryis0
IfweknowthevalueatA,
andthetoprow,leftcolumn
offsetvectors,
andx
l
x
l
,y
r
y
r
,
ThenwecanfindD
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
5 6 5 5
6 5 5 4
5 6 5 5
4 5 6 5
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Example:
x=AACT
y=CACT
1 2 1 1
2 1 1 0
1 2 1 1
0 1 2 1
A A C T
C
A
C
T
0 1 -1
0
0
-1
1
0 0 1 -1
0
1
1
-1
-1
TheFour-RussianAlgorithm
Definition:
Theoffsetfunctionofa
t-block
isafunctionthatforany
givenoffsetvectors
oftoprow,leftcolumn,
andx
l
x
l
,y
r
y
r
,
producesoffsetvectors
ofbottomrow,right
column
A B
C
D
x
l
x
l
y
r
y
r
t
TheFour-RussianAlgorithm
4
3
Wecanpre-computetheoffsetfunction:
2(t-1)
possibleinputoffsetvectors
2t
possiblestringsxx
l
,y
r
y
r l
Therefore3
2(t-1)
=4
2t
valuestopre-compute
Wecankeepallthesevaluesinatable,andlookupinlineartime,
orinO(1)timeifweassume
constant-lookupRAMforlog-sizedinputs
TheFour-RussianAlgorithm
Four-RussiansAlgorithm:(Arlazarov,Dinic,Kronrod,
Faradzev)
1. CovertheDPtablewitht-blocks
2. InitializevaluesF(.,.)infirstrow&column
3. Row-by-row,useoffsetvaluesatleftmostcolumnandtop
rowofeachblock,tofindoffsetvaluesatrightmostcolumn
andbottomrow
4. LetQ=totalofoffsetsatrowN
F(N,N)=Q+F(N,0)
TheFour-RussianAlgorithm
t t
t

You might also like