Approximating The Shortest Superstring Problem: Martin Paluszewski University of Copenhagen
Approximating The Shortest Superstring Problem: Martin Paluszewski University of Copenhagen
Martin Paluszewski
University of Copenhagen
12/2-2008
Applications
Shortest Superstring Problem
Sequencing of DNA, Fragment Assembly
Greedy Algorithm
Shotgun Method
Set Cover Based Algorithm
Data Compression
Factor 4 Algorithm
Factor 3 Algorithm
Sequencing of DNA
DNA, a string of 4 letters (A, G, C, T)
Virus DNA 5 × 104 letters (base pairs)
Human DNA 3 × 109 letters (base pairs)
Shotgun Method
DNA
Shotgun
Fragments
ACTGAACTTACGGGCTAAAGCCATACGAATCCTACGAGA
Sequence
ACTGAACTTACGGG
Assembly
AATCCTAC
ACGGGCTAAAGCCATA
TGAACTTACGGGCTAAAG
TAAAGCCATACGAATCCTAC
ACTGAACTTACGGGCTA
ATACGAATCC
TACGAGA
Applications
Shortest Superstring Problem
Sequencing of DNA, Fragment Assembly
Greedy Algorithm
Shotgun Method
Set Cover Based Algorithm
Data Compression
Factor 4 Algorithm
Factor 3 Algorithm
Data Compression
Input: 0010101011101001010111001010100010011100101010010
Output: 0010101110100101010010011
index, length
00101 0,5
0101110100 3,10
1010111 2,7
001010100 12,9
010011 20,6
100101010010 11,12
Output of compression:
A shortest superstring
An ordered list of start index and fragment lengths.
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm
Example
Input (S)
Output
CATGC
s=GCTAAGTTCATGCATC
CTAAGT
........CATGC...
GCTA
.CTAAGT.........
TTCA
GCTA............
ATGCATC
......TTCA......
.........ATGCATC
Σ = {A, G , C , T }
NP-hard!
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm
Solution Methods
Exact Algorithms
Metaheuristics
Approximation Algorithms
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm
Outline
Greedy Algorithm
Applications
Shortest Superstring Problem
Algorithm
Greedy Algorithm
Example
Set Cover Based Algorithm
Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm
Example
Approximation guarantee
Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm
Greedy SCA
Subsets
Cost of a subset
Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm
Example
Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm
Example
GCTA......
...ATGCATC
GCTATGCATC GCTA, ATGCATC 10
TTCA......
...ATGCATC
TTCATGCATC TTCA, ATGCATC, CATGC 10
GCTA...
.CTAAGT
GCTAAGT GCTA, CTAAGT 7
TTCA...
..CATGC
TTCATGC CATGC, TTCA 7
CATGC...
.ATGCATC
CATGCATC CATGC, ATGCATC 8
CATGC CATGC 5
CTAAGT CTAAGT 6
GCTA GCTA 4
TTCA TTCA 4
ATGCATC ATGCATC 7
Approximation
Greedy SCA
Lemma
OPTSSP ≤ OPTSCP ≤ 2 · OPTSSP
Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm
Approximation
Proof
First inequality: OPTSSP ≤ OPTSCP
Let s be the string obtained from an optimal solution to the set
cover problem. Then,
OPTSCP = |s|
Since s covers all strings, s is a valid solution to the SSP and:
|s| ≥ OPTSSP
Approximation
Proof, continued
Second inequality: OPTSCP ≤ 2 · OPTSSP
Show that some set cover always can be constructed such that the
inequality holds.
s
s b1
s e1
s b2
s e2
s b3
s e3
π1
π2
π3
Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm
Approximation
Proof, continued
set(π1 ), set(π2 ), ..., set(πt ) is a solution to SCP (not
necessarily optimal).
Each input string is covered by at most two π strings. (πi
cannot overlap with πi+2 ).
P
i |πi | ≤ 2 · OPTSSP
OPTSCP ≤ 2 · OPTSSP
Approximation
Proof, continued
GREEDYSCP ≤ Hn OPTSCP
Hn = 1 + 21 + 13 + ... + n1
(Vijay V. Vazirani)
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Some definitions
overlap(si , sj ): The maximum overlap between si and sj ,
i ACGGCTAT....
j ....CTATTAGC
overlap ....CTAT....
Example
s: TCTAAGTTCATGCATC
1 TCTA............
2 .CTAAGT.........
3 ......TTCA......
4 ........CATGC...
5 .........ATGCATC
1 ..............TCTA
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Prefix Graph
CATGC
TTCA ATGCAT
GCTA CTAAGT
Prefix Graph
CATGC
3
2
4 6
5 1
6 3
TTCA ATGCAT
5 3
6
4
3
6 4
4 4 6
GCTA CTAAGT
6 1
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
4 6
5 1
6 3
TTCA ATGCAT
5 3
6
4
3
6 4
4 4 6
GCTA CTAAGT
6 1
OPTTSP ≤ OPTSSP
CATGC
3
2
C1
4 6
5 1
6 3
TTCA ATGCAT
5 3
6
4
C2
3
6 4
4 4 6
GCTA CTAAGT
6 1
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Algorithm (factor 4)
Example
CATGC
3
2
C1
4 6
5 1
6 3
TTCA ATGCAT
5 3
6
4
C2
3
6 4
4 4 6
GCTA CTAAGT
6 1
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
ALGCC ≤ 4 · OPTSSP
Proof (sketch):
Lemma 1
If each string in S 0 ⊆ S is a substring of t ∞ for a string t, then there is a cycle
of weight at most |t| in the prefix graph covering all vertices corresponding to
strings in S’
Proof (sketch)
t∞
t t t
a
b
c
- Sort strings
t∞
t t t
a
c
b
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
a
c
b b
a
Cost(a → c → b → a) ≤ |t|
a
c
b b
a
Cost(a → c → b → a) ≤ |t|
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
a
c
b b
a
Cost(a → c → b → a) ≤ |t|
a a
c
b b
a
Cost(a → c → b → a) ≤ |t|
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Lemma 2
Let C be an optimal cycle cover. Let c and c’ be two cycles in C,
and let r, r’ be strings from each cycle. Then
A GTAGTAGTA C AGCAGCAGC
AGTAGTAGT
w=3
CAGCAGCAG
T G
G w=3
TAGTAGTAG A
GCAGCAGCA
C
C’
Proof
Suppose |overlap(r , r 0 )| ≥ w (c) + w (c 0 )
α(ci ) = prefix(r , si2 ) ◦ · · · ◦ prefix(sil −1 , sil ) ◦ prefix(sil , r )
r is a substring of α(c)∞ , so overlap(r, r’) is also a substring
of α(c)∞ .
|overlap(r , r 0 )| ≥ w (c), so the overlap consists of repeating
strings α(c).
α α α
o(r,r´)
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Proof
Suppose |overlap(r , r 0 )| ≥ w (c) + w (c 0 )
α(ci ) = prefix(r , si2 ) ◦ · · · ◦ prefix(sil −1 , sil ) ◦ prefix(sil , r )
r is a substring of α(c)∞ , so overlap(r, r’) is also a substring
of α(c)∞ .
|overlap(r , r 0 )| ≥ w (c), so the overlap consists of repeating
strings α(c).
overlap(r, r’)
r α α
α’ α = α α ’
r’
α’ α’ α’
overlap(r, r’)
r α α
α’ α = α α ’
r’
α’ α’ α’
Proof, continued
α and α0 commute, so
αk ◦ α0k = α0k ◦ αk
for all k > 0, so
α∞ = α0∞
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Proof, continued
All strings from c and c’ are substring of α∞ .
There is a cycle covering all strings in c and c’ with weight at
most |α| (lemma 1)
Contradiction.
r1
ri
r i+1
rn
k
X k−1
X
OPTSSP ≥ |ri | − |overlap(ri , ri+1 )|
i=1 i=1
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm
Lemma 2:
|overlap(r , r 0 )| < w (c) + w (c 0 )
k
X k
X
OPTSSP ≥ |ri | − 2 w (ci )
i =1 i =1
k
X k
X
|ri | ≤ OPTSSP + 2 w (ci ) ≤ 3OPTSSP
i =1 i =1
k
X k
X
ALG = |σ(ci )| = w (C ) + |ri | ≤ 4 · OPTSSP
i =1 i =1
Factor 3 Algorithm
Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm
Factor 3 Algorithm
Compression
P
||S|| = x∈S |x|
Compression = ||S|| − |s|
Max. compression ↔ min. superstring
Lemma: Greedy superstring algorithm achieves at least half the
optimal compression:
1
||S|| − GREEDY ≥ (||S|| − OPT)
2
Lemma 1
|τ | ≤ OPTσ + w (C )
Proof
Assume σ(c1 ), · · · , σ(ck ) appear in this order in the
superstring of Sσ
Maximum compression is:
k−1
X
||Sσ || − OPTσ = |overlap(σ(ci ), σ(ci+1 ))|
i=1
Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm
Proof, continued
k−1
X
||Sσ || − OPTσ = |overlap(σ(ci ), σ(ci+1 ))|
i=1
Lemma 2
OPTσ ≤ OPTSSP + w (C )
Proof
Let Sr = {r1 , · · · , rk } (repr. strings)
Each σ(ci ) begins and ends with ri , so
||Sσ || − OPTσ ≥ ||Sr || − OPTr
||Sσ || = ||Sr || + w (C )
OPTσ ≤ OPTr + w (C )
Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm
Lemma 1
|τ | ≤ OPTσ + w (C )
Lemma 2
OPTσ ≤ OPTSSP + w (C )
|τ | ≤ 3 · OPTSSP