0% found this document useful (0 votes)
61 views24 pages

Approximating The Shortest Superstring Problem: Martin Paluszewski University of Copenhagen

This document discusses algorithms for approximating the shortest superstring problem. It begins by defining the shortest superstring problem and providing an example. It then outlines several approximation algorithms that have been developed, including a greedy algorithm, a set cover based algorithm, and factor 4 and factor 3 algorithms. The greedy algorithm is described in more detail, showing the steps and providing an example of how it works. The document states that the greedy algorithm has an approximation guarantee of 4 times the optimal solution.

Uploaded by

Lovish Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views24 pages

Approximating The Shortest Superstring Problem: Martin Paluszewski University of Copenhagen

This document discusses algorithms for approximating the shortest superstring problem. It begins by defining the shortest superstring problem and providing an example. It then outlines several approximation algorithms that have been developed, including a greedy algorithm, a set cover based algorithm, and factor 4 and factor 3 algorithms. The greedy algorithm is described in more detail, showing the steps and providing an example of how it works. The document states that the greedy algorithm has an approximation guarantee of 4 times the optimal solution.

Uploaded by

Lovish Bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Applications

Shortest Superstring Problem


Greedy Algorithm
Set Cover Based Algorithm
Factor 4 Algorithm
Factor 3 Algorithm

Approximating the Shortest Superstring Problem

Martin Paluszewski
University of Copenhagen

12/2-2008

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Sequencing of DNA, Fragment Assembly
Greedy Algorithm
Shotgun Method
Set Cover Based Algorithm
Data Compression
Factor 4 Algorithm
Factor 3 Algorithm

Sequencing of DNA
DNA, a string of 4 letters (A, G, C, T)
Virus DNA 5 × 104 letters (base pairs)
Human DNA 3 × 109 letters (base pairs)

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Sequencing of DNA, Fragment Assembly
Greedy Algorithm
Shotgun Method
Set Cover Based Algorithm
Data Compression
Factor 4 Algorithm
Factor 3 Algorithm

Shotgun Method

DNA

Shotgun
Fragments

ACTGAACTTACGGGCTAAAGCCATACGAATCCTACGAGA
Sequence
ACTGAACTTACGGG
Assembly
AATCCTAC
ACGGGCTAAAGCCATA
TGAACTTACGGGCTAAAG
TAAAGCCATACGAATCCTAC
ACTGAACTTACGGGCTA
ATACGAATCC
TACGAGA

Problem: How can we automate this?


Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Sequencing of DNA, Fragment Assembly
Greedy Algorithm
Shotgun Method
Set Cover Based Algorithm
Data Compression
Factor 4 Algorithm
Factor 3 Algorithm

Data Compression

Input: 0010101011101001010111001010100010011100101010010

Output: 0010101110100101010010011
index, length
00101 0,5
0101110100 3,10
1010111 2,7
001010100 12,9
010011 20,6
100101010010 11,12

Output of compression:
A shortest superstring
An ordered list of start index and fragment lengths.
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm

Shortest Superstring Problem (SSP)


Given a finite alphabet Σ, and set of n strings, S={s1 , ..., sn } ⊆ Σ+ .
Find a shortest string s that contains each si as a substring.
Without loss of generality, we may assume that no string si is a substring
of another string sj , j 6= i.

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm

Example

Input (S)
Output
CATGC
s=GCTAAGTTCATGCATC
CTAAGT
........CATGC...
GCTA
.CTAAGT.........
TTCA
GCTA............
ATGCATC
......TTCA......
.........ATGCATC
Σ = {A, G , C , T }

NP-hard!
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem
Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm

Solution Methods
Exact Algorithms
Metaheuristics
Approximation Algorithms

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Definition
Greedy Algorithm Example
Set Cover Based Algorithm Solution Methods
Factor 4 Algorithm Outline
Factor 3 Algorithm

Outline

1 Greedy, conjecture: ALG ≤ 2· OPT


2 Set cover, ALG ≤ 2 · Hn · OPT
Hn = 1 + 21 + 13 + ... + n1
3 Cycle cover, ALG ≤ 4· OPT
4 Cycle cover (greedy), ALG ≤ 3· OPT
Best known: ALG ≤ 2.5· OPT (Sweedyk)

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Algorithm
Greedy Algorithm
Example
Set Cover Based Algorithm
Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Greedy Algorithm

1: Greedy Shortest Superstring


2: input: A set of strings S.
3: output: A short superstring of S.
4: T←S
5: while |T | > 1 do
6: Let a and b be the most overlapping strings of T
7: Replace a and b with the string obtained by overlapping a
and b
8: end while
9: T contains a superstring of S

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Algorithm
Greedy Algorithm
Example
Set Cover Based Algorithm
Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Example

S = T = {CATGC, CTAAGT, GCTA, TTCA, ATGCATC}


T = {CATGCATC, CTAAGT, GCTA, TTCA}
T = {CATGCATC, GCTAAGT, TTCA}
T = {TTCATGCATC, GCTAAGT}
T = {GCTAAGTTCATGCATC}

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Algorithm
Greedy Algorithm
Example
Set Cover Based Algorithm
Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Approximation guarantee

ALG ≤ 4 · OPT (proved by Blum et. al.)


ALG ≤ 2 · OPT (conjectured)

Conjectured worst case


S = {abk , bk c, bk+1 }

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximating SSP Using Set Cover

Shortest Superstring Problem (SSP)

Set Cover Problem (SCP)

Greedy SCA

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Set Cover Problem:


Choose some sets that cover all elements with least cost
Elements

Subsets

Cost of a subset

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Set Cover Problem:


Choose some sets that cover all elements with least cost
Elements
The input strings
Subsets
σijk = string obtained by overlapping input strings si and sj , k
letters.
β = S ∪ σijk , all i,j,k
Let π ∈ β
set(π) = {s ∈ S | s is a substr. of π }
Cost of a subset
set(π) is |π|
A solution to SSP is the concatenation of π obtained from SCP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Example

S = {CATGC, CTAAGT, GCTA, TTCA, ATGCATC}


π Set Cost
CATGC.....
....CTAAGT
CATGCTAAGT CATGC, CTAAGT, GCTA 10
CATGC..
...GCTA
CATGCTA CATGC, GCTA 7
......CATGC
ATGCATC....
ATGCATCATGC CATGC, ATGCATC 11
CTAAGT...
.....TTCA
CTAAGTTCA CTAAGT, TTCA 9
ATGCATC.....
......CTAAGT
ATGCATCTAAGT CTAAGT, ATGCATC 12

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Example

GCTA......
...ATGCATC
GCTATGCATC GCTA, ATGCATC 10
TTCA......
...ATGCATC
TTCATGCATC TTCA, ATGCATC, CATGC 10
GCTA...
.CTAAGT
GCTAAGT GCTA, CTAAGT 7
TTCA...
..CATGC
TTCATGC CATGC, TTCA 7
CATGC...
.ATGCATC
CATGCATC CATGC, ATGCATC 8
CATGC CATGC 5
CTAAGT CTAAGT 6
GCTA GCTA 4
TTCA TTCA 4
ATGCATC ATGCATC 7

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximation

Shortest Superstring Problem (SSP)

Set Cover Problem (SCP)

Greedy SCA

Lemma
OPTSSP ≤ OPTSCP ≤ 2 · OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximation

Proof
First inequality: OPTSSP ≤ OPTSCP
Let s be the string obtained from an optimal solution to the set
cover problem. Then,
OPTSCP = |s|
Since s covers all strings, s is a valid solution to the SSP and:

|s| ≥ OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximation

Proof, continued
Second inequality: OPTSCP ≤ 2 · OPTSSP
Show that some set cover always can be constructed such that the
inequality holds.
s

s b1

s e1
s b2
s e2
s b3
s e3

π1
π2
π3

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximation

Proof, continued
set(π1 ), set(π2 ), ..., set(πt ) is a solution to SCP (not
necessarily optimal).
Each input string is covered by at most two π strings. (πi
cannot overlap with πi+2 ).
P
i |πi | ≤ 2 · OPTSSP
OPTSCP ≤ 2 · OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem Strategy
Greedy Algorithm Set Cover
Set Cover Based Algorithm Example
Factor 4 Algorithm Approximation Guarantee
Factor 3 Algorithm

Approximation

Proof, continued
GREEDYSCP ≤ Hn OPTSCP
Hn = 1 + 21 + 13 + ... + n1
(Vijay V. Vazirani)

GREEDYSCP ≤ 2Hn OPTSSP

H10 ' 2.9, H100 ' 5.2, H1000 ' 7.5

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Factor 4 Algorithm (cycle cover)

Some definitions
overlap(si , sj ): The maximum overlap between si and sj ,
i ACGGCTAT....
j ....CTATTAGC
overlap ....CTAT....

prefix(si , sj ): First letters of si where overlap(si , sj ) is removed


i ACGGCTAT....
j ....CTATTAGC
prefix ACGG.........

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Important property of a shortest superstring s


s = prefix(s1 , s2 ) ◦ prefix(s2 , s3 ) ◦ · · · ◦ prefix(sn , s1 ) ◦ overlap(sn , s1 )

Example
s: TCTAAGTTCATGCATC
1 TCTA............
2 .CTAAGT.........
3 ......TTCA......
4 ........CATGC...
5 .........ATGCATC
1 ..............TCTA

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Prefix Graph

CATGC

TTCA ATGCAT

GCTA CTAAGT

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Prefix Graph

CATGC
3
2

4 6

5 1
6 3
TTCA ATGCAT

5 3
6
4

3
6 4
4 4 6

GCTA CTAAGT
6 1

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

TSP in Prefix Graph


CATGC
3
2

4 6

5 1
6 3
TTCA ATGCAT

5 3
6
4

3
6 4
4 4 6

GCTA CTAAGT
6 1

OPTTSP = |prefix(sa , sb )| + |prefix(sb , sc )| + · · · + |prefix(sn , sa )|

OPTSSP = |prefix(s1 , s2 )|+|prefix(s2 , s3 )|+· · ·+|prefix(sn , s1 )|+|overlap(sn , s1 )|


Lower bound:

OPTTSP ≤ OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

TSP is NP-hard → cycle cover problem

CATGC
3
2
C1
4 6

5 1
6 3
TTCA ATGCAT

5 3
6
4

C2

3
6 4
4 4 6

GCTA CTAAGT
6 1

OPTCCP ≤ OPTTSP ≤ OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Algorithm (factor 4)

1 Construct the prefix graph corresponding to strings in S


2 Find a minimum weight cycle cover of the prefix graph, C = {c1 , ..., ck }
3 For each cycle ci ∈ C , arbitrarily choose a representative string si1
4 For cycle ci ∈ C , construct:
α(ci ) = prefix(si1 , si2 ) ◦ · · · ◦ prefix(sil −1 , sil ) ◦ prefix(sil , si1 )
5 For cycle ci ∈ C , construct:
σ(ci ) = α(ci ) ◦ si1
6 Output σ(c1 ) ◦ · · · ◦ σ(ck )

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Example
CATGC
3
2
C1
4 6

5 1
6 3
TTCA ATGCAT

5 3
6
4

C2

3
6 4
4 4 6

GCTA CTAAGT
6 1

σ(c1 ) = C ATG CATGC


σ(c2 ) = G CTAAG TTCA GCTA
s = C ATG CATGC G CTAAG TTCA GCTA
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

ALGCC ≤ 4 · OPTSSP

Proof (sketch):

s = σ(c1 ) ◦ ... ◦ σ(ck ) (1)


s = α(c1 ) ◦ s11 ◦ ... ◦ α(ck ) ◦ sk1 (2)
|s| = |α(c1 ) ◦ ... ◦ α(ck )| + |s11 ◦ ... ◦ sk1 | (3)

Since OPTCCP = |α(c)| the α strings are bounded by


OPTSSP .
OPTCCP ≤ OPTTSP ≤ OPTSSP
We need to show that concatenation of representative strings
is bounded by 3 · OPTSSP
Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem
Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Lemma 1
If each string in S 0 ⊆ S is a substring of t ∞ for a string t, then there is a cycle
of weight at most |t| in the prefix graph covering all vertices corresponding to
strings in S’

Proof (sketch)
t∞
t t t

a
b
c

- Sort strings

t∞
t t t

a
c
b

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof (sketch), continued


t∞
t t t

a
c
b b
a

Cost(a → c → b → a) ≤ |t|

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof (sketch), continued


t∞
t t t

a
c
b b
a

Cost(a → c → b → a) ≤ |t|

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof (sketch), continued


t∞
t t t

a
c
b b
a

Cost(a → c → b → a) ≤ |t|

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof (sketch), continued


t∞
t t t

a a
c
b b
a

Cost(a → c → b → a) ≤ |t|

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Lemma 2
Let C be an optimal cycle cover. Let c and c’ be two cycles in C,
and let r, r’ be strings from each cycle. Then

|overlap(r , r 0 )| < w (c) + w (c 0 )

A GTAGTAGTA C AGCAGCAGC
AGTAGTAGT
w=3
CAGCAGCAG
T G
G w=3
TAGTAGTAG A

GCAGCAGCA
C

C’

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof
Suppose |overlap(r , r 0 )| ≥ w (c) + w (c 0 )
α(ci ) = prefix(r , si2 ) ◦ · · · ◦ prefix(sil −1 , sil ) ◦ prefix(sil , r )
r is a substring of α(c)∞ , so overlap(r, r’) is also a substring
of α(c)∞ .
|overlap(r , r 0 )| ≥ w (c), so the overlap consists of repeating
strings α(c).

α α α

o(r,r´)

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof
Suppose |overlap(r , r 0 )| ≥ w (c) + w (c 0 )
α(ci ) = prefix(r , si2 ) ◦ · · · ◦ prefix(sil −1 , sil ) ◦ prefix(sil , r )
r is a substring of α(c)∞ , so overlap(r, r’) is also a substring
of α(c)∞ .
|overlap(r , r 0 )| ≥ w (c), so the overlap consists of repeating
strings α(c).

overlap(r, r’)

r α α

α’ α = α α ’

r’
α’ α’ α’

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

overlap(r, r’)

r α α

α’ α = α α ’

r’
α’ α’ α’

Proof, continued
α and α0 commute, so

αk ◦ α0k = α0k ◦ αk
for all k > 0, so

α∞ = α0∞

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof, continued
All strings from c and c’ are substring of α∞ .
There is a cycle covering all strings in c and c’ with weight at
most |α| (lemma 1)
Contradiction.

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof of factor 4 algorithm


ri are sorted representative strings
s

r1
ri

r i+1
rn

k
X k−1
X
OPTSSP ≥ |ri | − |overlap(ri , ri+1 )|
i=1 i=1

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Definitions
Shortest Superstring Problem
Prefix Graph
Greedy Algorithm
Cycle Cover
Set Cover Based Algorithm
Example
Factor 4 Algorithm
Approximation Guarantee
Factor 3 Algorithm

Proof of factor 4 algorithm, continued.


k
X k−1
X
OPTSSP ≥ |ri | − |overlap(ri , ri +1 )|
i =1 i =1

Lemma 2:
|overlap(r , r 0 )| < w (c) + w (c 0 )

k
X k
X
OPTSSP ≥ |ri | − 2 w (ci )
i =1 i =1

k
X k
X
|ri | ≤ OPTSSP + 2 w (ci ) ≤ 3OPTSSP
i =1 i =1

k
X k
X
ALG = |σ(ci )| = w (C ) + |ri | ≤ 4 · OPTSSP
i =1 i =1

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Factor 3 Algorithm

Construct the prefix graph corresponding to strings in S


Find a minimum weight cycle cover of the prefix graph,
C = {c1 , ..., ck }
Run the greedy superstring algorithm on {σ(c1 ), · · · , σ(ck )}
and output the resulting string τ

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Factor 3 Algorithm

Compression
P
||S|| = x∈S |x|
Compression = ||S|| − |s|
Max. compression ↔ min. superstring
Lemma: Greedy superstring algorithm achieves at least half the
optimal compression:

1
||S|| − GREEDY ≥ (||S|| − OPT)
2

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Lemma 1
|τ | ≤ OPTσ + w (C )

Proof
Assume σ(c1 ), · · · , σ(ck ) appear in this order in the
superstring of Sσ
Maximum compression is:
k−1
X
||Sσ || − OPTσ = |overlap(σ(ci ), σ(ci+1 ))|
i=1

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Proof, continued
k−1
X
||Sσ || − OPTσ = |overlap(σ(ci ), σ(ci+1 ))|
i=1

|overlap(r , r 0 )| < w (c) + w (c 0 ) (lemma 2)


So maximum compression is:
||Sσ || − OPTσ ≤ 2w (C )
Greedy algorithm gives:
||Sσ || − |τ | ≥ 21 (||Sσ || − OPTσ )
2(|τ | − OPTσ ) ≤ ||Sσ || − OPTσ ≤ 2w (C )

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem


Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Lemma 2
OPTσ ≤ OPTSSP + w (C )

Proof
Let Sr = {r1 , · · · , rk } (repr. strings)
Each σ(ci ) begins and ends with ri , so
||Sσ || − OPTσ ≥ ||Sr || − OPTr
||Sσ || = ||Sr || + w (C )
OPTσ ≤ OPTr + w (C )

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

Applications
Shortest Superstring Problem
Greedy Algorithm Algorithm
Set Cover Based Algorithm Approximation Guarantee
Factor 4 Algorithm
Factor 3 Algorithm

Lemma 1
|τ | ≤ OPTσ + w (C )

Lemma 2
OPTσ ≤ OPTSSP + w (C )

Combine lemma 1 and lemma 2 to get:

|τ | ≤ 3 · OPTSSP

Martin Paluszewski University of Copenhagen Approximating the Shortest Superstring Problem

You might also like