A Subquadratic Algorithm For Minimum Palindromic Factorization
A Subquadratic Algorithm For Minimum Palindromic Factorization
b Department
Abstract
We give an O(n log n)-time, O(n)-space algorithm for factoring a string into
the minimum number of palindromic substrings. That is, given a string S[1..n],
in O(n log n) time our algorithm returns the minimum number of palindromes
S1 , . . . , S such that S = S1 S . We also show that the time complexity is
O(n) on average and (n log n) in the worst case. The last result is based on a
characterization of the palindromic structure of Zimin words.
Keywords: string algorithms, palindromes, factorization
1. Introduction
Palindromic substrings are a well-studied topic in stringology and combinatorics on words.
Since a single character is a palindrome, there are always
between n and n2 + n = (n2 ) non-empty palindromic substrings in a string of
length n. There are only 2n 1 possible centers of those substrings, however
i.e., the n individual characters and the n 1 gaps between them so many
algorithms involving palindromic substrings still run in subquadratic time. For
example, Manacher [12] gave a linear-time algorithm for listing all the palindromic prefixes of a string. Apostolico, Breslauer and Galil [3] observed that
Manachers algorithm can be used to list in linear time all maximal palindromic
substrings, which are those that cannot be extended without changing the position of the center. Other linear-time algorithms for this problem were given
by Jeuring [9] and Gusfield [7]. Since any palindromic substring is contained
within the maximal palindromic substring with the same center, the list of all
maximal palindromic substrings can be viewed as a linear-space representation
of all palindromic substrings. For more discussion of algorithms involving palindromes, we refer the reader to Jeurings recent survey [10].
Palindromes are a useful tool for investigating string complexity; see, e.g., [2].
A natural measure of the asymmetry of a string S is its palindromic length
Email addresses: [email protected] (Gabriele Fici),
[email protected] (Travis Gagie), [email protected] (Juha
K
arkk
ainen), [email protected] (Dominik Kempa)
August 8, 2014
1 Editors note: we are satisfied that the results of this paper, and those of [8] and [11],
have all been achieved independently.
Algorithm Palindromic-length(S[1..n])
1: PL[0] 0
2: P
3: for j 1 to n do
4:
P
5:
foreach i P do
6:
if i > 1 and S[i 1] = S[j] then
7:
P P {i 1}
8:
if j > 1 and S[j 1] = S[j] then
9:
P P {j 1}
10:
P P {j}
11:
PL[j] j
12:
foreach i P do
13:
PL[j] min(PL[j], PL[i 1] + 1)
14: return PL[n]
Figure 1: A simple quadratic-time algorithm for computing the palindromic length. Every
iteration of the for loop in line 3 starts with P = Pj1 and ends with P = Pj .
We compute and store an array PL[0..n], where PL[0] = 0 and PL[i] = PL(S[1..i])
for i 1. At each step j, we compute the set Pj of the starting positions of all
palindromes ending at j from the set Pj1 using the observation that S[i..j],
i + 1 j 1, is a palindrome if and only if S[i + 1..j 1] is a palindrome and
S[i] = S[j]. The algorithm is given in Figure 1.
The space requirement is clearly O(n). During the jth step of the algorithm,
we use time O(|Pj | + |Pj1 |), so for all the steps we use total time proportional
to the number of palindromic substrings in S. For most strings the time is
linear (see Theorem 11) but the worst case is quadratic, e.g., for S = an or
S = (ab)n/2 .
It is straightforward to modify the algorithm so that it produces an actual
minimum palindromic factorization of S, without increasing the running time
or space by more than a constant factor.
3. Faster Computation of Palindromes
In this section, we replace the representation Pj of the palindromes ending
at j with a more compact representation Gj that needs only O(log j) space and
can be computed in O(log j) time from Gj1 . The representation is based on
combinatorial properties of palindromes.
A string y is a border of a string x if y is both a prefix of x and a suffix of x,
and a proper border if y 6= x. The following easy lemmas establish a connection
between borders and palindromes.
Lemma 1 ([13]). Let y be a suffix of a palindrome x. Then y is a border of x
iff y is a palindrome.
3
x
y
u
v
Figure 2: Proof of Lemma 4: |u| |v|; if |u| > |v| then |u| > |z|; and if |u| = |v| then u = v.
Lemma 2 ([13]). Let x be a string with a border y such that |x| 2|y|. Then
x is a palindrome iff y is a palindrome.
A positive integer p |x| is a period of a string x if there exists a string w
of length p such that x is a factor of w . It is well known that y is a proper
border of x if and only if |x| |y| is a period of x. This, together with Lemma 1,
implies the following connection between periods and palindromes.
Lemma 3. Let y be a proper suffix of a palindrome x. Then |x| |y| is a period
of x iff y is a palindrome. In particular, |x| |y| is the smallest period of x iff
y is the longest palindromic proper suffix of x.
Now we are ready to state and prove the key combinatorial property of
palindromic suffixes.
Lemma 4. Let x be a palindrome, y the longest palindromic proper suffix of x
and z the longest palindromic proper suffix of y. Let u and v be strings such
that x = uy and y = vz. Then
(1) |u| |v|;
(2) if |u| > |v| then |u| > |z|;
(3) if |u| = |v| then u = v.
Proof. See Figure 2 for an illustration.
(1) By Lemma 3, |u| = |x| |y| is the smallest period of x, and |v| = |y| |z|
is the smallest period of y. Since y is a factor of x, either |u| > |y| > |v| or |u|
is a period of y too, and thus it cannot be smaller than |v|.
(2) By Lemma 1, y is a border of x and thus v is a prefix of x. Let w be a
string such that x = vw. Then z is a border of w and |w| = |zu|, see Figure 3.
Since we assume |u| > |v|, we must have |w| > |y|. Suppose to the contrary that
|u| |z|. Then |w| = |zu| 2|z|, and by Lemma 2, w is a palindrome. But this
contradicts y being the longest palindromic proper suffix of x.
(3) In the proof of (2) we saw that v is a prefix of x, and so is u by definition.
Thus u = v if |u| = |v|.
We will use the above lemma to establish the properties of the set Pj . Let
Pj = {p1 , p2 , . . . , pm } with p1 < p2 < < pm . By gap we mean the difference
pi pi1 of two consecutive values in Pj . The following result has been proven
in [14] but we provide a proof for completeness.
4
w
v
|u|
y
x
y
u
v
Figure 3: Proof of Lemma 4(2): if |u| > |v| and |u| |z| then w is a palindromic proper suffix
of x longer than y.
S[1..j 1] :
c a a a b a a a b a a a b a a a
1
Gj1 :
2
Pj1,
10 11 12 13 14 15 16
10
Pj1,4
14 15 16
Pj1,1
(a)
S[1..j] :
c a a a b a a a b a a a b a a a b
1
Gj
Gj
10 11 12 13 14 15 16 17
13
13
17
Gj :
13
Pj,4
17
Pj,
(b)
Figure 4: (a) The palindromic suffixes of S[1..j 1] for j = 17 start at positions Pj1 =
{2, 6, 10, 14, 15, 16} and the compact representation is Gj1 = ((2, , 1), (6, 4, 3), (15, 1, 2)).
The shaded symbols will be compared with the next symbol appended to the text.
(b) The palindromic suffixes after appending S[j]. The sequence Gj is obtained by taking each
triple (i, , k) Gj1 and either removing it or replacing it with (i 1, , k). The resulting
sequence Gj = ((5, 4, 3)), however, is no longer a valid gap partitioning because the gap of
the first element encoded by triple (5, 4, 3) is . This is fixed by separating this element
into its own triple. At this point we also add the palindromes of length at most 2 to obtain
G
j = ((5, , 1), (9, 4, 2), (17, 4, 1)). Finally, we merge neighboring triples with the same to
obtain Gj = ((5, , 1), (9, 4, 3)).
Figure 8 and the example of computation is given in Figure 4b. Each triple is
processed in constant time and the number of triples never exceeds O(|Gj1 |).
Lemma 7. Gj can be computed from Gj1 in O(|Gj1 |) = O(log j) time.
4. Faster Factorization
In this section, we will show how to compute PL[j] from PL[0..j 1] and
Gj in O(|Gj |) time. The key to fast computation of Gj was the close relation
between Pj, and Pj1, . Now we will rely on the relation between Pj, and
Pj, captured by the following result.
Lemma 8. If (i, , k) Gj for k 2, then (i, , k 1) Gj .
Proof. By definition, (i, , k) Gj is equivalent to saying that Pj, = {i, i +
, . . . , i + (k 1)}, and we need to show that Pj, = {i, i + , . . . , i + (k
2)}. We will show first that Pj, [i + 1..j ] = {i, i + , . . . , i +
(k 2)} and then that Pj, [1..i ] = .
Since y = S[i..j] and x = S[i ..j] are palindromes and y is the longest
proper border of x, S[i ..j ] = y = S[i..j]. Thus for all [i..j],
Pj iff Pj (see Figure 5a). In particular, the gaps in both cases
are the same and for all [i + 1..j], Pj, iff Pj, . Thus
Pj, [i + 1..j ] = {i, i + , . . . , i + (k 2)}.
We still need to show that Pj, [1..i ] = , which is true if and
only if i 2 6 Pj . Suppose to the contrary that S[i 2..j ] is a
palindrome and let w = S[i 2..i 1]. Then S[j 2 + 1..j ] = wR ,
the reverse of w. Since z = S[i ..j ] and S[i ..j] are palindromes too,
we have that S[i ..i 1] = w and S[j + 1..j] = wR . Finally, since z is a
palindrome, S[i 2..j] = wzwR is a palindrome (see Figure 5b). This implies
that i 2 Pj and thus i Pj, , which is a contradiction.
By the above lemma, Pj, = Pj, {max Pj, } whenever |Pj, | 2.
Thus we can compute PLj, = min{PL[i 1] + 1 : i Pj, } from PLj,
in constant time. We will store the value PLj, in an array GPL[1..n] at the
position m = min Pj, . Note that m is the predecessor of min Pj, in Pj
and the position is shared by PLj, (when |Pj, | 2). The following lemma
shows that the position is not overwritten by another value between the rounds
j and j. See Figure 6 for an example.
Lemma 9. Let m = min Pj, . For all [j + 1..j 1], m 6 P .
Proof. Suppose to the contrary that m P for some [j + 1..j 1], i.e.,
S[m..] is a palindrome. Then S[m + h.. h] for h = j + is a palindrome
too (see Figure 7). Since h = j and m < m + h < m + = min Pj, ,
this contradicts m being the predecessor of min Pj, in Pj .
The full algorithm is given in Figure 8. The running time of round j is
O(|Gj1 | + |Gj |). Since |Gj | = O(log j) for all j, we obtain the following result.
7
y
i
i
j
j
(a)
w
i2
wR
j
wR
j
(b)
Figure 5: Proof of Lemma 8. (a) Pj iff Pj for all [i..j]. (b) If i 2 Pj
then S[i 2..j] is a palindrome.
S[1..j] :
c a a a b a a a b a a a b a a a
Gj :
2
Pj,
0
10
Pj,4
14 15 16
Pj,1
10 11 12 13 14 15 16
PL : 0 1 2 2 2 3 3 3 2 3 3 3 2 3 3 3 2
+1
+1
+1
+1 +1 +1
min
PLj, = 2
GPL :
min
min
14
.. 4 ..
.. 4 ..
PLj,4
PLj,1
(a) Iteration j.
S[1..j 4] :
c a a a b a a a b a a a
Gj4 :
2
Pj4,
0
10 11 12
Pj4,1
Pj4,4
4
10 11 12
PL : 0 1 2 2 2 3 3 3 2 3 3 3 2
+1
+1
+1 +1 +1
min
PLj4, = 2
GPL :
min
min
10
.. 4 ..
.. 4 ..
PLj4,4
PLj4,1
(b) Iteration j 4.
Figure 6: Example usage of the GPL array for j = 16. The value of PLj,4 computed in
iteration j depends on shaded elements from PL array. Rather than scanning them all, we
apply Lemma 8. Since |Pj,4 | 2 we get Pj,4 = Pj4,4 {14}. Therefore we can compute
PLj,4 as min{PLj4,4 , PL[13] + 1}. The value of PLj4,4 was computed during iteration j 4
and stored at position min Pj4,4 4 = min Pj,4 4 = 2 in the GPL array, and by Lemma 9
it was not overwritten between iterations j 4 and j. Thus we compute PLj,4 in constant
time as min{GPL[2], PL[13] + 1} and update GPL[2] with the new value.
h
m
h
m+
n/2
X
i=1
ni +
i=1
+1 n
3 n .
1
Therefore the average number of palindromes ending at any position is less than
three, and both algorithms spend a constant time on average for processing each
position.
We show the worst case complexity of the algorithm by constructing a family
of strings based on the Zimin words [4, Chapter 5.4]. Let Z0 = , and Zi =
Zi1 iZi1 for i > 0. The limit of this sequence is the infinite Zimin word Z =
1213121412131215 . . .. For a non-negative integer n, let B(n) be the number
of 1-bits in the binary representation of n. For example, B(0) = 0, B(1) = 1,
B(7) = 3 and B(8) = 1.
Lemma 12. The prefix Z[1..n] of the infinite Zimin word Z has exactly B(n)
suffix palindromes.
Proof. From the definition, it is easy to see that the prefix Z[1..n] has a unique
factorization of the form
Z[1..n] = Zik (ik + 1) Zik1 (ik1 + 1) Zi2 (i2 + 1) Zi1 (i1 + 1)
where 0 i1 < i2 < < ik1 < ik . For example, Z[1..10] = Z3 4Z1 2. Since the
Pk
length of a factor Zi (i+1) is 2i , we must have that j=1 2ij = n. Thus i1 , . . . , ik
are the positions of 1-bits in the binary representation of n, and k = B(n).
Let nj = 2ij for j [1..k]. Clearly, Z[2nk n..n] is a palindrome of length
2(n nk ) + 1 centered at Z[nk ] = (ik + 1). For example, Z[6..10] = 21412 is a
palindrome centered at Z[8] = 4. Since Z[nk ] is the only occurrence of (ik + 1)
in Z[1..n], there can be no other suffix palindromes with a starting position in
Z[1..nk ]. By a similar argument, there is exactly one suffix palindrome with a
starting position in Z[nk + 1..nk + nk1 ], the one centered at Z[nk + nk1 ] =
(ik1 + 1), and so on. In total, Z[1..n] has exactly k suffix palindromes.
10
Algorithm Palindromic-length(S[1..n])
1: PL[0] 0
2: G ()
3: for j 1 to n do
4:
G ()
5:
foreach (i, , k) G do
6:
if i > 1 and S[i 1] = S[j] then
7:
G .pushback((i 1, , k))
// appends the given triple
8:
G ()
9:
r j
// makes i r big enough to act as
10:
foreach (i, , k) G do
11:
if i r 6= then
12:
G .pushback((i, i r, 1))
13:
if k > 1 then
14:
G .pushback((i + , , k 1))
15:
else
16:
G .pushback((i, , k))
17:
r i + (k 1)
18:
if j > 1 and S[j 1] = S[j] then
19:
G .pushback((j 1, j 1 r, 1))
20:
r j1
21:
G .pushback((j, j r, 1))
22:
G ()
23:
(i , , k ) G .popfront()
// removes and returns the first triple
24:
foreach (i, , k) G do
25:
if = then
26:
k = k + k
27:
else
28:
G.pushback((i , , k ))
29:
(i , , k ) (i, , k)
30:
G.pushback((i , , k ))
31:
PL[j] j
32:
foreach (i, , k) G do
33:
r i + (k 1)
34:
m PL[r 1] + 1
35:
if k > 1 then
36:
m min(m, GPL[i ])
37:
if i then
38:
GPL[i ] m
39:
PL[j] min(PL[j], m)
40: return PL[n]
Figure 8: Algorithm for computing the palindromic length in O(n log n) time.
11
Theorem 13. The running time of the algorithm in Figure 8 for input Z[1..n]
is (n log n).
Proof. By Lemma 12, Z[1..j] has exactly B(j) suffix palindromes, i.e., |Pj | =
B(j). From the proof it is easy to see that each of the suffix palindromes is
at least twice as long as the next shorter suffix palindrome. Thus there are no
two identical gaps in Pj and |Gj | = |Pj | = B(j). Since the algorithm
spends
P
n
(|Gj1 | + |Gj |) time in round j, the total time complexity is
B(j)
,
j=1
which is (n log n) [15].
Acknowledgements
Many thanks to the organizers and participants of the Stringmasters 2013
workshops in Verona and Prague, and to the anonymous reviewers. This research was partially supported by the Italian MIUR Project PRIN 2010LYA9RH,
Automi e Linguaggi Formali: Aspetti Matematici e Applicativi, and by the
Academy of Finland through grant 268324 and grant 118653 (ALGODAN).
References
[1] A. Alatabbi, C. S. Iliopoulos, and M. S. Rahman. Maximal palindromic
factorization. In Proceedings of the Prague Stringology Conference (PSC),
pages 7077, 2013.
[2] J. P. Allouche, M. Baake, J. Cassaigne, and D. Damanik. Palindrome
complexity. Theoretical Computer Science, 292(1):931, 2003.
[3] A. Apostolico, D. Breslauer, and Z. Galil. Parallel detection of all palindromes in a string. Theoretical Computer Science, 141(12):163173, 1995.
[4] J. Berstel, A. Lauve, C. Reutenauer, and F. V. Saliola. Combinatorics
on Words: Christoffel Words and Repetition in Words, volume 27 of
CRM Monograph Series. American Mathematical Society and Centre de
Recherches Mathematiques, 2008.
[5] G. Fici and L. Q. Zamboni. On the least number of palindromes contained
in an infinite word. Theoretical Computer Science, 481:18, 2013.
[6] A. E. Frid, S. Puzynina, and L. Zamboni. On palindromic factorization of
words. Advances in Applied Mathematics, 50(5):737748, 2013.
[7] D. Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, 1997.
[8] T. I, S. Sugimoto, S. Inenaga, H. Bannai, and M. Takeda. Computing
palindromic factorizations and palindromic covers on-line. In Proceedings
of the 25th Symposium on Combinatorial Pattern Matching (CPM), volume
8486 of LNCS, pages 150161. Springer, 2014.
12
13