0% found this document useful (0 votes)
4 views7 pages

case4文章

The document presents an optimal algorithm for computing all repetitions in a word. It uses an improved version of the partitioning technique to refine equivalence relations in logarithmic time, showing words with a number of repetitions proportional to the word length. The algorithm computes repetitions or maximal repetitions in a word.

Uploaded by

bugfaithes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views7 pages

case4文章

The document presents an optimal algorithm for computing all repetitions in a word. It uses an improved version of the partitioning technique to refine equivalence relations in logarithmic time, showing words with a number of repetitions proportional to the word length. The algorithm computes repetitions or maximal repetitions in a word.

Uploaded by

bugfaithes
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Volume 12.

number 5 INFORMATIONPROCESSINGLETTERS 13 October 1981

AN QPTIMAL ALGORITHM FOR COMPUTING THE REPETITIONS IN A WORD

Max CROCHEMORE
Laboratoire d ‘pifwmatique, UniversitCde Haute - Normandie, BP 6 7 76130, Man t-S&int-A ignan, France

Received 27 January 1981; revised version received 14 May 1981

Analysis of algorithm, optimality, partitioning, repetitions in words

A word has a repetition when it has at least two 1. Repetitions in words


consecutive equal factors. For instance, abab is a repe-
tition (a square) in aababba. Let A be a finite alphabet and A* be the free
Recently, it has been proved that the set of words monoid generated by A.
containing a square is not context-free [3,71. The length of a word x in A* is denoted by 1x1.
This paper presents an algorithm to compute all Let x =x1x* 0.X, be a word (XiE A). A position of
l

the repetitions of primitive factors in a word x in time x is, as in [4], any integer in {1,2, .... n).
0( i x i log2 Ix i 9. A straightforward adaption of the A word u of length p is said to occur at position i
Knuth, Morris and Pratt’s string-matching algorithm in x if
[S) also allows to solve the problem, but in time
O(ixi2). it p - 1 Gn and U=XiXi+l l
**Xi+p_le
Main and Lorentz have given an 0( I x I log2 i xi)
algorithm to find one square in a word x. Their meth- As usual, a word is said to be primitive if it cannot
od cannot be directly extended to solve the present be written ve with v E A* and e 3 2.
problem since they eliminate many repetitions when Then, a repetition in x is a non trivial power of a
they are guaranteed to find another one later in the primitive word which occurs in x.
search. More accurately, a repetition in x is defined to be a
Our algorithm uses an improved version of the well- triple (i, p, e) so that, set”* u = xi a**
x~+P_~,one has:
known partitioning technique [I] for refmements of ue occurs at position i in x.
equivalence relations. This version has already been uetl does not occur at position i in x.
fruitful in a problem concerning partitions on graphs u is primitive.
C2L The integers p and e are called respectively the
The optimality of the algorithm is proved by period and the exponent of the repetition (i, p, e).
showing that there exist words which have indeed For instance, (1,3,2), (3,1,2), (4,2,2) and
0( ix i log2 ix i) repetitions. These particular words are (5,2,2) are repetitions in abaababa.
Fibonacci words. Maximalrepetitions are also considered and defined
With a slight modification, the algorithm gives the by: a repetition (i, p, e) in x is maximal if i - p G 0 or
maximal repetitions of a word. This algorithm is aljo xixitl *bxi+p_l does not occur at position i - p.
l

optimal since it computes all the 0( ix I log2 ix I) The algorithm given in this paper computes, for a
maximal repetitions of a Fibonacci word x in time word x in A*, its set of repetitions or only its set of
O(ixi loga 1x1). maximal repetitions.

0020-O190/g l/0000-0000/$02 (500 1981 North-Holland


Volume12, number5 INFORMATION
PROCESSING
LETTERS 13 October1981

2. Equivalences on positions 3. The basic lemma for computing the equivalences

A sequence (E&,31 of equivaIenceson the posi-


One can easily check that any equivalence Ep+ris
tions in a word is defmed as follows:
a refinement of Er&, > Ep+i); furthermore, there
LetxEA*,n=ixl andp;B1;then(i,j)EEpiff
clearly exists a smallest integer N, 1 G N ;sZn, so that
i+p-lQn,j+p-lQnandxi*~~xi+p_l.=
x1 .**xj+p-1. E,>E,>->EN,
So, two positions in x are equivalent according to
and EN = EN+i = a**is the equality relation on
E, when the factors of x of length p and starting at i
(1, .... n)*
and j are equal.
The computation of the equivalencesE, may be
For each p 3 1, is also defmed a fimction on the
done by the classicalMoore’sagony which com-
positions of x, which gives, for a positIon i, its differ-
putes su~ce~ively Er, Ez, .... EN. It is based on the
ence to the least position in the same equivalence
relation:
class:
(4 j) E E, iff (i, j)E l&i,
the least integer k > 0
DP(i) = s.t. (i, i + k) E I$, , and

{ ~0 if there is no such k. (i+ 1,jt l)EE,_,.

Then, repetitions in x are characterized in term of Exploiting this relation directly leads to an 0(n2)
differences D,: algorithm to compute the equivalences.
The other dassicrd partitioning algorithm,
Lemma 1. (i, p, e) is a repetition iff Hopcroft’s one [ 11,does not work for tbis problem
since it computes EN via other equivalencesthan the
Dp(i) = DP(i + p) = **a= Dp(i + (e - 2)~) = p
E; s.
and The method retained here was used in [2] to par-
tition graphs. It leads to an O(n logs n) algorithm.
D,(it(e- l)p)+p.
Let us consider two consecutive values of the
Maximal repetitions are characterized in the same equivalences,E,_r and E,. Let (Cfp .... C,) be the
way: equivalence classesaccording to E, (E,-classes) and
CC’,,*a*,CAl) the EP_, -classes.E, being a refinement
Lemma 2. (i, p, e) is a maxImal repetition iff (i, p, e) of Ep_r , each Ep_r-classis a union of E,-classes.
is a repetition and i - p G 0 or Dp(i - p) # p. A choice finction is a function

Proof of Lemma I. Of course, the conditions are suf- f: {Ci, .*.,Ch*>+ IC1, *‘a,C,],
ficient. Now, if (i, p, e) is a repetition we have:
with the properties: for any C’ in CC’,,.,., Chs}
VjdjE{i,i+p,...,i+(e-2)p) DP(j)9p, [f(C) C C’ and for any C in {Cl, .... CJ C C C’ *
and ICI < lf(C’)l].
So, f associates to each E,_l -classone of its E,-
DP(it (e - l)p)#p. ~b~la~es of maximal size.
Suppose that D&i) = p’ < p for one j. The word Given a choice function f, each E+ass f(C’)is
u = XJ **xI+_r occurs also at positions j t p’ and
l
called a big class; the others are cabed ~rn~~iclasses,
j + p. In such a situation, denotiug by v the word Of course, there are as many big E,-classes than
xj ***Xj+p*_rand by w the word xI+r,sI**xp,,_l it E,_,&sses. In particular, E, = E,_r iff there is no
c~e~ybe~n~atu=~=~.~~s~~euisa small E&ass. By definition, all the E&asses are
power.of a word in A* [6J that contradicts the fact small.
that u is primitive, Now, a new sequence (S,),~r of equivalenceson

‘45
INFORMATION PROCESSINGLETTERS 13 October 1981 ’
Volume 12, number 5

the positions of x are defmed: of small ciasses. Thus, the cost of all executions of
steps 5 and 6 is
6, j) E E, of,
(i, j) f Sp iff
both i and j are in big En-classes. c IEp-classsi ,
of a small Ep-class
Equivalently we have:
where N is the first integer such that EN = EN+1or
ii, j) f Sp iff for any small E,-class C, equivalently such that EN+~has no small equivalence
ifC iffjEC. class.

Lemma 3. For any p 2 1, (i, j) E Ep+r iff (i, j; S E, Lemma 4. C < n log,(n - m + 1) where m is the num-
and(i+ l,j+ I)E$. ber of distinct letters in x.

Denoting by S, the equivalence: Proof. Consider a position i in a small E,-class C, anri


(i,j)E S, iff(i+ l,j+ l)ESp, let C’ be its Ep_l -class.By defmition of the small
classes(and choice functions) IC I< IC’1/2.Thus, a
Lemma 3 asserts that EpfI = E, n g,. position i cannot belong to a small classmore than
log2(n - m t l), since the EL-classof i has a cardi-
Proof. E, being a refinement of S, we have Ep+i C nality less than n - m + 1. As there are n positions,
E, n Sr,. Let i and j be two positions such that C<nlog& -m + 1).
(i,j)EEp and (i+ l,j+ l)E5,.
Ifi+ 1 isinasmallE,-classthenj+ 1 isinthesame 5. The algorithm
E,&ss; so, (i, j) E E,,+i. If i + 1 is in a big Ep-class,
so it is for j + I. km (i, j) E E, we deduce The algorithm that gives in R the repetitions in a
(i + 1, j + l)EE,_i, which proves that i + 1 and j + 1 word x is given in Fig. 2 as a procedure named REP.
must be in the same big Ep-class.Thus, we have again It parallels the schema in Fig. 1.
(i, j) E Ep+f. The data structures used to implement the algo-
.rithm are now described.
The equivalence E is represented twice: an array E
4. Outline of the algorithm gives for each position the index of its .E&ss; a
double-linked list ECLASSgives for each equivalence
A schema of the algorithm is drawn in Fig. 1. From class index the positions in the equivalence class.
the word x, El and Di are computed and their values Doing so, transferring a position from an E-classto
put in E and 0. The indices of the E i-classes(which another is realized in constant time. To each E-class
are all small) are put in SMALL.Then, in the “while” is associated its number of elements.
loop, the successivevalues of E are computed using A stack NEWINDEXcontains the availableindices
Lemma 3. The difference function D is updated at the of E-classes.This stack may be seen as a ‘garbagecol-
same time, and the new small E-classesare determined lection’. An index k is availablewhen &lass(k) is
and memorized in SMALL.At the beginning of each empty.
execution of the loop, the new repetitions are calcu- The difference function D is realized by an array;
lated as stated in Lemma 1. simultaneously a double-linked list DCLASSis main-
It is shown in the next section how to implement
tained and givesfor each period p the set of positions
steps 5 and 6 efficiently, with a time complexity i satisfyiug D(i) = p.. The fianction D together with the
list DCLASSpermit a search of the repetitions of
0 23 IE&sssl , period p linear in their number,
SESMALL 1
Steps 5.1 and 5.2 realize step 5 in Fig. 1, First, in
that is, with a complexity proportional to the union step 5.1, the small E-classesare copied in a queue
Volume 12, number 5 INFORMATIONPROCESSINGLETTERS 13 October 1981

procedure REP(x)
(I) defmeEtobeEronthewordx;defineDtobeDr;
p + 1; make R empty; SMALL+ {indices of E-classes);
(2) while SMALL# 0 do
(3) begin add to R the repetitions of period p (Lemma 1);
(4) p+p+ l;ifp> 1x1/2 thenretumR;
(5) E + E n g (Lemma 3); update D from the value of E.
(6) SMALL+ {indices of small E-classes);
end;
return R.

Fig. 1. Schema of the repetition-searchingaigorithm.

QUEUE in order to preserve the increasing order on execution of the while loop 5.2 exactly one position i
the positions in each small class. At the same time the is transferred from its equivalence class k to another
set, SPLIT, of E-classes submitted to the ‘splitting ‘5;.If i’ is the position that preceeds i in ECLASS(k)
instruction’ 5.2 is created. For each E-class k in then the value of Dp(i’) after i has been extracted
SPLIT, a set SUBCLASS(k) is initialized to contain its from ECLASS(k) is Dp_r (i’) + D,_ r(i) since positions
subclass indices, together with a variable LAST- in ECLASS(k) are in increasing order. When i Is added
SMALL(k). This indicator gives in step 5.2 the last to ECLASS(@ its predecessor i” in ECLASS(K) must
small class s that has been used to split the E-class k. satisfy
During step 5.2 the equivalence classes are split.
D,,(i”) = i _ i” ,
One position at a time is transferred to a new class’i;,
from the E-class k. Let us assume that i’ is the last since the positions in the small classes (copied in
position in ECLASS(k) that has been transferred to a QUEUE) are in increasing order. Furthermore, i being
class k’, using a small class s’; in this case LAST- the greatest position in ECLASS(k), we have Dp(i) =’
SMALL(k) = s’; if s’ is used again to transfer i into 00.These three points correspond to what is done
ECLASS(x) then i and i’ are equivalent according to during step 5.2.
the value of E being computed and % is defmed to be
k’. If not, a new index is extracted from NEWINDEX The procedure REP may be immediately modified
to define H, and LASTSMALL is set to be s. to calculate maximal repetitions in the word x. Re-
While a position is transferred, D and DCLASS are garding Lemma 2 we have only to move the instruc-
updated. The computation of D use heavily the fact tion 3.1 after the step 3.2. Let this new procedure be
that positions in equivalence classes are in increasing called REPMAX.
order.
At step 6, a new value of SMALL is calculated. The Theorem 6. The procedure REPMAX computes all the
array that gives the number of elements in each E- maximal repetitions of a word x.
class allows to find the small classes efficiently.
Theorem 7. The time complexity of procedure REP
Theorem5. The procedure REP in Fig. 2 computes (or REPMAX) is 0( 1x1log, Ixi+ IAl 1x1).
all the repetitions in a word x.
Proof. Step 1 in Fig. 2 contributes to O(m lx I) in the
Proof. It is easy to see that ‘the algorithm stops. The total complexity, where m is the number of distinct
computation of a new value of the equivalence E is letters in the word x. This is bounded by 0( IA I 1x1).
done in steps 5.1 and 5.2 exactly as stated in Lemma Next, we discuss the complexity of the “while”
3. If we assume that D is correctly calculated, then loop 2. All the executions of step 3 take a time
from Lemma 1 it can be shown that all the repetitions proportional to the number of repetitions in the word
of period p are added to R at step 3. x. This number is bounded by I x I 1062 I x I [6].
It remains to prove that D is well updated. At each The cost of the executions of steps 5.1,s .2 and 6

247
Volume 12, number 5 INFORMATION PROCESSING LETTERS 13 October 1981

procedure REP(x)
for k c 2n step 1 until 1 do begin push k onto NEWINDEX; make ECLASS(k) empty;
end;
(1) for I + 1 until n do begin if (xi already occurs at j) then k +- E(j)
else pop k from NEWINDEX; E(i) c- k; add i at the end of ECLASS(k);
end;
defme D; put in same DCLASS the positions that have same values of D; p +- 1; make R, QUEUE, SPLIT empty;
SMALL + {indices of the E-classes};
(2) while SMALL f Q do
begin comment computation of the repetitions of period p;
(3) while DCLASS(p) + @do
begin i + a position in DCLASS(p);
repeati+i+puntiID(i)#p;e+l;
repeatbegini+i-p;e+e+l;
(3.1) add (i, p, e) to R; erase i from DCLASS(p);
end;
until (i - p 6 0 or D(i - p) + p);
(3.2) comment see computation of maximal repetitions;
end;
(4) pep+ l;ifp>n/2thenretumR;
comment copy of small classes in QUEUE;
(5.1) while SMALL # fl do
begin extract s from SMALL;
for j from the frost to the last element of ECLASS(s) do
be@nIfj+ 1 then
begin add (j, s) at the end of QUEUE; k t E(j - 1);
if k 4 SPLIT then
begin add k to SPLIT; set SUBCLASS(k) = {k}; LASTSMALL e 0;
end;
end;
end
comment computation of the new values of E and D;
(5.2) while QUEUE + 0 do
begin (j, s) a the first pair in QUEUE; i - j - 1; k + E(i);
if LASTSMALL # s then
begin LASTSMALL t s; pop NI from NEWINDEX; add NI to SUBCLASS(k);
end;
x+ the last index put in SUBCLASS(k);
if (i has a predecessor i’ in ECLASS(k)) then
begin D(i’) + D(i’) + D(i); transfer i’ to DCLASS(D(i’));
end;
transfer i at the end of ECLASS(k); E(i) CT; D(i) + 0; transfer i to DCLASS(=);
if (i has a predecessor i’ in ECLASS(@) then
begin D(i’) + i - i’; transfer i’ to DCLASS(D(i’));
end;
end;
comment determination of the small classes;
while SPLIT # Q do
begin extract k from SPLIT;
If IECLASS(k)l = 0 then
begin push k onto NEW INDEX; erase k from SUBCLASS(k);
end;
add to SMALL ali the indices in SUBCLASS(r) but one, corresponding to a greatest E-class;
end;
end;
return R.
Fig. 2. Searching repetitions in a word x.

248
Volume 12, number 5 INFORMATIONPROCESSINGLETTERS 13 October 1981

is proportional to the length of QUEUEwhich iq or


If.1 If,+, 1
c l ECLASS(s)i . ‘q+l a@ Kqllog, -$+ t ~lfq_~llog, -
s index of 9 It-1 I’
a small E-class
using the relation lfq+ll = If,1 t If,_,l.
Thus, applying Lemma 4, the aggregatecost of all the It is well known that Fibonacci words satis@, for
executions of steps 5.1,5.2 and 6 is 0( 1x1log, 1x1). q>4
If,1 / If,,rl< Ifal / IfsI =; .

6. Optimality So it remains to prove that

rq+l +(lf,l+ Ifq_J)log, i,


Theorem 8. The procedure REP is optimal in the
class of algorithms computhtg all the repetitions of a or 0)
word. rq+l 2 $ lfq+l I.

The proof is a direct consequence of Lemma 10 on First, we prove that fq+r contains Ifq_3 I t 1
the number of squares in Fibonacci words. Observing squares of period Ifq_l I; so, these squares contribute
that Fibonacci words do not contain repetition of to rq+l. We have successively:
exponent 4, together with Lemma 10, we obtain also
fq+l = fq fq-1 = fq_.Jfq_*fq_*fq-3
the optimal@ of the procedure REPMAX:
= fq_~fq_~fq__3fq-~fq._~
Theorem 9. The procedure REPMAXis optimal in the
= fq_lfq_~fq_L&fq_+
class of algorithms computing all the maximal repeti-
tions of a word. The square fq_Ifq_l is then a prefm of fq+r .
For q 3 6, fq_a is a prefm of fq_4fq_3 since:
Lemma 10. Let us define the sequence of Fibonacci
fq-4 fq-3 = fq._4fq_4fq_~
words by: f0 = b, fr = a and fq+r = fqfq_rq integer
> 1. Then, the number % of squares (repetition of = fq-4fq_,fq-_6fq-5
exponent 2) in fq satisfy, for any q Z 5 :
= fq_$q_Jq-5 l

Rq 2$lf,l loga lfql.


The word fq_3 being also a prefer of fq_r we get
Proof. The property can be checked for q = 5 and 6. lfq_31 other squares of period If,_, I.
Weproceed by induction on q. Suppose q > 6 and Secondly, fq+r may be written fq_r fq_afq_2fq_3.
consider word fq+r which is fqfq_ r . Then Analogously, we get Ifq_s I t 1 squares which con-
tributes to rq+l, since fq_3 is a prefer of fq_2.
&+1 = Rq + b-1 + rq+r 9 S0,forq>6,wehaver~+~ >21fq_31.Theresult
where rq+l is the number of squares in fq+r = fqfq_r (1) follows from the inequality:
that are neither squares in fq nor in fq_r , i.e. squares lfq_&+lfq+J.
that overlap over the border line between fq and fq_r
By induction hypothesis, we have:

&+r >$ Ifqilog, lfql+ 6 Ifq_Illog2 lfq_ll +rq+lm References

To get the results, it suffices to prove: [l] A.V. Aho, J.E. Hopcroft and J.D. Ullman, The Design and
Analysis of Computer Algorithms (Addison-Wesley, MA,
1974) 157-162.
[2] A. Cardon and M. Crochemore,Partitioning a graph in
O( IA Ilog2 IVI), Theoret. Comput. Sci., to appear.
>pq+lllogz Ifq+J, [3] A. Ehrenfeucht and G. Rosenberg,On the separating

249
Volume 12, number S INFORMATION PROCESSING LETTERS 13 October 1981

power of EOL systems, RAIRO, to appear. problem in the theory of free monoids, in: Combinatorial
(41 M.A. Harrison, Introduction to Formai Language Theory Mathematics and Its Applications (University of North
(Addison-Wesley, MA, 1978). Carolina Press, NC, 1969) 128-144.
[5] D.E. Knuth, J.H. Morris and V.R. Pratt, Fast pattem- [7] R. Ross and R. Winkimann, Repetitive strings are not
matching in strings, SIAM J. Comput. 6 (1977) 323-350. context-free, CS-81-070 (Washington State University,
[6] A. Lentin and M.P. Schutzenberger, A combinatorial Pullman, WA, 1981).

250

You might also like