Simple Fast Algorithms For The Editing Distance Be
Simple Fast Algorithms For The Editing Distance Be
net/publication/220618233
CITATIONS READS
1,302 6,309
2 authors, including:
Dennis Shasha
New York University
508 PUBLICATIONS 20,548 CITATIONS
SEE PROFILE
All content following this page was uploaded by Dennis Shasha on 14 August 2017.
Abstract. Ordered labeled trees are trees in which the left-to-right order among siblings is. significant.
The distance between two ordered trees is considered to be the weighted number of edit operations
(insert, delete, and modify) to transform one tree to another. The problem of approximate tree matching is
also considered. Specifically, algorithms are designed to answer the following kinds of questions:
1. What is the distance between two trees?
2. What is the minimum distance between T and T when zero or more subtrees can be removed
from T2
3. Let the pruning of a tree at node n mean removing all the descendants of node n. The analogous
question for prunings as for subtrees is answered.
A dynamic programming algorithm is presented to solve the three questions in sequential time O(I Tll x
IT2lxmin (depth( Tt), leaves( T)) x min (depth(T2), leaves(T2))) and space O(Ir, x lT21) compared with
o(I T,I IT=I x(depth(T)): x (depth(T2)) ) for the best previous published algorithm due to Tai [J. Assoc.
Comput. Mach., 26 (1979), pp. 422-433]. Further, the algorithm presented here can be parallelized to give
time O(1 T[ /1 T=I).
Key words, trees, editing distance, parallel algorithm, dynamic programming, pattern recognition
1. Motivation.
1.1. Applications. Ordered labeled trees are trees whose nodes are labeled and in
which the left-to-right order among siblings is significant. As such they can represent
grammar parses, image descriptions, and many other phenomena. Comparing such
trees is a way to compare scenes, parses, and so on.
As an example, consider the secondary structure comparison problem for RNA.
Because RNA is a single strand of nucleotides, it folds back onto itself into a shape
that is topologically a tree (called its secondary structure). Each node of this tree
contains several nucleotides. Nodes have colorful labels such as "bulge" and "hairpin."
Various researchers [ALKBO], [BSSBWD], [DD] have observed that the secondary
structure influences translation rates (from RNA to proteins). Because different sequen-
ces can produce similar secondary structures IDA], [SKI, comparisons among secon-
dary structures are necessary to understanding the comparative functionality of different
RNAs. Previous methods for comparing multiple secondary structures of RNA
molecules represent the tree structures as parenthesized strings [$88]. These have been
recently converted to using our tree distance algorithms.
Currently we are implementing a package containing algorithms described in this
paper and some other related algorithms. A preliminary version of the package is being
used at the National Cancer Institute for the RNA comparison problem.
1.2. Algorithmic approach. The tree distance problem is harder than the string
distance problem. Intuitively, here is why. In the string case, if Sl[i] S2[j], then the
* Received by the editors August 5, 1987; accepted for publication (in revised form) February 12, 1989.
This work was partially supported by the National Science Foundation under grant number DCR8501611
and by the Office of Naval Research under grant number N00014-85-K-0046.
"Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York,
New York 10012 ([email protected]). Present address, Department of Computer Science, Middlesex
College, The University of Western Ontario, London, Ontario, Canada N6A 5B7.
t Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York,
New York, 10012 ([email protected]).
1245
1246 K.Z. ZHANG AND D. SHASHA
distance between Sl[1..i-1] and Sz[1..j--1] is the same as between Sl[1..i] and
$2[1 ..j]. The main difficulty in the tree case is that preserving ancestor relationships
in the mapping between trees prevents the analogous implication from holding.
By introducing the distance between ordered forests and careful elimination of
certain subtree-to-subtree distance calculations we are able to improve the time and
space of best previous published algorithm [T]. Note that the improvement of space
for this problem is extremely important in practical applications.
Besides improving on the time and space of the best previous algorithm [T], our
algorithm is far simpler to understand and to implement. In style, it resembles algorithms
for computing the distance between strings. In fact, the string distance algorithm is a
special case of our algorithm when the input is a string.
2. Definitions.
2.1. Edit operations and editing distance between trees. Let us consider three kinds
of operations. Changing node n means changing the label on n. Deleting a node n
means making the children of n become the children of the parent of n and then
removing n. Inserting is the complement of delete. This means that inserting n as the
child of n’ will make n the parent of a consecutive subsequence of the current children
of n’. Figs. 1-3 illustrate these editing operations.
T1 T2
(a-- b)
FG.
(b A)
FG. 2
T1
(A b)
FG. 3
TREE EDITING DISTANCE 1247
T1 T2
FG. 4
1248 K.Z. ZHANG AND D. SHASHA
Consider the diagram of a mapping in Fig. 4. A dotted line from T[i] to T2[j]
indicates that T[i] should be changed to Tz[j] if T[i] Tz[j], or that T[i] remains
unchanged if T[i] T2[j]. The nodes of T not touched by a dotted line are to be
deleted and the nodes of T2 not touched are to be inserted. The mapping shows a way
to transform T to T.
Formally we define a triple (M, T1, T2) to be a mapping from T1 to T2, where M
is any set of pair of integers (i,j) satisfying:
(1) <-i<-_N, I <-j<-_N2;
(2) For any pair of (i,j) and (i,j) in M,
(a) il i if and only if jl =j2 (one-to-one),
(b) T[i] is to the left of Tl[i2] if and only if T2[j] is to the left of T2[j]
(sibling order preserved),
(c) T[il] is an ancestor of T[i] if and only if T[jl] is an ancestor of T[j_]
(ancestor order preserved).
We will use M instead of (M, T1, T2) if there is no confusion. Let M be a mapping
from T to T2. Let I and J be the sets of nodes in T and T2, respectively, not touched
by any line in M. Then we can define the cost of M"
y(M)= y(T[i]Tz[j])+ y(TI[i]A)+Z y(ATz[j]).
(i,j) M i jJ
Note that our definition of mapping is different from the definition in [T]. We believe that our definition
is more natural because it does not depend on any traversal ordering of the tree.
TREE EDITING DISTANCE 1249
3. A simple new algorithm. This algorithm, unlike [T], [L], and [Z83], will, in its
intermediate steps, consider the distance between two ordered forests. At first sight
one may think that this will complicate the work, but it will in fact make matters easier.
We use a postorder numbering of the nodes in the trees. In the postordering,
TI[ 1 i] and T211 ..j] will generally be forests as in Fig. 5. (The edges are those in the
subgraph of the tree induced by the vertices.) Fortunately, the definition of mapping
for ordered forests is the same as for trees.
3.1. Notation. Let T[i] be the ith node in the tree according to the left-to-right
postorder numbering, l(i) is the number of the leftmost leaf descendant of the subtree
rooted at T[i]. When T[i] is a leaf, /(i)=i. The parent of T[i] is denoted p(i).
We define p(i)=i, pl(i)=p(i),p:(i)=p(pl(i)), and so on. Let anc(i)=
{pk( i) lO <- k <= depth(i)}.
T[ i..j] is the ordered subforest of T induced by the nodes numbered to j inclusive
(Fig. 5). If i>j, then T[i..j]=. T[1..i] will be referred to as forest(i), when the
tree T referred to is clear. T[ l(i)., i] will be referred to as tree(i). Size(i) is the number
of nodes in tree(i).
T T[1 7]
T[7]
T[4] T[51
FIG. 5
3.2. New algorithm. We first present three lemmas and then give our new algorithm.
Recall that anc(i)= {pk(i)lO<-_k<-depth(i)}.
LEMMA 3. (i) forestdist((, ) O.
1250 K.Z. ZHANG AND D. SHASHA
(ii) forestdist( TI[ 1(il).. i], ) =forestdist( TI[ l(i)., i- 1 ], ) + 3/( T[i] -> A).
(iii) forestdist(, T2[ l(j) ..j]) forestdist(, T2[ l(j) ..j 1])+ 3/(A- T2[j])
where i, anc( i) and jl anc(j).
Proof Case (i) requires no edit operation. In (ii) and (iii), the distances correspond
to the cost of deleting or inserting the nodes in T[l(i).. i] and T2[l(j)..j]), respec-
tively.
LEMMA 4. Let il anc( i) and jl anc(j). Then
r[il r+[jl
Since these three cases express all the possible mappings yielding forest-
dist(l(il)., i, l(jl)..j), we take the minimum of these three costs. Thus,
forestdist(t(il)., i,/(jl)..j)
fforestdist( t( il)
minlforestdist(t(i)
(forestdist(l(i,)
i- 1, l(j,) .j) +
y( TI[ i] A),
i,/(jl)..j 1) + y(A -* T2[j]), --
i- 1, t(jl)..j 1) + y( T[ i] Tz[j]).
(2) If l( i)
forestdist( l( il).. i,
1( i) or t(j)
j) min
/(jl) (i.e., otherwise)
fforestdist( l( i,) i- 1, t(j) .j) +
i, l(j)..j
Jforestdist(l(il)..
y( Tl[ i]
+ ,/(A --, T2[j]),
forestdist( l( i) l( i) 1, l(j) l(j) 1)
( + treedist(i, j).
-
A),
Proof. By Lemma 4, if l(i) (i) and l(j) l(j) then, since forestdist(l(i)., l(i)
1, l(j)..l(j)- 1) forestdist(, ) -0, (1) follows immediately.
Because the distance is the cost of a minimal cost mapping, we know forest-
dist(l(i)..i, l(j)..j) forestdist(l(i)..l(i)- 1, l(j)..l(j)- 1)4- treedist(i,j) since the
latter formula represents a particular (and therefore possibly suboptimal) mapping of"
l(i)=l(il) and/(J)=/(,]l)
Tl[i] T2[j]
l(i)4:l(il) or/(j)4:l(jl)
It is easy to see that there is a linear time algorithm to compute the function l(
and the set LR_keyroots. We can also assume that the result is in array and LR_keyroots.
Furthermore, in array LR_keyroots the order of the elements is in increasing order.
We are now ready to give our new simple algorithm.
Input: Tree T and T2.
Output: Tree_dist( i, j), where 1 -< i<-IT, and <--j<--ITI.
Preprocessing
(To compute l(), LR_keyrootsl and LR_keyroots2)
Main loop
for i’ := 1 to [LR_keyroots(T1)[
for j’ := 1 to ILR_keyroots(T2)
LR_keyroots 1[ i’ ];
j LR_keyroots2[j’ ];
Compute treedist(i, j);
We use dynamic programming to compute treedist(i, j). The forestdist values computed
and used here are put in a temporary array that is freed once the corresponding treedist
is computed. The treedist values are put in the permanent treedist array.
The computation of treedist(i,j).
TREE EDITING DISTANCE 1253
forestdist(, ) 0;
for i :-l(i) to
forestdist( TI[ l(i)., il], ) =forestdist( TI[ l(i)., il 1 ], ) 4- 5,( TI[ i] A)
for jl :- l(j) to j
forestdist(, T2[ l(j) .jl]) forestdist(, T2[ l(j) .jl 1])4- ),(A- T2[jl])
for il :-l(i) to
for jl :- l(j) to j
-
if l(i)= l(i) and/(jl)- l(j) then
- --
forestdist( T[ l(i)., il], T2[ l(j)..jl)] min (
forestdist( TI[ l( i).. il 1 ], TEl l(j)..jl]) 4- 5,( TI[ il] A),
forestdist( TI[ l( i).. il], T[ l(j)..jl 1 ]) 4- ),(A T[jl]),
forestdist( TI[ l(i)., 1 ], T.[ l(j)..jl 1 ]) 4- 5,( Tl[ il] TE[jl]))
treedist(il,jl)-forestdist(Tl[l(i)..i],T[l(j)..j])/* put in permanent
array */
else
forestdist( TI[ l(i)., il], T[ l(j)..jl]) min (
forestdist( TI[ l( i).. il 1 ], T[ l(j)..jl]) 4- 5’( TI[ il] A),
forestdist( TI[ l( i).. il], T[ l(j)..jl 1]) 4- ),(A T[jl]),
-
forestdist( TI[ 1( i) l( il) 1], TEl l(j) l(j) 1]) 4- treedist( il jl))
THEOREM 1. The basic algorithm is correct.
Proof. We will prove that for any pair (i, j) such that i LR_keyroots(T1) and
j LR_keyroots(T2), the following invariants holds.
tree_dist
0 2 3 5
0 2 3 5
2 2 2 2 4
3 3 2 4 4
3 4 0 5
5 5 3 3 5 2
FIG. 8. The result of computation for T and T in Fig. 4.
1254 K.Z. ZHANG AND D. SHASHA
i=1
Size(i) 2 ILR-colldepth(j)]
j=l
TREE EDITING DISTANCE 1255
Consider the time complexity of our algorithm. The preprocessing takes linear
time. The subtree distance dynamic programming algorithm takes Size(i) x Size(j) for
the subtree rooted at T[i] and the subtree rooted at T2[j]. We have a main loop that
calls this subroutine several times. So the time is"
i=lLU_keyroots T1) j=lLR_keyroots( T2)
E Size(i) x Size(j)
.,
i=1 j=l
i=lLR_keyroots( Tl)l j=lLR_keyroots( T2)I
Size(i) x Size(j).
i=1 j=l
i=1
LR_ colldepth i) x
j= N
j-----1
LR_ colldepth (j).
4.2. Mapping. It is natural to ask for a mapping that yields the distance computed.
Also given two trees, we may ask, what is the largest common substructure of these
two trees? This is analogous to the longest common substring problem for strings. We
can find the mapping in the same time and space complexity as finding the distance,
although we do not give the details here. The mapping is produced by our toolkit.
4.3. Parallel implementation. A straightforward transformation of our algorithm
to a parallel one yields an algorithm with time complexity O(N1 + N:) whereas [T]
and Z83 have time complexity O( N + N:) depth (TI) + depth (T2))). Our algorithm
uses O(min (I TI[, IT21) leaves(T) leaves(T)) processors.
Actually, by controlling the starting point of each treedist computation more carefully, we can reduce
the processor bound to O(min ([ T[, T2[) x min (depth(T), leaves(T)) x min (depth(T2), leaves(T2)). The
algorithm is more complicated however.
1256 K. 7,. ZHANG AND D. SHASHA
The algorithm computes in "waves" for all subtree pairs tree(i) and tree(j), where
e LR_keyroots(T1) and j LR_keyroots(T2), simultaneously. We start at wave 0. At
wave k, for each such subtree pair tree(i) and tree(j), compute forest-
dist(l(i)..il, l(j)..j), where (i-l(i))+(j-l(j))= k.
We now present the parallel algorithm in detail. (When the PARBEGIN-PAREND
construct surrounds one or more for loops, it means that every setting of the iterators
in the enclosed for loops can be executed in parallel. The semantics are those of the
sequential program ignoring this construct.)
In the algorithm dist[ i, j] is the array for the computation of treedist(i, j). Therefore
dist[ i, j][ p, q] is the distance forestdist(l(i)..p, l(j)..q) and is. the p, qth member of
the array computing treedist(i, j).
PAREND
dist[ i, j][ l(i) + k, l(j)
for i’ :- to ILR_keyroots(T)
for j’ :- 1 to ILR_keyroots(T2)
:= LR_keyroots 1 i’
j :- LR_keyroots2[j’]
dist[ i, j][ l(i) 1, l(j) + k]
:= dist[i,j][l(i)- 1, l(j)+ k- 1]+ y(A- Till[j]+ k])
PAREND
for k:=0 to N+M-2
PARBEGIN
for i’ := to ILR_keyroots(T1)l
for j’ := 1 to ILR_keyroots(Tz)
:= LR_keyroots 1 i’
j := LR_keyroots2[j’]
for i,, jl satisfying il l( i) +jl l(j) k and l( i) <-_ il <= i, l(j) <-j, <=j
if l(i) l(il) and l(j) =/(jl) then
dist[ i, j][ il, jl] := min {
-
dist[ i,j][ il- 1,jl] + y( TI[ il] A)
TREE EDITING DISTANCE 1257
T T’
T[91
T[S]
T[61 T[7]
T[1] T[2I
T[4I T[S]
FIG. 9. Remove subtree rooted at T[8].
TREE EDITING DISTANCE 1259
T T’
T[8]
T[7I
T[1] T[2]
T[41 T[SI
FIG. 10. Pruning at T[8]mremove all its proper descendants.
Assume an ordering for tree T. Define a subtree set S(T) as follows: S(T) is a
set of numbers satisfying
(1) i S(T) implies that 1_-< i<-ITI
(2) i,j S(T) implies that neither is an ancestor of the other.
Define R(T, S(T)) to be the tree T with removing at all nodes in S(T).
Define P(T, S(T)) to be the tree T with pruning at all nodes in S(T).
Now we can give the definition of approximate tree matching. Given tree T and
PAT, for each i, we want to calculate
DR(T[l(i)..i, PAT)=min {treedist(R(T[l(i)..i], S(T[l(i)..i])), PAT)}.
s
F_DR ,
empty_initialization:
) 0
left_initialization:
F_DR(T,[l(i) i,], ) 0
right_initialization:
F_DR(, T2[l(j)..j,])= F_DR(, Te[I(j)..j,-1])+T(A-
general_if_computation
/* applies if l(i,) l(i) and l(j,) l(j)*/
F_DR T[ l( i) i,], T[ l(j) ..j,]) min {
1260 K.Z. ZHANG AND D. SHASHA
F_DR(, T_[I(j)..j,]),
F_DR(T[I(i)..i- 1], T2[l(j)..j,])+ y(T[i,]-> A),
F_DR(T[I(i)..i], T2[I(j)..j- 1])+ 3,(A-* T[j]),
F_DR(T[I(i)..i- 1], T[I(j)..j- 1])+ y(T[i]-> T2[j])}
/* put the derived treedist in the permanent array, as specified by template */
general_else_computation
F_DR T[ l(i)., i], T[ l(j)..j]) min {
F_DR(T[I(i)..I(i)- 1], T2[I(j)..j]),
y(Tl[i]-> A),
F_DRITI[I(i)..i-I], T[l(j)..])+
F_DR T[l(i) i], T2[I(j)..jl ]) + y(A--> T2[jl])
F_DR(T[I(i)..I(i)- 1], T2[l(j)..l(j)- 1])+ T_DR(i,jl)}
LEMMA 8. Algorithm Subtree Removal is correct.
Proof. First we show that the initialization is correct. The empty_initialization and
the right_initialization are the same as in the tree distance algorithm. The left_initializ-
ation F_DR(T[I(i)..i],)=O is correct, because we can remove all of Tl[l(i)..i].
For the general term F_DR(T[I(i)..i], T[I(j)..j]), we ask first whether or not
the subtree T[l(i)..i] is removed. If it is removed, then the distance should be
F_DR(T[I(i)..I(i)-I], T[I(j)..j]). Otherwise, consider the mapping between
Tt[l(i)..il] and T2[I(j)..j] after we perform an optimal removal of subtrees of
T[l(i)..i]. Now we have the same three cases as in Lemma 4. Hence the general
expression should be the minimum of these four terms:
F_DR( T[ I( i).. i], T2[ l(j)..jl]) min {
I_DR TI[ l(i)., l(i) 1], T2[ l(j)..j]),
F_DR(TI[I(i)..i- 1], T[I(j)..j])+ .,(TI[il] --> A),
F_DR(T[I(i)..i], T2[I(j)..j- 1])+ y(A--> T[j]),
F_DR T[ l(i)., l(i) 1 ], T[ l(j)../(jl) 1 ])
+ F_DR( T[l(i)..i- 1], T2[I(jl)..j- 1])+ y( T[ il]--> T2[j])}
As in Lemma 5, this specializes to the general_if_computation and the gen-
eral_else_computation given in the algorithm.
5.2.2. Prune at any number of nodes from the TEXT tree. Given trees T and T,
we want to know what is the minimum distance between T[l(i).. i] and T2 when there
have been zero or more prunings at nodes of T[l(i).. i].
Let F_DP(T[I(i)..i], T[I(j)..j]) denote the minimum distance between for-
est T[l(i).. i] and T2[I(j)..j] with zero or more pruning from T[l(i).. i]. Let
T_DP(i, j) denote the minimum distance between tree T[ l(i)., i] and T2[ l(j)..j] with
zero or more prunings from T[l(i).. i]. The following initialization and general term
computation steps will give us an algorithm to solve our problem.
ALGORITHM PRUNINGS.
empty_initialization"
F_DP(,)=O
left_initialization"
F_DP( T[ l( i) i], ,
F_DP( T[ l( i) l( i) 1 ) + y( T[ i,] --> A)
right_initialization:
F_DP(, T[ l(j)..j.]) F_DP(, T:[ l(j)..j, ]) + (A T:[j-I)
general_if_ computation
/* applies if l(i)= l(i) and l(j)= l(j)*/
F_DP( T[ 1( i) i-], T2[ l(j) ..j]) min
-
TREE EDITING DISTANCE 1261
/* put the -
F_DP( T[ l(i)., il], T2[ l(j)..jl 1 ]) + 3/(A TE[jl]),
F_DP( T[ I( i) il 1], T2[l(j)..jl 1 ]) + 3/(Tl[ il] T2[j])}
derived treedist in the permanent array, as specified by template
general_else_computation
F_DP( TI[ l(i)., il], T2[ l(j)..jl]) min {
*/
-
F_DP( Tl[l(i).. l(il) 1], T2[l(j)..jl]) 4. 3/( Tl[ il] -> A),
F_DP( T[l(i).. il 1], T2[l(j)..jl]) + 3/( TI[ il] A),
F_DP(TI[I(i) il], T2[l(j)..jl 1 ]) + 3/(A- T2[jl]),
F_DP(Tl[l(i)..l(il)- 1], T2[l(j)..l(jl)- 1])+ T_DP(i,jl) }
LEMMA 9. Algorithm Prunings is correct.
Proof. First we show that the initialization is correct. The empty_initialization and
the right_initialization are the same as in the tree distance algorithm. For left_initializ-
ation, the best we can do for tree T[l(il)..il] is to prune at TI[il]. Therefore
F_DP( TI[ I( i).. il], ) F_DP( TI[ l(i).. 1(il) 1 ], ) + 3/( TI[ il] -’> A). Hence the
left_initialization is correct.
For the general term F_DP(T[I(i)..il], TE[l(j)..jl]), we have the following
similar three cases.
(1) TI[il] is not touched by a line of M.
(la) (without pruning) F_DP( TI[ l(i)., il- 1 ], T2[ l(j)..jl]) + 3/( TI[ il] A)
(lb) (with pruning) F_DP(Tl[l(i)..l(il)- 1], T[I(j)..jl])+ 3/(Tl[i]- A)
(2) T[jl] is not touched by a line of M. Since we only prune from T1, there is
only one case here:
-
F_DP( T[ l(i)., il], T_[l(j)..jl 1 ]) 4- 3/(A- TI[ il])
(3) both TI[il] and T2[j] are touched by lines of M.
(3a) (without pruning)
F_DP( TI[ l( i).. l(il) 1], T:[ l(j)., l(j,) 1])
4- F_DP( T[ l( il).. i
])
3/( T[ il] T2[jl])
4-
If /(i)=/(il) and /(j)= l(j), consider cases (lb) and (3b.) Case (lb) becomes
F_DP(, T2[l(j)..jl])+ 3/(Tl[i]- A). Case (3b) becomes F_DP(, Ta[l(j)..j- 1])+
3/(Tl[i] T2[j]). Now from the right_initialization we know that
F_DP((, T2[ l(j)..j,]) + 3/( TI[ il] A)
>-- F_DP(, T2[ l(j)..jl ]) + 3/(A T2[jl]) + 3/( T[ i,] A)
F_DP((, T2[l(j)..j,- 1])+ 3/(Tl[il] T2[j,]).
So the distance given by case (lb) => the distance from (3b). The proposed gen-
eral_if_computation is therefore correct where the first term handles two cases.
If l(i) 1(i) or l(j) l(j), consider case (3). As in Lemma 6, cases (3a) and (3b)
can be replaced by F_DP( T[ l(i)., t(i) 1 ], T2[ l(j)., t(j,) ]) + T_DP(i, j). The
proposed general_else_computation is therefore correct. Hence algorithm pruning is
correct. [3
1262 K.Z. ZHANG AND D. SHASHA
REFERENCES
[ALKBO] S. ALLUVIA, H. LOCKER-GILADI, S. KOBY, O. BEN-NUN, AND A.B. OPPENHEIM, RNase III
stimulates the translations of the clII gene of bateriophage lambda, Proc. Nat. Acad. Sci.
U.S.A., 85 (1987), pp. 1-5.
[BSSBWD] B. BERKOUT, B.F. SCHMIDT, A. STRIEN, J. BOOM, J. WESTRENEN, AND J. DUIN, "Lysis
gene of bateriophage MS2 is activated by translation termination at the overlapping coat gene,’"
Proc. Nat. Acad. Sci. U.S.A., 195 (1987), pp. 517-524.
[DA] N. DELIHAS AND J. ANDERSON, Generalized structures of 5s ribosomal RNA’s, Nucleic Acid
Res. 10 (1982) p. 7323.
[DD] I.C. DECKMAN AND D. E. DRAPER, S4-alpha mRNA translation regulation complex, Molecular
Biol 196 (1987), pp. 323-332.
[HO] C.M. HOFFMANN AND M. J. O’DONNELL, Pattern matching in trees, J. Assoc. Comput. Mach.,
29 (1982), pp. 68-95.
[L] S. Y. Lu, A tree-to-tree distance and its application to cluster analysis, IEEE Trans. Pattern Anal.
Mach. Intelligence, (1979), pp. 219-224.
[LV] G.M. LANDAU AND U. VISHKIN, Introducing efficient parallelism into approximate string
matching and a new serial algorithm, in Proc. 18th Annual ACM Symposium on Theory of
Computing, Association for Computing Machinery, New York, 1986, pp. 220-230.
[$80] P.H. SELLERS, The theory and computation of evolutionary distances, J. Algorithms, (1980),
pp. 359-373.
[$88] B.A. SHAPIRO, An algorithm for comparing multiple RNA secondary structures, Comput. Appl.
Biosci. (1988), pp. 387-393.
[SK] J.L. SUSSMAN AND S. H. KIM, Three dimensional structure of a transfer RNA in two crystal
forms, Science, 192 (1976), p. 853.
[T] KuO-CHUNG TAI, The tree-to-tree correction problem, J. Assoc. Comput. Mach., 26 (1979),
pp. 422-433.
[U83] E. UKKONEN, On approximate string matching, in Proc. Internat. Conference on the Foundations
of Computing Theory, Lecture Notes in Computer Science 158, Springer-Verlag, Berlin,
New York, 1983, pp. 487-495.
[U85] , Finding approximate pattern in strings, J. Algorithms, 6 (1985), pp. 132-137.
[WF] R. WAGNER AND M. FISHER. The string-to-string correction problem, J. Assoc. Comput. Mach.,
21 (1974), pp. 168-178.
[Z83] KAIZHONG ZHANG, An algorithm for computing similarity of trees, Tech. Report, Mathematics
Department, Peking University, Peking, China, 1983.
[Z89] , The editing distance between trees: algorithms and applications, Ph.D. thesis, Department
of Computer Science, Courant Institute of Mathematical Sciences, New York University,
New York, 1989.
In a separate result obtained with the help of Rick Statman, we find that the editing distance between
unordered labelled trees (i.e., where the sibling order is insignificant) is NP-complete. The reduction is from
exact cover by 3-sets [Z89].