0% found this document useful (0 votes)

19 views19 pages

Simple Fast Algorithms For The Editing Distance Be

Uploaded by

FabianAndresTorresSalamanca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views19 pages

Simple Fast Algorithms For The Editing Distance Be

Uploaded by

FabianAndresTorresSalamanca

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220618233

Simple Fast Algorithms for the Editing Distance Between Trees

and Related Problems

Article in SIAM Journal on Computing · December 1989

DOI: 10.1137/0218082 · Source: DBLP

CITATIONS READS

1,302 6,309

2 authors, including:

Dennis Shasha
New York University
508 PUBLICATIONS 20,548 CITATIONS

SEE PROFILE

All content following this page was uploaded by Dennis Shasha on 14 August 2017.

The user has requested enhancement of the downloaded file.

SIAM J. COMPUT. (C) 1989 Society for Industrial and Applied Mathematics
Vol. 18, No. 6, pp. 1245-1262, December 1989 011

SIMPLE FAST ALGORITHMS FOR THE EDITING DISTANCE BETWEEN

TREES AND RELATED PROBLEMS*
KAIZHONG ZHANG" AND DENNIS SHASHA$

Abstract. Ordered labeled trees are trees in which the left-to-right order among siblings is. significant.
The distance between two ordered trees is considered to be the weighted number of edit operations
(insert, delete, and modify) to transform one tree to another. The problem of approximate tree matching is
also considered. Specifically, algorithms are designed to answer the following kinds of questions:
1. What is the distance between two trees?
2. What is the minimum distance between T and T when zero or more subtrees can be removed
from T2
3. Let the pruning of a tree at node n mean removing all the descendants of node n. The analogous
question for prunings as for subtrees is answered.
A dynamic programming algorithm is presented to solve the three questions in sequential time O(I Tll x
IT2lxmin (depth( Tt), leaves( T)) x min (depth(T2), leaves(T2))) and space O(Ir, x lT21) compared with
o(I T,I IT=I x(depth(T)): x (depth(T2)) ) for the best previous published algorithm due to Tai [J. Assoc.
Comput. Mach., 26 (1979), pp. 422-433]. Further, the algorithm presented here can be parallelized to give
time O(1 T[ /1 T=I).

Key words, trees, editing distance, parallel algorithm, dynamic programming, pattern recognition

AMS(MOS) subject classifications. 68P05, 68Q25, 68Q20, 68R10

1. Motivation.
1.1. Applications. Ordered labeled trees are trees whose nodes are labeled and in
which the left-to-right order among siblings is significant. As such they can represent
grammar parses, image descriptions, and many other phenomena. Comparing such
trees is a way to compare scenes, parses, and so on.
As an example, consider the secondary structure comparison problem for RNA.
Because RNA is a single strand of nucleotides, it folds back onto itself into a shape
that is topologically a tree (called its secondary structure). Each node of this tree
contains several nucleotides. Nodes have colorful labels such as "bulge" and "hairpin."
Various researchers [ALKBO], [BSSBWD], [DD] have observed that the secondary
structure influences translation rates (from RNA to proteins). Because different sequen-
ces can produce similar secondary structures IDA], [SKI, comparisons among secon-
dary structures are necessary to understanding the comparative functionality of different
RNAs. Previous methods for comparing multiple secondary structures of RNA
molecules represent the tree structures as parenthesized strings [$88]. These have been
recently converted to using our tree distance algorithms.
Currently we are implementing a package containing algorithms described in this
paper and some other related algorithms. A preliminary version of the package is being
used at the National Cancer Institute for the RNA comparison problem.
1.2. Algorithmic approach. The tree distance problem is harder than the string
distance problem. Intuitively, here is why. In the string case, if Sl[i] S2[j], then the

* Received by the editors August 5, 1987; accepted for publication (in revised form) February 12, 1989.
This work was partially supported by the National Science Foundation under grant number DCR8501611
and by the Office of Naval Research under grant number N00014-85-K-0046.

"Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York,
New York 10012 ([email protected]). Present address, Department of Computer Science, Middlesex
College, The University of Western Ontario, London, Ontario, Canada N6A 5B7.
t Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street, New York,
New York, 10012 ([email protected]).
1245
1246 K.Z. ZHANG AND D. SHASHA

distance between Sl[1..i-1] and Sz[1..j--1] is the same as between Sl[1..i] and
$2[1 ..j]. The main difficulty in the tree case is that preserving ancestor relationships
in the mapping between trees prevents the analogous implication from holding.
By introducing the distance between ordered forests and careful elimination of
certain subtree-to-subtree distance calculations we are able to improve the time and
space of best previous published algorithm [T]. Note that the improvement of space
for this problem is extremely important in practical applications.
Besides improving on the time and space of the best previous algorithm [T], our
algorithm is far simpler to understand and to implement. In style, it resembles algorithms
for computing the distance between strings. In fact, the string distance algorithm is a
special case of our algorithm when the input is a string.
2. Definitions.
2.1. Edit operations and editing distance between trees. Let us consider three kinds
of operations. Changing node n means changing the label on n. Deleting a node n
means making the children of n become the children of the parent of n and then
removing n. Inserting is the complement of delete. This means that inserting n as the
child of n’ will make n the parent of a consecutive subsequence of the current children
of n’. Figs. 1-3 illustrate these editing operations.

T1 T2
(a-- b)

FG.

(b A)

FG. 2

T1
(A b)

FG. 3
TREE EDITING DISTANCE 1247

(1) Change. To change one node label to another.

(2) Delete. To delete a node. (All children of the deleted node b become children
of the parent a.)
(3) Insert. To insert a node. (A consecutive sequence of siblings among the children
of a become the children of b.)
Following [WF] and [T], we represent an edit operation as a pair (a, b) (A, A),
sometimes written as a b, where a is either A or a label of a node in tree T1 and b
is either A or a label of a node in tree T2. We call a b a change operation if a A
and b A; a delete operation if b A; and an insert operation if a A. Since many
nodes may have the same label, this notation is potentially ambiguous. It could be
made precise by identifying the nodes as well as their labels. However, in this paper,
which node is meant will always be clear from the context.
Let S be a sequence Sl,’’ ", Sk of edit operations. An S-derivation from A to B
is a sequence of trees Ao,"’’, Ak such that A Ao, B Ak, and Ai-1 "-> Ai via si for
l<__i<_k.
Let y be a cost function that assigns to each edit operation a- b a nonnegative
real number y(a- b). This cost can be different for different nodes, so it can be used
to give greater weights to, for example, the higher nodes in a tree than to lower nodes.
We constrain 7 to be a distance metric. That is,
(i) y(a-b)>=O; y(aa)=O
(ii) y(a-b)=y(b-a); and
(iii) y(a-e)<- y(ab)+y(bc).
We extend y to the sequence S by letting y(S) i y(si). Formally the distance
between T and T2 is defined as follows:

6(TI, T2)= min {y(S)]S is an edit operation sequence taking T to T2}.

The definition of y makes 6 a distance metric also.
2.2. Mapping. Let T1 and T2 be two trees with N and N2 nodes, respectively.
Suppose that we have an ordering for each tree, then T[i] means the ith node of tree
T in the given ordering.
The edit operations give rise to a mapping that is a graphical specification of what
edit operations apply to each node in the two trees (or two ordered forests). The
mapping in Fig. 4 shows a way to transform T1 to T:. It corresponds to the sequence
(delete (node with label c), insert (node with label c)).

T1 T2

FG. 4
1248 K.Z. ZHANG AND D. SHASHA

Consider the diagram of a mapping in Fig. 4. A dotted line from T[i] to T2[j]
indicates that T[i] should be changed to Tz[j] if T[i] Tz[j], or that T[i] remains
unchanged if T[i] T2[j]. The nodes of T not touched by a dotted line are to be
deleted and the nodes of T2 not touched are to be inserted. The mapping shows a way
to transform T to T.
Formally we define a triple (M, T1, T2) to be a mapping from T1 to T2, where M
is any set of pair of integers (i,j) satisfying:
(1) <-i<-_N, I <-j<-_N2;
(2) For any pair of (i,j) and (i,j) in M,
(a) il i if and only if jl =j2 (one-to-one),
(b) T[i] is to the left of Tl[i2] if and only if T2[j] is to the left of T2[j]
(sibling order preserved),
(c) T[il] is an ancestor of T[i] if and only if T[jl] is an ancestor of T[j_]
(ancestor order preserved).
We will use M instead of (M, T1, T2) if there is no confusion. Let M be a mapping
from T to T2. Let I and J be the sets of nodes in T and T2, respectively, not touched
by any line in M. Then we can define the cost of M"
y(M)= y(T[i]Tz[j])+ y(TI[i]A)+Z y(ATz[j]).
(i,j) M i jJ

Mappings can be composed. Let M be a mapping from T to T2 and let M2 be

a mapping from TE to T3. Define
M, ME= {(i,j)[ lk s.t. (i, k) M, and (k,j) M2}.
LEMMA 1. (1) M M is a mapping.
(2) 3’(M, M2)<= T(M1)+ 3"(M2).
Proof. Case (1) follows from the definition of mapping.
(2) Let M be the mapping from T to T2. Let ME be the mapping from T2 to
T3. Let M M2 be the composed mapping from T to T3 and let I and J be the
corresponding deletion and insertion sets. Three general situations occur. (i,j)
M M2, I, or j J. In each case this corresponds to an editing operation 3’(x y)
where x and y may be nodes or may be A. In all such cases, the triangle inequality
on the distance metric 3’ ensures that 3"(x-y)<=3"(x-z)+3"(z-y).
The relation between a mapping and a sequence of edit operation is as follows.
-
LEMMA 2. Given S, a sequence Sl,"’, Sk of edit operations from T to T2, there
exists a mapping M from T to T2 such that 3"( M) <= 3"(S). Conversely, for any mapping
M, there exists a sequence of editing operations such that 3"(S)= 3"(M).
Proof. The first part can be proved by induction on k. The base case is k- 1. This
case holds, because any single editing operation preserves the ancestor and sibling
relationships in the mapping. In the general case, let S be the sequence s,..., Sk-
of edit operations. There exist a mapping M1 such that 3’(M1) <- 3’(S). Let M be the
mapping for Sk. From Lemma 1, we have that
3’(M,o M) -<_ 3’(M,) + 3’(M2) -<_ 3’(S).
To construct the sequence of editing operations, simply perform all the deletes
indicated by the mapping (i.e., all nodes in T having no lines attached to them are
deleted), then all relabellings, then all inserts. [3

Note that our definition of mapping is different from the definition in [T]. We believe that our definition
is more natural because it does not depend on any traversal ordering of the tree.
TREE EDITING DISTANCE 1249

Hence, 6(T1, T2)= min {y(M)IM is a mapping from T to T2}.

There has been previous work on this problem. Tai [T] gave the best published
algorithm for the problem. [Z83] is an improvement of [T], giving better sequential
time and space than [T]. Our new algorithm is much simpler than [T] and [Z83], gives
better time and space than both of them, and extends to related problems. The algorithm
of Lu [L] does not solve this problem for trees of more than two levels.

3. A simple new algorithm. This algorithm, unlike [T], [L], and [Z83], will, in its
intermediate steps, consider the distance between two ordered forests. At first sight
one may think that this will complicate the work, but it will in fact make matters easier.
We use a postorder numbering of the nodes in the trees. In the postordering,
TI[ 1 i] and T211 ..j] will generally be forests as in Fig. 5. (The edges are those in the
subgraph of the tree induced by the vertices.) Fortunately, the definition of mapping
for ordered forests is the same as for trees.

3.1. Notation. Let T[i] be the ith node in the tree according to the left-to-right
postorder numbering, l(i) is the number of the leftmost leaf descendant of the subtree
rooted at T[i]. When T[i] is a leaf, /(i)=i. The parent of T[i] is denoted p(i).
We define p(i)=i, pl(i)=p(i),p:(i)=p(pl(i)), and so on. Let anc(i)=
{pk( i) lO <- k <= depth(i)}.
T[ i..j] is the ordered subforest of T induced by the nodes numbered to j inclusive
(Fig. 5). If i>j, then T[i..j]=. T[1..i] will be referred to as forest(i), when the
tree T referred to is clear. T[ l(i)., i] will be referred to as tree(i). Size(i) is the number
of nodes in tree(i).

T T[1 7]

T[7]

T[7] T[1] T[2] T[4] T[5]

T[4] T[51
FIG. 5

The distance between T[i’..i] and T[j’..j] is denoted forestdist(T[i’..i],

T[j’..j]) or forestdist(i’., i,j’..j) if the context is clear. We use a more abbreviated
notation for certain special cases. The distance between T[1..i] and T[1..j] is
sometimes denoted forestdist( i, j). The distance between the subtree rooted at and
the subtree rooted at j is sometimes denoted treedist( i, j).

3.2. New algorithm. We first present three lemmas and then give our new algorithm.
Recall that anc(i)= {pk(i)lO<-_k<-depth(i)}.
LEMMA 3. (i) forestdist((, ) O.
1250 K.Z. ZHANG AND D. SHASHA

(ii) forestdist( TI[ 1(il).. i], ) =forestdist( TI[ l(i)., i- 1 ], ) + 3/( T[i] -> A).
(iii) forestdist(, T2[ l(j) ..j]) forestdist(, T2[ l(j) ..j 1])+ 3/(A- T2[j])
where i, anc( i) and jl anc(j).
Proof Case (i) requires no edit operation. In (ii) and (iii), the distances correspond
to the cost of deleting or inserting the nodes in T[l(i).. i] and T2[l(j)..j]), respec-
tively.
LEMMA 4. Let il anc( i) and jl anc(j). Then

(forestdist(l( il).. i- 1, l(j,)..j) + 3/( TI[ i] --> A),

]forestdist( l( i,) i, l(j)..j 1) + /(A- T[j]),
forestdist( l( il) i, l(j) ..j) l( i) l( i) 1, l(j) l(j) 1)
minlforestdist
+ forestdist( i).. 1, (j)..j 1
|
( + 3/(T,[i]--> T2[j]).
Proof We compute forestdist(l(il)..i, l(jl)..j) for l(i,) < iN i, and l(jl)<-_j<-j,.
We are trying to find a minimum-cost map M between forest( l( il).. i) and for-
est(l(jl)..j). The map can be extended to T[i] and T[j] in three ways.
(1) T[i] is not touched by a line in M. Then (i,A)M. So, forest-
dist( l( i) i, l(jl) ..j) forestdist( l( i) i- 1, l(j) ..j) + 3/( T[ i] A).
(2) T2[j] is not touched by a line in M. Then (A,j)M. So, forest-
dist(l(il)., i,/(jl)..j) =forestdist(l(i).. i,/(jl)..j 1) + 3/(A T[j]).
-
(3) T[i] and T_[j] are both touched by lines in M. Then (i,j) M. Here is why.
Suppose (i,k) and (h,j) are in M. If l(il)<-h<-l(i)-l, then is to the right of h so
k must be to the right of j by the sibling condition on mappings. This is impossible
in forest(l(jl)..j). Similarly, if is a proper ancestor of h, then k must be a proper
ancestor ofj by the ancestor condition on mappings. This too is impossible. So, h i.
By symmetry, k =j and (i, j) M.
Now, by the ancestor condition on mapping, any node in the subtree rooted at
Tl[i] can only be touched by a node in the subtree rooted at T2[j]. Hence,

forestdist( l( i,) i, l(j,) ..j) forestdist( l( il) l(i) 1, l(j,) l(j) 1

+forestdist(l(i).. i- 1, l(j)..j 1) + 3/( T[ i] -> T2[j]).

Figure 6 shows the situation.

r[il r+[jl

r[l(i)..l(i)-11 Tl[l(i) i-11 T2[l(jl)..l(j)- 1] T2[I(j) j-l]

FIG. 6. Case (3) of Lemma 4.

TREE EDITING DISTANCE 1251

Since these three cases express all the possible mappings yielding forest-
dist(l(il)., i, l(jl)..j), we take the minimum of these three costs. Thus,

forestdist( l( i) i, l(j) ..j)

(forestdist( l( il).. i- 1, l(j) ..j) + y( T[ i]
]forestdist(l(il).. i, l(jl)..j 1) + 7(A T2[j])
minforestdist( l( i) l( i) 1, l(j) l(j) 1)
I( + forestdist (l( i).. 1, (j)..j 1
+ T(T,[i]
- ---) A)

LEMMA 5. Let i anc( i) and jl anc(j). Then

(1) If l( i) l( i,) and l(j) l(j,)

forestdist(t(il)., i,/(jl)..j)
fforestdist( t( il)
minlforestdist(t(i)
(forestdist(l(i,)
i- 1, l(j,) .j) +
y( TI[ i] A),
i,/(jl)..j 1) + y(A -* T2[j]), --
i- 1, t(jl)..j 1) + y( T[ i] Tz[j]).
(2) If l( i)

forestdist( l( il).. i,
1( i) or t(j)

j) min
/(jl) (i.e., otherwise)
fforestdist( l( i,) i- 1, t(j) .j) +
i, l(j)..j
Jforestdist(l(il)..
y( Tl[ i]
+ ,/(A --, T2[j]),
forestdist( l( i) l( i) 1, l(j) l(j) 1)
( + treedist(i, j).
-
A),

Proof. By Lemma 4, if l(i) (i) and l(j) l(j) then, since forestdist(l(i)., l(i)
1, l(j)..l(j)- 1) forestdist(, ) -0, (1) follows immediately.
Because the distance is the cost of a minimal cost mapping, we know forest-
dist(l(i)..i, l(j)..j) forestdist(l(i)..l(i)- 1, l(j)..l(j)- 1)4- treedist(i,j) since the
latter formula represents a particular (and therefore possibly suboptimal) mapping of"

forestdist(l( i).. 1, l(j)..j

-
forest(l(il)..i) to forest(l(j)..j). For the same reason, treedist(i,j)
4- 7( T[ i] T2[j]). Lemma 4 and these two inequalities
imply that the substituting of treedist(i,j) for forestdist(l(i)., i- 1, l(j)..j- 1) /
7(T[i]- T2[j]) in (2) is correct. (See Fig. 7.)
Lemma 5 has three important implications:
-
First, the formulas it yields suggest that we can use a dynamic programming style
algorithm to solve the tree distance problem.
Second, from (2) of Lemma 5 we observe that to compute treedist(i,j) we need
in advance almost all values of treedist(i, j) where i is the root of a subtree containing
and j is the root of a subtree containing j. This suggests a bottom-up procedure for
computing all subtree pairs.
Third, from (1) in Lemma 5 we can observe that when is in the path from l(i)
to i and j is in the path from l(jl) to jl, we do not need to compute treedist(i,j)
separately. These subtree distances can be obtained as a byproduct of computing
treedist( il j).
These implications lead to the following definition and then our new algorithm.
Let us define the set LR_keyroots of tree T as follows:
LR_keyroots(T)= {klthere exists no k’> k such that/(k)= l(k’)}.
That is, if k is in LR_keyroots(T) then either k is the root of T or l(k) l(p(k)),
i.e., k has a left sibling. Intuitively, this set will be the roots of all the subtrees of tree
T that need separate computations.
Consider trees T and T2 in Fig. 4. From the above definition we can see that
LR_keyroots(T1) {3, 5, 6} and LR_keyroots(T2) {2, 5, 6}.
1252 K.Z. ZHANG AND D. SHASHA

l(i)=l(il) and/(J)=/(,]l)

Tl[i] T2[j]

Tl[l(i) i-l] T2[I(j) j-.]

l(i)4:l(il) or/(j)4:l(jl)

rl[l(il)..l(i)- 1] tree(i) T2[l(jl)..l(j )- 1] tree(j)

FIG. 7. The two situations of Lemma 5.

It is easy to see that there is a linear time algorithm to compute the function l(
and the set LR_keyroots. We can also assume that the result is in array and LR_keyroots.
Furthermore, in array LR_keyroots the order of the elements is in increasing order.
We are now ready to give our new simple algorithm.
Input: Tree T and T2.
Output: Tree_dist( i, j), where 1 -< i<-IT, and <--j<--ITI.
Preprocessing
(To compute l(), LR_keyrootsl and LR_keyroots2)
Main loop
for i’ := 1 to [LR_keyroots(T1)[
for j’ := 1 to ILR_keyroots(T2)
LR_keyroots 1[ i’ ];
j LR_keyroots2[j’ ];
Compute treedist(i, j);
We use dynamic programming to compute treedist(i, j). The forestdist values computed
and used here are put in a temporary array that is freed once the corresponding treedist
is computed. The treedist values are put in the permanent treedist array.
The computation of treedist(i,j).
TREE EDITING DISTANCE 1253

forestdist(, ) 0;
for i :-l(i) to
forestdist( TI[ l(i)., il], ) =forestdist( TI[ l(i)., il 1 ], ) 4- 5,( TI[ i] A)
for jl :- l(j) to j
forestdist(, T2[ l(j) .jl]) forestdist(, T2[ l(j) .jl 1])4- ),(A- T2[jl])
for il :-l(i) to
for jl :- l(j) to j
-
if l(i)= l(i) and/(jl)- l(j) then

- --
forestdist( T[ l(i)., il], T2[ l(j)..jl)] min (
forestdist( TI[ l( i).. il 1 ], TEl l(j)..jl]) 4- 5,( TI[ il] A),
forestdist( TI[ l( i).. il], T[ l(j)..jl 1 ]) 4- ),(A T[jl]),
forestdist( TI[ l(i)., 1 ], T.[ l(j)..jl 1 ]) 4- 5,( Tl[ il] TE[jl]))
treedist(il,jl)-forestdist(Tl[l(i)..i],T[l(j)..j])/* put in permanent
array */
else
forestdist( TI[ l(i)., il], T[ l(j)..jl]) min (
forestdist( TI[ l( i).. il 1 ], T[ l(j)..jl]) 4- 5’( TI[ il] A),
forestdist( TI[ l( i).. il], T[ l(j)..jl 1]) 4- ),(A T[jl]),
-
forestdist( TI[ 1( i) l( il) 1], TEl l(j) l(j) 1]) 4- treedist( il jl))
THEOREM 1. The basic algorithm is correct.
Proof. We will prove that for any pair (i, j) such that i LR_keyroots(T1) and
j LR_keyroots(T2), the following invariants holds.

tree_dist(3, 2) tree_dist(3, 5) tree_dist(3, 6)

0 0 0 2 3 4 5 6
0 2 3 4 5
2 2 2 2 2 2 2 2 3 4

tree_dist(5, 2) tree_dist(5, 5) tree_dist(5, 6)

0 0 0 2 3 4 5 6
0 2 3 4 4 5

tree_ dist (6, 2) tree_ dist (6, 5) tree_ dist (6, 6)

0 0 0 2 3 4 5 6
0 2 3 4 5
2 2 2 2 0 2 3 4
3 2 3 3 3 2 2 3 4 5
4 3 4 4 4 3 2 2 3 4
5 4 5 4 5 4 3 2 3 2 3
6 5 6 5 6 5 4 3 3 3 2

tree_dist
0 2 3 5
0 2 3 5
2 2 2 2 4
3 3 2 4 4
3 4 0 5
5 5 3 3 5 2
FIG. 8. The result of computation for T and T in Fig. 4.
1254 K.Z. ZHANG AND D. SHASHA

(1) Immediately before the computation of treedist(i,j), all distances

treedist(il,jl), where l(i)<-i<-i and l(j)<=j,<=j and either l(i) l(i)or l(j) l(j),
are available. In other words, treedist(i, j) is available if i is in the subtree of tree(i)
but not in the path from l(i) to and j is in the subtree of tree(j) but not in the path
from t(j) to j.
(2) Immediately after the computation of treedist(i, j), all distances treedist(i, jl),
where l(i) <- il <- and l(j) <-j <-j are available.
We first show that if (1) is true then (2) is true. From Lemma 5 we know that all
required subtree-to-subtree distances are available. (We need all treedist(i,jl) such
that l(i) _<- i _-< and l(j) <-j <-j and either l(i) l(i) or l(j) l(j), and by (1) all these
distances are available.) We compute each treedist(i ,j), where l(il)= l(i) and l(j)=
l(j) in the if part and add it to the permanent treedist array. So, (2) holds.
Let us show that (1) always holds. Suppose l(i) l(i). Let i be the lowest ancestor
of such that iLR_keyroots(T1). Since l(i)-l(i)l(i),ii. Since i
LR_keyroots(T1) 1-- < i. So i’ < i. Let jl be the lowest ancestor of jl such that j’ G
LR_keyroots(T.). Sincej LR_keyroots( T2),j <-j. Hence i +j < +j. This means that
treedist(i’ ,j) will have already been computed before treedist(i,j) because in the main
loop LR_keyrootsl and LR_keyroots2 are in increasing order. Hence treedist(i,j) is
available after the computation of treedist(i’,j]). [3
As an example, consider tree T and T2 in Fig. 4. For simplicity, assume that all
insert, delete, and change (of labels) operations will cost one. Figure 8 shows the result
of applying our new algorithm to T1 and T2. The matrix below tree_dist( i, j) is the
result of temporary array produced by the computation of tree_dist(i,j). (Out of 36
possible tree_dist arrays, only ninethose corresponding to pairs of keyrootsare
explicitly computed.) The matrix below tree_dist is the final result. The value in the
lower right corner (2) is the distance between T1 and T.
4. Some aspects of our algorithm.
4.1. Complexity.
LEMMA 6. ILR_keyroots(r) <_-Ileaves(T)l.
Proof We will prove that for any i, j LR_keyroots(T), l(i) l(j).
Let i,jLR_keyroots(T) and i<j. If l(i)=l(j) from i<j we know that is in
the path from l(j) to j. By the definition of l(j), has no left_sibling. This contradicts
the assertion that i LR_keyroots(T). Hence each leaf is the leftmost descendant of
at most one member of LR_keyroots(T). So, ILR_keyroots( T)l -<- [leaves( T) I. 3
Because not all subtree-to-subtree distances need be computed, the number of
such calculation a node participates in is less than its depth. Instead, it is the node’s
collapsed depth:
LR_colldepth( i) lanc( i) LR_keyroots( T)I.
We define the collapsed depth of tree T as follows:
LR_colldepth( T) max LR_colldepth( i).

By the definition and Lemma 6 we can see that LR_colldepth(i)

min (depth (T), leaves(T)) for 1 lTI. Hence LR_colldepth(T) <-
min (depth(T), leaves(T)).
LEMMA 7.
i= LR_keyroots( T)] j=N

i=1
Size(i) 2 ILR-colldepth(j)]
j=l
TREE EDITING DISTANCE 1255

Proof.Consider when node j is counted in the first summation" in the subtrees

corresponding to each of its ancestors that is in LR_keyroots(T). By the definition of
LR_colldepth( ),j is counted LR_colldepth(j) times.
THEOREM 2. The time complexity is O(]Tll lT2l min (depth( T1), leaves( T1))
min (depth(T2), leaves(T2))). The space complexity is O(I TI). LI
Proof Let us consider the space complexity first. We use a permanent array for
treedist and a temporary array for forestdist. Each of these two arrays requires space

Consider the time complexity of our algorithm. The preprocessing takes linear
time. The subtree distance dynamic programming algorithm takes Size(i) x Size(j) for
the subtree rooted at T[i] and the subtree rooted at T2[j]. We have a main loop that
calls this subroutine several times. So the time is"
i=lLU_keyroots T1) j=lLR_keyroots( T2)
E Size(i) x Size(j)

.,
i=1 j=l
i=lLR_keyroots( Tl)l j=lLR_keyroots( T2)I
Size(i) x Size(j).
i=1 j=l

By Lemma 6, the above equals

,
i= N

i=1
LR_ colldepth i) x
j= N

j-----1
LR_ colldepth (j).

This is less than

IT1[ x T.[ x LR_colldepth(T1) x LR_colldepth(T2).
By the definition of LR_colldepth, we have that the time complexity is
o(I T[ x IT=[ min (depth(T), leaves(T1)) x min (depth(T), leaves(T2))).
These time and space complexities are an improvement over the O(IT I T=I
depth(T1)2x depth(Tz) 2) time and space complexity of IT].
Note. If we use a right-to-left postorder numbering for tree nodes and define
similar functions r(i), RL_keyroots(T) and RL_colldepth(i), we can have the same
N2 RL_ colldepth (j)
result as above. The complexity will be Ei=l
Clearly, using the left-to-right or right-to-left postorder numberings give the same
worst-case time complexity. However, in practice it may be beneficial to choose the

=- Lg_colldepth(j) and i=N, gL_colldepth(i) x

==
ordering that gives the lower of the following two products" i=N,i LR_colldepth(i)x
N2 N2 RL_colldepth(j).

4.2. Mapping. It is natural to ask for a mapping that yields the distance computed.
Also given two trees, we may ask, what is the largest common substructure of these
two trees? This is analogous to the longest common substring problem for strings. We
can find the mapping in the same time and space complexity as finding the distance,
although we do not give the details here. The mapping is produced by our toolkit.
4.3. Parallel implementation. A straightforward transformation of our algorithm
to a parallel one yields an algorithm with time complexity O(N1 + N:) whereas [T]
and Z83 have time complexity O( N + N:) depth (TI) + depth (T2))). Our algorithm
uses O(min (I TI[, IT21) leaves(T) leaves(T)) processors.

Actually, by controlling the starting point of each treedist computation more carefully, we can reduce
the processor bound to O(min ([ T[, T2[) x min (depth(T), leaves(T)) x min (depth(T2), leaves(T2)). The
algorithm is more complicated however.
1256 K. 7,. ZHANG AND D. SHASHA

The algorithm computes in "waves" for all subtree pairs tree(i) and tree(j), where
e LR_keyroots(T1) and j LR_keyroots(T2), simultaneously. We start at wave 0. At
wave k, for each such subtree pair tree(i) and tree(j), compute forest-
dist(l(i)..il, l(j)..j), where (i-l(i))+(j-l(j))= k.
We now present the parallel algorithm in detail. (When the PARBEGIN-PAREND
construct surrounds one or more for loops, it means that every setting of the iterators
in the enclosed for loops can be executed in parallel. The semantics are those of the
sequential program ignoring this construct.)
In the algorithm dist[ i, j] is the array for the computation of treedist(i, j). Therefore
dist[ i, j][ p, q] is the distance forestdist(l(i)..p, l(j)..q) and is. the p, qth member of
the array computing treedist(i, j).

ALGORITHM PARALLEL DISTANCE.

begin
PARBEGIN
for i’ := to [LR_keyroots(T1)l
for j’ := 1 to ILR_keyroots(T2)
:= LR_keyroots 1 i’
j := LR_keyroots2[j’]
dist[i,j][l(i) 1, t(j) 1] := 0/* initializes temporary array for each tree
dist */
PAREND
for k:=0 to N-1
PARBEGIN
for i’ := 1 to ILR_keyroots(Tl)
for j’ := 1 to ILR_keyroots( T2)]
:= LR_keyroots 1[ i’]
j := LR_keyroots2[ i’]

PAREND
dist[ i, j][ l(i) + k, l(j)

for k:-0 to M-1

PARBEGIN
:= dist[ i, j][ l( i) + k- 1, l(j)
-
1]+ y( Till[i]+ k] A)

for i’ :- to ILR_keyroots(T)
for j’ :- 1 to ILR_keyroots(T2)
:= LR_keyroots 1 i’
j :- LR_keyroots2[j’]
dist[ i, j][ l(i) 1, l(j) + k]
:= dist[i,j][l(i)- 1, l(j)+ k- 1]+ y(A- Till[j]+ k])
PAREND
for k:=0 to N+M-2
PARBEGIN
for i’ := to ILR_keyroots(T1)l
for j’ := 1 to ILR_keyroots(Tz)
:= LR_keyroots 1 i’
j := LR_keyroots2[j’]
for i,, jl satisfying il l( i) +jl l(j) k and l( i) <-_ il <= i, l(j) <-j, <=j
if l(i) l(il) and l(j) =/(jl) then
dist[ i, j][ il, jl] := min {

-
dist[ i,j][ il- 1,jl] + y( TI[ il] A)
TREE EDITING DISTANCE 1257

dist[i,j][i,j- 1]+ ’y(A T2[j])

dist[ i, j][ i 1, j + ’y( T[ i] T2[j])
}
treedist( i j) := dist[ i, j][ i, j]
else
dist[ i, j][ i, j] := min {
dist[i, j][ i- 1,j] + y(T[i] A)
dist[i,j][i,j- 1]+ y(A T2[j])
dist[ i, j][ l(i) 1, l(j) 1 + treedist[ i j]
}
end
It is easy to see that in the above algorithm all the terms, except treedist[il,jl], are
available whenever needed. We now show that treedist[i,jl] is available whenever
we use it. Our argument is similar to the one we used in the sequential case.
Note that we compute all terms such that (i l(i)) + (j l(j)) k together. During
that computation, all terms such that (i- l(i))+ (j- l(j))< k are available. So, when
we need item treedist[ i, jl], either l(i) > l(i) or l(j) > l(j). Let i2 be the lowest ancestor
of i such that izE LR_keyroots(T1). Let j2 be the lowest ancestor of j such that
j2E LR_keyroots(T). Since l(il)=/(i2) and l(j)= l(j) we know either 1(i2)> l(i) or
l(jz)>l(j). Therefore, (i-l(i2))+(j-l(jz))<(i-l(i))+(j-l(j))=k. Hence
treedist[il,jl] was already computed in the computation of dist[iz,j2][i ,j] and put
into the permanent tree distance array. This settles correctness.
THEOREM 3. The Parallel Distance Algorithm has time complexity 0(I Tll +ITzl).
Proof By simple analysis of the for loop. 13
4.4. From trees to strings. Strings are an important special case of trees. This
algorithm is a generalization of the natural dynamic programming algorithms on strings
in two senses" time complexity and algorithmic style.
First, we consider the time complexity. Since a string has only one leaf, applying
our algorithms to strings yields a time complexity of O(ITI[ Tzl). This is the same
as that of the best available algorithm for the general problem of string distance.
Second, we consider the algorithm itself. For a string S, LR_keyroots(S) {root}.
So the main loop will only have one iteration. In the dynamic programming subroutine,
since l(i) 1, we will never come to the case l(i) # l(i) or l(j) l(j). So if we change
to [TI, j to Tz], 1(i) to one, l(j) to one, delete the main loop and delete the case
where l(i) l(i) or l(j) l(j), we will have exactly the string distance algorithm.
5. The general technique applied to approximate tree matching. Many problems in
strings can be solved with dynamic programming. Similarly, our algorithm not only
applies to tree distance but also provides a way to do dynamic programming for a
variety of tree problems with the same time complexity. In this section we show how
to apply this general paradigm to approximate tree matching.
5.1. Algorithm template. Here is the general form of the algorithm (assuming a
left-to-right postorder traversal):
preprocessing
main loop
for i’ := 1 to ILR_keyroots(T)
for j’ := 1 to ILR_keyroots( T2)I
LR_keyroots 1[ i’];
j LR_keyroots2[j’ ];
1258 K.Z. ZHANG AND D. SHASHA

compute Tree_D( i,j);

subroutine for Tree_D( i, j)
empty_initialization
for := l(i) to
left_initialization
for jl := l(j) to j
right_initialization
for il:= l(i) to
for jl := l(j) to j
if l(il)= l(i) and l(j)= l(j) then
general_if_ computation
Tree_D( il ,j) Forest_D( T[ l(i) i], T2[ l(j) ..jl]);
else
general_else_computation
5.2. Approximate tree matching. We first consider approximate string matching
[$80], [U83], [U85], [LV]. We will then give two natural generalizations of approximate
string matching to approximate tree matching. This will also be a generalization of the
exact tree matching algorithm as found in Hoffmann and O’Donnell [HO].
The approximate string matching problem is the following. Given two strings
STEXT and SPAT, the problem is to compute, for each i, SD[i, SPAT]=
mini {D(STEXT[j..i], SPAT)}, where =<j=< i+ and D is the string distance metric.
In other words, the problem is to compute, for each i, the minimum number of editing
operations between the "pattern" string SPA T[1..]PATI] and the "text string"
STEXT[1.. i] where any prefix can be removed from STEXT[1..i]. (Intuitively, the
algorithm finds the "occurrence" in TEXT that most closely matches PAT.)
To extend this problem to trees, we must gene-alize the notion of removing a
prefix. For us, a prefix will mean a collection of subtrees.
We first define two operations at a node.
Removing at node T[ i] means removing the subtree rooted at T[ i]. In other words,
delete T[l(i).. i]. (See Fig. 9.)
Pruning at node T[ i] means removing all the descendants of T[ i]. In other words,
delete T[l(i)..i-1]. (Thus, a pruning never eliminates the entire tree.) (See Fig. 10.)

T T’

T[91

T[S]

T[61 T[7]
T[1] T[2I

T[4I T[S]
FIG. 9. Remove subtree rooted at T[8].
TREE EDITING DISTANCE 1259

T T’

T[8]

T[7I
T[1] T[2]

T[41 T[SI
FIG. 10. Pruning at T[8]mremove all its proper descendants.

Assume an ordering for tree T. Define a subtree set S(T) as follows: S(T) is a
set of numbers satisfying
(1) i S(T) implies that 1_-< i<-ITI
(2) i,j S(T) implies that neither is an ancestor of the other.
Define R(T, S(T)) to be the tree T with removing at all nodes in S(T).
Define P(T, S(T)) to be the tree T with pruning at all nodes in S(T).
Now we can give the definition of approximate tree matching. Given tree T and
PAT, for each i, we want to calculate
DR(T[l(i)..i, PAT)=min {treedist(R(T[l(i)..i], S(T[l(i)..i])), PAT)}.
s

DP( T[ l( i) i, PAT) min { treedist( P( T[ l(i) i], S( T[ l(i) i])), PAT)}.

s
The minimum here is over all possible subtree sets S( T[ l( i).. i]). We consider
each generalization in turn.
5.2.1. Remove any number of subtrees from TEXT tree. The problem is as follows.
Given trees T and T2, we want to know what is the minimum distance between
Till(i).. i] and T2 when zero or more subtrees can be removed from T[l(i).. i].
Let F_DR(T[I(i)..il], T2[l(j)..jl]) denote the minimum distance between forest
T[ l(i)., i] and T2[ l(j)..j] with zero or more subtrees removed from T[ l(i)., i]. Let
T_DR(i,j) denote the minimum distance between tree T[ l( i) i] and T2[l(j)..j] with
zero or more subtrees removed from T[l(i).. i]. We write the algorithm in the form
suggested by the algorithm template.
ALGORITHM SUBTREE REMOVAL.

F_DR ,
empty_initialization:
) 0
left_initialization:
F_DR(T,[l(i) i,], ) 0
right_initialization:
F_DR(, T2[l(j)..j,])= F_DR(, Te[I(j)..j,-1])+T(A-
general_if_computation
/* applies if l(i,) l(i) and l(j,) l(j)*/
F_DR T[ l( i) i,], T[ l(j) ..j,]) min {
1260 K.Z. ZHANG AND D. SHASHA

F_DR(, T_[I(j)..j,]),
F_DR(T[I(i)..i- 1], T2[l(j)..j,])+ y(T[i,]-> A),
F_DR(T[I(i)..i], T2[I(j)..j- 1])+ 3,(A-* T[j]),
F_DR(T[I(i)..i- 1], T[I(j)..j- 1])+ y(T[i]-> T2[j])}
/* put the derived treedist in the permanent array, as specified by template */
general_else_computation
F_DR T[ l(i)., i], T[ l(j)..j]) min {
F_DR(T[I(i)..I(i)- 1], T2[I(j)..j]),
y(Tl[i]-> A),
F_DRITI[I(i)..i-I], T[l(j)..])+
F_DR T[l(i) i], T2[I(j)..jl ]) + y(A--> T2[jl])
F_DR(T[I(i)..I(i)- 1], T2[l(j)..l(j)- 1])+ T_DR(i,jl)}
LEMMA 8. Algorithm Subtree Removal is correct.
Proof. First we show that the initialization is correct. The empty_initialization and
the right_initialization are the same as in the tree distance algorithm. The left_initializ-
ation F_DR(T[I(i)..i],)=O is correct, because we can remove all of Tl[l(i)..i].
For the general term F_DR(T[I(i)..i], T[I(j)..j]), we ask first whether or not
the subtree T[l(i)..i] is removed. If it is removed, then the distance should be
F_DR(T[I(i)..I(i)-I], T[I(j)..j]). Otherwise, consider the mapping between
Tt[l(i)..il] and T2[I(j)..j] after we perform an optimal removal of subtrees of
T[l(i)..i]. Now we have the same three cases as in Lemma 4. Hence the general
expression should be the minimum of these four terms:
F_DR( T[ I( i).. i], T2[ l(j)..jl]) min {
I_DR TI[ l(i)., l(i) 1], T2[ l(j)..j]),
F_DR(TI[I(i)..i- 1], T[I(j)..j])+ .,(TI[il] --> A),
F_DR(T[I(i)..i], T2[I(j)..j- 1])+ y(A--> T[j]),
F_DR T[ l(i)., l(i) 1 ], T[ l(j)../(jl) 1 ])
+ F_DR( T[l(i)..i- 1], T2[I(jl)..j- 1])+ y( T[ il]--> T2[j])}
As in Lemma 5, this specializes to the general_if_computation and the gen-
eral_else_computation given in the algorithm.
5.2.2. Prune at any number of nodes from the TEXT tree. Given trees T and T,
we want to know what is the minimum distance between T[l(i).. i] and T2 when there
have been zero or more prunings at nodes of T[l(i).. i].
Let F_DP(T[I(i)..i], T[I(j)..j]) denote the minimum distance between for-
est T[l(i).. i] and T2[I(j)..j] with zero or more pruning from T[l(i).. i]. Let
T_DP(i, j) denote the minimum distance between tree T[ l(i)., i] and T2[ l(j)..j] with
zero or more prunings from T[l(i).. i]. The following initialization and general term
computation steps will give us an algorithm to solve our problem.
ALGORITHM PRUNINGS.
empty_initialization"
F_DP(,)=O
left_initialization"
F_DP( T[ l( i) i], ,
F_DP( T[ l( i) l( i) 1 ) + y( T[ i,] --> A)
right_initialization:
F_DP(, T[ l(j)..j.]) F_DP(, T:[ l(j)..j, ]) + (A T:[j-I)
general_if_ computation
/* applies if l(i)= l(i) and l(j)= l(j)*/
F_DP( T[ 1( i) i-], T2[ l(j) ..j]) min
-
TREE EDITING DISTANCE 1261

F_DP(, T2[ l(j)..j, 1 ]) + 3/( TI[ il] T2[j,]),

- --
F_DP( TI[ l(i)., il 1 ], T2[ l(j)..jl]) + 3/( TI[ il] A),

/* put the -
F_DP( T[ l(i)., il], T2[ l(j)..jl 1 ]) + 3/(A TE[jl]),
F_DP( T[ I( i) il 1], T2[l(j)..jl 1 ]) + 3/(Tl[ il] T2[j])}
derived treedist in the permanent array, as specified by template
general_else_computation
F_DP( TI[ l(i)., il], T2[ l(j)..jl]) min {
*/

-
F_DP( Tl[l(i).. l(il) 1], T2[l(j)..jl]) 4. 3/( Tl[ il] -> A),
F_DP( T[l(i).. il 1], T2[l(j)..jl]) + 3/( TI[ il] A),
F_DP(TI[I(i) il], T2[l(j)..jl 1 ]) + 3/(A- T2[jl]),
F_DP(Tl[l(i)..l(il)- 1], T2[l(j)..l(jl)- 1])+ T_DP(i,jl) }
LEMMA 9. Algorithm Prunings is correct.
Proof. First we show that the initialization is correct. The empty_initialization and
the right_initialization are the same as in the tree distance algorithm. For left_initializ-
ation, the best we can do for tree T[l(il)..il] is to prune at TI[il]. Therefore
F_DP( TI[ I( i).. il], ) F_DP( TI[ l(i).. 1(il) 1 ], ) + 3/( TI[ il] -’> A). Hence the
left_initialization is correct.
For the general term F_DP(T[I(i)..il], TE[l(j)..jl]), we have the following
similar three cases.
(1) TI[il] is not touched by a line of M.
(la) (without pruning) F_DP( TI[ l(i)., il- 1 ], T2[ l(j)..jl]) + 3/( TI[ il] A)
(lb) (with pruning) F_DP(Tl[l(i)..l(il)- 1], T[I(j)..jl])+ 3/(Tl[i]- A)
(2) T[jl] is not touched by a line of M. Since we only prune from T1, there is
only one case here:
-
F_DP( T[ l(i)., il], T_[l(j)..jl 1 ]) 4- 3/(A- TI[ il])
(3) both TI[il] and T2[j] are touched by lines of M.
(3a) (without pruning)
F_DP( TI[ l( i).. l(il) 1], T:[ l(j)., l(j,) 1])
4- F_DP( T[ l( il).. i

(3b) (with pruning)

1 ], T2[ l(j)..j 1]) 4- 3/( TI[ il]

F_DP( TI[ l( i).. l(il) ], T2[ l(j)../(jl) 1 ]) + F_DP(, T2[/(jl)..jl

- T2[j])

])
3/( T[ il] T2[jl])
4-

If /(i)=/(il) and /(j)= l(j), consider cases (lb) and (3b.) Case (lb) becomes
F_DP(, T2[l(j)..jl])+ 3/(Tl[i]- A). Case (3b) becomes F_DP(, Ta[l(j)..j- 1])+
3/(Tl[i] T2[j]). Now from the right_initialization we know that
F_DP((, T2[ l(j)..j,]) + 3/( TI[ il] A)
>-- F_DP(, T2[ l(j)..jl ]) + 3/(A T2[jl]) + 3/( T[ i,] A)
F_DP((, T2[l(j)..j,- 1])+ 3/(Tl[il] T2[j,]).
So the distance given by case (lb) => the distance from (3b). The proposed gen-
eral_if_computation is therefore correct where the first term handles two cases.
If l(i) 1(i) or l(j) l(j), consider case (3). As in Lemma 6, cases (3a) and (3b)
can be replaced by F_DP( T[ l(i)., t(i) 1 ], T2[ l(j)., t(j,) ]) + T_DP(i, j). The
proposed general_else_computation is therefore correct. Hence algorithm pruning is
correct. [3
1262 K.Z. ZHANG AND D. SHASHA

6. Conclusion. We present a simple dynamic programming algorithm for finding

the editing distance between ordered labelled trees. Our algorithm
(1) Has better time and space complexity than any in the literature;
(2) Is efficiently parallelizable; and
(3,) Is generalizable with the same time complexity to approximate tree matching
problems.
We have implemented these algorithms as a toolkit that has already been used at
the National Cancer Institute.
Acknowledgments. We thank Bob Hummel for helpful discussions and the referees
for valuable comments.

REFERENCES
[ALKBO] S. ALLUVIA, H. LOCKER-GILADI, S. KOBY, O. BEN-NUN, AND A.B. OPPENHEIM, RNase III
stimulates the translations of the clII gene of bateriophage lambda, Proc. Nat. Acad. Sci.
U.S.A., 85 (1987), pp. 1-5.
[BSSBWD] B. BERKOUT, B.F. SCHMIDT, A. STRIEN, J. BOOM, J. WESTRENEN, AND J. DUIN, "Lysis
gene of bateriophage MS2 is activated by translation termination at the overlapping coat gene,’"
Proc. Nat. Acad. Sci. U.S.A., 195 (1987), pp. 517-524.
[DA] N. DELIHAS AND J. ANDERSON, Generalized structures of 5s ribosomal RNA’s, Nucleic Acid
Res. 10 (1982) p. 7323.
[DD] I.C. DECKMAN AND D. E. DRAPER, S4-alpha mRNA translation regulation complex, Molecular
Biol 196 (1987), pp. 323-332.
[HO] C.M. HOFFMANN AND M. J. O’DONNELL, Pattern matching in trees, J. Assoc. Comput. Mach.,
29 (1982), pp. 68-95.
[L] S. Y. Lu, A tree-to-tree distance and its application to cluster analysis, IEEE Trans. Pattern Anal.
Mach. Intelligence, (1979), pp. 219-224.
[LV] G.M. LANDAU AND U. VISHKIN, Introducing efficient parallelism into approximate string
matching and a new serial algorithm, in Proc. 18th Annual ACM Symposium on Theory of
Computing, Association for Computing Machinery, New York, 1986, pp. 220-230.
[$80] P.H. SELLERS, The theory and computation of evolutionary distances, J. Algorithms, (1980),
pp. 359-373.
[$88] B.A. SHAPIRO, An algorithm for comparing multiple RNA secondary structures, Comput. Appl.
Biosci. (1988), pp. 387-393.
[SK] J.L. SUSSMAN AND S. H. KIM, Three dimensional structure of a transfer RNA in two crystal
forms, Science, 192 (1976), p. 853.
[T] KuO-CHUNG TAI, The tree-to-tree correction problem, J. Assoc. Comput. Mach., 26 (1979),
pp. 422-433.
[U83] E. UKKONEN, On approximate string matching, in Proc. Internat. Conference on the Foundations
of Computing Theory, Lecture Notes in Computer Science 158, Springer-Verlag, Berlin,
New York, 1983, pp. 487-495.
[U85] , Finding approximate pattern in strings, J. Algorithms, 6 (1985), pp. 132-137.
[WF] R. WAGNER AND M. FISHER. The string-to-string correction problem, J. Assoc. Comput. Mach.,
21 (1974), pp. 168-178.
[Z83] KAIZHONG ZHANG, An algorithm for computing similarity of trees, Tech. Report, Mathematics
Department, Peking University, Peking, China, 1983.
[Z89] , The editing distance between trees: algorithms and applications, Ph.D. thesis, Department
of Computer Science, Courant Institute of Mathematical Sciences, New York University,
New York, 1989.

In a separate result obtained with the help of Rick Statman, we find that the editing distance between
unordered labelled trees (i.e., where the sibling order is insignificant) is NP-complete. The reduction is from
exact cover by 3-sets [Z89].

View publication stats

Data Structures Cheat Sheet
71% (14)
Data Structures Cheat Sheet
2 pages
CRSP Examination Preparation 1695464322
No ratings yet
CRSP Examination Preparation 1695464322
14 pages
XML Diff Survey
100% (2)
XML Diff Survey
7 pages
SEL-487B-1: Bus Differential and Breaker Failure Relay
100% (1)
SEL-487B-1: Bus Differential and Breaker Failure Relay
726 pages
Dom Tree Edit Distance
No ratings yet
Dom Tree Edit Distance
41 pages
Algo Imm6183
No ratings yet
Algo Imm6183
104 pages
Tutorial
No ratings yet
Tutorial
6 pages
A Guided Tour To Approximate String Matching: Gonzalo Navarro
No ratings yet
A Guided Tour To Approximate String Matching: Gonzalo Navarro
58 pages
York University CSE 3101 Summer 2012 - Exam
No ratings yet
York University CSE 3101 Summer 2012 - Exam
10 pages
DP and Edit Dist
No ratings yet
DP and Edit Dist
30 pages
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
No ratings yet
Efficient Algorithms For Normalized Edit Distance: Abdullah N. Arslan, Department of Computer Science
18 pages
Automated Methods For The Comparison of Natural La
No ratings yet
Automated Methods For The Comparison of Natural La
11 pages
Eth 48401 01
No ratings yet
Eth 48401 01
102 pages
Unit 4 Data Structure
No ratings yet
Unit 4 Data Structure
62 pages
1 s2.0 S0306437915001611 Main
No ratings yet
1 s2.0 S0306437915001611 Main
17 pages
S2015 Lecture-3
No ratings yet
S2015 Lecture-3
46 pages
Data Structures and Algorithms: (CS210/ESO207/ESO211)
No ratings yet
Data Structures and Algorithms: (CS210/ESO207/ESO211)
23 pages
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
No ratings yet
Theory I Algorithm Design and Analysis: (13 - Edit Distance and Approximate String Matching)
13 pages
Unit 4: Trees
No ratings yet
Unit 4: Trees
28 pages
cmsc420 2020 08 Handouts
No ratings yet
cmsc420 2020 08 Handouts
53 pages
KD Trees
No ratings yet
KD Trees
12 pages
06 Basic Graph Algorithms
No ratings yet
06 Basic Graph Algorithms
31 pages
Kdtrees
No ratings yet
Kdtrees
12 pages
Binary Trees
No ratings yet
Binary Trees
548 pages
Trees and Char
No ratings yet
Trees and Char
40 pages
Phylogenetics
100% (1)
Phylogenetics
51 pages
COMP 482: Design and Analysis of Algorithms: Spring 2013
No ratings yet
COMP 482: Design and Analysis of Algorithms: Spring 2013
34 pages
Devkinandan ADS Ass1
No ratings yet
Devkinandan ADS Ass1
6 pages
On The Nearest Neighbour Interchange Distance Between Evolutionary Trees
No ratings yet
On The Nearest Neighbour Interchange Distance Between Evolutionary Trees
5 pages
Review 4: CSCI 2720: Data Structures
No ratings yet
Review 4: CSCI 2720: Data Structures
33 pages
S2015 Lecture-6
No ratings yet
S2015 Lecture-6
28 pages
Binary Search Tree
No ratings yet
Binary Search Tree
177 pages
Binary Trees
No ratings yet
Binary Trees
29 pages
Riesen and Bunke - Approximate Graph Edit Distance
No ratings yet
Riesen and Bunke - Approximate Graph Edit Distance
10 pages
Slides 9
No ratings yet
Slides 9
62 pages
Notes 4
No ratings yet
Notes 4
22 pages
CS3353 Unit4
No ratings yet
CS3353 Unit4
20 pages
Data Structutes Using C'
No ratings yet
Data Structutes Using C'
7 pages
Unit 3 and Unit 4 DSA QB For ETE
No ratings yet
Unit 3 and Unit 4 DSA QB For ETE
35 pages
DS Cheatsheet
No ratings yet
DS Cheatsheet
2 pages
Slides21 PDF
No ratings yet
Slides21 PDF
125 pages
Building Trees Hunting For Trees and Com
No ratings yet
Building Trees Hunting For Trees and Com
226 pages
Lecture 13
No ratings yet
Lecture 13
47 pages
Final Exam Answer Key
No ratings yet
Final Exam Answer Key
3 pages
Dijkstra's Algorithm With Fibonacci Heaps: An Executable Description in CHR
No ratings yet
Dijkstra's Algorithm With Fibonacci Heaps: An Executable Description in CHR
10 pages
Graphs MST
No ratings yet
Graphs MST
46 pages
Algorithms On Strings Trees and Sequences
100% (1)
Algorithms On Strings Trees and Sequences
163 pages
K-Nearest Neighbor: Classification On Spatial Data Streams Using P-Trees
No ratings yet
K-Nearest Neighbor: Classification On Spatial Data Streams Using P-Trees
12 pages
Algorithmic Cheatsheet: Typesetting Math: 97%
No ratings yet
Algorithmic Cheatsheet: Typesetting Math: 97%
12 pages
Editorial Runda Finala
No ratings yet
Editorial Runda Finala
8 pages
Graph Theory and Application - Fournier
No ratings yet
Graph Theory and Application - Fournier
17 pages
C++ Full Course 2
No ratings yet
C++ Full Course 2
80 pages
CS 745
No ratings yet
CS 745
7 pages
On The Communication Complexity of Approximate Pattern Matching
No ratings yet
On The Communication Complexity of Approximate Pattern Matching
67 pages
Speed Mathamatics
From Everand
Speed Mathamatics
Naila Hina
1/5 (1)
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
Constructed Layered Systems: Measurements and Analysis
From Everand
Constructed Layered Systems: Measurements and Analysis
W. H. Cogill
No ratings yet
Matrix Theory and Applications for Scientists and Engineers
From Everand
Matrix Theory and Applications for Scientists and Engineers
Alexander Graham
No ratings yet
Elementary Matrix Theory
From Everand
Elementary Matrix Theory
Howard Eves
2.5/5 (3)
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Statistics for Spatio-Temporal Data
From Everand
Statistics for Spatio-Temporal Data
Noel Cressie
No ratings yet
The Essential R Reference
From Everand
The Essential R Reference
Mark Gardener
No ratings yet
HW1 - Q1 - Bhargavi: Sunday, January 29, 2023 1:54 AM
No ratings yet
HW1 - Q1 - Bhargavi: Sunday, January 29, 2023 1:54 AM
7 pages
Safety Data Sheet: Section 1. Product and Company Identification
No ratings yet
Safety Data Sheet: Section 1. Product and Company Identification
10 pages
BVT Bed Re Ets: Vie I
No ratings yet
BVT Bed Re Ets: Vie I
228 pages
Morality and The Good Life
No ratings yet
Morality and The Good Life
6 pages
Aiml Notes Chapter-3
No ratings yet
Aiml Notes Chapter-3
34 pages
Optical Illusions Thesis
100% (2)
Optical Illusions Thesis
6 pages
Library Management System Using Java: ASHUTOSH PATRA (2001229024) LALAJI PRASAD PANDA (2001229088) BINAYAK BAL (2001229025)
No ratings yet
Library Management System Using Java: ASHUTOSH PATRA (2001229024) LALAJI PRASAD PANDA (2001229088) BINAYAK BAL (2001229025)
28 pages
Eco and Youth Club 2023-24
No ratings yet
Eco and Youth Club 2023-24
9 pages
Standard Operating Procedure Title: Determination of PH GTP Number Supersedes Standard Effective Date
No ratings yet
Standard Operating Procedure Title: Determination of PH GTP Number Supersedes Standard Effective Date
2 pages
3 Simple Habits To Improve Your Critical Thinking
No ratings yet
3 Simple Habits To Improve Your Critical Thinking
6 pages
Marking Criteria: End-Of-Term Exams For English 5 Speaking Exam: 30% (Five Tests)
No ratings yet
Marking Criteria: End-Of-Term Exams For English 5 Speaking Exam: 30% (Five Tests)
7 pages
Friction Forces Engineering Lab
No ratings yet
Friction Forces Engineering Lab
3 pages
Answer Sheets
No ratings yet
Answer Sheets
4 pages
Fatigue Failure of The de Havilland Comet 1
No ratings yet
Fatigue Failure of The de Havilland Comet 1
8 pages
Handbook of Econometrics Volume 3
No ratings yet
Handbook of Econometrics Volume 3
620 pages
1911 Encyclopædia Britannica
No ratings yet
1911 Encyclopædia Britannica
301 pages
INAC 2011 Phnatom Alderson RANDO - Boia Et Al
No ratings yet
INAC 2011 Phnatom Alderson RANDO - Boia Et Al
10 pages
Art Appreciation - Assignment 1
No ratings yet
Art Appreciation - Assignment 1
1 page
Variability of Unconfined Compressive Strength in Relation To Number of Test Samples
No ratings yet
Variability of Unconfined Compressive Strength in Relation To Number of Test Samples
8 pages
Assessment Task 1.2
No ratings yet
Assessment Task 1.2
14 pages
San Chit
No ratings yet
San Chit
2 pages
CoS Undergraduate Brochure
No ratings yet
CoS Undergraduate Brochure
36 pages
WEEK 7 Module - Circuits
No ratings yet
WEEK 7 Module - Circuits
6 pages
Regression
No ratings yet
Regression
4 pages
1.1 Intro Earth Sciences
No ratings yet
1.1 Intro Earth Sciences
49 pages
12322-Article Text-35970-1-10-20160617
No ratings yet
12322-Article Text-35970-1-10-20160617
5 pages
5.2 Understanding Inheritance
No ratings yet
5.2 Understanding Inheritance
18 pages
Ministry of Science and Technology Department of Science and Technology Science and Technology of Yoga and Meditation (SATYAM)
No ratings yet
Ministry of Science and Technology Department of Science and Technology Science and Technology of Yoga and Meditation (SATYAM)
2 pages

Simple Fast Algorithms For The Editing Distance Be

Uploaded by

Simple Fast Algorithms For The Editing Distance Be

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Simple Fast Algorithms for the Editing Distance Between Trees

Article in SIAM Journal on Computing · December 1989

The user has requested enhancement of the downloaded file.

SIMPLE FAST ALGORITHMS FOR THE EDITING DISTANCE BETWEEN

AMS(MOS) subject classifications. 68P05, 68Q25, 68Q20, 68R10

(1) Change. To change one node label to another.

6(TI, T2)= min {y(S)]S is an edit operation sequence taking T to T2}.

Mappings can be composed. Let M be a mapping from T to T2 and let M2 be

Hence, 6(T1, T2)= min {y(M)IM is a mapping from T to T2}.

T[7] T[1] T[2] T[4] T[5]

The distance between T[i’..i] and T[j’..j] is denoted forestdist(T[i’..i],

(forestdist(l( il).. i- 1, l(j,)..j) + 3/( TI[ i] --> A),

forestdist( l( i,) i, l(j,) ..j) forestdist( l( il) l(i) 1, l(j,) l(j) 1

+forestdist(l(i).. i- 1, l(j)..j 1) + 3/( T[ i] -> T2[j]).

Figure 6 shows the situation.

r[l(i)..l(i)-11 Tl[l(i) i-11 T2[l(jl)..l(j)- 1] T2[I(j) j-l]

FIG. 6. Case (3) of Lemma 4.

forestdist( l( i) i, l(j) ..j)

LEMMA 5. Let i anc( i) and jl anc(j). Then

forestdist(l( i).. 1, l(j)..j

Tl[l(i) i-l] T2[I(j) j-.]

rl[l(il)..l(i)- 1] tree(i) T2[l(jl)..l(j )- 1] tree(j)

FIG. 7. The two situations of Lemma 5.

tree_dist(3, 2) tree_dist(3, 5) tree_dist(3, 6)

tree_dist(5, 2) tree_dist(5, 5) tree_dist(5, 6)

tree_ dist (6, 2) tree_ dist (6, 5) tree_ dist (6, 6)

(1) Immediately before the computation of treedist(i,j), all distances

By the definition and Lemma 6 we can see that LR_colldepth(i)

Proof.Consider when node j is counted in the first summation" in the subtrees

By Lemma 6, the above equals

This is less than

=- Lg_colldepth(j) and i=N, gL_colldepth(i) x

ALGORITHM PARALLEL DISTANCE.

for k:-0 to M-1

dist[i,j][i,j- 1]+ ’y(A T2[j])

compute Tree_D( i,j);

DP( T[ l( i) i, PAT) min { treedist( P( T[ l(i) i], S( T[ l(i) i])), PAT)}.

F_DP(, T2[ l(j)..j, 1 ]) + 3/( TI[ il] T2[j,]),

(3b) (with pruning)

F_DP( TI[ l( i).. l(il) ], T2[ l(j)../(jl) 1 ]) + F_DP(, T2[/(jl)..jl

6. Conclusion. We present a simple dynamic programming algorithm for finding

View publication stats

You might also like