Squibs and Discussions Memoization in Top-Down Parsing
Squibs and Discussions Memoization in Top-Down Parsing
Mark Johnson"
Brown University
1. Introduction
In a paper published in this journal, Norvig (1991) pointed out that memoization of a
top-down recognizer program produces a program that behaves similiarly to a chart
parser. This is not surprising to anyone familiar with logic-programming approaches to
natural language processing (NLP). For example, the Earley deduction proof procedure
is essentially a memoizing version of the top-down selected literal deletion (SLD) proof
procedure employed by Prolog. Pereira and Warren (1983) showed that the steps of
the Earley Deduction proof procedure proving the well-formedness of a string S from
the standard 'top-down' definite clause grammar (DCG) axiomatization of a context-
free grammar (CFG) G correspond directly to those of Earley's algorithm recognizing
S using G.
Yet as Norvig notes in passing, using his approach the resulting parsers in general
fail to terminate on left-recursive grammars, even with memoization. The goal of
this paper is to discover w h y this is the case and present a functional formalization
of memoized top-down parsing for which this is not so. Specifically, I show how
to formulate top-down parsers in a 'continuation-passing style,' which incrementally
enumerates the right string positions of a category, rather than returning a set of such
positions as a single value. This permits a type of memoization not described to my
knowledge in the context of functional programming before. This kind of memoization
is akin to that used in logic programming, and yields terminating parsers even in the
face of left recursion.
In this paper, algorithms are expressed in the Scheme programming language (Rees
and Clinger 1991). Scheme was chosen because it is a popular, widely known language
that many readers find easy to understand. Scheme's 'first-class' treatment of functions
simplifies the functional abstraction used in this paper, but the basic approach can be
implemented in more conventional languages as well. Admittedly elegance is a matter
of taste, but personally I find the functional specification of CFGs described here as
simple and elegant as the more widely known logical (DCG) formalization, and I hope
that the presentation of working code will encourage readers to experiment with the
ideas described here and in more substantial works such as Leermakers (1993). In
fact, my own observations suggest that with minor modifications (such as the use of
integers rather than lists to indicate string positions, and vectors indexed by string
positions rather than lists in the memoization routines) an extremely efficient chart
parser can be obtained from the code presented here.
Ideas related to the ones discussed here have been presented on numerous occa-
sions. Almost 20 years ago Shiel (1976) noticed the relationship between chart parsing
and top-down parsing. Leermakers (1993) presents a more abstract discussion of the
functional treatment of parsing, and avoids the left-recursion problem for memoized
If sets are represented by unordered lists, union can be given the following defini-
tion. The function reduce is defined such that an expression of the form (reduce
f e' (xl ... Xn)) evaluates to ( f (... 0c e Xl)...)Xn).
When evaluated using Scheme's applicative-order reduction rule, such a system be-
haves as a depth-first, top-down recognizer in which nondeterminism is simulated by
backtracking. For example, in (2) the sequence V NP is first investigated as a potential
analysis of VP, and then the sequence V S is investigated.
Rather than defining the functions f by hand as in (2), higher-order functions can
be introduced to automate this task. It is convenient to use suffixes of the input string
to represent the string positions of the input string (as in DCGs).
The expression (terminal x) evaluates to a function that maps a string position I to
the singleton set { r } iff the terminal x spans from I to r, and the empty set otherwise.
406
Mark Johnson Memoization in Top-Down Parsing
The expression (seq fA fB) evaluates to a function that maps a string position 1 to the
set of string positions {ri} such that there exists an m 6 fA(1), and ri 6 fB(rrl). Informally,
the resulting function recognizes substrings that are the concatenation of a substring
recognized by fA and a substring recognized by f~.
These higher-order functions can be used to provide simpler definitions, such as (2a)
or (2b), for the function VP defined in (2) above.
407
Computational Linguistics Volume 21, Number 3
3. M e m o i z a t i o n and Left R e c u r s i o n
1 This problem can arise even if syntactic constructions specifically designed to express mutual recursion
are used, such as letrec. Although these variables are closed over, their values are not applied when
the defining expressions are evaluated, so such definitions should not be problematic for an
applicative-order evaluator. Apparently Scheme requires that mutually recursive functional expressions
syntactically contain a lambda expression. Note that this is not a question of reduction strategy (e.g.,
normal-order versus applicative-order), but an issue about the syntactic scope of variables.
408
Mark Johnson Memoization in Top-Down Parsing
409
Computational Linguistics Volume 21, Number 3
4 The relation rA and the function fA mentioned above satisfy V r ~/l rA(l, r) ~ r C f(l).
5 Several readers of this paper, including a reviewer, suggested that this can be formulated more
succinctly using Scheme's call/cc continuation-constructing primitive. After this paper was accepted
for publication, Jeff Sisskind devised an implementation based on call/cc which does not require
continuations to be explicitly passed as arguments to functions.
410
Mark Johnson Memoization in Top-Down Parsing
(16) ( d e f i n e ( s q u a r e x) (* x x ) )
(17) ( d e f i n e ( s q u a r e cont x) ( c o n t (* x x ) ) )
(18) > ( s q u a r e d i s p l a y 3)
9
For a more complicated example, consider the two rules defining VP in the fragment
above, repeated here as (20). These could be formalized as the CPS function defined
in (21).
6 Tail recursion optimization prevents the procedure call stack from growing unboundedly.
7 This CPS formalization of CFGs is closely related to the 'downward success passing' method of
translating Prolog into Lisp discussed by Kahn and Carlsson (1984).
411
Computational Linguistics Volume 21, Number 3
In this example V, NP, and S are assumed to have CPS definitions. Informally, the
expression (lambda (poe1) (NP c o n t i n u a t i o n p o s l ) ) is a continuation that specifies
what to do if a V is found, viz., pass the V's right string position posl to the NP
recognizer as its left-hand string position, and instruct the NP recognizer in turn to
pass its right string positions to continuation.
The recognition process begins by passing the function corresponding to the root
category the string to be recognized, and a continuation (to be evaluated after suc-
cessful recognition) that records the successful analysis. 8
Thus rather than constructing a set of all the right string positions (as in the previous
encoding), this encoding exploits the ability of the CPS approach to 'return' a value
zero, one or more times (corresponding to the number of right string positions). And
although it is not demonstrated in this paper, the ability of a CPS procedure to 'return'
more than one value at a time can be used to pass other information besides right string
position, such as additional syntactic features or semantic values.
Again, higher-order functions can be used to simplify the definitions of the CPS
functions corresponding to categories. The CPS versions of the terminal, se% and a l t
functions are given as (23), (25), and (24) respectively.
8 Thus this formaliza~on makes use of mutability to return final results, and so cannot be expressed in a
purely func~onal language. Howeve~ it is possible to construct a similiar formalization in the purely
functional subset of Scheme by passing around an additional 'result' argument (here the last
argument). The examples above would be rewritten as the following under this approach.
412
Mark Johnson Memoization in Top-Down Parsing
If these three functions definitions replace the earlier definitions given in (5), (6), and
(7), the fragment in Figure I defines a CPS recognizer. Note that just as in the first CFG
encoding, the resulting program behaves as a top-down recognizer. Thus in general
these progams fail to terminate when faced with a left-recursive grammar for es-
sentially the same reason: the procedures that correspond to left-recursive categories
involve ill-founded recursion.
The memo procedure defined in (15) is not appropriate for CPS programs because it as-
sociates the arguments of the functional expression with the value that the expression
reduces to, but in a CPS program the 'results' produced by an expression are the val-
ues it passes on to the continuation, rather than the value that the expression reduces
to. That is, a memoization procedure for a CPS procedure should associate argument
values with the set of values that the unmemoized procedure passes to its continua-
tion. Because an unmemoized CPS procedure can produce multiple result values, its
memoized version must store not only these results, but also the continuations passed
to it by its callers, which must receive any additional results produced by the original
unmemoized procedure.
The cps-memo procedure in (26) achieves this by associating a table entry with
each set of argument values that has two components; a list of caller continuations
and a list of result values. The caller continuation entries are constructed when the
memoized procedure is called, and the result values are entered and propagated back
to callers each time the unmemoized procedure 'returns' a new value. 9
9 The dolist form used in (26) behaves as the dolist form in CommonLisp. It can be defined in terms
of Scheme primitives as follows:
(define-syntax dolist
(syntax-rules ()
((dolist (var list) . body)
(do ((to-do list))
((null? to-do))
(let ((var (car to-do)))
• body)))))
413
Computational Linguistics Volume 21, Number 3
414
Mark Johnson Memoization in Top-Down Parsing
Entries are manipulated by the following procedures. Again, because this fragment
does not produce partially specified results, the result subsumption check can be per-
formed by the Scheme function member.
As claimed above, the memoized version of the CPS top-down parser does terminate,
even if the grammar is left-recursive. Informally, memoized CPS top-down parsers
terminate in the face of left-recursion because they ensure that no unmemoized pro-
cedure is ever called twice with the same arguments. For example, we can replace
the definition of NP in the fragment with the left-recursive one given in (35) with-
out compromising termination, as shown in (36) (where the input string is meant to
approximate Kim's professor knows every student).
415
Computational Linguistics Volume 21, Number 3
416
Mark Johnson Memoization in Top-Down Parsing
417