Uow Cs WP 1993 03 PDF
Uow Cs WP 1993 03 PDF
ISSN l l 70-487X
January, 1993
Tony C .. Smith
Ian H. Witten
Abstract
January, 1993
1 Introduction
Syntactic analysis of natural language has generally focused on the structural roles fulfilled by the
thematic elements of linguistic expression: agent, principal action, recipient, instrument, and so
on [4, 30]. This has produced theories in which the noun and verb are the primary constituents of
every utterance, and syntactic structure emerges as a projection from these major lexical categories
(15, 19, 27, 24]. Language processors developed under this prescriptive tradition face two serious
practical limitations: an inadequate lex.icon and an incomplete structural description for the language
in question [6, 8, 20]. In contrast, the present paper investigates an alternative methodology that
passively infers grammatical information from positive instances of wen-formed expressions.
Grammar induction has often been employed to find descriptive formulations of language
structure [3, 6, 9, 10, 20, 35, 33). Such efforts commonly adopt straightforward sequence inference
techniques without regard to lexical categories or syntactic modularity. This paper discusses the
development of a system that uses the simple notion of closed-class lexemes to infer lexical and
syntactic information, including lexical categories and grammar rules, from stati stical analyses of
copious quantities of machine-readable text. Closed-class inferencing compares well with current
linguistic theories of syntax and offers a ~ide range of potential applications.
2 Function words
Most linguists accept that there is a set of function words that can be characterized as "closed class."
But there is no consensus on exactly which words this comprises. Since language inference is a
discovery process, membership in the closed class should be based on a criterion that identifies
fu nction words by analysing their usage patterns. Of the previously mentioned characteristics of
fu nction words, relative high frequency is the only one that can be determined from a static language
analysis.
Table 1: Most frequent words in Far From the Madding Crowd and Moby Dick
The principal practical obstacle to the 1% cutoff is how to ascertain an "average person's
everyday vocabulary." One might object to the suggestion that Hardy or Melville exemplify
common parlance-despite the fact that their demonstrated vocabularies are of an appropriate size.
But we assume that closed-class elements are functionally significant for the language itself, and will
therefore be statistically dominant in any individual's vernacular, including Hardy's or Melville's.
For example, Table 2 shows that the top 1% of Hardy's vocabulary accounts for almost 54% of the
text in Far From The Madding Crowd. These 115 words are listed in Table 3 and only about 16 of
them fail any sort of intuitive test as function words.
Table 1 shows tremendous commonality between the most frequently used words of Hardy and
Melville. Sixteen of the top twenty are the same, the first six differing only in their order. lbis
similarity proceeds beyond the words Hsted here. But there are also some significant discrepancies.
For instance, there are no feminine pronouns in the 80 most frequently used words of Moby Dick,
with she appearing in the relatively distant 217th position, though it is the 14th most common word
in Hardy's novel. Moreover, whale is the 28th most common word in Moby Dick yet it never occurs
in Far From the Madding Crowd; similarly Bathsheba, Hardy's 38th most frequently used word,
does not appear in Melville's book.
Of course, neither Bathsheba nor whale conforms with our intuitive notion of a function word
and should be removed from the class, whereas it would be unfortunate if feminine pronouns
were overlooked. Therefore neither vocabulary is entirely suited to be the paradigm. But we can
capitalize on their similarities by intersecting the two vocabularies before taking the top 1%. This
removes lexical items pecuHar to any one text and, as a consequence, moves function words that
number of vocabulary items represented fraction of total fraction of
words vocabulary usage LexL
1 {the} 0.01% 7,746 5.5%
2 {and, the} 0.02% 12,031 8.5%
3 {a. and, the} 0.03% 15,942 11.3%
5 {a. and, of, the, to} 0.04% 23,315 16.6%
IO {a. and, I, in, it, ... } 0.09% 32,857 23.4%
15 {a. and, as, I, in, ...} 0.13% 39,638 28.2%
115 {a. about, again, all, am, ...} 0.99% 75,688 53.8%
11589 {aaron, abandon, abasement, ...} 100.00% 140,632 100.0%
Table 3: The most frequent 1% of words in Far From the Madding Crowd
may otherwise have been overlooked higher up in the frequency ordering. Table 4 lists the function
words obtained by applying this method to the vocabularies of Hardy, Melville, and that employed
in a collection of works by Lewis Carroll (Alice in Wonderland, Alice Through the Looking Glass,
and The Hunting of the Snark). Unfortunately "she" still does not appear in the list, though "he"
and "her" do.
Table 4: The closed class, inferred from Hardy, Melville and Carroll
specific referent, whereas "a" works as a kind of universal quantifier indicating a representative of
a general class of referent. Moreover, determiners Uke "his", "some", "many", and "all" permit
reference at greater and lesser degrees of specificity.
It seems that closed-class words fall into functional categories. This is attractive because it
greatly reduces the number of syntactic roles in a language. However, in keeping with a static
analysis, we seek to achieve such generalization without relying on semantic or psychological
properties. Once again, frequency analysis provides a solution.
The frequency-based method for discovering closed-class words can be regarded as a kind of
zero-order test which considers the usage of words in isolation. It takes no account of the structural
usage demonstrated by a word- its proximity and juxtaposition with respect to neighbors. But if
closed-class words represent functional categories, then words from the same category might be
expected to demonstrate similar structural usage. This can be determined by comparing the number
of times each one is used in a structural context similar to that of another.
Define the "first-order successors" of a function word to be the set of words that immediately
fo11ow it in a particular text (To extend the idea further, the "second-order successors" can be
defined as the set of words following second after it, and so on.) The relative size of the intersection
of the first-order successors of two function words is a measure of how often the words are used in
similar syntactic structures. Where two closed-class words share an unusually common structural
usage, we assume that they are functionally similar.
To determine whether two function words have a unusually large degree of commonality in their
first-order successors, assume that closed-class words play no part in establishing functional roles.
Then the words following each particular closed-class lexeme in a text would represent a more or
less random sampling of the vocabulary.
By counting the number of different words that occur after two particular closed-class words,
the expected number of different words that will appear after both can be calculated, under the
assumption of random sampling. In fact, the degree of commonality is often very much higher than
expected. This is no doubt partly due to the breakdown of our simplifying assumption. However,
in some cases the degree of commonality-measured as the probability of this much commonality
occurring by chance-is so extremely high that it indicates a substantial similarity between the
syntactic roles of the two closed-class words being considered.
word first-order word first-order intersection log probability apparent
successors successors size association
I 231 you 293 110 -316.0 strong
we 71 you 293 45 -238.0 strong
her 557 you 293 55 -27.7 weak
be 348 they 138 71 -253.0 strong
her 557 my 243 99 -149.0 strong
him 113 me 104 27 -149.0 strong
her 557 his 562 149 -138.0 strong
him 113 he 348 20 -18.9 weak
bis 562 be 348 13 -0. l weak
had 341 have 205 80 -211.0 strong
had 341 was 641 115 -117.0 strong
is 229 was 641 93 -117.0 strong
from 126 was 641 32 -23.1 weak
about 63 at 124 24 -184.0 strong
at 124 from 126 29 -127 .0 strong
on 147 from 126 28 -101.0 strong
have 205 at 124 15 - 18.9 weak
was 641 at 124 26 -15.2 weak
What is the probability that the intersection between two random1y-chosen sets is as large as
a given value? Consider sets S1 and S2 of given sizes n 1 and n 2, whose members are drawn
independently and at random from a set of size N. Denote the size of their intersection, JS 1 n S2 J,
by the random variable I. It can be shown that I is distributed according to a hypergeometic
distribution, and the probability that it exceeds a certain value n, Pr[J 2: nJ, cari be determined.
Unfortunately, the calculation is infeasible for large values of n 1, n2 and N. Various approximations
can be used to circumvent the problem, such as the binomial, Poisson and Normal distributions.
For example, suppose that for a particular corpus with a vocabulary of 10000 words (N =
10000), two particular function words are both followed by 2000 different words (n 1 = 2000,
n2 = 2000). Suppose that these two sets have 700 words in common (n = 700). Then the Normal
approximation has mean ~ 400; in other words one expects only 400 words to be in common if
the sets were randomly chosen. Its standard deviation is (J ~ 16, and so the actual figure of 700
is 19 standard deviations from the mean. It follows that the probability of I being at least as large
as it is, Pr(J 2: 700], is very tiny-about 10- 80. (In fact tables of the Normal distribution do not
generally give values for z 2: 5-they end with Pr[z > 4.99] = 3 x 10-7 .)
To estimate the probability Pr[/ 2: n] in general, several approximations are possible. It was
decided to split the problem into three cases depending on the size of n, n 1 and n2. First, when
n = 0, use Pr[J 2: OJ = I. Second, when either n 1 or n2 is large (say n 1 or n2 > 100), use
the Normal approximation to the hypergeometric distribution, employing standard mathematical
tables to approximate the integral that is involved. Otherwise, when both n 1 and n2 are small (i.e.
::; 100), calculate an approximation directly from the hypergeometric distribution and evaluate it
using precomputed factorials up to 100 stored in a table.
Table 5 lists the probabilities calculated for intersection sizes of the first-order successors for
some of the function words in the novel Far From the Madding Crowd. The first line shows that "I"
and ''you" were followed by 231 and 293 different words respectively, of which 110 are in common.
Considering the vocabulary size of 11 ,589 words, it is very unlikely that as many as 110 would
be in common had the successors been randomly chosen-the probability is in fact only 10- 316 !
'T' and "you" thus seem to perform similar functions. So do "we" and "you", whereas "her" and
"you" are much less strongly associated. The remaining blocks of the table give samples of other
associations, both strong and weak. Possessive pronouns, for example, show strong associations
with each other, as do pronouns in the same case (i.e. nominative, objective, etc.). Relatively weak
associations are indicated by comparisons across such class boundaries. Auxiliary verbs also show
strong associations with each other, and prepositions do as well, yet these two categories offer little
statistical evidence of any relationship between them.
Figure 1: Categorization clusters for Hardy (solid lines) and Melville (dashed lines)
precludes access to any sort of semantic information that would help to assign open-class words to
these categories.
Category generalization
When applied to Far From The Madding Crowd, this procedure creates about 90,000 initial cat-
egories. Each is subsequently compared against all others in the same manner as the first-order
successors for function words were compared. That is, the strength of the association between two
categories is determined by the probability that the sets have an intersection of the size exhibited.
The larger the intersection, the more likely it is that the categories share the same lexical function.
Probabilities are calculated for all pairs before any are combined, and amalgamation is performed
in a single pass. Once again, no provision is made to prevent a word from occupying several
categories.
Table 7 shows some of the 61 final content word categories derived using this technique.
Category cw44 exemplifies a fairly sound collection of adverbs, and cw 41 and cw51 are reasonably
consistent sets of past tense and present participle verbs respectively. Category cw5g includes many
of the plural nouns from Far From The Madding Crowd. These groupings represent some of the
more coherent open class categories; however, they do not demonstrate complete collections of the
classic grammatical forms they exemplify. For example, most of the present participle verbs used
in Hardy's novel are found in groupings not listed here, often mixed in with words from a variety
of standard syntactic categories. Of the 61 categories, 58 contain fewer than 170 words, each of
which tends toward a particular grammatical class. Unfortunately the three largest sets contain over
3000 words and do not submit to characterization under traditional syntactic forms. In general, the
larger the group the more difficult it is to interpret using standard grammatical terminology.
II calegory I elements II
CW41 pulled sent drew
wrong formed asked
visible returned short
used closed
CW44 certainly merely entirely
already apparently sometimes
really nearly hardly
cws1 doing beginning able
coming next began
feeling looking having
going
cwss miles circumstances pounds
clothes hours arms
feet neighbours thoughts
horses trees features
lips days others
sort hands minutes
things limes people
sheep women years
words men
Table 7: Some content word categories from Far From the Madding Crowd
C ___ ru_le_g_en_e...,.ra_li_za_u_on
___ ~
~
final PSRs
s ~ FPo,1
FPo,1
FpO,
~
~
/wo
fwo
Cp,
Cp,
CW40
Fp
.
FP10
Unlike the other category symbols, the terminating symbol for each segment does not denote a
substituted word. It merely indicates the type of fw-phrase that follows the segment in question and
serves to preserve fw-phrase links within the grammar. As before, the symbol fw <I> represents a null
fw-phrase and is used at the end of sentences.
Table 9: Stages of grammar reduction for Far From the Madding Crowd
longer sequence. Comparison continues for shorter and shorter sequences down to substrings of
length two. Finally, the fw-phrase rules are presented in Bachus-Naur Form as a context-free
grammar describing the text. Table 8 shows the grammar for the example sentence.
5.1 Applications
Possible applications for any language processing system are many and varied. Grammars produced
from syntax induction are inherently generative to the extent that they can be used to reproduce at
least the set of expressions from which the rules were inferred. This has practical implications for
day-to-day computing with improved data compression techniques, and more esoteric applications
in computer generation of prose and poetry. This kind of grammatical analysis may provide a new
tool for attacking authorship puzzles for anonymous texts, and the use of function word grammars
for semantic-free language processors may have prospects in artificial intelligence. We briefly
outline each possible application in turn.
Text compression
The substitution and decomposition procedures uncover a tremendous amount of similarity within
the expressions of a text. These similarities reflect general syntactic structures characterized as a
context-free grammar. If we express the original text of Far From the Madding Crowd as a grammar
such that each sentence is equated with a production rule, then the entire text requires 7281 rules to
describe its 7282 sentences ("I must go." is the only duplicate sentence), with each rule averaging
19.31 symbols (i.e. words) in length. The same text can be expressed by 8801 fw-phrase structures
with an average length of 4.46 symbols, and although the grammar generalization stage creates
about 1650 new rules, the rules' average length decreases significantly, to just under 3 symbols.
TI1e number of rules and symbols per rule in the various grammars is summarized in Table 9. The
total size of the grammar in symbols is the product of these two quantities. It seems likely that
the generalizations captured by these grammars can be used to compress the text through standard
encoding techniques [7], and this possibility is presently being investigated.
An soothingly were perceived miss laid of the hour. It hope what which have brought of
accidenL And gloves to such stream and in the herself and the board inexpressibly stirred of two
and inflamed any liddy. He reach window of such juno. I has good the people plainly cajolery
for mossy the little whistling to crack about frankly and tarried of a with bis cbrisunas ingenuity
you must keep to the multiplying no her dark try know the omen with the running rest to oldest
girls on some enough to one tartly off all but it health in he leafless on he revealed shivering in
age evil and meeting to of a matter not to not. As stream at coggan and a winter in the boys.
From at his high two fog water.
Table 10: Text generated randomly from the grammar for Far From the Madding Crowd
Text generation
There has been much interest over the years in the "creative computer," using programs to create
prose, poetry, and other forms of literature [34, 29]. One of the key problems in this area is the
immense amount oflabor required to develop a system to create text in a particular genre. The ability
to infer a grammar from a given text and then use it for generation opens up new possibilities for
the automatic writing of text within a particular genre. Table 10 shows a sample of text generated
randomly from the grammar inferred from Far From the Madding Crowd. We find the quality
of this extract disappointing, although, to be fair, this is characteristic of the text generated from
compression schemes in general [37]. It indicates that the system in its present state has not been
successful in capturing the essence of Hardy's grammar. We plan to investigate this deficiency and,
if possible, remedy it. Studying the shortcomings of randomly-generated text is an excellent device
for focussing attention on the quality of the grammar that is inferred.
Authorship analysis
Statistical techniques have often been employed to identify authors of anonymous texts, or to
challenge authorship claims [ 11 , 18]. O'Donnell [31] outlines statistical analysis of sentence length,
vocabulary size, distribution of sentence complexity, and other "stylistic variables" to evaluate the
proposal that Thackeray and Dickens were one in the same author, and similarly for Shakespeare
and Marlowe. Grammatical inference allows such analyses to examine the more microscopic details
of sentence structure.
Recently, law enforcement agencies have begun to use statement analysis as a field tool for
interrogation [13]. The technique statically examines the use of determiners, connectives, tense,
and possessive pronouns to evaluate the sincerity of witness statements and to provide indications
for further questioning. The method is based upon conjectures of an indissoluble relationship
between language and thought.
DP-Theory
Chomsky incorporated X-bar into his theory of grammar to capture cross-categorial generalizations
that are true of natural language phrase structures. A satifactory exposition of X-bar is well beyond
the scope of this paper, but we can for our purposes express a crude summary by the following
formulation:
XP =}{COMP I SPEC} X {COMP I SPEC}.
In X-bar, all hierarchical substructures of linguistic expression are labelled as head or non-head
nodes, where non-head nodes are constrained by thematic projections from their respective head
nodes. For English, the following formulation generalizes XP for verb phrases and prepositional
phrases:
XP :::} X NP PP* (SS).
For verb phrases, such as invited the man with the bald head, and prepositional phrases, such
as down the street, the head is phrase-initial and the complement structure is subject to selection
(generally rightward) by, among other things, a thematic relation to the head-a relation projected
as a lexical property [27]. That is, the structural head (the X) is also the semantic head-it is the
lexical source of the descriptive content within the structure.
Unlike other structures in English, the head of a noun phrase occurs in the final position, as in
the little brown fox. The fact that determiners appear exclusively in noun phrases suggests that there
is selection between the noun and determiner. But if selection in English is generally rightward,
one must assume that the determiner selects the noun.
The desire for uniform treatment of the nominal system within X-bar has contributed to the
development of DP-Theory-a formalization of the view taken by Fukui and Speas [21] and argued
for extensively by Abney [l, 2] that selection is functional.
In DP-Theory, the determiner is the functional head of the noun phrase. "Its function is to specify
the reference of the phrase. The noun provides a predicate, and the determiner picks out a particular
member of that predicate's extension" ([1] , page 3). In the verbal system, DP-Theory maintains
that tense, or inflection, is the functional head of the verb phrase. Tense locates a particular event
in time from the class of events predicated by the verb.
Like DP-Theory, the function word approach to grammar induction uses functional elements to
indicate the onset of a new phrase type, and generalizes phrase structure as a rightward continuation
from that functional head.
Psycholinguistics
The peculiar properties exhibited by function words indicates that they receive a rather different
treatment in cognitive language processes than do content words. Their late appearance in the
productive vocabulary during first language acquisition seems to imply that the use of function
words involves inferring somewhat more abstract grammatical knowledge than that required by
other lexical items.
Psycholinguistic research of aphasics, individuals that suffer from language processing deficits
due to brain damage, indicates that the function word vocabulary may exist as a separate mental
lexicon from the rest of an individual's vocabulary. Goldstein [26] noted that one class of aphasics,
those suffering from a condition known as agrammatism, demonstrate selective impairment in using
one class of vocabulary elements-the "little words," or "grammatical words." Kean [28] further
noted that the omission of function words in agrammatism is often accompanied by inflectional
omissions-a characterization confirmed by Badecker and Caramazza [5].
Some slips-of-the-tongue by non-aphasics reveal the possibility that functional elements are
composed into syntactic structures prior to the insertion of any major lexical items. Word exchanges,
such as Ire is planting the garden in the flowers, and "stranding" errors, such as he is schooling to
go, were among the corpus of thousands of naturally-occurring speech errors that led Garrett [22]
to develop a psychological model of sentence production wherein functional elements establish
sentence form.
Garrett's model describes a sentence planning process in which the choice and location of
function words and inflectional morphemes are determined apart from processes that determine what
content words are to appear. The model indicates that the syntactic level of sentence production
consists of functional representations similar to those shown in Figure 4. Though the semantic
intention of the production may influence which representation is to be selected, the semantic
elements are inserted after the basic syntactic structure has been established.
The extent of psycholinguistic evidence that indicates a special status for functional elements
in inflectional languages warrants a sincere effort to incorporate this notion into a computational
account. The research presented in this paper represents such an effort.
6 Acknowledgments
This research has been supported by the Natural Sciences and Engineering Research Council of
Canada. We gratefully acknowledge the help of Ingrid Rinsma in locating approximations to the
hypergeometric distribution.
References
[l] Steven Abney. Functional elements and licensing. presented to GLOW, Gerona, Spain, April
1986.
[2] Steven Abney. The Noun Phrase in its Sentential Aspect. PhD thesis, MIT, 1987. unpublished.
[3] D. Angluin. Inductive inference of formal languages from positive data. Information Control,
45:117- 135, 1980.
[4] Emmon Bach. An extension of classical transformational grammar. In Problems in linguistic
Metatheory, Proceedings ofthe 1976 Conference at Michigan State University, pages 183-224,
1976.
[5] B. Badecker and A. Caramazza. On consideration of method and theory governing the
uses of clinical categories in neurolinguistics and cognitive psychology: the case against
agrammatism. Cognition, 20:97- 125, 1985.
[8] R. C. Berwick. The acquisition of syntactic knowledge. MIT Press, Cambridge, Mass, 1986.
[9] R. C. Berwick and S. Pilato. Learning syntax by automata induction. Machine Leaming,
2(1):9-38, 1987.
[12] D. Caplan. Neurolinguistics and Linguistic Aphasiology. Cambridge University Press, Cam-
bridge, 1987.
[13] Sgt. Robert Chamberlain. private communication. RCMP Serious Crimes Division, Prince
George, B.C., Canada, April 1990.
[14] Noam Chomsky. Aspects of the Theory of Syntax. MIT Press, Cambridge, Mass., 1965.
[15] Noam Chomsky. Lectures on Government and Binding. Foris Publications, Dordrecht, 1981.
[16] V. J. Cook. Chomsky's Universal Grammar. Basil Blackwell Ltd., Oxford, England, 1988.
[17] Hamish Dewar, Paul Bratley, and James Peter Thome. A program for the syntactic analysis
of English. Communications of the ACM, 12(8):476-479, August 1969.
[18] Al var A. Ellegard. A Statistical Method for Determining Authorship. Goteborg, Holland,
1962.
[19] Ann Farmer. Modularity in Syntax. MIT Press, Cambridge, Mass., 1984.
[20] J. A. Feldman. Some decidability results on grammatical inference and complexity. AI Memo
93.1, Computer Science Dept., Stanford University, Stanford, California, 1970.
[21] Naoki Fukui and Peggy Speas. Specifiers and projection. MIT Working Papers in linguistics,
8:128-172, 1986.
[22] M. F. Garrett. Syntactic processes in sentence production. In R. Wales and E. Walker, editors,
New Approaches to Language Mechanisms. North-Holland, Amsterdam, 1976.
[23] M .F. Garrett. The organization of processing structure for language production. In D. Ca-
plan, A.R. Lecours, and A. Smith, editors, Biological Perspectives on Language. MIT Press,
Cambridge, Mass., 1984.
[24] Gerald Gazdar, Ewan Klein, Geoffrey Pullam, and Ivan Sag. Generalized Phrase Structure
Grammar. Basil Blackwell, Oxford, UK, 1985.
[25] N. Geschwind. The paradoxical position of Kurt Goldstein in the history of aphasia. Cortex,
1:214-224, 1964.
(26] K. Goldstein. Language and Language Disturbances. Grune & Stratton, New York, 1948.
[29] K. McKeown Discourse strategies for generating natural language text. Artificial Intelligence,
27:1-42, 1985.
(30] Richard Montague. Formal philosophy. In R.H. Thomason, editor, Selected Papers ofRichard
Montague. Yale University Press, New Haven, CT, 1974.
(31] Bernard O'Donnell. An Analysis ofProse Style to Determine Authorship. Mouton & Company,
The Netherlands, 1970.
[32] William O'Grady and Michael Dobrovolsky, editors. Contemporary linguistic Analysis. Copp
Clark Pittman Ltd., Toronto, 1987.
[33] T. W. Pao and J. W. Carr. A solution of the syntactical induction-inference problem for regular
languages. Computer umguages, 3:53-64, 1978.
[34] Tony C. Smith and Ian H. Witten A planning mechanism for text generation. Literary &
Linguistic Computing, 6(2):119- 126, 1991
[35] R. Solomonoff. A new method for discovering the grammars of phrase structure languages.
Information Processing, pages 258- 290, June 1959.
[36] Timothy Stowell. Subjects across categories. The Linguistic Review, 2:285- 312, 1983.
[37] I. H. Witten and T. C. Bell. Source models for natural language text. International J Man-
Machine Studies, 32(5):545-579, May 1990.