Unsupervised Learning of The Morphology PDF
Unsupervised Learning of The Morphology PDF
Unsupervised Learning of The Morphology PDF
John Goldsmith*
University of Chicago
This study reports the results of using minimum description length (MDL) analysis to model
unsupervised learning of the morphological segmentation of European languages, using corpora
ranging in size from 5,000 words to 500,000 words. We develop a set of heuristics that rapidly
develop a probabilistic morphological grammar, and use MDL as our primary tool to determine
whether the modifications proposed by the heuristics will be adopted or not. The resulting grammar
matches well the analysis that would be developed by a human morphologist.
In the final section, we discuss the relationship of this style of MDL grammatical analysis to
the notion of evaluation metric in early generative grammar.
1. Introduction
* Department of Linguistics, University of Chicago, 1010 E. 59th Street, Chicago, IL 60637. E-mail:
[email protected].
1 Some of the work reported here was done while I was a visitor at Microsoft Research in the winter of
1998, and I am grateful for the support I received there. A first version was written in September, 1998,
and a much-revised version was completed in December, 1999. This work was also supported in part
by a grant from the Argonne National Laboratory-University of Chicago consortium, which I thank for
its support. I am also grateful for helpful discussion of this material with a number of people,
including Carl de Marcken, Jason Eisner, Zhiyi Chi, Derrick Higgins, Jorma Rissanen, Janos Simon,
Svetlana Soglasnova, Hisami Suzuki, and Jessie Pinkham. As noted below, I owe a great deal to the
remarkable work reported in de Marcken's dissertation, without which I would not have undertaken
the work described here. I am grateful as well to several anonymous reviewers for their considerable
improvements to the content of this paper.
2 Sylvain Neuvel has recently produced an interesting computational implementation of a theory of
morphology that does not have a place for morphemes, as described at http://www.neuvel.net. It is
well established that nonconcatenative morphology is found in some scattered language families,
notably Semitic and Penutian. African tone languages require simultaneous morphological analyses of
the tonal and the segmental material.
3 But see the following note.
The program in question takes a text file as its input (typically in the range of 5,000
to 1,000,000 words) and produces a partial morphological analysis of most of the words
of the corpus; the goal is to produce an output that matches as closely as possible the
analysis that would be given by a human morphologist. It performs unsupervised
learning in the sense that the program's sole input is the corpus; we provide the
program with the tools to analyze, but no dictionary and no morphological rules
particular to any specific language. At present, the goal of the program is restricted to
providing the correct analysis of words into component pieces (morphemes), though
with only a rudimentary categorical labeling.
The underlying model that is utilized invokes the principles of the minimum
description length (MDL) framework (Rissanen 1989), which provides a helpful per-
spective for understanding the goals of traditional linguistic analysis. MDL focuses
on the analysis of a corpus of data that is optimal by virtue of providing both the
most compact representation of the data and the most compact means of extracting
that compression from the original data. It thus requires both a quantitative account
whose parameters match the original corpus reasonably well (in order to provide
the basis for a satisfactory compression) and a spare, elegant account of the overall
structure.
The novelty of the present account lies in the use of simple statements of mor-
phological patterns (called signatures below), which aid both in quantifying the MDL
account and in constructively building a satisfactory morphological grammar (for MDL
offers no guidance in the task of seeking the optimal analysis). In addition, the system
whose development is described here sets reasonably high goals: the reformulation in
algorithmic terms of the strategies of analysis used by traditional morphologists.
Developing an unsupervised learner using raw text data as its sole input offers
several attractive aspects, both theoretical and practical. At its most theoretical, un-
supervised learning constitutes a (partial) linguistic theory, producing a completely
explicit relationship between data and analysis of that data. A tradition of consider-
able age in linguistic theory sees the ultimate justification of an analysis A of any single
language L as residing in the possibility of demonstrating that analysis A derives from
a particular linguistic theory LT, and that that LT works properly across a range of
languages (not just for language L). There can be no better way to make the case that
a particular analysis derives from a particular theory than to automate that process,
so that all the linguist has to do is to develop the theory-as-computer-algorithm; the
application of the theory to a particular language is carried out with no surreptitious
help.
From a practical point of view, the development of a fully automated morphology
generator would be of considerable interest, since we still need good morphologies
of many European languages and to produce a morphology of a given language "by
hand" can take weeks or months. With the advent of considerable historical text avail-
able on-line (such as the ARTFL database of historical French), it is of great interest
to develop morphologies of particular stages of a language, and the process of auto-
matic morphology writing can simplify this stage--where there are no native speakers
available---considerably.
A third motivation for this project is that it can serve as an excellent preparatory
phase (in other words, a bootstrapping phase) for an unsupervised grammar acqui-
sition system. As we will see, a significant proportion of the words in a large corpus
can be assigned to categories, though the labels that are assigned by the morpholog-
ical analysis are corpus internal; nonetheless, the assignment of words into distinct
morphologically motivated categories can be of great service to a syntax acquisition
device.
154
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
Table 1
Some signatures from Tom Sawyer.
Signature Example Stem Count (type) Token Count
NULL.ed.ing betray betrayed betraying 69 864
NULL.ed.ing.s remain remained remaining remains 14 516
NULL.s. cow cows 253 3,414
e.ed.es.ing notice noticed notices noticing 4 62
The problem, then, involves both the determination of the correct morphological
split for individual words, and the establishment of accurate categories of stems based
on the range of suffixes that they accept:
To give a sense of the results of the program, consider one aspect of its analysis
of the novel The Adventures of Tom Sawyer--and this result is consistent, b y and large,
regardless of the corpus one chooses. Consider the top-ranked signatures, illustrated
in Table 1: a signature is an alphabetized list of affixes that appear with a particular
stem in a corpus. (A larger list of these patterns of suffixation in English are given in
Table 2, in Section 5.)
The present m o r p h o l o g y learning algorithm is contained in a C++ p r o g r a m called
Linguistica that runs on a desktop PC and takes a text file as its input. 5 A n a l y z i n g a
4 In addition, one would like a statement of general rules of allomorphy as well; for example, a
statement that the stems hit and hitt (as in hits and hitting, respectively) are forms of the same linguistic
stem. In an earlier version of this paper, we discussed a practical method for achieving this. The work
is currently under considerable revision, and we will leave the reporting on this aspect of the problem
to a later paper; there is a very brief discussion below.
5 The executable is available at http://humanities.uchicago.edu/faculty/goldsmith/Linguistica2000,
along with instructions for use. The functions described in this paper can be incrementally applied to a
corpus by the user of Linguistica.
155
Computational Linguistics Volume 27, Number 2
corpus of 500,000 words in English requires about five minutes on a Pentium II 333.
Perfectly respectable results can be obtained from corpora as small as 5,000 words.
The system has been tested on corpora in English, French, German, Spanish, Italian,
Dutch, Latin, and Russian; some quantitative results are reported below. The corpora
that serve as its input are largely materials that have been obtained over the Internet,
and I have endeavored to make no editorial changes to the files that are the input.
In this paper, I will discuss prior work in this area (Section 2), the nature of the
MDL model we propose (Section 3), heuristics for the task of the initial splitting of
words into stem and affix (Section 4), the resulting signatures (Section 5), use of MDL
to search the space of morphologies (Section 6), results (Section 7), the identification
of entirely spurious generalizations (section 8), the grouping of signatures into larger
units (Section 9), and directions for further improvements (Section 10). Finally, I will
offer some speculative observations about the larger perspective that this work sug-
gests and work in progress (Section 11).
The task of automatic word analysis has intrigued workers in a range of disciplines,
and the practical and theoretical goals that have driven them have varied consider-
ably. Some, like Zellig Harris (and the present writer), view the task as an essential
one in defining the nature of the linguistic analysis. But workers in the area of data
compression, dictionary construction, and information retrieval have all contributed
to the literature on automatic morphological analysis. (As noted earlier, our primary
concern here is with morphology and not with regular allomorphy or morphophonol-
ogy, which is the study of the changes in the realization of a given morpheme that
are dependent on the grammatical context in which it appears, an area occasionally
confused for morphology. Several researchers have explored the morphophonologies
of natural language in the context of two-level systems in the style of the model de-
veloped by Kimmo Koskenniemi [1983], Lauri Karttunen [1993], and others.) The only
general review of work in this area that I am aware of is found in Langer (1991), which
is ten years old and unpublished.
Work in automatic morphological analysis can be usefully divided into four major
approaches. The first approach proposes to identify morpheme b o u n d a r i e s first, and
thus indirectly to identify morphemes, on the basis of the degree of predictability of the
n + 1st letter given the first n letters (or the mirror-image measure). This was first pro-
posed by Zellig Harris (1955, 1967), and further developed by others, notably by Hafer
and Weiss (1974). The second approach seeks to identify bigrams (and trigrams) that
have a high likelihood of being morpheme internal, a view pursued in work discussed
below by Klenk, Langer, and others. The third approach focuses on the discovery of
patterns (we might say, of rules) of phonological relationships between pairs of related
words. The fourth approach, which includes that used in this paper, is top-down, and
seeks an analysis that is globally most concise. In this section, we shall review some
of the work that has pursued these approaches--briefly, necessarily. 6 While not all
of the approaches discussed here use n o prior language-particular knowledge (which
is the goal of the present system), I exclude from discussions those systems that are
based essentially on a prior human-designed analysis of the grammatical morphemes
of a language, aiming at identifying the stem(s) and the correct parsing; such is the
6 Another effort is that attributed to Andreev (1965) and discussed in Altmann and Lehfeldt (1980),
especially p. 195 and following, though their description does not facilitate establishing a comparison
with the present approach.
156
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
case, for example, in Pacak and Pratt (1976), Koch, K~stner, and Riidiger (1989), and
Wothke and Schrnidt (1992). With the exception of Harris's algorithm, the complex-
ity of the algorithms is such as to make implementation for purposes of comparison
prohibitively time-consuming.
At the heart of the first approach, due to Harris, is the desire to place boundaries
between letters (respectively, phonemes) in a w o r d based on conditional entropy, in
the following sense. We construct a device that generates a finite list of words, our
corpus, letter b y letter and with uniform probability, in such a w a y that at any point
in its generation (having generated the first n letters 111213 • • • In ) we can inquire of it
what the entropy is of the set consisting of the next letter of all the continuations it
might make. (In current parlance, we w o u l d most naturally think of this as a path
from the root of a trie to one of its terminals, inquiring of each n o d e its associated
one-letter entropy, based on the continuations from that node.) Let us refer to this as
the prefix conditional entropy; clearly we m a y be equally interested in constructing
a trie from the right edge of words, which then provides us with a suffix conditional
entropy, in mirror-image fashion.
Harris himself e m p l o y e d no probabilistic notions, and the inclusion of entropy
in the formulation had to await Hafer and Weiss (1974); but allowing ourselves the
anachronism, we m a y say that Harris p r o p o s e d that local peaks of prefix (and suffix)
conditional e n t r o p y should identify m o r p h e m e breaks. The m e t h o d p r o p o s e d in Harris
(1955) appealed to what t o d a y we w o u l d call an oracle for information about the lan-
guage u n d e r scrutiny, but in his 1967 article, Harris i m p l e m e n t e d a similar procedure
on a c o m p u t e r and a fixed corpus, restricting his problem to that of finding m o r p h e m e
boundaries within words. Harris's m e t h o d is quite good as a heuristic for finding a
good set of candidate m o r p h e m e s , comparable in quality to the m u t u a l i n f o r m a t i o n -
based heuristic that I have used, and which I describe below. It has the same problem
that good heuristics frequently have: it has m a n y inaccuracies, and it does not lend
itself to a next step, a qualitatively more reliable approximation of the correct solution. 7
Hafer and Weiss (1974) explore in detail various ways of clarifying and improving
on Harris's algorithm while remaining faithful to the original intent. A brief s u m m a r y
does not do justice to their fascinating discussion, but for our purposes, their results
confirm the character of the Harrisian test as heuristic: with Harris's proposal, a quan-
titative measure is p r o p o s e d (and Hafer and Weiss develop a range of 15 different
measures, all of them rooted in Harris's proposal), and best results for morphological
analysis are obtained in some cases by seeking a local m a x i m u m of prefix conditional
entropy, in others b y seeking a value above a threshold, and in yet others, good results
are obtained only w h e n this measure is paired with a similar measure constructed in
mirror-image fashion from the end of the w o r d - - a n d then some arbitrary thresholds
are selected which yield the best results. While no single m e t h o d emerges as t h e best,
one of the best yields precision of 0.91 and recall of 0.61 on a corpus of approximately
6,200 w o r d types. (Precision here indicates p r o p o r t i o n of predicted m o r p h e m e breaks
that are correct, and recall denotes the proportion of correct breaks that are predicted.)
The second approach that can be f o u n d in the literature is based on the hypothesis
that local information in the string of letters (respectively, phonemes) is sufficient to
identify m o r p h e m e boundaries. This hypothesis w o u l d be clearly correct if all mor-
p h e m e boundaries were between pairs of letters 11-12 that never occur in that sequence
7 But Harris's method does lend itself to a generalization to more difficult cases of morphological
analysis going beyond the scope of the present paper. In work in progress, we have used minimization
of mutual information between successive candidate morphemes as part of a heuristic for preferring a
morphological analysis in languages with a large number of suffixes per word.
157
Computational Linguistics Volume 27, Number 2
158
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
The third approach focuses on the discovery of patterns explicating the overt
shapes of related forms in a paradigm. Dzeroski and Erjavec (1997) report on w o r k
that they have done on Slovene, a South Slavic language with a complex morphology,
in the context of a similar project. Their goal essentially was to see if an inductive
logic p r o g r a m could infer the principles of Slovene m o r p h o l o g y to the point where
it could correctly predict the nominative singular form of a w o r d if it were given an
oblique (nonnominative) form. Their project apparently shares with the present one
the requirement that the automatic learning algorithm be responsible for the decision
as to which letters constitute the stem and which are part of the suffix(es), t h o u g h the
details offered b y Dzeroski and Erjavec are sketchy as to h o w this is accomplished.
In any event, they present their learning algorithm with a labeled pair of w o r d s - - a
base form and an inflected form. It is not clear from their description w h e t h e r the
base form that they s u p p l y is a surface form from a particular point in the inflectional
p a r a d i g m (the nominative singular), or a more articulated u n d e r l y i n g representation
in a generative linguistic sense; the former appears to be their policy.
Dzeroski and Erjavec's goal is the d e v e l o p m e n t of rules couched in traditional
linguistic terms; the categories of analysis are decided u p o n ahead of time b y the
p r o g r a m m e r (or, more specifically, b y the tagger of the corpus), and each individual
w o r d is identified with regard to what morphosyntactic features it bears. The form
bolecina is marked, for example, as a feminine n o u n singular genitive. In sum, their
project thus gives the system a good deal more information than the present project
does. 9
Two recent papers, Jacquemin (1997) and Gaussier (1999), deserve consideration
here. 1° Gaussier (1999) approaches a v e r y similar task to that which we consider, and
takes some similar steps. His goal is to acquire derivational rules from an inflectional
lexicon, thus insuring that his algorithm has access to the lexical category of the w o r d s
it deals with (unlike the present study, which is allowed no such access). Using the
terminology of the present paper, Gaussier considers candidate suffixes if they appear
with at least two stems of length 5. His first task is (in our terms) to infer p a r a d i g m s
from signatures (see Section 9), which is to say, to find appropriate clusters of signa-
tures. One example cited is depart, departure, departer. He used a hierarchical agglomera-
tive clustering method, which begins with all signatures forming distinct clusters, and
successively collapses the two most similar clusters, where similarity between stems is
defined as the n u m b e r of suffixes that two stems share, and similarity between clusters
is defined as the similarity between the two least similar stems in the respective clus-
ters. H e reports a success rate of 77%, but it is not clear h o w to evaluate this figure. 11
The task that Gaussier addresses is defined from the start to be that of derivational
morphology, and because of that, his analysis does not need to address the problem of
inflectional morphology, but it is there (front and center, so to speak) that the difficult
clustering problem arises, which is h o w to ensure that the signatures NULL.s.'s (for
nouns in English) and the signature NULL.ed.s (or NULL.ed.ing.s) are not assigned to
single clusters. 12 That is, in English both nouns and verbs freely occur with the suffixes
9 Baroni (2000) reported success using an MDL-based model in the task of discovering English prefixes. I
have not had access to further details of the operation of the system.
10 I am grateful to a referee for drawing my attention to these papers.
11 The analysis of a word w in cluster C counts as a success if most of the words that in fact are related to
w also appear in the cluster C, and if the cluster "comprised in majority words of the derivational
family of w." I am not certain how to interpret this latter condition; it means perhaps that more than
half of the words in C contain suffixes shared by forms related to w.
12 In traditional terms, inflectional morphology is responsible for marking different forms of the same
lexical item (lemma), while derivational morphology is responsible for the changes in form between
distinct but morphologically related lexical items (lemmas).
159
Computational Linguistics Volume 27, Number 2
NULL and -s, and while -ed and -~s disambiguate the two cases, it is very difficult to
find a statistical and morphological basis for this knowledge, lB
Jacquemin (1997) explores an additional source of evidence regarding clustering of
hypothesized segmentation of words into stems and suffixes; he notes that the hypoth-
esis that there is a common stem gen in gene and genetic, and a common stem express
in expression and expressed, is supported by the existence of small windows in corpora
containing the word pair genetic...expression and the word pair gene.., expressed (as
indicated, the words need not be adjacent in order to provide evidence for the rela-
tionship). As this example suggests, Jacquemin's work is situated within the context
of a desire for superior information retrieval.
In terms of the present study, Jacquemin's algorithm consists of (1) finding sig-
natures with the longest possible stems and (2) establishing pairs of stems that occur
together in two or more windows of length 5 or less. He tests his results on 100 ran-
dom pairs discovered in this fashion, placing upper bounds on the length of the suffix
permitted between one and five letters, and independently varying the length of the
window in question. He does not vary the minimum size of the stem, a consideration
that turns out to be quite important in Germanic languages, though less so in Ro-
mance languages. He finds that precision varies from 97% when suffixes are limited
to a length of one letter, to 64% when suffixes may be five letters long, with both
figures assuming an adjacency window of two words; precision falls to 15% when a
window of four words is permitted.
Jacquemin also employs the term signature in a sense not entirely dissimilar to
that employed in the present paper, referring to the structured set of four suffixes
that appear in the two windows (in the case above, the suffixes are -ion, -ed; NULL,
-tic). He notes that incorrect signatures arise in a large number of cases (e.g., good:
optical control ~ optimal control; adoptive transfer ~ adoptively tranfer, paralleled by bad:
ear disease ~ early disease), and suggests a quality function along the following lines:
Stems are linked in pairs (adopt-transfer, ear-disease); compute then the average length
of the shorter stem in each pair (that is, create a set of the shorter member of each
pair, and find the average length of that set). The quality function is defined as that
average divided by the length of the largest suffix in the signature; reject any signature
class for which that ratio is less than 1.0. This formula, and the threshold, is purely
empirical, in the sense that there is no larger perspective that bears on determining
the appropriateness of the formula, or the values of the parameters.
The strength of this approach, clearly, is its use of information that co-occurrence
in a small window provides regarding semantic relatedness. This allows a more ag-
gressive stance toward suffix identification (e.g., alpha interferon ~ alpha2 interferon).
There can be little question that the type of corpus studied (a large technical medical
corpus, and a list of terms--partially multiword terms) lends itself particularly to this
style of inference, and that similar patterns would be far rarer in unrestricted text such
as Tom Sawyer or the Brown corpus. 14
13 Gaussier also offers a discussion of inference of regular morphophonemics, which we do not treat in
the present paper, and a discussion in a final section of additional analysis, though without test results.
Gaussier aptly calls our attention to the relevance of minimum edit distance relating two potential
allomorphs, and he proposes a probabilistic model based on patterns established between allomorphs.
In work not discussed in this paper, I have explored the integration of minimum edit distance to an
MDL account of allomorphy as well, and will discuss this material in future work.
14 In a final section, Jacquemin considers how his notion of signatures can be extended to identify sets of
related suffixes (e.g., onic/atic/ic--his example). He uses a greedy clustering algorithm to successively
add nonclustered signatures to clusters, in a fashion similar to that of Gaussier (who Jacquemin thanks
for discussion, and of course Jacquemin's paper preceded Gaussier's paper by two years), using a
160
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
I laughedlaughinglaughs
walkedwalkingwalks
jumpedjumpingjumpsJ
walk~ i
jumpj ~g
metric more complex than the familiar minimum edit distance, but no results are offered in support of
the choice of the additional complexity.
15 I am grateful to Scott Meredith for drawing my attention to this paper.
161
Computational Linguistics Volume 27, Number 2
16 Brent's description of his algorithm is not detailed enough to satisfy the curiosity of someone like the
present writer, who has encountered problems that Brent's approach would seem certain to encounter
equally. As we shall see below, the central practical problem to grapple with is the fact that when
considering suffixes (or candidate suffixes) consisting of only a single letter (let us say, s, for example),
it is extremely difficult to get a good estimate of how many of the potential occurrences (of word-final
s) are suffixal s and how many are not. As we shall suggest towards the end of this paper, the only
accurate way to make an estimate is on the basis of a multinomial estimate once larger suffix
signatures have been established. Without this, it is difficult not to overestimate the frequency of
single-letter suffixes, a result that may often, in my experience, deflect the learning algorithm from
discovering a correct two-letter suffix (e.g., the suffix -al in French).
162
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
De Marcken (1995) addresses a similar but distinct task, that of determining the
correct breaking of a continuous stream of segments into distinct words. This prob-
lem has been addressed in the context of Asian (Chinese-Japanese-Korean) languages,
where standard orthography does not include white space between words, and it has
been discussed in the context of language acquisition as well. De Marcken describes
an unsupervised learning algorithm for the development of a lexicon using a mini-
m u m description length framework. He applies the algorithm to a written corpus of
Chinese, as well as to written and spoken corpora of English (the English text has had
the spaces between words removed), and his effort inspired the work reported here.
De Marcken's algorithm begins by taking all individual characters to be the baseline
lexicon, and it successively adds items to the lexicon if the items will be useful in
creating a better compression of the corpus in question, or rather, when the improve-
ment in compression yielded by the addition of a new item to the codebook is greater
than the length (or "cost") associated with the new item in the codebook. In general, a
lexical item of frequency F can be associated with a compressed length of - log F, and
de Marcken's algorithm computes the compressed length of the Viterbi-best parse of
the corpus, where the compressed length of the whole is the sum of the compressed
lengths of the individual words (or hypothesized chunks, we might say) plus that of
the lexicon. In general, the addition of chunks to the lexicon (beginning with such
high-frequency items as th) will improve the compression of the corpus as a whole,
and de Marcken shows that successive iterations add successively larger pieces to the
lexicon. De Marcken's procedure builds in a bottom-up fashion, looking for larger
and larger chunks that are worth (in an MDL sense) assigning the status of dictionary
entries. Thus, if we look at unbroken orthographic texts in English, the two-letter com-
bination th will become the first candidate chosen for lexical status; later, is will achieve
that status too, and soon this will as well. The entry this will not, in effect, point to
its four letters directly, but will rather point to the chunks th and is, which still retain
their status in the lexicon (for their robust integrity is supported by their appearance
throughout the lexicon). The creation of larger constituents will occasionally lead to
the elimination of smaller chunks, but only when the smaller chunk appears almost
always in a single larger unit.
An example of an analysis provided by de Marcken's algorithm is given in (1),
taken from de Marcken (1995), in which I have indicated the smallest-level constituent
by placing letters immediately next to one another, and then higher structure with
various pair brackets (parentheses, etc.) for orthographic convenience; there is no the-
oretical significance to the difference between "( )" and "0", etc. De Marcken's analysis
succeeds quite well at identifying words, but does not make any significant effort at
identifying morphemes as such.
([the]{([unit]ed)([stat]es)})(of{ame([ric])a}) (1)
163
Computational Linguistics Volume 27, Number 2
of the lexical items that have been hypothesized to form the lexicon of the corpus.
It would certainly be natural to try using this figure of merit on words in English,
along with the constraint that all words should be divided into exactly two pieces.
Applied straightforwardly, however, this gives uninteresting results: words will always
be divided into two pieces, where one of the pieces is the first or the last letter of
the word, since individual letters are so much more common than morphemes. 17 (I
will refer to this effect as peripheral cutting below.) In addition--and this is less
obvious--the hierarchical character of de Marcken's model of chunking leaves no
place for a qualitative difference between high-frequency "chunks," on the one hand,
and true morphemes, on the other: str is a high-frequency chunk in English (as schl
is in German), but it is not at all a morpheme. The possessive marker ~s, on the other
hand, is of relatively low frequency in English, but is clearly a morpheme.
MDL is nonetheless the key to understanding this problem. In the next section,
I will present a brief description of the algorithm used to bootstrap the problem,
one which avoids the trap mentioned briefly in note 21. This provides us with a
set of candidate splittings, and the notion of the signature of the stem becomes the
working tool for determining which of these splits is linguistically significant. MDL
is a framework for evaluating proposed analyses, but it does not provide a set of
heuristics that are nonetheless essential for obtaining candidate analyses, which will
be the subject of the next two sections.
The central idea of minimum description length analysis (Rissanen 1989) is composed
of four parts: first, a model of a set of data assigns a probability distribution to the
sample space from which the data is assumed to be drawn; second, the model can then
be used to assign a compressed length to the data, using familiar information-theoretic
notions; third, the model can itself be assigned a length; and fourth, the optimal anal-
ysis of the data is the one for which the sum of the length of the compressed data
and the length of the model is the smallest. That is, we seek a minimally compact
specification of both the model and the data, simultaneously. Accordingly, we use the
conceptual vocabulary of information theory as it becomes relevant to computing the
length, in bits, of various aspects of the morphology and the data representation.
164
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
A. Affixes: 6
1. NULL
2. ed
3. ing
4. s
i
5. e
6. es
B. Stems: 9
!1. cat
2. dog
3. hat
4. John
5. jump
6. laugh
7. sav
8. the
9. walk
C. Signatures: 4
Signature 1:
/ treat
SimpleStem : ptr(dog)
SimpleStem : ptr(hat) L ptr(s) J
ComplexStem : ptr(Sig2):ptr(sav) + ptr(ing)
Signature 2:
f ptr(e) ~
{SimpleStem:ptr(sav)} ~ptr(es) ~
~,ptr(ing) )
Signature 3:
ptr(NULL) "
f SimpleStem:ptr(jump) ~ ~ ptr(ed)
~ SimpleStem :ptr(laugh)~ | ptr(ing)
I, SimpleStem : ptr(walk) ) I, ptr(s)
Signature 4:
SimpleStem : ptr(John)
SimpleStem : ptr(the) J
Figure 2
A sample morphology. This morphology covers the words: cat, cats, dog, dogs, hat, hats, save,
saves, saving, savings, jump, jumped, jumping, jumps, laugh, laughed, laughing, laughs, walk, walked,
walking, walks, the, John.
165
Computational Linguistics Volume 27, Number 2
Since stem, suffix, and signature all begin with s, w e opt for using t to represent
a stem, f to represent a suffix, and cr to represent a signature, while the uppercase
T, F, E represent the sets of stems, suffixes, and signatures, respectively. The n u m b e r
of members of such a set will be represented (T) , (F/, etc., while the n u m b e r of
occurrences of a stem, suffix, etc., will be represented as [t], [f], etc. The set of all
w o r d s in the corpus will be represented as W; hence the length of the corpus is [W],
and the size of the vocabulary is (W).
Note the structure of the signatures in Figure 2. Logically a signature consists
of two lists of pointers, one a list of pointers to stems, the other a list of pointers to
suffixes. To specify a list of length N, w e m u s t specify at the beginning of the signature
that N items will follow, and this requires just slightly more than log 2 N bits to do (see
Rissanen [1989, 33-34] for detailed discussion); I will use the notation A(N) to indicate
this function.
A pointer to a stem t, in turn, is of length - l o g prob (t), a basic principle of
information theory (Li and Vit8nyi 1997). Hence the length of a signature is the sum
of the (inverse) log probabilities of its stems, plus that of its suffixes, plus the n u m b e r
of bits it takes to specify the n u m b e r of its stems and suffixes, using the A function.
We will return in a m o m e n t to h o w we determine the probabilities of the stems and
suffixes; looking ahead, it will be the empirical frequency.
Let us consider the length of stem list T. As we have already observed, its length
is ),((T))--this is the length of the information specifying h o w long the list i s - - p l u s
the length of each stem specification. In most of our work, w e make the assumption
that the length of a stem is the n u m b e r of letters in it, w e i g h t e d b y the factor log 26
converting to binary bits, in a language with 26 lettersJ 8 The same reasoning holds
for the suffix list F: its length is X((F)) plus the length of each suffix, which we m a y
take to be the total n u m b e r of letters in the suffix times log 26.
We return to the question of h o w long the pointer (found inside a signature) to a
stem or suffix is. The probability of a stem is its (empirical) frequency, i.e., the total
n u m b e r of w o r d s in the corpus corresponding to the w o r d s w h o s e analysis includes
the stem in question; the probability of a suffix is defined in parallel fashion. Using
W to indicate all the w o r d s of the corpus, w e m a y say that the length of a pointer to
a stem t is of length
log [w]
[t] '
a pointer to suffix f is of length
log [%
K'
18 This is a reasonable, and convenient, assumption, but it may not be precise enough for all work. A
more refined measure would take the length of a letter to be -1 times the binary log of its frequency.
A still more refined measure would base the probability of a letter on bigram context; this matters for
English, where stem final t is very common. In addition, there is information in the linear order in
which the letters are stored, roughly equal to
n
~-~ log2 k
k=l
for a string of length n (compare the information that distinguishes the lexical representation of
anagrams). This is an additional consideration in an MDL analysis of morphology pressing in favor of
breaking words into morphemes when possible.
166
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
[w]
log -[cr] "
We have now settled the question of how to determine the length of our initial
model; we next must determine the probability that the model assigns to each word
in the corpus, and armed with that knowledge, we will be able to compute the com-
pressed length of the corpus.
The morphology assigns a probability to each word w as the product of the prob-
ability of w's signature times w's stem, given its signature, and w's suffix, given its
signature: prob (w = t + f ) = prob (c0 prob (t I or) prob (f ] or), where cr is the signa-
ture associated with t: cr = sig(t). Thus while stems and suffixes, which are defined
relative to a particular morphological model, are assigned their empirical frequency
as their probability, words are assigned a probability based on the model, one which
will always depart from the empirical frequency. The compression to the corpus is
thus worse than would be a compression based on word frequency alone, 19 or to put
it another way, the morphological analysis in which all words are unanalyzed is the
analysis in which each word is trivially assigned its own empirical frequency (since
the word equals the stem). But this decrease in compression that comes with morpho-
logical analysis is the price willingly paid for not having to enter every distinct word
in the stem list of the morphology.
Summarizing, the compressed length of the corpus is
where we have summed over the words in the corpus, and or(w) is the signature to
which word w is assigned. The compressed length of the model is the length of the
stem list, the suffix list, and the signature list. The length in bits of the stem list is
&((T))+ ~ Ltypo(t)
tCStems
A((r)) + L, po(f),
f ff Suffixes
where LtvpoO is the measurement of the length of a string of letters in bits, which we
take to be log 2 26 times the number of letters (but recall note 18). The length of the
signature list is
A((~,)) + Z L(¢),
c~ff Sign atures
where L(~) is the length of signature or. If the set of stems linked to signature a is
T(~r) and the set of suffixes linked to signature a is F(a), then
19 Due to the fact that the cross-entropy is always greater than or equal to the entropy.
167
Computational Linguistics Volume 27, Number 2
(The denominator in the last term consists of the token count of w o r d s in a particular
signature with the given suffix f , and we will refer to this below more simply as
in cr].)
It is no d o u b t easy to get lost in the formalism, so it m a y be helpful to point out
what the contribution of the additional structure accomplishes. We observed above that
the MDL analysis is an elaboration of the insight that the best morphological analysis
of a corpus is obtained b y counting the total n u m b e r of letters in the list of stems and
suffixes according to various analyses, and choosing the analysis for which this s u m is
the least (cf. Figure 2). This simple insight fails rapidly w h e n we observe in a language
such as English that there are a large n u m b e r of verb stems that end in t. Verbs appear
with a null suffix (that is, in bare stem form), with the suffixes -s, -ed, and -ing. But
once we have 11 stems ending in t, the naive letter-counting approach will judge it a
good idea to create a n e w set of suffixes: -t, -ted, -ts, and -ting, because those 10 letters
will allow us to r e m o v e 11 or more letters from the list of stems. It is the creation of the
lists, notably the signature list, and an information cost which increases as probability
decreases, that overcomes that problem. Creating a n e w signature m a y save some
information associated with the stem list in the morphology, but since the length of
pointers to a signature cr is - log freq (0), the length of the pointers to the signatures
for all of the w o r d s in the corpus associated with the old signature (-O, -ed, -s, -ing) or
the n e w signature (-ts, -ted, -ting, -ts) will be longer than the length of the pointers to a
signature whose token count is the sum of the token count of the two combined, i.e.,
20 In addition, the number of words in a corpus will change if the analysis determines that all
occurrences of (let us say) -ings are to be reanalyzed as complex words, and the stem in question
168
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
We may distinguish between those words, like work or working, whose immediate
analysis involves a stem appearing in the stem list (we may call these WSIMPLE) and
those whose analysis, like workings, involves recursive structure (we may call these
WCOMPLEX). AS we have noted, every stern entry in a signature begins with a flag
indicating which kind of stem it is, and this flag will be of length
[wl
log [WsIMPLE]
for simple stems, and of length
[w]
log [WcoMPrZX]
for complex stems. We also keep track separately of the total number of words in the
corpus (token count) that are morphologically analyzed, and refer to this set as WA;
this consists of all words except those that are analyzed as having no suffix (see item
(ii) in (2), below).
[w]
log [WsIMPLE]
(perhaps work-ing)did not appear independently as a freestanding word in the corpus; we will refer to
these inferred words as being "virtual" words with virtual counts.
169
Computational Linguistics Volume 27, Number 2
log [w].
[t] '
or
(ii) Case of complex stem: flag of length
[w]
log [WcoMPLEX]"
[w] [~]
log [stem(t)~-~ + log [suffix(t) in cr]"
(d) a pointer to each suffix, of total length
v'z_. log ~ in ~]
f c suyfixe~ ( ~ )
[w] log ~
[w] + log
[~(w)] + log
[~(w)] ]
wEW
[stem(w)] [suffix(w)in a(w)]]
MDL thus provides a figure of merit that we wish to minimize, and we will seek
heuristics that modify the morphological analysis in such a fashion as to decrease this
figure of merit in a large proportion of cases. In any given case, we will accept a
modification to our analysis just in case the description length decreases, and we will
suggest that this strategy coincides with traditional linguistic j u d g m e n t in all clear
cases.
The MDL model designed in the preceding section will be of use only if we can provide
a practical means of creating one or more plausible morphologies for a given corpus.
That is, we need bootstrapping heuristics that enable us to go from a corpus to such
a morphology. As we shall see, it is not in fact difficult to come up with a plausible
initial morphology, but I w o u l d like to consider first an approach which, though it
might seem like the most natural one to try, fails, and for an interesting reason.
The problem we wish to solve can be thought of as one suited to an expectation-
maximization (EM) approach (Dempster, Laird, and Rubin 1977). Along such a line,
each word w of length N would be initially conceived of as being analyzed in N
different ways, cutting the word into stem + suffix after i letters, 1 K i < N, with each
of these N analyses being assigned probability mass of
[w]
N[W]"
170
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
That probability mass is then s u m m e d over the resulting set of stems and suffixes,
and on successive iterations, each of the N cuts into stem + suffix is weighted by its
probability; that is, if the ith cut of word w, of length I, cuts it into a stem t of length i
and suffix of length 1 - i, then the probability of that cut is defined as
where ZOj,k refers to the substring of w from the jth to the kth letter. Probability mass
for the stem and the suffix in each such cut is then augmented by an a m o u n t equal
to the frequency of word w times the probability of the cut. After several iterations
(approximately four), estimated probabilities stabilize, and each word is analyzed on
the basis of the cut with the largest probability.
This initially plausible approach fails because it always prefers an analysis in which
either the stem or (more often) the suffix consists of a single letter. More importantly,
the probability that a sequence of one or more word-final letters is a suffix is very
poorly modeled by the sequence's frequency. 21 To put the point another way, even the
initial heuristic analyzing one particular word must take into account all of the other
analyses in a more articulated w a y than this particular approach does.
I will turn n o w to two alternative heuristics that succeed in producing an initial
morphological analysis (and refer to a third in a note). It seems likely that one could
construct a number of additional heuristics of this sort. The point to emphasize is
that the primary responsibility of the overall morphology is not that of the initial
heuristic, but rather of the MDL model described in the previous section. The heuristics
described in this section create an initial morphology that can serve as a starting point
in a search for the shortest overall description of the morphology. We deal with that
process in Section 5.
21 It is instructive to think about why this should be so. Consider a word such as diplomacy. If we cut the
word into the pieces diplomac + y, its probability is freq (diplomac)* freq (y), and constrast that value
with the corresponding values of two other analyses: freq (diploma)* freq (cy), and
freq (diplom)* freq (acy). Now, the ratio of the frequency of words that begin with diploma and those
that begin with diplomac is less than 3, while the ratio of the frequency of words that end in y and
those that end in cy is much greater. In graphical terms, we might note that tries (the data structure)
based on forward spelling have by far the greatest branching structure early in the word, while tries
based on backward spelling have the greatest branching structure close to the root node, which is to
say at the end of the word.
171
Computational Linguistics Volume 27, Number 2
where
n--1
Z = ~ H(Wl,i q- Wi+l,1)
i=1
For each word, we note what the best parse is, that is, which parse has the highest
rating by virtue of the H-function. We iterate until no word changes its optimal parse,
which empirically is typically less than five iterations on the entire lexicon. 22 We n o w
have an initial split of all words into stem plus suffix. Even for words like this and
stomach we have such an initial split.
which we m a y refer to as the weighted mutual information. We choose the top 100
n-grams on the basis of this measure as our set of candidate suffixes.
We should bear in m i n d that this ranking will be guaranteed to give incorrect
results as well as correct ones; for example, while ing is very highly ranked in an
English corpus, ting and ng will also be highly ranked, the former because so m a n y
stems end in t, the latter because all ings end in ng, but of the three, only ing is a
m o r p h e m e in English.
We then parse all words into stem plus suffix if such a parse is possible using a
suffix from this candidate set. A considerable n u m b e r of words will have more than
one such parse under those conditions, and we utilize the figure of merit described in
the preceding section to choose among those potential parses.
22 Experimenting with other functions suggests empirically that the details of our choices for a figure of
merit, and the distribution reported in the text, are relatively unimportant. As long as the measurement
is capable of ensuring that the cuts are not strongly pushed towards the periphery, the results we get
are robust.
23 Various versions of Harris's method of morpheme identification can be used as well. Harris's approach
has the interesting characteristic (unlike the heuristics discussed in the text) that it is possible to impose
restrictions that improve its precision while at the same time worsening its recall to unacceptably low
levels. In work in progress, we are exploring the consequences of using such an initial heuristic with
significantly higher precision, while depending on MDL considerations to extend the recall of the
entire morphology.
172
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
splits which (from our external vantage point) are splits b e t w e e n prefix and stem:
words begim~ng with de (defense,demand, delete, etc.) will at this point all be split after
the initial de. So there is work to be done, and for this we return to the central notion
of the signature.
5. Signatures
Each w o r d n o w has been assigned an optimal split into stem and suffix b y the initial
heuristic chosen, and we consider henceforth only the best parse for that word, and we
retain only those stems and suffixes that were optimal for at least one word. For each
stem, we make a list of those suffixes that appear with it, and we call an alphabetized
list of such suffixes (separated b y an arbitrary symbol, such as period) the stem's
signature; we m a y think of it as a miniparadigm. For example, in one English corpus,
the stems despair,pity, appeal, and insult appear with the suffixes ing and ingly. However,
they also appear as freestanding words, and so we use the w o r d NULL, to indicate
a zero suffix. Thus their signature is NULL.ing.ingly. Similarly, the stems assist and
ignor are assigned the signature ance.ant.ed.ing in a certain corpus. Because each stem
is associated with exactly one signature, we will also use the term signature to refer to
the set of affixes along with the associated set of stems w h e n no ambiguity arises.
We establish a data structure of all signatures, keeping track for each signature of
which stems are associated with that signature. As an initial heuristic, subject to cor-
rection below, we discard all signatures that are associated with only one stem (these
latter form the o v e r w h e l m i n g majority, well over 90%) and all signatures with only
one suffix. The remaining signatures we shall call regular signatures, and we will call
all of the suffixes that we find in them the regular suffixes. As we shall see, the regular
suffixes are not quite the suffixes we w o u l d like to establish for the language, but they
are a v e r y good approximation, and constitute a good initial analysis. The nonregu-
lar signatures p r o d u c e d b y the take-all-splits approach are typically of no interest, as
examples such as ch.e.erial.erials.rimony.rons.uring and el.ezed.nce.reupon.ther illustrate.
The reader m a y identify the single English p s e u d o s t e m that occurs with each of these
signatures.
The regular signatures are thus those that specify exactly the entire set of suffixes
used b y at least two stems in the corpus. The presence of a signature rests u p o n the
existence of a structure as in (6), where there are at least two members present in each
column, and all combinations indicated in this structure are present in the corpus,
and, in addition, each stem is f o u n d with no other suffix. (This last condition does
not hold for the suffixes; a suffix m a y well appear in other signatures, and this is the
difference between stems and a f f i x e s . ) 24
stem1} f suffi.Xl~
stem2 (6)
stem3 ~ suffix2J
If we have a morphological pattern of five suffixes, let us say, and there is a large
set of stems that appear with all five suffixes, then that set will give rise to a reg-
ular signature with five suffixal members. This simple pattern w o u l d be p e r t u r b e d
b y the (for our purpose) extraneous fact that a stem appearing with these suffixes
24 Langer 1991 discusses some of the historical origins of this criterion, known in the literature as a
Greenburg square (Greenberg 1957). As Langer points out, important antecedents in the literature
include Bloomfield's brief discussion (1933, 161) as well as Nida (1948, 1949).
173
Computational Linguistics Volume 27, Number 2
should also appear with some other suffix; and if all stems that associate with these
five suffixes appear with idiosyncratic suffixes (i.e., each different from the others),
then the signature of those five suffixes would never emerge. In general, however, in
a given corpus, a good proportion of stems appears with a complete set of what a
grammarian would take to be the paradigmatic set of suffixes for its class: this will
be neither the stems with the highest nor the stems with the lowest frequency, but
those in between. In addition, there will be a large range of words with no accept-
able morphological analysis, which is just as it should be: John, stomach, the, and so
forth.
To get a sense of what are identified as regular signatures in a language such as
English, let us look at the results of a preliminary analysis in Table 2 of the 86,976 words
of The Adventures of Tom Sawyer, by Mark Twain. The signatures in Table 2 are ordered
by the breadth of a signature, defined as follows. A signature ¢r has both a stem count
(the number of stems associated with it) and an affix count (the number of affixes
it contains), and we use log (stem count) ~ log (affix count) as a rough guide to the
centrality of a signature in the corpus. The suffixes identified are given in Table 3 for
the final analysis of this text.
In this corpus of some 87,000 words, there are 202 regular signatures identified
through the procedure we have outlined so far (that is, preceding the refining opera-
tions described in the next section), and 803 signatures composed entirely of regular
suffixes (the 601 additional signatures either have only one suffix, or pertain to only
a single stem).
The top five signatures are: NULL.ed.ing, e.ed.ing, NULL.s, NULL.ed.s, and
NULL.ed.ing.s; the third is primarily composed of noun stems (though it includes
a few words from other categories--hundred, bleed, new), while the others are verb
stems. Number 7, NULL.ly, identifies 105 words, of which all are adjectives (appre-
hensive, sumptuous, gay . . . . ) except for Sal, name, love, shape, and perhaps earth. The
results in English are typical of the results in the other European languages that I
have studied.
These results, then, are derived by the application of the heuristics described above.
The overall sketch of the morphology of the language is quite reasonable already in
its outlines. Nevertheless, the results, when studied up close, show that there remain
a good number of errors that must be uncovered using additional heuristics and
evaluated using the MDL measure. These errors may be organized in the following
ways:
. The collapsing of two suffixes into one: for example, we find the suffix
ings here; in most corpora, the equally spurious suffix ments is found.
2. The systematic inclusion of stem-final material into a set of (spurious)
suffixes. In English, for example, the high frequency of stem-final ts can
lead the system to analyze a set of suffixes as in the spurious signature
ted.ting.ts, or ted.tion.
. The inclusion of spurious signatures, largely derived from short stems
and short suffixes, and the related question of the extent of the inclusion
of signatures based on real suffixes but overapplied. For example, s is a
real suffix of English, but not every word ending in s should be analyzed
as containing that suffix. On the other hand, every word ending in ness
should be analyzed as containing that suffix (in this corpus, this reveals
the stems: selfish, uneasi, wretched, loveli, unkind, cheeri, wakeful, drowsi,
cleanli, outrageous, and loneli). In the initial analysis of Tom Sawyer, the
stem ca is posited with the signature n.n't.p.red.st.t.
174
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
Table 2
Top 81 signatures from Tom Sawyer.
Number Number
Rank Signature Stems Rank Signature Stems
1 NULL.ed.ing 69 42 's.NULL,lys 3
2 e.ed.ing 35 43 NULL.ed.s.y 3
3 NULL.s 253 44 t.tion 8
4 NULL.ed.s 30 45 NULL.less 8
5 NULL.ed.ing.s 14 46 e.er 8
6 's.NULL.s 23 47 NULL.ment 8
7 NULL.ly 105 48 le.ly 8
8 NULL.ing.s 18 49 NULL.ted 7
9 NULL.ed 89 50 NULL.tion 7
10 NULL.ing 77 51 1.t 7
11 ed.ing 74 52 ence.ent 6
12 's.NULL 65 53 NULL.ity 6
13 e.ed 44 54 NULL.est.ly 3
14 e.es 42 55 ed.er.ing 3
15 NULL.er.est.ly 5 56 NULL.ed.ive 3
16 e.es.ing 7 57 NULL.led.s 3
17 NULL.ly.ness 7 58 NULL.er.ly 3
18 NULL.ness 20 59 NULL.ily.y 3
19 e.ing 18 60 NULL.n.s 3
20 NULL.ly.s 6 61 NULL.ed.ings 3
21 NULL.y 17 62 NULL.ed.es 3
22 NULL.er 16 63 e.en.ing 3
23 e.ed.es.ing 4 64 NULL.ly.st 3
24 NULL.ed.er.ing 4 65 NULL.s.ter 3
25 NULL.es 16 66 NULL.ed.ing.ings.s 2
26 NULL.ful 13 67 NULL.i.ii.v.x 2
27 NULL.e 13 68 NULL.ed.ful.ing.s 2
28 ed.s 13 69 ious.y 5
29 e.ed.es 5 70 NULL.en 5
30 ed.es.ing 5 71 ation.ed 5
31 NULL.ed.ly 5 72 NULL.able 5
32 NULL.n't 10 73 ed.er 5
33 NULL.t 10 74 nce.nt 5
34 'll.'s.NULL 4 75 NULL.an 4
35 ed.ing.ings 4 76 NUL.ed.ing.y 2
36 NULL.s.y 4 77 NULL.en.ing.s 2
37 NULL.ed.er 4 78 NULL.ed.ful.ing 2
38 NULL.ed.ment 4 79 NULL.st 4
39 NULL.ful.s 4 80 e.ion 4
40 NULL.ed.ing.ings 3 81 NULL.al.ed.s 2
41 ted.tion 9
In the next section, we discuss some of the approaches we have taken to resolving
these problems.
Computational Linguistics Volume 27, N u m b e r 2
Table 3
Suffixes from Tom Sawyer.
W e c a n u s e t h e d e s c r i p t i o n l e n g t h of t h e g r a m m a r f o r m u l a t e d in (2) a n d (3) to e v a l u a t e
a n y p r o p o s e d r e v i s i o n , as w e h a v e a l r e a d y o b s e r v e d : n o t e t h e d e s c r i p t i o n l e n g t h of t h e
g r a m m a r a n d t h e c o m p r e s s e d c o r p u s , p e r f o r m a m o d i f i c a t i o n of t h e g r a m m a r , r e c o m -
p u t e t h e t w o l e n g t h s , a n d see if t h e m o d i f i c a t i o n i m p r o v e d t h e r e s u l t i n g d e s c r i p t i o n
l e n g t h . 25
25 This computation is rather lengthy, and in actual practice it may be preferable to replace it with far
faster approaches to testing a change. One way to speed up the task is to compute the differential of
the MDL function, so that we can directly compute the change in description length given some prior
changes in the variables that define the morphology that are modified in the hypothetical change being
evaluated (see the Appendix). The second way to speed up the task is, again, to use heuristics to
identify clear cases for which full description length computation is not necessary, and to identify a
smaller number of cases where fine description length is appropriate. For example, in the case
176
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
mentioned in the text, that of determining whether a suffix such as ments should always be split into
two independently motivated suffixes ment and s, we can compute the fraction of words ending in
ments that correspond to freestanding words ending in ment. Empirical observation suggests that ratios
over 0.5 should always be split into two suffixes, ratios under 0.3 should not be split, and those in
between must be studied with more care.
26 This is accomplished by the command am4 in Linguistica.
27 This is accomplished by the command am5 in Linguistica.
177
Computational Linguistics Volume 27, Number 2
linguistic pattern. On the other hand, if the suffix is long enough, even one stem
may be enough to motivate a signature, especially if the suffix in question is oth-
erwise quite frequent in the language. A single stem occurring with a single pair
of suffixes may be a very convincing signature for other reasons as well. In Ital-
ian, for example, even in a relatively small corpus we are likely to find a signa-
ture such as a.ando.ano.are.ata.ate.ati.ato.azione.~ with several stems in it; once we are
sure that the 10-suffix signature is correct, then the discovery of a subsignature along
with a stem is perfectly natural, and we would not expect to find multiple stems
associated with each of the occurring combinations. (A similar example in English
from Tom Sawyer is NULL.ed.ful.ing.ive.less for the single stem rest.) And a signature
may be "contaminated," so to speak, by a spurious intruder. A corpus containing
rag, rage, raged, raging, and rags gave rise to a signature: NULL.e.ed.ing.s for the stem
rag. It seems clear that we need to use information that we have obtained regard-
ing the larger, robust patterns of suffix combinations in the language to influence
our decisions regarding smaller combinations. We return to the matter of triage be-
low.
We are currently experimenting with methods to improve the identification of re-
lated stems. Current efforts yield interesting but inconclusive results. We compare all
pairs of stems to determine whether they can be related by a simple substitution pro-
cess (one letter for none, one letter for one letter, one letter for two letters), ignoring
those pairs that are related by virtue of one being the stem of the other already within
the analysis. We collect all such rules, and compare by frequency. In a 500,000-word
English corpus, the top two such pairs of 1:1 relationships are (1) 46 stems related by
a final d/s alternation, including intrud/intrus, apprendend/apprenhens, provid/provis, sus-
pend/suspens, and elud/elus, and (2) 43 stems related by a final i/y alternation, includ-
ing reli/rely, ordinari/ordinary, decri/decry, suppli/supply, and accompani/accompany. This
approach can quickly locate patterns of allomorphy that are well known in the Eu-
ropean languages (e.g., alternation between a and/~ in German, between o and ue in
Spanish, between c and q in French). However, we do not currently have a satisfactory
means of segregating meaningful cases, such as these, from the (typically less frequent
and) spurious cases of stems whose forms are parallel but ultimately not related.
7. Results
On the whole, the inclusion of the strategies described in the preceding sections leads
to very good, but by no means perfect, results. In this section we shall review some
of these results qualitatively, some quantitatively, and discuss briefly the origin of the
incorrect parses.
We obtain the most striking result by looking at the top list of signatures in a
language, if we have some familiarity with the language: it is almost as if the textbook
patterns have been ripped out and placed in a chart. As these examples suggest,
the large morphological patterns identified tend to be quite accurately depicted. To
illustrate the results on European languages, we include signatures found from a
500,000-word corpus of English (Table 4), a 350,000-word corpus of French (Table 5),
Don Quijote, which contains 124,716 words of Spanish (Table 6), a 125,000-word corpus
of Latin (Table 7), and 100,000 words and 1,000,000 words of Italian (Tables 8 and 9).
The 500,000-word (token-count) corpus of English (the first part of the Brown Corpus)
contains slightly more than 30,000 distinct words.
To illustrate the difference of scale that is observed depending on the size of
the corpus, compare the signatures obtained in Italian on a corpus of 100,000 words
(Table 8) and a corpus of 1,000,000 words (Table 9). When one sees the rich inflectional
178
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
Table 4
Top 10 signatures, 500,000-word English corpus.
10. NULL.ed.s
acclaim
beckon
benefit
blend
blister
bogey
bother
breakfast
buffet
burden
179
Computational Linguistics Volume 27, N u m b e r 2
Table 5
Top 10 signatures, 350,000-word French corpus.
p a t t e r n e m e r g i n g , a s w i t h t h e e x a m p l e of t h e 10 suffixes o n f i r s t - c o n j u g a t i o n s t e m s
(a.ando.ano.are.ata.ate.ati.ato.azione.~), o n e c a n n o t b u t b e s t r u c k b y t h e g r a m m a t i c a l d e t a i l
t h a t is e m e r g i n g f r o m t h e s t u d y of a l a r g e r c o r p u s , as
28 Signature 1 is formed from adjectival stems in the fem.sg., fem.pl., masc.pl, and masc.sg, forms;
Signature 2 is entirely parallel, based on stems ending with the morpheme -ic/-ich, where ich is used
before i and e. Signature 4 is an extension of Signature 2, including nominalized (sg. and pl.) forms.
Signature 5 is the large regular verb inflection pattern (seven such verb stems are identified). Signature
3 is a subset of Signature 1, composed of stems accidentally not found in the feminine plural form.
Signatures 6 and 8 are primarily masculine nouns, sg., and pl., Signature 10 is feminine nouns, sg., and
pl., and the remaining Signatures 7 and 9 are again subsets of the regular adjective pattern of
Signature 1.
180
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
Table 6
Top 10 signatures, 130,000-word Spanish corpus.
1. a.as.o.os 4. N U L L . n7. NULL.a.as.o.os
abiert abrfa algun
aficionad abriria buen
ajen acabase es
amig acabe mf
antigu acaece primer
compuest acertaba un
cortesan acometfa
cubiert acompafiaba 8. NULL.es
cuy acordaba - ~ngel
delicad aguardaba animal
~rbol
2. NULLs 5. NULL.n.s azul
aborrecido caballero bachiller
abrasado cante belianis
abundante debfa bien
acaecimiento dice buey
accidente dijere calidad
achaque duerme cardenal
acompafiado entiende
acontecimiento fuerza 9. da.do.r
acosado hubiera - amanceba
acostumbrado miente ata
3. a.o.os 6. a.as.o averigua
afligid agradezc colga
~inim anch emplea
asalt at6nit feri
caballeriz confus fingi
desagradecid conozc heri
descubiert decill pedi
despiert dificultos persegui
dorad estrech
enemig extrafi 10. NULL.le
flac fresc abraz6
acomodar
aconsej6
afligi6se
agradeci6
aguardar
alegr6
arroj6
atraer
bes6
181
Computational Linguistics Volume 27, Number 2
Table 7
Top 10 signatures, 125,000-word Latin corpus.
1. NULL.que 4. NULL.m 7. NULL.e.m
abierunt abdia angustia
acceperunt abia baptista
accepit abira barachia
accinctus abra bethania
accipient adonira blasphemia
addidit adsistente causa
adiuvit adulescente conscientia
adoravit adulescentia corona
adplicabis adustione ignorantia
adprehendens aetate lorica
do this, we have selected from the English and the French analyses a set of 1,000 con-
secutive w o r d s in the alphabetical list of w o r d s from the corpus and divided them into
distinct sets regarding the analysis p r o v i d e d b y the present algorithm. See Tables 10
and 11.
The first category of analyses, labeled Good, is self-explanatory in the case of most
w o r d s (e.g., proceed, proceeded, proceeding, proceeds), and m a n y of the errors are equally
easy to identify b y eye (abide with no analysis, next to abid-e and abid-ing, or Abn-er).
Quite honestly, I was surprised h o w m a n y w o r d s there were in which it was difficult
to say what the correct analysis was. For example, consider the pair aboli-tion and abol-
ish. The words are clearly related, and abolition clearly has a suffix; b u t does it have the
suffix -ion, -tion, or -ition, and does abolish have the suffix -ish, or -sh? It is hard to say.
182
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
Table 8
Top 10 signatures, 100,000-word Italian corpus.
Rank Signature Number of Stems Participating in this Signature
1 a.e.i.o 55
2 ica.iche.ici.ico 17
3 a.i.o 33
4 e.i 221
5 i.o 164
6 e.i.o 24
7 a.e.o 23
8 a.e.i 23
9 a.e 131
10 NULL.o 71
11 e.i.it~ 14
Table 9
Top 10 signatures, 1,000,000-word Italian corpus.
Rank Signature Number of Stems Participating
in this Signature
1 .a.e.i.o. 136
2 .ica.iche.ici.ico. 43
3 .a.i.o. 114
4 .ia.ica.iche.ici.ico.ie. 13
5 .a.ando.ano.are.ata.ate 7
.ati.ato.azione.6.
6 .e.i. 583
7 .a.e.i. 47
8 .i.o. 383
9 .a.e.o. 32
10 .a.e. 236
Table 10
Results (English).
Category Count Percent
Good 829 82.9%
Wrong analysis 52 5.2%
Failed to analyze 36 3.6%
Spurious analysis 83 8.3%
Table 11
Results (French).
Category Count Percent
Good 833 83.3%
Wrong analysis 61 6.1%
Failed to analyze 42 4.2%
Spurious analysis 64 6.4%
183
Computational Linguistics Volume 27, Number 2
give credit to either the analysis aboli-tion/aboli-sh or the analysis abol-ition/abol-ish. The
second criterion is a bit more subtle. Consider the pair of words alumnus and alumni.
Should these be morphologically analyzed in a corpus of English, or rather, should
failure to analyze them be penalized for this morphology algorithm? (Compare in like
manner alibi or allegretti; do these English words contain suffixes?). M y principle has
been that if I would have given the system additional credit by virtue of discovering
that relationship, I have penalized it if it did not discover it; that is a relatively harsh
criterion to apply, to be sure. Should proper names be morphologically analyzed?
The answer is often unclear. In the 500,000 w o r d English corpus, we encounter Alex
and Alexis, and the latter is analyzed as alex-is. I have scored this as correct, m u c h
as I have scored as correct the analyses of Alexand-er and Alexand-re. On the other
hand, the failure to analyze Alexeyeva despite the presence of Alex and Alexei does
not seem to me to be an error, while the analysis Anab-el has been scored as an
error, but John-son (and a bit less obviously Wat-son) have not been treated as errors. 29
Difficult to classify, too, is the treatment of words such as abet~abetted~abetting. The
present algorithm selects the uniform stem abet in that case, assigning the signature
NULL.ted.ting. Ultimately w h a t we w o u l d like to have is a means of indicating that
the doubled t is predictable, and that the correct signature is NULL.ed.ing. At present
this is not implemented, and I have chosen to mark this as correct, on the grounds
that it is more important to identify words with the same stem than to identify the
(in some sense) correct signature. Still, unclear cases remain: for example, consider the
words accompani-ed/accompani-ment/accompani-st. The word accompany does not appear
as such, but the stem accompany is identified in the word accompany-ing. The analysis
accompani-st fails to identify the suffix -ist, but it will successfully identify the stem as
being the same as the one found in accompanied and accompaniment, which it w o u l d
not have done if it h a d associated the i with the suffix. I have, in any event, marked
this analysis as wrong, but without m u c h conviction behind the decision. Similarly,
the analysis of French putative stem embelli with suffixes e/rent/t passes the low test
of treating related words with the same stem, but I have counted it as in error, on the
grounds that the analysis is unquestionably one letter off from the correct, traditional
analysis of second-conjugation verbs. This points to a more general issue regarding
French morphology, which is more complex than that of English. The infinitive ~crire
'to write' would ideally be analyzed as a stem &r plus a derivational suffix i followed
by an infinitival suffix re. Since the derivational suffix i occurs in all its inflected forms,
it is not unreasonable to find an analysis in which the i is integrated into the stem
itself. This is what the algorithm does, employing the stem dcri for the words dcri-re and
~cri-t. Ecrit in turn is the stem for dcrite, ~crite, ~crites, &rits, and ~criture. An alternate
stem form dcriv is used for past tense forms (and the nominalization dcrivain) with
the suffixes aient, ait, ant, irent, it. The algorithm does not make explicit the connection
between these two stems, as it ideally would.
Thus in the tables, Good indicates the categories of words where the analysis was
clearly right, while the incorrect analyses have been broken into several categories.
Wrong Analysis is for bimorphemic words that are analyzed, but incorrectly analyzed,
by the algorithm. Failed to Analyze are the cases of words that are bimorphemic but
29 My inability to determine the correct morphological analysis in a wide range of words that I know
perfectly well seems to me to be essentially the same response as has often been observed in the case
of speakers of Japanese, Chinese, and Korean when forced to place word boundaries in e-mail
romanizations of their language. Ultimately the quality of a morphological analysis must be measured
by how well the algorithm handles the clear cases, how well it displays the relationships between
words perceived to be related, and how well it serves as the language model for a stochastic
morphology of the language in question.
184
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
for which no analysis was provided by the algorithm, and Spurious Analysis are the
cases of words that are not morphologically complex but were analyzed as containing
a suffix.
For both English and French, correct performance is found in 83% of the words;
details are presented in Tables 10 and 11. For English, these figures correspond to
precision of 829/(829 + 52 + 83) = 85.9%, and recall of 829/(829 + 52 + 36) = 90.4%.
8. Triage
As noted above, the goal of triage is to determine how many stems must occur in
order for the data to be strong enough to support the existence of a linguistically real
signature. MDL provides a simple but not altogether satisfactory method of achieving
this end.
Using MDL for this task amounts to determining whether the total description
length decreases when a signature is eliminated by taking all of its words and elim-
inating their morphological structure, and reanalyzing the words as morphologically
simple (i.e., as having no morphological structure). This is how we have implemented
it, in any event; one could well imagine a variant under which some or all subparts
of the signature that comprised other signatures were made part of those other sig-
natures. For example, the signature NULL.ine.ly is motivated just for the stem just.
Under the former triage criterion, justine and justly would be treated as unanalyzed
words, whereas under the latter, just and justly would be made members of the (large)
NULL.ly signature, and just and justine might additionally be treated as comprising
parts of the signature NULL.ine along with bernard, gerald, eng, capitol, elephant, def, and
sup (although that would involve permitting a single stem to participate in two distinct
signatures).
Our MDL-based measure tests the goodness of a signature by testing each sig-
nature cr to see if the analysis is better when that signature is deleted. This deletion
entails treating the signature's words as members of the signature of unanalyzed words
(which is the largest signature, and hence such signature pointers are relatively short).
Each word member of the signature, however, now becomes a separate stem, with all
of the increase in pointer length that that entails, as well as increase in letter content
for the stem component.
One may draw the following conclusions, I believe, from the straightforward ap-
plication of such a measure. On the whole, the effects are quite good, but by no means
as close as one would like to a human's decisions in a certain number of cases. In
addition, the effects are significantly influenced by two decisions that we have al-
ready discussed: (i) the information associated with each letter, and (ii) the decision
as to whether to model suffix frequency based solely on signature-internal frequences,
or based on frequency across the entire morphology. The greater the information as-
sociated with each letter, the more worthwhile morphology is (because maintaining
multiple copies of nearly similar stems becomes increasingly costly and burdensome).
When suffix frequencies (which are used to compute the compressed length of any
analyzed word) are based on the frequency of the suffixes in the entire lexicon, rather
than conditionally within the signature in question, the loss of a signature entails a hit
on the compression of all other words in the lexicon that employed that suffix; hence
triage is less dramatic under that modeling assumption.
Consider the effect of this computation on the signatures produced from a 500,000-
word corpus of English. After the modifications discussed to this point, but before
triage, there were 603 signatures with two or more stems and two or more suffixes,
and there were 1,490 signatures altogether. Application of triage leads to the loss
185
Computational Linguistics Volume 27, Number 2
of only 240 signatures. The single-suffix signatures that were eliminated were: ide,
it, rs, he, ton, o, and ie, all of which are spurious. However, a n u m b e r of signatures
that should not have been lost were eliminated, most strikingly: NULL.ness, with 51
good analyses, NULL.ful, with 18 good analyses, and NULL.ish with only 8 analyses.
Most of the cases eliminated, however, were indeed spurious. Counting only those
signatures that involves suffixes (rather than c o m p o u n d s ) and that were in fact correct,
the percentage of the words w h o s e analysis was incorrectly eliminated b y triage was
21.9% (236 out of 1,077 changes). Interestingly, in light of the discussion on results
above, one of the signatures that was lost was i.us for the Latin plural (based in this
particular case on genii~genius). Also eliminated (and this is most regrettable) was
NULL.n't (could~had~does~were~would/did).
Because maximizing correct results is as important as testing the MDL m o d e l
p r o p o s e d here, I have also utilized a triage algorithm that departs from the MDL-
based optimization in certain cases, which I shall identify in a moment. I believe that
w h e n the i m p r o v e m e n t s identified in Section 10 below are made, the purely MDL-
based algorithm will be more accurate; that prediction remains to be tested, to be
sure. On this account, we discard any signature for which the total n u m b e r of stem
letters is less than five, and any signature consisting of a single, one-letter suffix; we
keep, then, only signatures for which the savings in letter counts is greater than 15
(where savings in letter counts is simply the difference b e t w e e n the s u m of the length
of words spelled out as a m o n o m o r p h e m i c w o r d and the s u m of the lengths of the
stems and the suffixes); 15 is chosen empirically.
9. Paradigms
30 As long as we keep the total number of words fixed, the global task of minimizing description length
can generally be obtained by the local strategy of finding the largest cohort for a group of forms to
associate with: if the same data can be analyzed in two ways, with the data forming groups of sizes
{a}} in one case, and {a2}in the other, maximal compression is obtained by choosing the case (k -- 1, 2)
for which
~ log(a~)
i
is the greatest.
186
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
posite words are worse than they were, leading to a p o o r e r description (via increased
cross-entropy, we might say). In practice, the collapsing of signatures is rejected b y
the MDL measure that we have i m p l e m e n t e d here.
In w o r k in progress, we treat groups of signatures (as defined here) as parts of
larger groups, called paradigms. A p a r a d i g m consisting of tile suffixes NULL.ed.ing.s,
for example, includes all 15 possible combinations of these suffixes. We can in general
estimate the n u m b e r of stems we w o u l d expect to appear with zero counts for one or
more of the suffixes, given a frequency distribution, such as a multinomial distribution,
for the s u f f i x e s . 31 In this way, we can establish some reasonable frequencies for the case
of stems appearing in a corpus with only a single suffix. It appears at this time that the
unavailability of this information is the single most significant cause of inaccuracies
in the present algorithm. It is thus of considerable importance to get a handle on such
estimates. 32
31 In particular, consider a paradigm with a set {j~i} of suffixes. We m a y represent a subsignature of that
signature as a string of 0s and ls (a Boolean string b, of the form {0,1}*, abbreviated bk) indicating
w h e t h e r (or not) the ith suffix is contained in the subsignature. If a stem t occurs [t] times, then the
probability that it occurs without a particular suffix~ is (1 -prob(fi))[tJ; the probability that it occurs
without all of the suffixes missing from the particular subsignature b = {bk} is
I - I ( 1 -- bk)(1 -- prob(fi))[t];
k
and the probability that the particular subsignature b will arise at all is the s u m of those values over
all of the stems in the signature:
187
Computational Linguistics Volume 27, Number 2
. I d e n t i f y i n g p a r a d i g m s f r o m signatures. We w o u l d like to a u t o m a t i c a l l y
i d e n t i f y N U L L . e d . i n g as a s u b c a s e of the m o r e g e n e r a l NULL.ed.ing.s. This
is a difficult task to a c c o m p l i s h well, as E n g l i s h illustrates, for w e w o u l d
like to be able to d e t e r m i n e that N U L L . s is p r i m a r i l y a s u b c a s e of
's.NULL.s, a n d n o t of (e.g.) NULL.ed.s. 33
. D e t e r m i n i n g the relationship b e t w e e n prefixation a n d suffixation. The
s y s t e m c u r r e n t l y a s s u m e s that prefixes are to be s t r i p p e d off the s t e m
that h a s a l r e a d y b e e n identified b y suffix stripping. In f u t u r e w o r k , w e
w o u l d like to see alternative h y p o t h e s e s r e g a r d i n g the relationship of
prefixation a n d suffixation tested b y the M D L criterion.
. I d e n t i f y i n g c o m p o u n d s . In w o r k r e p o r t e d in G o l d s m i t h a n d Reutter
(1998), w e h a v e e x p l o r e d the u s e f u l n e s s of the p r e s e n t s y s t e m for
d e t e r m i n i n g the linking e l e m e n t s u s e d in G e r m a n c o m p o u n d s , b u t m o r e
w o r k r e m a i n s to be d o n e to i d e n t i f y c o m p o u n d s in general. H e r e w e r u n
straight into the p r o b l e m of a s s i g n i n g v e r y s h o r t strings a l o w e r
likelihood of b e i n g w o r d s t h a n l o n g e r strings. T h a t is, it is difficult to
a v o i d p o s i t i n g a certain n u m b e r of v e r y s h o r t stems, as in E n g l i s h m a n d
an, the first b e c a u s e of pairs s u c h as me a n d m y , the s e c o n d b e c a u s e of
pairs s u c h as an a n d any, b u t these facts s h o u l d n o t be t a k e n as s t r o n g
e v i d e n c e that man is a c o m p o u n d .
. A s n o t e d at the outset, the p r e s e n t a l g o r i t h m is limited in its ability to
d i s c o v e r the m o r p h o l o g y of a l a n g u a g e in w h i c h there are n o t a
sufficient n u m b e r of w o r d s w i t h o n l y o n e suffix in the c o r p u s . In w o r k
in progress, w e are d e v e l o p i n g a related a l g o r i t h m that deals w i t h the
33 We noted in the preceding section that we can estimate the likelihood of a subsignature assuming a
multinomial distribution. We can in fact do better than was indicated there, in the sense that for a
given observed signature a*, whose suffixes constitute a subset of a larger signature ~r, we can
compute the likelihood that a is responsible for the generation of ¢*, where {¢i} are the frequencies
(summing to 1.0) associating with each of the suffixes in a, and {ci} are the counts of the
corresponding suffixes in the observed signature a*:
it1 ) it[,
~,[Cl], [c2]..... [c,] ~(i)c' - - [C11![C21[... [Cn]!
i=I i=1
or approximately
,,ogt
from Stirling's approximation. If we normalize the cis to form a distribution (by dividing by [t]) and
denote these by di, then this can be simply expressed in terms of the Kullback-Leibler distance
D(a* II a):
188
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
more general case. In the more general case, it is even more important to
develop a model that deals with the layered relationship among suffixes
in a language. The present system does not explicitly deal with these
relationships: for example, while it does break up ments into ment and s,
it does not explicitly determine which suffixes s may attach to, etc. This
must be done in a more adequate version.
. In work in progress, we have added to the capability of the algorithm
the ability to posit suffixes that are in part subtractive morphemes. That
is, in English, we would like to establish a single signature that combines
NULL.ed.ing.s and e.ed.es.ing (for jump and love, respectively). We posit an
operator Ix/which deletes a preceding character x, and with the
mechanism, we can establish a single signature NULL.leled.leling.s,
composed of familiar suffixes NULL and s, plus two suffixes leled and
leling, which delete a preceding (stem-final) e if one is present.
11. Conclusion
Linguists face at the present time the question of whether, and to what extent,
information-theoretic notions will play a significant role in our understanding of lin-
guistic theory over the years to come, and the present system perhaps casts a small
ray of light in this area. As we have already noted, MDL analysis makes clear what the
two areas are in which an analysis can be judged: it can be judged in its ability to deal
with the data, as measured by its ability to compress the data, and it can be judged on
its complexity as a theory. While the former view is undoubtedly controversial when
viewed from the light of mainstream linguistics, it is the prospect of being able to say
something about the complexity of a theory that is potentially the most exciting. Even
more importantly, to the extent that we can make these notions explicit, we stand a
chance of being able to develop an explicit model of language acquisition employing
these ideas.
A natural question to ask is whether the algorithm presented here is intended
to be understood as a hypothesis regarding the way in which human beings acquire
morphology. I have not employed, in the design of this algorithm, a great deal of innate
knowledge regarding morphology, but that is for the simple reason that knowledge of
how words divide into subpieces is an area of knowledge which no one would take
to be innate in any direct fashion: if sanity is parsed as san + ity in one language, it
may perfectly well be parsed as sa + nity in another language.
That is, while passion may flame disagreements between partisans of Universal
Grammar and partisans of statistically grounded empiricism regarding the task of
syntax acquisition, the task which we have studied here is a considerably more humble
one, which must in some fashion or other be figured out by grunt work by the language
learner. It thus allows us a much sharper image of how powerful the tools are likely
to be that the language acquirer brings to the task. And does the human child perform
computations at all like the ones proposed here?
From most practical points of view, nothing hinges on our answer to this question,
but it is a question that ultimately we cannot avoid facing. Reformulated a bit, one
might pose the question, does the young language learner--who has access not only
to the spoken language, but perhaps also to the rudiments of the syntax and to the
intended meaning of the words and sentences--does the young learner have access
to additional information that simplifies the task of morpheme identification? It is
the belief that the answer to this question is yes that drives the intuition (if one has
189
Computational Linguistics Volume 27, Number 2
this intuition) that an MDL-based analysis of the present sort is an unlikely model of
human language acquisition.
But I think that such a belief is very likely mistaken. Knowledge of semantics and
even grammar is unlikely to make the problem of morphology discovery significantly
easier. In surveying the various approaches to the problem that I have explored (only
the best of which have been described here), I do not know of any problem (of those
which the present algorithm deals with successfully) that would have been solved
by having direct access to either syntax or semantics. To the contrary: I have tried to
find the simplest algorithm capable of dealing with the facts as we know them. The
problem of determining whether two distinct signatures derive from a single larger
paradigm would be simplified with such knowledge, but that is the exception and not
the rule.
So in the end, I think that the hypothesis that the child uses an MDL-like analysis
has a good deal going for it. In any event, it is far from clear to me how one could
use information, either grammatical or contextual, to elucidate the problem of the
discovery of morphemes without recourse to notions along the lines of those used in
the present algorithm.
Of course, in all likelihood, the task of the present algorithm is not the same
as the language learner's task; it seems unlikely that the child first determines what
the words are in the language (at least, the words as they are defined in traditional
orthographic terms) and then infers the morphemes. The more general problem of
language acquisition is one that includes the problems of identifying morphemes,
of identifying words both morphologically analyzed and nonanalyzed, of identifying
syntactic categories of the words in question, and of inferring the rules guiding the
distribution of such syntactic categories. It seems to me that the only manageable
kind of approach to dealing with such a complex task is to view it as an optimization
problem, of which MDL is one particular style.
Chomsky's early conception of generative grammar (Chomsky 1975 [1955]; hence-
forth LSLT) was developed along these lines as well; his notion of an evaluation metric
for grammars was equivalent in its essential purpose to the description length of the
morphology utilized in the present paper. The primary difference between the LSLT
approach and the MDL approach is this: the LSLT approach conjectured that the gram-
mar of a language could be factored into two parts, one universal and one language-
particular; and when we look for the simplest grammatical description of a given
corpus (the child's input) it is only the language-particular part of the description that
contributes to complexity--that is what the theory stipulates. By contrast, the MDL
approach makes minimal universal assumptions, and so the complexity of everything
comprising the description of the corpus must be counted in determining the complex-
ity of the description. The difference between these hypotheses vanishes asymptotically
(as Janos Simon has pointed out to me) as the size of the language increases, or to put it
another way, strong Chomskian rationalism is indistinguishable from pure empiricism
as the information content of the (empiricist) MDL-induced grammar increases in size
relative to the information content of UG. Rephrasing that slightly, the significance
of Chomskian-style rationalism is greater, the simpler language-particular grammars
are, and it is less significant as language-particular grammars grow larger, and in the
limit, as the size of grammars grows asymptotically, traditional generative grammar
is indistinguishable from MDL-style rationalism. We return to this point below.
There is a striking point that has so far remained tacit regarding the treatment
of this problem in contemporary linguistic theory. That point is this: the problem ad-
dressed in this paper is not mentioned, not defined, and not addressed. The problem
of dividing up words into morphemes is generally taken as one that is so trivial and
190
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
devoid of interest that morphologists, or linguists more generally, simply do not feel
obliged to think about the problem. 34 In a very uninteresting sense, the challenge pre-
sented by the present paper to current morphological theory is no challenge at all,
because morphological theory makes no claims to knowing how to discover morpho-
logical analysis; it claims only to know what to do once the morphemes have been
identified.
The early generative grammar view, as explored in LSLT, posits a grammar of
possible grammars, that is, a format in which the rules of the morphology and syntax
must be written, and it establishes the semantics of these rules, which is to say, how
they function. This grammar of grammars is called variously Universal Grammar, or
Linguistic Theory, and it is generally assumed to be accessible to humans on the basis
of an innate endowment, though one need not buy into that assumption to accept
the rest of the theory. In Syntactic Structures (Chomsky 1957, 51ff.), Chomsky famously
argued that the goal of a linguistic theory that produces a grammar automatically,
given a corpus as input, is far too demanding a goal. His own theory cannot do that,
and he suggests that no one else has any idea how to accomplish the task. He suggests
furthermore that the next weaker position--that of developing a linguistic theory that
could determine, given the data and the account (grammar), whether this was the best
grammar--was still significantly past our theoretical reach, and he suggests finally that
the next weaker position is a not unreasonable one to expect of linguistic theory: that
it be able to pass judgment on which of two grammars is superior with respect to a
given corpus.
That position is, of course, exactly the position taken by the MDL framework,
which offers no help in coming up with analyses, but which is excellent at judging the
relative merits of two analyses of a single corpus of data. In this paper, we have seen
this point throughout, for we have carefully distinguished between heuristics, which
propose possible analyses and modifications of analyses, on the one hand, and the
MDL measurement, which makes the final judgment call, deciding whether to accept
a modification proposed by the heuristics, on the other.
On so much, the early generative grammar of LSLT and MDL agree. But they
disagree with regard to two points, and on these points, MDL makes clearer, more
explicit claims, and both claims appear to be strongly supported by the present study.
The two points are these: the generative view is that there is inevitably an idiosyn-
cratic character to Universal Grammar that amounts to a substantive innate capacity,
on the grounds (in part) that the task of discovering the correct grammar of a human
language, given only the corpus available to the child, is insurmountable, because this
corpus is not sufficient to home in on the correct grammar. The research strategy asso-
ciated with this position is to hypothesize certain compression techniques (generally
called "rule formalisms" in generative grammar) that lead to significant reduction in
the size of the grammars of a number of natural languages, compared to what would
have been possible without them. Sequential rule ordering is one such suggestion
discussed at length in LSLT.
To reformulate this in a fashion that allows us to make a clearer comparison with
MDL, we may formulate early generative grammar in the following way: To select
the correct Universal Grammar out of a set of proposed Universal Grammars {UGi},
given corpora for a range of human languages, select that UG for which the sum of the
sizes of the grammars for all of the corpora is the smallest. It does not follow--it need not
be the case--that the grammar of English (or German, etc.) selected by the winning
191
Computational Linguistics Volume 27, Number 2
UG is the shortest one of all the candidate English grammars, but the winning UG is
all-round the supplier of the shortest grammars a r o u n d the w o r l d J s
MDL could be formulated in those terms, undoubtedly, but it also can be formu-
lated in a language-particular fashion, which is h o w it has been used in this paper.
Generative g r a m m a r is inherently universalist; it has no language-particular format,
other than to say that the best g r a m m a r for a given language is the shortest grammar.
But w e k n o w that such a position is untenable, and it is precisely out of that
k n o w l e d g e that MDL was born. The position is untenable because we can always
make an arbitrarily small compression of a given set of data, if w e are allowed to
make the g r a m m a r arbitrarily complex, to match and, potentially, to overfit the data,
and it is untenable because generative g r a m m a r offers no explicit notion of h o w well
a g r a m m a r must match the training data. MDUs insight is that it is possible to m a k e
explicit the trade-off b e t w e e n complexity of the analysis and snugness of fit to the
data-corpus in question.
The first tool in that computational trade-off is the use of a probabilistic m o d e l
to compress the data, using stock tools of classical information theory. These notions
were rejected as irrelevant b y early workers in early generative g r a m m a r (Goldsmith
2001). Notions of probabilistic g r a m m a r due to Solomonoff (1995) were not integrated
into that framework, and the possibility of using t h e m to quantify the goodness of fit
of a g r a m m a r to a corpus was not exploited.
It seems to me that it is in this context that we can best u n d e r s t a n d the w a y
in which traditional generative g r a m m a r and c o n t e m p o r a r y probabilistic g r a m m a r
formalism can be u n d e r s t o o d as c o m p l e m e n t i n g each other. I, at least, take it in that
way, and this p a p e r is offered in that spirit.
Appendix
Since w h a t we are really interested in c o m p u t i n g is not the m i n i m u m description
length as such, but rather the difference b e t w e e n the description length of one m o d e l
and that of a variant, it is convenient to consider the general form of the difference
b e t w e e n two MDL computations. In general, let us say we will compare two analyses
$1 and $2 for the same corpus, where $2 typically contains some item(s) that $1 does
not (or they m a y differ b y where they break a string into factors). Let us write out the
difference in length b e t w e e n these two analyses, as in (7)-(11), calculating the length
of $1 minus the length of $2. The general formulas derived in (7)-(11) are not of direct
computational interest; they serve rather as a template that can be filled in to c o m p u t e
the change in description length occasioned b y a particular structural change in the
m o r p h o l o g y p r o p o s e d b y a particular heuristic. This template is rather complex in
its most general form, but it simplifies considerably in any specific application. The
heuristic determines which of the terms in these formulas take on nonzero values,
and w h a t their values are; the overall formula determines w h e t h e r the change in
question improves the description length. In addition, we m a y regard the formulas in
35 As the discussion in the text may suggest, I arn skeptical of the generative position, and I would like to
identify what empirical result would confirm the generative position and dissolve my skepticism. The
result would be the discovery of two grammars of English, G1 and G2, with the following properties:
G1 is inherently simpler than G2, using some appropriate notion of Turing machine program
complexity, and yet G2 is the correct grammar of English, based on some of the complexity of G2 being
the responsibility of linguistic theory, hence "free" in the complexity competition between G1 and G2.
That is, the proponent of the generative view must be willing to acknowledge that overall complexity
of the grammar of a language may be greater than logically necessary due to evolution's investment in
one particular style of programming language.
192
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
second, that this difference is generally computed inside a summation over a set of
morphemes, and hence the first term simplifies to a constant times the type count of
the morphemes in the set in question. Indeed, so prevalent in these calculations is the
formula
log X~t~t~l
Xstate2
log ~ ,
where the numerator is a count in $1, and the denominator a count of the same variable
in $2; if no confusion would result, we write Ax. 36
Let us review the terms listed in (7)-(11). A N is a measure of the change in the
number of total words due to tile proposed modification (the difference between the $1
and $2 analyses); an increase in the total number of words results in a slightly negative
value. In the text above, I indicated that we could, by judicious choice of word count
distribution, keep Wl = W2; I have included the more general case in (7)-(11) where
the two may be different. AWs and AWc are similar measures in the change of words
that have morphologically simple, and morphologically complex, stems, respectively.
They measure the global effects of the typically small changes brought about by a
hypothetical change in morphological model. In the derivation of each formula, we
consider first the case of those morphemes that are found in both $1 and $2 (indicated
($1, $2)), followed by those found only in S1 ($l, ~ $2), and then those only found in
$2 ('-~ $1, $2). Recall that angle brackets are used to indicate the type count of a set, the
number of typographically distinct members of a set.
In (8), we derive a formula for the change in length of the suffix component of
the morphology. Observe the final formulation, in which the first two terms involve
suffixes present in both $1 and $2, while the third term involves suffixes present only
in $1 and the fourth term involves suffixes present only in $2. This format will appear
in all of the components of this computation. Recall that the function Ltypo specifies
the length of a string in bits, which we may take here to be simply log(26) times the
number of characters in the string.
In (9), we derive the corresponding formula for the stem component.
The general form of the computation of the change to the signature component
(10) is more complicated, and this complexity motivates a little bit more notation to
simplify it. First, we can compute the change in the pointers to the signatures, and the
information that each signature contains regarding the count of its stems and suffixes
36 We beg the reader's indulgence in recognizing that we prepend the operator A immediately to the left
of the name of a set to indicate the change in the size of the counts of the set, which is to say, "AW" is
shorthand for "A([W])", and "A(W}" for "A((W))".
193
Computational Linguistics Volume 27, Number 2
as in (10a). But the heart of the matter is the treatment of the stems and suffixes within
the signatures, given in (10b)-(10d).
Bear in mind, first of all, that each signature consists of a list of pointers to stems,
and a list of pointers to suffixes. The treatment of suffixes is given in (10d), and is
relatively straightforward, but the treatment of stems (10c) is a bit more complex.
Recall that all items on the stem list will be pointed to by exactly one stem pointer,
located in some particular signature. All stem pointers in a signature that point to
stems on the suffix list are directly described a "simple" word, a notion we have
already encountered: a word whose stem is not further analyzable. But other words
may be complex, that is, may contain a stem whose pointer is to an analyzable word,
and hence the stem's representation consists of a pointer triple: a pointer to a signature,
a stem within the signature, and a suffix within the signature. And each stem pointer
is preceded by a flag indicating which type of stem it is.
We thus have three things whose difference in the two states, $1 and $2, we wish
to compute. The difference of the lengths of the flag is given in (10c.i). In (10c.ii), we
need change in the total length of the pointers to the stems, and this has actually
already been computed, during the computation of (9). 37 Finally in (10c.iii), the set of
pointers from certain stem positions to words consists of pointers to all of the words
that we have already labeled as being in W o and we can compute the length of these
pointers by adding counts to these words; the length of the pointers to these words
needs to be computed anyway in determining the compressed length of the corpus.
This completes the computations needed to compare two states of the morphology.
In addition, we must compute the difference in the compressed length of the
corpus in the two states, and this is given in (11).
AW(Suffixes)(1,2) - ~ Af q- ~_~
f ~ S'tLYff:~2:e'~(1,2) f ~ Sud~xe$(1,~2)
fE Suffixes(~l,2)
AW (Stems)(1,2) - ~_, At + ~
r [W]l
[log ~ +
]
Ltypo(t)
t6 Steads(i,2 ) t6 Ste11"~,8(1,~2)
- }2 [,og -[w]2
~ + Ltyvo(t)]
tC St e~7,s (~1,2)
194
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
(10) Difference in description length for the signature component of the morphology:
(a) + (b) + (c) + (d)
(b) Change in counts of stems and suffixes within each signature, summed
over all signatures:
Z [A (stems(a)} + A (suffixes(a)}]
crff Signatures(i,2 )
(c) Change in the lengths of the stem pointers within the signatures = (c.i)
+ (c.ii) + (c.iii), as follows:
(c.i) Change in total length of flags for each stem indicating whether
simple or complex:
-- ( W s I M P L E ) ~ I , 2 log [W]2
[Ws~veL.]2
+ (WcoMPLEX)1 ~2 log - -
[wh
' [WCOMULEXh
[w]2
- (WcoMPLEX)~I,2 log [WcoMPLUX]2
[w]2
AW (Stems)o,2) - Z At + Z log [W]I E log - ~ -
tff Ste~q,8(1,2) tE Stems(l ,~2) It] tff Stems (~1,2)
195
Computational Linguistics Volume 27, Number 2
[w]l [w]2
+ E log [stem(w)]1 E log [stem(w)]2
wE W C O M P L E X (1,~2) wE W C O M P L E X (~1,2)
+ [~(w)]
K-"Z.., log [suff(w)in a(w)]
wE W C O M P L E X O,~2 )
[¢(w)]
K-"/_._, log [suff(w)in a(w)]
wE W C O M P L E X (~1,2)
E E l°g ~fi[n]a]
aE Signatures(~l,2) fE•
196
Goldsmith Unsupervised Learning of the Morphology of a Natural Language
197
Computational Linguistics Volume 27, Number 2
198