Structure in Linguistics
Structure in Linguistics
Structure in Linguistics
Linas Vepstas
Novamente, OpenCog Project
1. Introduction
The starting point for automated linguistic learning is collocation. One of the
simplest analyses one can perform is to discover important bigrams by means
of mutual information (Yuret 1998). While a simple count of probabilities might
show which word pairs (or N-grams) occur frequently, mutual information gives
a different view, as it uncovers idiomatic expressions and “set phrases”: collec-
tions of words that are used together in a relatively invariant fashion. Examples
of high mutual information word pairs are Johns Hopkins and jet propulsion, both
scoring an MI of about 15 in one dataset (Vepstas 2008). The most striking result
from looking over lists of high-mutual-information word pairs is that most are the
names of conceptual entities.
Mutual information between word pairs provides the starting point for a
class of parsers such as the Minimum Spanning Tree Parser (McDonald & Pereira
2005). These parsers typically are dependency parsers that discover dependency
relations using only “unsupervised” training. The key word here is “unsupervised”:
rather than providing the learning algorithm with linguist-annotated datasets, the
parser “learns” merely by observing text “au naturel”, without annotations. This is
in contrast to many other parsers built on statistical techniques, which require an
annotated corpus as input. By eschewing annotation, it stays true to one of the core
ideals of corpus linguistics: “trust the text”.
Structure in Linguistics 359
Sometimes it is imagined that statistical techniques stop here: anyone who has
played with (or read about) Markov chains built from N-grams knows that such
systems are adept at generating quasi-grammatical nonsense text. But this does not
imply that machine learning techniques cannot advance further: it merely means
that a single level of analysis has only limited power, and must be aided by further
layers providing additional structure and constraints. So, for example, although
a naive Markov chain might allow nonsense word combinations, a higher order
structure discriminator or “layer” would note that certain frequently-generated
patterns are rarely or never observed in an actual corpus, and prohibit their gen-
eration (alternatively, balk at creating a parse when presented with such ungram-
matical input). Examples of such “layering” are presented below.
Another important point is that the observation of structures in corpora can
be used to validate “mentalistic” theories of grammar. The whole point of tradi-
tional, hand-built parsers was to demonstrate that one could model certain high-
frequency word occurrences with a relatively small set of rules — that these rules
can differentiate between grammatical and ungrammatical text. The “small” set of
rules capture and distil the contents of a much larger corpus. Thus, for example,
high mutual information word pairs help validate the “mentalistic” notion of a
“dependency” parser. But this is a two-way street: theories based on practical ex-
perience can serve as a guide for the kinds of structures and patterns that one
might be able to automatically mine, using unsupervised techniques, from a text
corpus. Thus, for example, the “Meaning-Text Theory” developed by Igor Mel’cuk
and others (Mel’cuk & Polguere 1987, Steele 1990) provides a description of ‘set
phrases’, deep semantic structures, and of the interesting notion of ‘lexical func-
tions’. Perhaps one might be able to discover these by automated means, and con-
versely, use these automatically discovered structures to perform more accurate
parsing and even analysis of semantic content.
Such attempts at automated, unsupervised learning of deeper structure are
already being done, albeit without any theoretical baggage in tow. Prominent ex-
amples include Dekang Lin’s work on the automatic discovery of synonymous
phrases (Lin & Pantel 2001), and Poon & Domingos’ (2009) extension of these
ideas to the automated discovery of synonymous relations of higher N-arity. The
earlier work by Lin applies a fairly straightforward statistical analysis to discover
co-occurring phrases and thus deduce their synonymy.
However, it is important to note that Lin’s analysis is “straightforward” only
because a certain amount of heavy lifting was already performed by parsing the
input corpus. That is, rather than applying statistical techniques on raw N-grams,
the text is first analyzed with a dependency parser; it is the output of the parser
that is subject to statistical analysis. Crudely speaking, “collocations of parses” are
found.
360 Linas Vepstas
3. Conclusion
In the connectionist view of semantics, “meaning” exists only in patterns and re-
lations and connections: meaning is expressed in linguistic dialog. In the case of
embodied actors, meaning can be expressed in action; but again, the action is in
reference to some thing. There is no atomic kernel to meaning, there is no “there”,
there. Rather, meaning is tied into the expression and articulation of structure.
But what is meant by “structure” and “connections”? There are multiple can-
didates for things that could be “structure” in linguistics. One obvious candidate
is “lexis” — that grand lexicon in the sky, where all words, phrases and idioms
are defined maximally in relation to one another. But there is also the structure
commonly known as “syntax” — a set of rules governing the positional ordering
of words. Cognitive linguistics posits that there are even deep structures of vari-
ous sorts (such as the “lexical functions” of Meaning-Text Theory). It would seem
that each of these different kinds of structures can be discerned to some degree or
another using automatic, unsupervised machine-learning techniques, taking only
a bare, naked corpus of text as input.
To put it crudely, collocation need not always be done “by hand”: let the com-
puter do it, and look at what it does. Don’t just look at what words are next to
each other, but look at what structures are next to each other. Future effort at the
intersection of computational and corpus linguistics is in the discovery of more
subtle structure and the invention of new techniques to expose and manipulate it.
Insofar as current structures only capture an approximation of human discourse,
then clearly more work remains to be done.
Perhaps to the reader, this layering and interaction of different algorithms to
explain linguistic phenomena is reminiscent of the use of epicycles on epicycles
to correct for circular planetary orbits. Perhaps it is. But at this stage, it is still
too early to tell. Certainly, the parsers of a few decades back were perhaps less
than impressive. Language has a lot of structure, and capturing that structure with
hand-coded rules proved to be an overwhelming task. Yet, the structure is there,
unclear as it may often be. What we have now, that we didn’t have back then, are
the tools to raise the exploration to a “meta” level: the tools themselves find rules
and structure in the corpus — and now we can try to understand what sort of rules
and structure they are finding, and why some kinds of structure are invisible to
some kinds of tools, and what it is that makes one tool better than another.
362 Linas Vepstas
References
Bing, B. et al. 2009: online. “Polynomial semantic indexing”. In Proceedings of the 2009 Confer-
ence in Advances in Neural Information Processing Systems 22. Available at: http://books.
nips.cc/papers/files/nips22/NIPS2009_0881.pdf (accessed April 2010).
Lin, D. & Pantel, P. 2001. “Discovery of inference rules from text”. In Proceedings of the Sev-
enth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
(KDD’01). USA: ACM Press, 323–328.
McDonald, R. & Pereira, F. 2005: online. “Minimum-Spanning Tree Parser”. Available at: http://
www.seas.upenn.edu/~strctlrn/MSTParser/MSTParser.html (accessed April 2010).
Mel’čuk, I. A. & Polguère, A. 1987. “A formal lexicon in the meaning-text theory: (Or how to do
lexica with words)”. Computational Linguistics, 13 (3–4), 261–275.
Michael, L. & Valiant, L. G. 2008. “A first experimental demonstration of massive knowledge
infusion”. In Proceedings of the 11th International Conference on Principles of Knowledge
Representation and Reasoning, Sydney, Australia, 16–20 September 2008, 378–389.
Pfeifer, R. & Scheier, C. 1999. Understanding Intelligence. Cambridge, MA: The MIT Press.
Poon, H. & Domingos, P. 2009 “Unsupervised semantic parsing”. In Proceedings of the 2009 Con-
ference on Empirical Methods in Natural Language Processing. Singapore: Association for
Computational Linguistics, 1–10. Also available at: http://www.aclweb.org/anthology/D/
D09/D09–1001 (accessed April 2010).
Steele, J. (Ed.) 1990. Meaning-Text Theory: Linguistics, Lexicography, and Implications. Canada:
University of Ottawa Press.
Varela, F. J., Thompson, E. & Rosch, E. 1991. The Embodied Mind: Cognitive Science and Human
Experience. Cambridge, MA: The MIT Press.
Vepstas, L. 2008: online. “Linas’ collection of NLP data”. Available at: http://gnucash.org/linas/
nlp/ (accessed April 2010).
Yuret, D. 1998. Discovery of Linguistic Relations Using Lexical Attraction. PhD Thesis, Mas-
sachusetts Institute of Technology. Also available at: http://arxiv.org/PS_cache/cmp-lg/
pdf/9805/9805009v1.pdf (accessed April 2010).
Author’s address
Linas Vepstas
Novamente, OpenCog Project
1518 Enfield Road
Austin TX 78703
[email protected]