Assessing Linguistic Complexity
Assessing Linguistic Complexity
net/publication/228868826
CITATIONS READS
103 2,573
1 author:
Patrick Juola
Duquesne University
95 PUBLICATIONS 2,863 CITATIONS
SEE PROFILE
All content following this page was uploaded by Patrick Juola on 26 May 2014.
Patrick Juola
Duquesne University
Abstract
The results show that, as expected, languages are all about equally
tinuum.
1
about process, including an implicit description of the underlying cog-
1 Introduction
Most people with any background in language have at least an informal un-
that you have to study in order to pass the exam on it, and in particular to the
amount of stuff you simply have to memorize, such as lists of irregular verbs,
which (Nichols 1986) is an obvious example; she counts the number of points
2
extensive use of inflectional morphology), but he ties this, at least in theory,
requires. Despite the obvious practical difficulties (how do you compare two
complexity.
tion of “length” itself raises the issue of in what language (in a mathematical
that analysis of data from natural languages can shed light on the psycho-
theoretic basis can be found in (Zipf 1949), in his argument about applica-
3
tions of words:
True, we do not yet know that whenever man talks, his speech is
bine it with our previous view of speech as a set of tools and stated:
words are tools that are used to convey meanings in order to achieve
objectives. . .
nature. Since it is usually felt that words are “combined with mean-
ings” we may suspect that there is latent in speech both a more and a
4
Information theory provides a way, unavailable to Zipf, of resolving the
that can be used in any situation for any meaning. The hearer’s economy
requires that the speaker be easily understandable, and thus that the amount
messages, but at the same time, too much useless information will clog things
up. One can thus see that both the speaker and hearer have incentives to
measured in bits, that can be carried along such a channel.1 If less than
this information is sent, some messages will not be distinguishable from each
other. If more is sent, the “extra” is wasted resources. This is, for any source,
5
the amount of time/space/bandwidth necessary to send messages from that
requires a Zipf-like framework, where the most frequently sent messages have
requires much more bandwidth than the information content of the message.
In practical terms, there are a number of problems with the direct ap-
beforehand all possible “messages” that might be sent along with their prob-
dent, meaning that the chance of a message being sent does not change,
depending upon what other messages have been sent. In natural language,
content of the previous messages. Finally, the assumption that the receiver
6
since the knowledge I wish to transmit may be less than the language struc-
ture demands; if all I wish to say is that one of my siblings has red hair,
the sex of the relevant sibling is irrelevant, as is their comparative age, but
some languages (e.g. Japanese) may force me not only to specify the sex
plexity (Li and Vitányi 1997). Kolmogorov complexity measures the infor-
‘b’s would be easily (and quickly) described, while a (specific) random collec-
tion of a thousand ‘a’s and ‘b’s would be very difficult to describe. For a large
7
any given set of messages produced by a Shannon source, it is a very efficient
prove this is non-trivial, the result can be seen intuitively by observing that
strict technical sense related to the Halting Problem. Despite this technical
1969; Schneier 1996) Linear complexity addresses the issue by assuming that
8
the reconstruction machinery/algorithm is of a specific form, a linear feed-
back shift register (LFSR, see figure 1) composed of an ordered set of (shift)
registers and a (linear) feedback function. The register set acts as a queue,
where the past few elements of the sequence line politely up in order, while
the feedback function generates the next single text element and adds it to
the end of the queue (dropping the element at the head, of course). The
simple computer program (Massey 1969). Such systems are widely used as
the next element of the sequence (be it a word, morpheme, phoneme, etc.)
pel and Ziv 1976; Ziv and Lempel 1977) (the complexity metric that underlies
9
Feedback function
10
most of the ZIP family of commercial file compressors) which by contrast in-
“lexicon” is not confined solely to traditional lexemes, but can also incorpo-
and Lempel 1977) or similar techniques will, given infinite computing power
logical underpinnings) would apply to any other proposed form of file com-
pression.
11
a detailed analysis not only of overall complexity, but of its linguistic and
cognitive components.
3 Linguistic Experiments
interest. Elsewhere (Juola 1997), it has been argued that the information
• the shared information omitted between the author and her audience
12
information contained in several expressions of the same ideas. A language
brother has red hair. From a purely semantic analysis of the meanings of
the words, it is apparent that a single person is being spoken of and that
the person is neither myself, nor the listener. It is not necessary to transmit
this explicitly, for example via third person singular verb inflection. The
mation. If English (or any language) were perfectly regular, the nature of the
13
Other examples of this sort of mandatory complexity would include gen-
one can determine the importance of this type of complexity in the overall
measure.
and Diab 1999) It is relatively large, widely available, for the most part free
14
formalize common wisdom (pace McWhorter) regarding linguistic complexity
as follows:
Finnish, French, Maori, and Russian), it was shown that the variation in
size of the uncompressed text (4242292 bytes, +/- 376471.4, or about 8.86%
variation) was substantially more than the variation in size after compression
via LZ (1300637 bytes, +/- 36068.2, about 2.77%). This strongly suggests
that much of the variance in document size (of the Bible) is from the charac-
ter encoding system, and that the underlying message complexity is (more)
uniform.
We can contrast this with the results found (Juola 2000) by a comparable
study of the linear complexity of the Bible translations. In this study, it was
shown that the linear complexity of the Bible varies directly with the number
advantage to the linear complexity framework. (We are tempted to draw the
15
related conclusion that this similarly illustrates that the underlying process
of the paradigms. One such experiment (Juola, Bailey, and Pothos 1998)
mulation. The results, replicated here as figure 2, show that, as expected, the
“simplest” system is no inflection at all, that the system where all words were
inflected was more complex (reflecting the additional complexity of the rule
itself), but that the measured complexity varied smoothly and continuously,
and that the most complex system (as predicted by Shannon’s mathematics)
was at intermediate stage where the question of whether a word was subject
16
315
estimated C
310
305
300
295
290
285
0 10 20 30 40 50
17
Figure 2: Variations in measured complexity as number of inflected words
varies from 0 (0%) to 55 (100%)
appropriate corpora, and give meaningful measurements. Linear complexity
measurements can also be taken, but do not map to our preliminary intuitions
tive, this argues strongly that LZ, with its focus on the lexicon and long-term
processing that did not include a lexicon. But this also suggests that careful
the cognition underlying human language, either in ways that it is the same
omitted as shared between the author and her audience.” One observation
18
made by scholars of the translation process is that text in translation tends
between the speaker and hearer need not be expressed. For example, a writer
and a reader who share a common knowledge of the city layout can be less
One of the more obvious example of common knowledge is, of course, the
tural cues, and general knowledge that can be broadly lumped together as
the readers of the same work in translation. (Baker 1993) suggests that this
19
feature of increased specificity may be a universal aspect of the translation
process and presents a brilliantly clear example, where the (English) sentence
the American Presidency and the end of the Second World War, his political
implicit information need not even necessarily be “factual,” but can instead
the degree of shared knowledge between a reader and writer, even between
the writer and a “general audience.” For the simplest case, that of two
primary difference is, of course, the social and cultural context of both. This
conjecture has been tested directly (Juola 1997) using the LZ compression
language as one of the test set.2 If the translation process adds information
20
Language Size (raw) Size (compressed)
English 859,937 231,585
Dutch 874,075 245,296
Finnish 807,179 243,550
French 824,584 235,067
Maori 878,227 221,101
Russian 690,909 226,453
Mean 822,485 235,775
Deviation 70,317 10,222
HEBREW (506,945) (172,956)
true.
Again, we see that the variation in compressed size is much smaller than
the original, and we see that Hebrew is substantially smaller than either.
Of course, written Hebrew omits vowels, but we can see that this is not the
from the ACL. Of the nine versions of 1984, the two smallest are the (original)
21
Work Language Size
1984 ENGLISH (v2) 228,046
1984 ENGLISH 232,513
1984 Slovene 242,088
1984 Croat (v2) 242,946
1984 Estonian 247,355
1984 Czech 274,310
1984 Hungarian 274,730
1984 Romanian 283,061
1984 Bulgarian 369,482
We thus see that the original version of corpora tend to be smaller than
tion scholars.
With this framework in hand, we are able to address more specific questions,
22
As before, we start with some definitions. In particular, we treat the
marked with the -s suffix. The fact that the suffix -ing often signals a present
to the information conveyed by the entire text: for example, one where the
the word order would destroy syntactic relationships (one would no longer
23
jump walk touch 8634 139 5543
jumped walked touched 15 4597 1641
jumping walking touching 3978 102 6
Figure 3: Example of morphological degradation process
Consider figure 3: by replacing each type in the input corpus with a randomly
chosen symbol (here a decimal number), the simple regularity between the
rows and columns of the left half of the table have been replaced by arbitrary
relationships; where the left half can be easily described by a single row and
column, the entire right half would need to be memorized individually. More
accurately, this process is carried out tokenwise where each token of a given
type is replaced by the same (but randomly chosen) symbol which is unique to
morphological rules have been “degraded” into a situation where all versions
Note, however, that this does not eliminate lexical information. If, in
the original text, “I am” predicted a present participle, the new symbols
that replace “I am” will predict (the set of) symbols which correspond to
and replace present participles. However, because these symbols have been
24
randomly rewritten, there will be no easy and simple properties to determine
which symbols these are. Examining figure 3 again, the third row contains
the present participles. In the left hand side, all entries in this row are
instantly recognizable by their “-ing” morphology; the right hand side has
way of predicting anything about the new token. Compressing the resulting
substituted file will show the effects on the “complexity,” i.e. the role of
morphological complexity.
Performing this substitution has shown (Juola 1998) that languages dif-
fer strongly in the way and to the degree that this substitution affects their
the resulting r/c ratios sort the languages into the order (of increasing com-
few would argue with any measure of morphological complexity that puts
Russian and Finnish near the top (and English near the bottom), there are
25
Table 3: Size (in bytes) of various samples
Language Uncompressed Comp.(raw) Comp. (“cooked”) R/C Ratio
Dutch 4,509,963 1,383,938 1,391,046 0.994
English 4,347,401 1,303,032 1,341,049 0.972
Finnish 4,196,930 1,370,821 1,218,222 1.12
French 4,279,259 1,348,129 1,332,518 1.01
Maori 4,607,440 1,240,406 1,385,446 0.895
Russian 3,542,756 1,285,503 1,229,459 1.04
the languages in this sample (English, French, and Russian) are also part of
less complex than either Russian or French, which are themselves equivalent.
The ranking here agrees with Nichols’ ranking, placing English below the
other two and French and Russian adjacent. This agreement, together with
servation that (within the studied samples) the ordering produced by the
number of word types in the samples, and identical-reversed with the ordering
Spearman’s rank test yields p < 0.0025, while Kendall’s T test yields p <
26
Table 4: R/C ratios with linguistic form counts
Language R/C Types in sample Tokens in sample
Maori 0.895 19,301 1,009,865
English 0.972 31,244 824,364
Dutch 0.994 42,347 805,102
French 1.01 48,609 758,251
Russian 1.04 76,707 600,068
Finnish 1.12 86,566 577,413
to have a wide variety of distinct linguistic forms, while languages which are
of types, as in table 5.
almost by definition are few types but many tokens. However, this finding is
27
use of an extraordinarily wide variety of function word types (thus inflating
the number of types) or that inflates the number of tokens (for example
teen non-English versions, shows, again, languages are about equally com-
plex, but that they express their complexity differently at different levels.
Methodologically, this study was slightly different, both the scope of the
levels studied, but also in how distortion was performed. All samples were di-
vided identically into verses (e.g. Genesis 1:2 represented the same “message”
cally by the deletion of 10% of the letters in each verse at random. Samples
28
English Versions Non-English Versions
American Standard Version (asv) Bahasa Indonesian (bis)
Authorized (King James) Version (av) Portuguese (brp)
Bible in Basic English (bbe) Haitian Creole (crl)
Darby Translation (dby) French (dby)
Complete Jewish Bible (jps) Finnish (fin)
Revised Standard (rsv) Hungarian (karoli)
Webster’s Revised (rweb) Dutch (lei)
Young’s Literal Translation (ylt) German (lut)
Modern Greek (mgreek)
Modern Greek [unaccented] (mgreeku)
French (neg)
Russian (rst)
German (sch)
German (uelb)
Ukranian (ukraine)
Table 6: Bible samples and languages used
terpreted through context. For example, the use of pronouns hinges on the
gether make even heavier use of context. Just as elimination of words can
distort the structure of individual sentences and make them difficult to fol-
Although all three procedures will provably delete the same number of
29
expected characters from the source files, the effects of compression on these
4
using the UNIX gzip utility.
Table 7 shows the compressed and uncompressed sizes (in bytes) of the
of the original and of the various distorted versions. Table 9 normalizes all
results by presenting them as multiples of the original size (e.g. the morpho-
logically distorted version of ’asv’ is 1.17 times its original size when both
are compressed). This table is also sorted in order of decreasing value, and
English versions are labeled in ALL CAPS for ease of identification. Exam-
ining these tables together supports the hypotheses presented earlier. First,
30
Version Uncompressed Compressed
asv 4280478 1269845
av 4277786 1272510
bbe 4309197 1234147
bis 4588199 1346047
brp 3963452 1273549
crl 4351280 1312252
dby 4227529 1265405
drb 4336255 1337768
fin 4169287 1370042
jps 4309185 1288416
karoli 3957833 1404639
lei 4134479 1356996
lsg 4252648 1347637
lut 4180138 1341866
mgreek 4348255 1468263
mgreeku 4321200 1341237
neg 4188814 1323919
rst 3506401 1269684
rsv 4061749 1253476
rweb 4247431 1262061
sch 4317881 1417428
uelb 4407756 1383337
ukraine 3564937 1315103
ylt 4265621 1265242
31
conversation-level constructions are handled similarly in all languages stud-
deleting words or phrases from a corpus should result in the overall lowering
overall increase in information — but that is a clear result of table 9. The ex-
alternatively a number of freely varying synonyms for each lexical item. Un-
such as Finnish and Hungarian, ironically, suffer the least penalty because
they have more word forms, and therefore fewer new variants are introduced
in each one. Thus, the results show both that Finnish and Hungarian have
hypothesis, this should be the smallest in compressed size. It is, in fact, about
32
Ver.. Normal Morph. Prag. Syn.
asv 1269845.00 1487311.21 1162625.76 1235685.14
av 1272510.00 1483580.35 1164269.82 1236878.18
bbe 1234147.00 1461058.00 1130314.66 1207880.69
bis 1346047.00 1601727.60 1229751.14 1282227.44
brp 1273549.00 1452517.41 1164547.00 1226345.93
crl 1312252.00 1518413.44 1200910.23 1285958.47
dby 1265405.00 1474721.85 1158317.63 1229267.18
drb 1337768.00 1557772.69 1224481.67 1295321.34
fin 1370042.00 1548445.81 1251576.37 1294915.38
jps 1288416.00 1504061.39 1179304.44 1253465.97
karoli 1404639.00 1575233.97 1283223.98 1330816.55
lei 1356996.00 1522644.85 1239991.16 1297103.54
lsg 1347637.00 1560146.19 1233249.90 1300432.48
lut 1341866.00 1529869.88 1226982.62 1285579.01
mgreek 1468263.00 1701221.45 1343161.70 1408603.16
mgreeku 1341237.00 1550600.30 1226936.07 1292009.44
neg 1323919.00 1533468.52 1211325.57 1277609.72
rst 1269684.00 1410919.44 1161101.89 1204901.46
rsv 1253476.00 1445795.94 1147694.76 1215960.88
rweb 1262061.00 1471718.69 1154952.77 1227624.14
sch 1417428.00 1604828.85 1295267.52 1353695.77
uelb 1383337.00 1598355.09 1265317.06 1329331.17
ukraine 1315103.00 1451550.59 1201829.66 1247197.05
ylt 1265242.00 1482038.11 1158189.17 1230766.62
33
Ver.. Morph. Prag. Syn.
ukraine 1.10375 0.913867 0.948365
rst 1.11124 0.914481 0.948977
karoli 1.12145 0.913561 0.947444
lei 1.12207 0.913777 0.955864
fin 1.13022 0.913531 0.945165
sch 1.13221 0.913815 0.955037
lut 1.14011 0.914385 0.958053
brp 1.14053 0.914411 0.962936
RSV 1.15343 0.91561 0.970071
uelb 1.15543 0.914685 0.96096
mgreeku 1.1561 0.914779 0.963297
crl 1.15711 0.915152 0.979963
lsg 1.15769 0.91512 0.964972
neg 1.15828 0.914954 0.965021
mgreek 1.15866 0.914796 0.959367
drb 1.16446 0.915317 0.968271
DBY 1.16541 0.915373 0.971442
AV 1.16587 0.91494 0.971999
RWEB 1.16612 0.915132 0.972714
JPS 1.16737 0.915313 0.972874
ASV 1.17125 0.915565 0.973099
YLT 1.17135 0.915389 0.972752
BBE 1.18386 0.915867 0.978717
bis 1.18995 0.913602 0.952587
34
average. Readers who wish to regard this as evidence against McWhorter’s
single data point. As discussed in the next section, a much larger experiment
4 Discussion
tion theory can provide a structure for developing such studies. Taking a
the speaker and hearer — one can compare the mathematical properties of
that characterize the language in which the texts are written. By systemat-
ically distorting a single base text, we can establish whether this distortion
The key questions are thus: what are the (mathematical) properties that
35
can be measured, and how do they relate to the (linguistic) properties of
One possible approach to resolving this is to examine more closely the notion
available, and measures the complexity by the size of this very bound. Since
36
human memory is known to have a large capacity, it makes intuitive sense
that linear complexity should not well measure aspects of human cognition –
complexity measurements..
such as LZ-compression.
that will be useful for the quite practical problem of text compression. With
the ever increasing amount of text available (Nerbonne 2004), the ability to
store ideas in a way that matches the efficiency of the human mind is almost
37
theory represents human performance under ideal conditions, studying how
plain aspects of human performance under less than ideal conditions, includ-
ing both situational and cognitive degradation. This may improve science’s
Briefly to recap the findings of the previous sections: the question of “lin-
multiple translations of the same basic text (for example, the Bible or the
novel 1984), one can see whether “all languages are about equally complex”
38
show correspondingly less plausible results. In particular, the variation in
translated texts need to be more explicit so that the reader can understand
translated texts are more complex, in this framework, than their originals.
measure the role that type of expression plays in the overall complexity of
language. Using this framework, studies have shown that languages differ
ity and vice versa. This is compatible with the tool-and-effort framework of
ogy need not encode the same information syntactically A further finding —
39
reflect the underlying cognitive universals of the human mind/brain system.
These findings only scratch the surface of what could be explored un-
der this framework. In particular, McWhorter’s theory that creoles are less
carefully, one would want a text sample, written in one language and inde-
Assuming that the same concepts are expressed in each translation, any sig-
theoretical observations.
measurements. The three concepts presented above are only a few of the
40
Beyond this, the primary difference between the intuitively plausible re-
sults of the Ziv-Lempel metric and the implausible ones of linear complexity
memory and the lexicon. Can we find other methods that illustrate or demon-
linguistics has been dealing with the results of corpus-oriented studies; the
useful tool to help in this inference and bridge the gap between the paper
Notes
1
More formally, Shannon demonstrated that, given a source S, capable of sending any
N
X
H(S) = pi · l i (1)
i=1
where pi is the probability that message i is sent, and li is the length of message i.
The optimum lengths are when li is equal to the negative logarithm of pi . Thus,
41
N
X
H(S) = −pi log2 (pi ) (2)
i=1
2
The Bible as a whole is actually somewhat problematic in this regard as there is no
single original language. For this reason, we momentarily restrict our attention to the first
3
This project is described at http://nl.ijs.si/ME/Corpus/mte-D21M/mte-D21M.html
4
Another analysis, using a slightly different compression scheme (bzip2, using the
Burrows-Wheeler transform), obtained similar results but will not be discussed further.
References
42
Juola, P. (2000, August). A linear model of complexity (and its
WI.
43
mars. Linguistic Typology 6, 125–166.
Press.
Nerbonne, J. (2004, June). The data deluge. In Proc. 2004 Joint In-
tic Computing and the Association for Computers and the Humanities
guistic Computing.
Algorithms and Source Code in C. New York: John Wiley and Sons,
Inc.
44
Shannon, C. E. (1948). A mathematical theory of communication. Bell
View publication stats
Zipf, G. K. (1949). Human Behavior and the Principle of Least Effort. New
23 (3), 373–343.
45