2020 Acl-Main 181

What Determines the Order of Adjectives in English?
Comparing
Efficiency-Based Theories Using Dependency Treebanks
Richard Futrell William Dyer Gregory Scontras

University of California, Irvine Oracle Corporation University of California, Irvine
[email protected] [email protected] [email protected]
Abstract of syntax. These theories provide mathematical

We take up the scientific question of what models that can describe the distribution of words
determines the preferred order of adjectives in sentences and the way those words combine to
in English, in phrases such as big blue box yield the meaning of a sentence, in a way that cap-
where multiple adjectives modify a following tures the fine-grained quantitative patterns observ-
noun. We implement and test four quantita- able in large text datasets (Manning, 2003; Bres-
tive theories, all of which are theoretically mo- nan et al., 2007; Chen and Ferrer-i-Cancho, 2019).
tivated in terms of efficiency in human lan-
Quantitative syntactic theories are often
guage production and comprehension. The
four theories we test are subjectivity (Scon-
efficiency-based, meaning that they model word
tras et al., 2017), information locality (Futrell, distributions as the result of a process that tries to
2019), integration cost (Dyer, 2017), and in- maximize information transfer while minimizing
formation gain, which we introduce. We eval- some measure of cognitive cost; as a result, they
uate theories based on their ability to predict often use the mathematical language of informa-
orders of unseen adjectives in hand-parsed and tion theory. Such theories promise not only to
automatically-parsed dependency treebanks. describe distributions of words, but also to explain
We find that subjectivity, information locality,
why they take the shape they do, by viewing
and information gain are all strong predictors,
with some evidence for a two-factor account, human language as an efficient code subject to
where subjectivity and information gain reflect appropriate constraints. This work informs NLP
a factor involving semantics, and information by providing a theory of language structure that
locality reflects collocational preferences. integrates with data-driven, optimization-based
machine learning models.
1 Introduction
Adjective order is a fruitful empirical target for
Across languages, there exist strong and stable quantitative theories of syntax because it is an area
constraints on the order of adjectives when more where the traditional discrete and symbolic the-
than one is used to modify a noun (Dixon, 1982; ories become highly complex, and a quantitative
Sproat and Shih, 1991). For example, in English, approach becomes more attractive. For example,
big blue box sounds natural and appears relatively in the formal syntax literature, a standard expla-
frequently in corpora, while blue big box sounds nation for adjective order constraints is that each
less natural and occurs less frequently (Scontras adjective belongs to a certain semantic class (e.g.,
et al., 2017). In this paper, we take up the scien- COLOR or SIZE ) and that there exists a universal
tific question of what explains these constraints in total order on these semantic classes (e.g., COLOR
natural language. To do so, we implement quanti- < SIZE) shared among all languages, which deter-
tative models that have been proposed in previous mines the order of adjectives in any given instance
literature as explanations for these constraints, and (Cinque, 1994; Scott, 2002). Such discrete theo-
compare their accuracy in predicting adjective or- ries of adjective order become complex rapidly as
dering data in parsed corpora of English1 . the number of semantic classes to be posited be-
In the last few years, adjective order has become comes large (upwards of twelve in Scontras et al.
a crucial testing ground for quantitative theories 2017) and more fine-grained (see Bar-Sever et al.
1
All code and data are available at https://github. 2018 for discussion of the learning problem posed
com/langprocgroup/adjorder. by such classifications).
2003
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2003–2012
July 5 - 10, 2020. c 2020 Association for Computational Linguistics
In contrast, quantitative syntax theories typi- tions concerning the composition of meaning in
cally identify a single construct that grounds out in multi-adjective strings, subjectivity-based order-
real-valued numerical scores given to adjectives, ings lead to a greater probability of successful ref-
which determine their ordering preferences. These erence resolution; the authors thus offer an evolu-
scores can be estimated based on large-scale cor- tionary explanation for the role of subjectivity in
pus data or based on human ratings. In what fol- adjective ordering (see also Simonič, 2018; Hahn
lows, we test the predictions of four such theories: et al., 2018; Scontras et al., 2019).
the subjectivity hypothesis (Scontras et al., 2017;
Simonič, 2018; Hahn et al., 2018; Franke et al., 2.2 Information locality
2019; Scontras et al., 2019), the information lo-
cality hypothesis (Futrell and Levy, 2017; Futrell The theory of information locality holds that
et al., 2017; Hahn et al., 2018; Futrell, 2019), the words that have high mutual information are un-
integration cost hypothesis (Dyer, 2017), and the der pressure to be close to each other in linear or-
information gain hypothesis, which we introduce. der (Futrell and Levy, 2017; Futrell et al., 2017).
We begin with a presentation of the details Information locality is a generalization of the well-
of each theory, then implement the theories and supported principle of dependency length mini-
test their predictions against large-scale naturalis- mization (Liu et al., 2017; Temperley and Gildea,
tic data from English. In addition to comparing 2018). In the case of adjective ordering, the pre-
the predictors in terms of accuracy, we also per- diction is simply that adjectives that have high
form a number of analyses to determine the impor- pointwise mutual information (PMI) with their
tant similarities and differences among their pre- head noun will tend to be closer to that noun. The
dictions. The paper concludes with a discussion PMI of an adjective a and a noun n is (Fano, 1961;
of what our results tell us about adjective order and Church and Hanks, 1990):
related issues, and a look towards future work.
p(a, n)
PMI(a : n) ≡ log . (1)
2 Theories of adjective order p(a)p(n)
2.1 Subjectivity
In this paper, we take the relevant joint distribu-
Scontras et al. (2017) show that adjective order tion p(a, n) to be the distribution of adjectives
is strongly predicted by adjectives’ subjectivity and nouns in a dependency relationship, with the
scores: an average rating obtained by asking hu-
P
marginals calculated as p(a) = n p(a, n) and
man participants to rate adjectives on a numerical
P
p(n) = a p(a, n).
scale for how subjective they are. Adjectives that Information locality is motivated as a conse-
are rated as more subjective typically appear far- quence of a more general theory of efficiency
ther from the noun than adjectives rated as less in human language. In this theory, languages
subjective, and the strength of ordering prefer- should maximize information transfer while mini-
ences tracks the subjectivity differential between mizing cognitive information-processing costs as-
two adjectives. For example, in big blue box, the sociated with language production and compre-
adjective big has a subjectivity rating of 0.64 (out hension. Information locality emerges from these
of 1), and the adjective blue has a subjectivity rat- theories when we assume that the relevant measure
ing of 0.30. If adjectives are placed in order of de- of information-processing cost is the surprisal of
creasing subjectivity, then big must appear before words given lossy memory representations (Hale,
blue, corresponding to the preferred order. The no- 2001; Levy, 2008; Smith and Levy, 2013; Futrell
tion of subjectivity as a predictor of adjective order and Levy, 2017; Futrell, 2019).
was previously introduced by Hetzron (1978).
Subsequent work has attempted to explain the
2.3 Integration Cost
role of subjectivity in adjective ordering by ap-
pealing to the communicative benefit afforded by The theory of integration cost is also based in
ordering adjectives with respect to decreasing sub- the idea of efficiency with regard to information-
jectivity. For example, Franke et al. (2019) use processing costs. It differs from information lo-
simulated reference games to demonstrate that, cality in that it assumes that the correct metric of
given a set of independently-motivated assump- processing difficulty for a word w is the entropy
2004
over the possible heads of w: shrinks. In line with the logic for integration cost,
we propose that the word with smaller informa-
Cost(w) ∝ H[T |w] tion gain will be placed earlier, so that the set of
X
= −pT (t|w) log pT (t|w), (2) referents is gradually narrowed by each word.
t As generally implemented in decision trees,
information gain refers to the reduction of en-
where T is a random variable indicating the head
tropy obtained from partitioning a set on a feature
t of the word w (Dyer, 2017). This notion of cost
(Quinlan, 1986). In our case, the distribution of
captures the amount of uncertainty that has to be
nouns N is partitioned on a given adjective a, cre-
resolved about the proper role of the word w with
ating two partitions: Na and its complement Na c .
respect to the rest of the words in the sentence.
The difference between the starting entropy H[N ]
Like information locality, the theory of integration
and the sum of the entropy of each partition, con-
cost recovers dependency length minimization as
ditioned on the size of that partition, is the infor-
a special case. For the case of predicting adjective
mation gain of a:
order, the prediction is that an adjective a will be
closer to a noun when it has lower integration cost: IG(a) = H[N ]
|Na c | (4)

|Na | c
IC(a) = H[N |a], (3) − H[Na ] + H[Na ] .
|N | |N |
where N is a random variable ranging over nouns.
Integration cost corresponds to an intuitive idea Information gain is therefore comprised of both
previously articulated in the adjective ordering lit- positive and negative evidence. That is, specify-
erature. The idea is that adjectives that can mod- ing an adjective such as big partitions the proba-
ify a smaller set of nouns appear closer to the bility distribution of nouns into Nbig , the subset of
noun: for example, an order such as big wooden N which takes big as a dependent, and Nbig C , the
spoon is preferred over wooden big spoon be- subset of N which does not.
cause the word big can modify nearly any noun, Crucially, H[Na ] is not H[N |a] in general.
while wooden can only plausibly modify a small H[N |a] is the conditional entropy of nouns given a
set of nouns (Ziff, 1960). The connection be- specific adjective, while H[Na ] is the entropy of a
tween integration cost and set size comes from distribution over nouns whose support is limited to
the information-theoretic notion of the typical set noun types that have been observed to occur with
(Cover and Thomas, 2006, pp. 57–71); the entropy an adjective a. Combined with H[Na c ], informa-
of a random variable can be interpreted as the (log) tion gain tells us how much the entropy of N is
cardinality of the typical set of samples from that reduced by partitioning on a. This means that in-
variable. When we order adjectives by integra- formation gain and integration cost, while concep-
tion cost, this is equivalent to ordering them such tually similar, are not mathematically equivalent.
that adjectives that can modify a larger typical set To our knowledge, information gain has not
of nouns appear farther from the noun. The re- been previously suggested as a predictor of ad-
sult is that each adjective gradually reduces the en- jective ordering, although Danks and Glucksberg
tropy of the possible nouns to follow, thus avoid- (1971) expressed a similar intuition in proposing
ing information-processing costs that may be asso- that adjectives are ordered according to their ‘dis-
ciated with entropy reduction (Hale, 2006, 2016; criminative potential’. Although decision-tree al-
Dye et al., 2018). gorithms such as ID3 choose the highest-IG fea-
ture first, we predict that the lower-information-
2.4 Information gain gain adjective will precede the higher one.
We propose a new efficiency-based predictor of
3 Related Work
adjective order: information gain. The idea is to
view the noun phrase, consisting of prenominal Previous corpus studies of adjective order include
adjectives followed by the noun, as a decision tree Malouf (2000), who examined methods for or-
for identifying a referent, where each word parti- dering adjectives in a natural language generation
tions the space of possible referents. Each parti- context, and Wulff (2003), who examined effects
tioning is associated with some information gain, of phonological length, syntactic category ambi-
indicating how much the set of possible referents guity, semantic closeness, adjective frequency, and
2005
a measure similar to PMI called noun specificity. First, we extract adjective–noun (AN) pairs: a
Our work differs from this previous work by fo- set of pairs hA, N i where A is an adjective and
cusing on recently-introduced predictors that have N is a noun and N is the head of A with depen-
theoretical motivations grounded in efficiency and dency type amod. As in Futrell (2019), we de-
information theory. fine A as an adjective iff its part-of-speech is JJ
The theories we test here (except information and its wordform is listed as an adjective in the
gain) have been tested in previous corpus studies, English CELEX database (Baayen et al., 1995).
but never compared against each other. Scontras We define N as a noun iff its part-of-speech is
et al. (2017) validate that subjectivity is a good NN or NNS and its wordform is listed as a noun in
predictor of adjective order in corpora, and Hahn CELEX. These AN pairs are used to estimate the
et al. (2018) and Futrell et al. (2019) evaluate information-theoretic predictors that we are inter-
both information locality and subjectivity. Dyer ested in. We extracted 33,210,207 adjective–noun
(2018) uses integration cost to model the order of pairs from the parsed Common Crawl corpus.
same-side sibling dependents cross-linguistically Second, we extract adjective–adjective–noun
and across all syntactic categories. (AAN) triples: a set of triples hA1 , A2 , N i where
A1 and A2 are adjectives as defined above, and A1
4 Methods and A2 are both adjective dependents with relation
type amod of a single noun head N . Furthermore,
Our task is to find predictors of adjective order
A1 and A2 must not have any further dependents,
based solely on data about individual adjectives
and they must appear in the order A1 A2 N in the
and nouns. More formally, the goal is to find
corpus with no intervening words. We extracted
a scoring function S(A, N ) applying to an ad-
a total of 842,714 AAN triples from the parsed
jective A and a noun N , such that the order of
Common Crawl corpus.
two adjectives modifying a noun A1 A2 N can be
predicted accurately by comparing S(A1 , N ) and The values of all corpus-based predictors are es-
S(A2 , N ). Furthermore, the scoring function S timated using the AN pairs. The AAN triples are
should not include information about relative or- used only for fitting regressions from the predic-
der in observed sequences of the form A1 A2 N — tors to adjective orders, and for evaluation.
the scoring function should be based only on cor-
pus data about co-occurrences of A and N , or on Ratings-based predictors We gathered subjec-
human ratings about A and/or N . We apply this tivity ratings for all 398 adjectives present in AAN
restriction because our goal is to evaluate scientific triples in the English UD corpus. These subjec-
theories of why adjectives are ordered the way they tivity ratings were collected over Amazon.com’s
are, rather than to achieve maximal raw accuracy. Mechanical Turk, using the methodology of Scon-
tras et al. (2017). 264 English-speaking partici-
4.1 Data sources pants indicated the subjectivity of 30 random ad-
Corpus-based predictors We estimate jectives by adjusting a slider between endpoints
information-theoretic quantities for adjectives labeled “completely objective” (coded as 0) and
using a large automatically-parsed subsection of “completely subjective” (coded as 1). Each adjec-
the English Common Crawl corpus (Buck et al., tive received an average of 20 ratings.
2014; Futrell et al., 2019). The use of a parsed
corpus is necessary to identify adjectives that are Test set As a held-out test set for our predictors,
dependents of nouns in order to calculate PMI we use the English Web Treebank (EWT), a hand-
and IC. As described in Futrell et al. (2019), this parsed corpus, as contained in Universal Depen-
corpus was produced by heuristically filtering dencies (UD) v2.4 (Silveira et al., 2014; Nivre,
Common Crawl to contain only full sentences and 2015). Following our criteria, we extract 155
to remove web boilerplate text, and then parsing AAN triples having scores for all our predictors.
the resulting text using SyntaxNet (Andor et al., Because this test set is very small, we also evaluate
2016), obtaining a total of ∼1 billion tokens of against a held-out portion of the parsed Common
automatically parsed web text. In this work, we Crawl data. In the Common Crawl test set, after
use a subset of this corpus, described below. including only AAN triples that have scores for all
From this corpus, we extract two forms of data. of our predictors, we have 41,822 AAN triples.
2006
4.2 Estimation of predictors strictly speaking, it is not necessary to fit these pre-
Our information-theoretic predictors require esti- dictors to any training data: we can evaluate our
mates of probability distributions over adjectives theories based on their a priori predictions simply
and nouns. To estimate these probability distribu- by asking how accurately we can predict the or-
tions, we first use maximum likelihood estimation der of adjectives in AAN triples based on the rules
as applied to counts of wordforms in AN pairs. We above.
call these estimates wordform estimates. However, we can get a deeper picture of the per-
Although maximum likelihood estimation is formance of our predictors by using them in classi-
sufficient to give an estimate of the general entropy fiers for adjective order. By fitting classifiers using
of words (Bentz et al., 2017), it is not yet clear that our predictors, we can easily extend our models
it gives a good measure for conditional entropy or to ones with multiple predictors, in order to de-
mutual information, due to data sparsity, even with termine if a combined set of the predictors gives
millions of tokens of text (Futrell et al., 2019). increased accuracy over any one.
Therefore, as a second method that alleviates Logistic regression method We fit logistic re-
the data sparsity issue, we also calculate our gressions to predict adjective order in AAN triples
probability distributions not over raw wordforms using our predictors. Our goal is to predict the
but over clusterings of words in an embedding order of the triple from the unordered set of the
space, a method which showed promise in Futrell two adjectives {A1 , A2 } and the noun N . To do
et al. (2019). To derive word clusters, we use so, we consider the adjectives in lexicographic
sklearn.cluster.KMeans applied to a pre- order: Given an AAN triple, let A1 denote the
trained set of 1.9 million 300-dimension GloVe lexicographically-first adjective, and A2 the sec-
vectors2 generated from the Common Crawl cor- ond. Then any given AAN triple is either of the
pus (Pennington et al., 2014). We classify ad- form hA1 , A2 , N i or hA2 , A1 , N i. We fit a logis-
jectives into kA = 300 clusters and nouns into tic regression to predict this order given the differ-
kN = 1000 clusters. These numbers k were found ence in the values of the predictors for the two ad-
by choosing the largest k multiple of 100 that did jectives. That is, we fit a logistic regression of the
not result in any singleton clusters. We then es- form in Figure 1. This method of fitting a classifier
timated probabilities p(a, n) by maximum likeli- to predict order data was used previously in Mor-
hood estimation after replacing wordforms a and gan and Levy (2016). Based on theoretical consid-
n with their cluster indices. erations and previous empirical results, we expect
This clustering method alleviates data sparsity that the fitted values of β1 will be negative for PMI
by reducing the size of the support of the distri- and positive for IC and subjectivity. The regres-
butions over adjectives and nouns, to kA and kN sion in Figure 1 can easily be extended to include
respectively, and by effectively spreading prob- multiple predictors, with a separate β for each.
ability mass among words with similar seman-
tics. The clusters might also end up recapitulating Evaluation metrics We evaluate our models us-
the semantic categories that have played a role in ing raw accuracy in predicting the order of held-
more traditional syntactic theories of adjective or- out AAN triples. We also calculate 95% confi-
der (Dixon, 1982; Cinque, 1994; Scott, 2002). We dence intervals on these accuracies, indicating our
call these estimates cluster estimates. uncertainty about how the accuracy would change
in repeated experiments. Following standard ex-
4.3 Evaluation perimental practice, if we find that two predictors
Fitting predictors to data Most of our individ- achieve different accuracies, but their confidence
ual predictors come along with theories that say intervals overlap, then we conclude that we do not
what their effect on adjective order should be. Ad- have evidence that their accuracies are reliably dif-
jectives with low PMI should be farther from the ferent. We say a difference in accuracy between
noun, adjectives with high IC should be farther predictors is significant if the 95% confidence in-
from the noun, and adjectives with high subjec- tervals do not overlap.
tivity should be farther from the noun. Therefore, Evaluation on held-out hand-parsed data It
2
http://nlp.stanford.edu/data/glove. is crucial that we not evaluate solely on
42B.300d.zip automatically-parsed data. The reason is that both
2007
p(hA1 , A2 , N i)
log = β0 + β1 (S(A1 , N ) − S(A2 , N )) + ǫ
p(hA2 , A1 , N i)
Figure 1: Logistic regression for adjective order. The function S(A, N ) is the predictor to be evaluated, β0 and β1
are the free parameters to be fit, and ǫ is an error term to be minimized.
PMI and IC, as measures of the strength of sta- investigate whether our predictors might be mak-
tistical association between nouns and adjectives, ing independent contributions to explaining adjec-
could conceivably double as predictors of pars- tive order, we fit logistic regressions containing
ing accuracy for automatic dependency parsers. If multiple predictors. If the best accuracy comes
that is the case, then we might observe that AAN from a model with two or more predictors, then
triples with low PMI or high IC are rare in auto- this would be evidence that these two predictors
matically parsed data. However, this would not be are picking up on separate sources of information
a consequence of any interesting theory of cogni- relevant for predicting adjective order.
tive cost, but rather simply an artifact of the auto- We conducted logistic regressions using all sets
matic parser used. To avoid this confound, we in- of two of our predictors. The top 5 such mod-
clude an evaluation based on held-out hand-parsed els, in terms of Common Crawl test set accuracy,
data in the form of the English Web Treebank. are shown in Table 2. The best two are clus-
ter/wordform subjectivity and wordform PMI, fol-
5 Results
lowed by cluster subjectivity and cluster informa-
Table 1a shows the accuracies of our predictors in tion gain. No set of three predictors achieves sig-
predicting held-out adjective orders in the Com- nificantly higher accuracy than the best predictors
mon Crawl test set, visualized in Figure 2a. We shown in Table 2.
find that the pattern of results depends on whether
predictors are estimated based on wordforms or
based on distributional clusters. When estimat- 5.2 Qualitative analysis
ing based on wordforms, we find that subjectivity
and PMI have the best accuracy. When estimating We manually examined cases where each model
based on clusters, the accuracy of PMI drops, and made correct and incorrect predictions in the hand-
the best predictor is subjectivity, with IG close be- parsed EWT data. Table 3a shows example AAN
hind. We find a negative logistic regression weight triples that were ordered correctly by PMI, but not
for information gain, indicating that the adjective by subjectivity. These are typically cases where a
with lower information gain is placed first. certain adjective–noun pair forms a common col-
This basic pattern of results is confirmed in the location whose meaning is in some cases even
hand-parsed EWT data. Accuracies of predictors noncompositional; for example, “bad behaviors”
on the EWT test set are shown in Table 1b and vi- is a common collocation when describing train-
sualized in Figure 2b. When estimating based on ing animals, and “ulterior motives” and “logical
wordforms, the best predictors are subjectivity and fallacy” are likewise common English colloca-
PMI, although the confidence intervals of all pre- tions. In contrast, when subjectivity makes the
dictors are overlapping. When estimating based right prediction and PMI makes the wrong predic-
on clusters, IG has the best performance, and PMI tion, these are often cases where a word pair which
again drops in accuracy. For this case, IG, IC, and normally would form a collocation is broken up
subjectivity all have overlapping confidence inter- by another adjective, such as “dear sick friend”,
vals, so we conclude that there is no evidence that where “dear friend” is a common collocation.
one is better than the other. However, we do have We also performed a manual qualitative anal-
evidence that IG and IC are more accurate than ysis to determine the contribution of information
PMI when estimated based on clusters. gain beyond subjectivity and PMI. Table 3b shows
examples of such cases from the EWT. Many of
5.1 Multivariate analysis these seem to be cases with weak preferences,
Adjective order may be determined by multiple where both the attested order and the the flipped
separate factors operating in parallel. In order to order are acceptable (e.g., “tiny little kitten”).
2008
Predictor Accuracy Conf. Interval Predictor Accuracy Conf. Interval
Subj. (cluster) .661 [.657, .666] IG (cluster) .737 [.668, .806]
PMI (wordform) .659 [.654, .664] Subj. (wordform) .724 [.654, .795]
Subj. (wordform) .659 [.654, .664] IC (cluster) .705 [.633, .777]
IG (cluster) .650 [.645, .654] Subj. (cluster) .692 [.620, .765]
IC (wordform) .642 [.634, .646] PMI (wordform) .667 [.592, .741]
IG (wordform) .640 [.635, .645] IC (wordform) .641 [.566, .717]
IC (cluster) .613 [.608, .618] IG (wordform) .603 [.526, .680]
PMI (cluster) .606 [.601, .610] PMI (cluster) .590 [.512, .667]
(a) Common Crawl (N = 41822). (b) Hand-parsed EWT (N = 155). All confidence inter-
vals overlap, other than cluster-based PMI and IG.
Table 1: Accuracies of the predictors on AAN triples in the held-out test data.
Wordforms Clusters Wordforms Clusters

1.00 1.00
0.75 0.75
EWT accuracy
CC accuracy
0.50 0.50
0.25 0.25
0.00 0.00
IC
IG
ec I
IC
IG
ec I
IC
IG
IC
IG
ec I
ity
ity
ity
ity
PM
PM
PM
PM
tiv
tiv
tiv
tiv
ec
bj
bj
bj
bj
Su
Su
Su
Su
(a) Common Crawl (N = 41822). (b) Hand-parsed EWT (N = 155)
Figure 2: Accuracies of predictors on AAN triples in the held-out test data, with 95% confidence intervals shown.
Predictor Accuracy Conf. Interval

Subj. (cluster) + PMI (wordform) .723 [.719, .727]
Subj. (wordform) + PMI (wordform) .713 [.708, .717]
Subj. (cluster) + IG (cluster) .699 [.695, .703]
Subj. (cluster) + IC (cluster) .690 [.686, .695]
IG (cluster) + IC (cluster) .684 [.680, .689]
Table 2: Common Crawl test set accuracy of the top 5 models combining two predictors.
2009
A1 A2 N A1 A2 N
major bad behaviors tiny little kitten
large outstanding debts correct legal name
classical logical fallacy chronic intractable pain
dark ulterior motives radical religious politics
minor fine tuning lonely eerie place
(a) Ordered correctly by wordform PMI, but not by word- (b) Ordered correctly by cluster-based information gain,
form subjectivity. but not by cluster-based subjectivity nor PMI.
Table 3: Selected examples of AAN triples ordered incorrectly by our models, from the EWT test set.
5.3 Interpretation This study provides a framework for evaluat-

Our results broadly support the following interpre- ing further theories of adjective order, and for
tation. Adjective ordering preferences are largely evaluating the theories given here against new
determined by a semantic factor that can be quan- data from dependency treebanks. Generalizing to
tified variously using wordform subjectivity or larger datasets of English is straightforward. More
distributional-cluster-based estimates of informa- excitingly, we now have the opportunity to bring
tion gain. In addition to this factor, another fac- new languages into the fold. The vast majority of
tor is in play: when an adjective–noun pair forms research on adjective ordering, and all the corpus
a collocation with a possibly non-compositional work to our knowledge, has been done on English,
meaning, then the adjective in this pair will tend where adjectives almost always come before the
to be placed next to the noun. This latter factor is noun. Studying other typologically-distinct lan-
measured by PMI. This interpretation matches that guages provides an opportunity to disentangle the
of Hahn et al. (2018), who found separate contri- theories that we studied here in a way that cannot
butions from PMI and a model-based operational- be done in English.
ization of subjectivity.
The available behavioral evidence suggests that
Our interpretation is supported by the following
mirror-image preferences (e.g., “box blue big”)
points from the analysis above. First, among pre-
may be the norm in post-nominal adjective lan-
dictors based solely on wordforms, the best accu-
guages (Martin, 1969; Scontras et al., 2020). In-
racy is obtained by a combination of subjectivity
formation locality, subjectivity, and integration
and PMI. Second, when we turn to estimates based
cost make precisely that prediction, though none
on clusters, two things happen: the accuracy of
addresses mixed-type languages in which adjec-
PMI drops, and the accuracy of information gain
tives can precede or follow nouns. It is an open
increases while the accuracy of subjectivity stays
question how to implement IG for these post-
about the same. This pattern of results suggests
or mixed-placement adjectives; one possibility is
that PMI is measuring a factor that has more to do
to measure the information gained when the set
with specific wordforms, while IG and subjectiv-
of adjectives associated to a noun An is parti-
ity are measuring a factor that has more to do with
tioned by an adjective a. In that case, the predic-
semantic uncertainty about the noun or about the
tions about post-nominal order could differ sub-
relationship between the adjective and the noun.
stantially from the predictions about pre-nominal
6 Conclusion order.
We examined a number of theoretically-motivated Our dependency-treebank-based methods can

predictors of adjective order in dependency tree- be applied to any other corpus of any language,
bank corpora of English. We found that the pre- provided it has enough data in the form of
dictors have comparable accuracy, but that it is adjective–noun pairs to get reliable estimates of
possible to identify two broad factors: a seman- the information-theoretic predictors. Such stud-
tic factor variously captured by subjectivity scores ies will be crucial to achieve a complete compu-
and information gain based on word clusters, and tational understanding of natural language syntax.
a wordform-based factor captured by PMI.
2010
References Robert M. W. Dixon. 1982. Where have all the ad-
jectives gone? And other essays in semantics and
Daniel Andor, Chris Alberti, David Weiss, Aliaksei syntax. Mouton, Berlin, Germany.
Severyn, Alessandro Presta, Kuzman Ganchev, Slav
Petrov, and Michael Collins. 2016. Globally nor- Melody Dye, Petar Milin, Richard Futrell, and Michael
malized transition-based neural networks. In Pro- Ramscar. 2018. Alternative solutions to a language
ceedings of the 54th Annual Meeting of the Associa- design problem: The role of adjectives and gender
tion for Computational Linguistics (Volume 1: Long marking in efficient communication. Topics in cog-
Papers), pages 2442–2452, Berlin, Germany. Asso- nitive science, 10(1):209–224.
ciation for Computational Linguistics.
William Dyer. 2018. Integration complexity and the
R. Harald Baayen, Richard Piepenbrock, and Leon Gu- order of cosisters. In Proceedings of the Second
likers. 1995. The CELEX Lexical Database. Release Workshop on Universal Dependencies (UDW 2018),
2 (CD-ROM). Linguistic Data Consortium, Univer- pages 55–65, Brussels, Belgium. Association for
sity of Pennsylvania. Computational Linguistics.
Galia Bar-Sever, Rachael Lee, Gregory Scontras, and
William E. Dyer. 2017. Minimizing integration cost:
Lisa S. Pearl. 2018. Little lexical learners: Quantita-
A general theory of constituent order. Ph.D. thesis,
tively assessing the development of adjective order-
University of California, Davis, Davis, CA.
ing preferences. In 42nd annual Boston University
Conference on Language Development, pages 58–
71. Robert M. Fano. 1961. Transmission of Information:
A Statistical Theory of Communication. MIT Press,
Christian Bentz, Dimitrios Alikaniotis, Michael Cambridge, MA.
Cysouw, and Ramon Ferrer-i-Cancho. 2017. The
entropy of words—Learnability and expressivity Michael Franke, Gregory Scontras, and Mihael Si-
across more than 1000 languages. Entropy, 19:275– monič. 2019. Subjectivity-based adjective ordering
307. maximizes communicative success. In Proceedings
of the 41st Annual Meeting of the Cognitive Science
Joan Bresnan, Anna Cueni, Tatiana Nikitina, and Har- Society, pages 344–350.
ald Baayen. 2007. Predicting the dative alternation.
In Cognitive Foundations of Interpration, pages 69– Richard Futrell. 2019. Information-theoretic locality
94. Royal Netherlands Academy of Science, Ams- properties of natural language. In Proceedings of
terdam. the First International Conference on Quantitative
Syntax, pages 2–15, Paris.
Christian Buck, Kenneth Heafield, and Bas Van Ooyen.
2014. N-gram counts and language models from the Richard Futrell and Roger Levy. 2017. Noisy-
common crawl. In LREC, volume 2, page 4. Cite- context surprisal as a human sentence processing
seer. cost model. In Proceedings of the 15th Confer-
ence of the European Chapter of the Association for
Xinying Chen and Ramon Ferrer-i-Cancho, editors. Computational Linguistics: Volume 1, Long Papers,
2019. Proceedings of the First Workshop on Quanti- pages 688–698, Valencia, Spain.
tative Syntax (Quasy, SyntaxFest 2019). Association
for Computational Linguistics, Paris, France. Richard Futrell, Roger Levy, and Edward Gibson.
2017. Generalizing dependency distance: Comment
Kenneth Ward Church and Patrick Hanks. 1990. Word on “dependency distance: A new perspective on syn-
association norms, mutual information, and lexicog- tactic patterns in natural languages” by haitao liu et
raphy. Computational Linguistics, 16(1):22–29. al. Physics of Life Reviews, 21:197–199.
Guglielmo Cinque. 1994. On the evidence for partial Richard Futrell, Peng Qian, Edward Gibson, Evelina
N-movement in the Romance DP. In R S Kayne, Fedorenko, and Idan Blank. 2019. Syntactic de-
G Cinque, J Koster, J.-Y. Pollock, Luigi Rizzi, pendencies correspond to word pairs with high mu-
and R Zanuttini, editors, Paths Towards Universal tual information. In Proceedings of the Fifth In-
Grammar. Studies in Honor of Richard S. Kayne, ternational Conference on Dependency Linguistics
pages 85–110. Georgetown University Press, Wash- (DepLing 2019), Paris.
ington DC.
Michael Hahn, Judith Degen, Noah Goodman,
Thomas M. Cover and J. A. Thomas. 2006. Elements of Daniel Jurafsky, and Richard Futrell. 2018. An
Information Theory. John Wiley & Sons, Hoboken, information-theoretic explanation of adjective order-
NJ. ing preferences. In Proceedings of the 40th Annual
Meeting of the Cognitive Science Society (CogSci).
J. H. Danks and S. Glucksberg. 1971. Psychological
scaling of adjective orders. Journal of Verbal Learn- John Hale. 2006. Uncertainty about the rest of the sen-
ing and Verbal Behavior, 10(1):63–67. tence. Cognitive science, 30(4):643–672.
2011
John T. Hale. 2001. A probabilistic Earley parser as a Gregory Scontras, Judith Degen, and Noah D. Good-
psycholinguistic model. In Proceedings of the Sec- man. 2019. On the grammatical source of adjective
ond Meeting of the North American Chapter of the ordering preferences. Semantics and Pragmatics.
Association for Computational Linguistics and Lan-
guage Technologies, pages 1–8. G.-J. Scott. 2002. Stacked adjectival modification and
the structure of nominal phrases. In G Cinque, edi-
John T. Hale. 2016. Information-theoretical complex- tor, The cartography of syntactic structures, Volume
ity metrics. Language and Linguistics Compass, 1: Functional structure in the DP and IP, pages 91–
10(9):397–412. 120. Oxford University Press, Oxford.
R. Hetzron. 1978. On the relative order of adjec- Natalia Silveira, Timothy Dozat, Marie-Catherine
tives. In H. Seller, editor, Language Universals. de Marneffe, Samuel Bowman, Miriam Connor,
Narr, Tübingen, Germany. John Bauer, and Christopher D. Manning. 2014. A
gold standard dependency corpus for English. In
Roger Levy. 2008. Expectation-based syntactic com- Proceedings of the Ninth International Conference
prehension. Cognition, 106(3):1126–1177. on Language Resources and Evaluation (LREC-
2014).
Haitao Liu, Chunshan Xu, and Junying Liang. 2017.
Dependency distance: A new perspective on syntac- Mihael Simonič. 2018. Functional explanation of ad-
tic patterns in natural languages. Physics of Life Re- jective ordering preferences using probabilistic pro-
views, 21:171–193. gramming. Master’s thesis, University of Tübingen.
Robert Malouf. 2000. The order of prenominal adjec- Nathaniel J. Smith and Roger Levy. 2013. The effect of
tives in natural language generation. In Proceed- word predictability on reading time is logarithmic.
ings of the 38th Annual Meeting on Association for Cognition, 128(3):302–319.
Computational Linguistics, ACL ’00, pages 85–92,
Stroudsburg, PA, USA. Association for Computa- R. Sproat and C. Shih. 1991. The cross-linguistic
tional Linguistics. distribution of adjective ordering restrictions. In
C. Georgopoulos and R. Ishihara, editors, Interdis-
Christopher D. Manning. 2003. Probabilistic syntax. ciplinary approaches to language: Essays in honor
In Probabilistic Linguistics, pages 289–341. MIT of S.-Y. Kuroda, pages 565–593. Kluwer Academic,
Press. Dordrecht, Netherlands.
J E Martin. 1969. Some competence-process relation- David Temperley and Daniel Gildea. 2018. Min-
ships in noun phrases with prenominal and postnom- imizing syntactic dependency lengths: Typologi-
inal adjectives. Journal of Verbal Learning and Ver- cal/cognitive universal? Annual Review of Linguis-
bal Behavior, 8:471–480. tics, 4:1–15.
Emily Morgan and Roger Levy. 2016. Abstract knowl- Stefanie Wulff. 2003. A multifactorial corpus analysis
edge versus direct experience in processing of bino- of adjective order in english. International Journal
mial expressions. Cognition, 157:382–402. of Corpus Linguistics, 8(2):245–282.
Joakim Nivre. 2015. Towards a universal grammar for P. Ziff. 1960. Semantic analysis. Cornell University
natural language processing. In Computational Lin- Press, Ithaca, NY.
guistics and Intelligent Text Processing, pages 3–16.
Springer.
Jeffrey Pennington, Richard Socher, and Christo-

pher D. Manning. 2014. GloVe: Global vectors for
word representation. In Empirical Methods in Nat-
ural Language Processing (EMNLP), pages 1532–
1543.
J. R. Quinlan. 1986. Induction of decision trees. Ma-

chine Learning, 1(1):81–106.
Gregory Scontras, Galia Bar-Sever, Zeinab

Kachakeche, Cesar Manuel Rosales Jr., and
Suttera Samonte. 2020. Incremental semantic re-
striction and subjectivity-based adjective ordering.
In Proceedings of Sinn und Bedeutung 24.
Gregory Scontras, Judith Degen, and Noah D. Good-

man. 2017. Subjectivity predicts adjective ordering
preferences. Open Mind: Discoveries in Cognitive
Science, 1(1):53–65.
2012

2020 Acl-Main 181

Uploaded by

Copyright:

Available Formats

2020 Acl-Main 181

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

2020 Acl-Main 181

Uploaded by

Copyright:

Available Formats

What Determines the Order of Adjectives in English?

Richard Futrell William Dyer Gregory Scontras

Abstract of syntax. These theories provide mathematical

Wordforms Clusters Wordforms Clusters

(a) Common Crawl (N = 41822). (b) Hand-parsed EWT (N = 155)

Predictor Accuracy Conf. Interval

5.3 Interpretation This study provides a framework for evaluat-

We examined a number of theoretically-motivated Our dependency-treebank-based methods can

Jeffrey Pennington, Richard Socher, and Christo-

J. R. Quinlan. 1986. Induction of decision trees. Ma-

Gregory Scontras, Galia Bar-Sever, Zeinab

Gregory Scontras, Judith Degen, and Noah D. Good-

You might also like