A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models
A Comprehensive Comparative Evaluation and Analysis of Distributional Semantic Models
Abstract Distributional semantics has deeply changed in the last decades. First, pre-
dict models stole the thunder from traditional count ones, and more recently both of
them were replaced in many NLP applications by contextualized vectors produced
by Transformer neural language models. Although an extensive body of research has
been devoted to Distributional Semantic Model (DSM) evaluation, we still lack a
thorough comparison with respect to tested models, semantic tasks, and benchmark
datasets. Moreover, previous work has mostly focused on task-driven evaluation, in-
stead of exploring the differences between the way models represent the lexical se-
mantic space. In this paper, we perform a comprehensive evaluation of type distribu-
tional vectors, either produced by static DSMs or obtained by averaging the contex-
tualized vectors generated by BERT. First of all, we investigate the performance of
embeddings in several semantic tasks, carrying out an in-depth statistical analysis to
identify the major factors influencing the behavior of DSMs. The results show that i.)
the alleged superiority of predict based models is more apparent than real, and surely
not ubiquitous and ii.) static DSMs surpass contextualized representations in most
out-of-context semantic tasks and datasets. Furthermore, we borrow from cognitive
1 Introduction
Distributional semantics (Lenci, 2008, 2018; Boleda, 2020) is today the leading
approach to lexical meaning representation in natural language processing (NLP),
artificial intelligence (AI), and cognitive modeling. Grounded in the Distributional
Hypothesis (Harris, 1954; Sahlgren, 2008), according to which words with similar
linguistic contexts tend to have similar meanings, distributional semantics represents
lexical items with real-valued vectors (nowadays commonly referred to as embed-
dings) that encode their linguistic distribution in text corpora. We refer to models that
learn such representations as Distributional Semantic Models (DSMs).
Many types of DSMs have been designed throughout the years (see Table 1). The
first generation of DSMs dates back to the 1990s and is characterized by so-called
count models, which learn the distributional vector of a target lexical item by record-
ing and counting its co-occurrences with linguistic contexts. These can consist of
documents (Landauer and Dumais, 1997; Griffiths et al., 2007) or lexical collocates,
the latter in turn identified with either a “bag-of-word” window surrounding the target
(Lund and Burgess, 1996; Kanerva et al., 2000; Sahlgren, 2008) or syntactic depen-
dency relations extracted from a parsed corpus (Padó and Lapata, 2007; Baroni and
Lenci, 2010). In matrix models, directly stemming from the Vector Space Model in
information retrieval (Salton et al., 1975), target-context co-occurrence frequencies
(or frequency-derived scores that are more suitable to reflect the importance of the
contexts) are arranged into a co-occurrence matrix (or a more complex geometric
object, like a tensor; Baroni and Lenci, 2010). In this matrix, target lexical items are
represented with high-dimensional and sparse explicit vectors (Levy and Goldberg,
2014b), such that each dimension is labeled with a specific context in which they
A comprehensive evaluation of Distributional Semantic Models 3
have been observed in the corpus. In order to improve the quality of the resulting
semantic space by smoothing unseen data, removing noise, and exploiting redundan-
cies and correlations between the linguistic contexts (Turney and Pantel, 2010), the
co-occurrence matrix is typically mapped onto a reduced matrix of low-dimensional,
dense vectors consisting of “latent” semantic dimensions implicit in the original dis-
tributional data. Dense vectors are generated from explicit ones by factorizing the
co-occurrence matrix with Singular Value Decomposition (Landauer and Dumais,
1997) or with Bayesian probabilistic methods like Latent Dirichlet Allocation (Blei
et al., 2003). A different kind of count DSMs are random encoding models: Instead
of collecting global co-occurrence statistics into a matrix, they directly learn low-
dimensional distributional representations by assigning to each lexical item a ran-
dom vector that is incrementally updated depending on the co-occurring contexts
(Kanerva et al., 2000).
With the emergence of deep learning methods in the 2010s (Goodfellow et al.,
2016; Goldberg, 2017), a new generation of so-called predict models has entered the
scene of distributional semantics and competed with more traditional ones. Rather
than counting co-occurrences, predict DSMs are artificial neural networks that di-
rectly generate low-dimensional, dense vectors by being trained as language models
that learn to predict the contexts of a target lexical item. In this period of innova-
tion, documents have largely been abandoned as linguistic contexts, and models have
focused on window-based collocates and to a much lesser extent on syntactic ones
(Levy and Goldberg, 2014b). Thanks to the deep learning wave, neural models – not
necessarily deep in themselves – like word2vec (Mikolov et al., 2013a,b) and FastText
(Bojanowski et al., 2017) have fast overshadowed count DSMs, though the debate on
the alleged superiority of predict models has produced inconclusive results (Baroni
et al., 2014; Levy et al., 2015). The only exception to the dominance of predict mod-
els in this period is represented by GloVe (Pennington et al., 2014). However, despite
being a matrix DSM, the GloVe method to learn embeddings is closely inspired by
neural language models.
The advent of neural DSMs has also brought important methodological novel-
ties. Besides popularizing the expression “word embedding” as a kind of standard
term to refer to distributional vectors, deep learning has radically modified the scope
and application of distributional semantics itself. The first generation of DSMs es-
sentially encompassed computational methods to estimate the semantic similarity or
relatedness among words (e.g., to build data-driven lexical resources). On the other
hand, embeddings are nowadays routinely used in deep learning architectures to ini-
tialize their word representations. These pretrained embeddings allow neural net-
works to capture semantic similarities among lexical items that are beneficial to carry
out downstream supervised tasks. Thus, distributional semantics has become a gen-
eral approach to provide NLP and AI applications with semantic information. This
change of perspective has also affected the approach to DSM evaluation. The previ-
ous generation of distributional semantics usually favoured intrinsic methods to test
DSMs for their ability to model various kinds of semantic similarity and relatedness
(e.g, synonymy tests, human similarity ratings, etc.). Currently, the widespread use of
distributional vectors in deep learning architectures has boosted extrinsic evaluation
4 Alessandro Lenci et al.
methods: The vectors are fed into a downstream NLP task (e.g., part-of-speech tag-
ging or named entity recognition) and are evaluated with the system’s performance.
A further development in distributional semantics has recently come out from the
research on deep neural language models. For both count and predict DSMs, a com-
mon and longstanding assumption is the building of a single, stable representation for
each word type in the corpus. In the latest generation of embeddings, instead, each
word token in a specific sentence context gets a unique representation. These models
typically rely on a multi-layer encoder network and the word vectors are learned as
a function of its internal states, such that a word in different sentence contexts de-
termines different activation states and is represented by a distinct vector. Thefore,
the embeddings produced by these new frameworks are said to be contextualized, as
opposed to the static ones produced by earlier DSMs. One of the most popular archi-
tectures to learn contextualized embeddings is BERT (Devlin et al., 2019), which is
based on a stack of Transformer encoder layers (Vaswani et al., 2017) trained jointly
in a masked language model and a next sentence prediction task.
Generating lexical representations is not the end goal of systems like BERT or
GPT (Radford et al., 2019), which are designed chiefly as general, multi-task archi-
tectures to develop NLP applications based on the technique of fine-tuning. However,
since their internal layers generate embeddings that encode several aspects of mean-
ing as a function of the distributional properties of words in texts, BERT and its
relatives can also be regarded as DSMs (specifically, predict DSMs, given their lan-
guage modelling training objective) that produce token distributional vectors (Mickus
et al., 2020).1 In fact, the improvements achieved by these systems in several tasks
has granted huge popularity to contextualized distributional vectors that have fast re-
placed static ones, especially in downstream applications. The reason for this success
is ascribed to the ability of such representations to capture several linguistic features
(Tenney et al., 2019) and, in particular, context-dependent aspects of word meaning
(Wiedemann et al., 2019), which overcomes an important limit of static embeddings
that conflate different word senses in the same type vector (Camacho-Collados and
Pilehvar, 2018). In this last generation of contextualized DSMs, the contexts of the
target lexical items are not selected a priori. Models are fed with huge amounts of
raw texts and they learn (e.g., thanks to the attention mechanism of Transfomers)
which words are most related to the one that is being processed, thereby encoding in
the output vectors relevant aspect of the context structure.
As it is clear from this brief review, the landscape of distributional semantics has
undergone deep transformations since its outset. Changes involve the way to char-
acterize linguistic contexts, the methods to generate distributional vectors, the very
nature of such vectors (e.g., type vs. token ones), and the model complexity itself,
which has exponentially grown especially with the last generation of deep neural
DSMs, now consisting of hundreds of billions parameters and requiring equally huge
amounts of computational resources for their training. In this paper, we assess the ef-
1 Westera and Boleda (2019) instead argue that the domain of distributional semantics is limited to
context-invariant semantic representations. However, context-sensitive token vectors are not an absolute
novelty in the field (Erk and Padó, 2010), though they have remained a kind of sideshow until the boom of
deep neural language models.
A comprehensive evaluation of Distributional Semantic Models 5
We conducted all the analyses on type distributional vectors, which were natively
produced by static DSMs or obtained by averaging the token vectors generated by
BERT (Bommasani et al., 2020; Vulić et al., 2020). Token vectors are attractive be-
cause of their ability to capture word meaning in context. On the other hand, although
contexts can induce important effects of meaning modulation, human lexical compe-
tence is also largely context-independent (Chronis and Erk, 2020). When presented
with out-of-context words, human subjects are indeed able to carry out several se-
mantic tasks on them, such as rating their semantic similarity or group them into
categories (Murphy, 2002). For instance, the fact that dog and cat belong to the same
semantic category is a type-level property. This supports the hypothesis that word
meanings abstract and are (at least partially) invariant from the specific contexts in
which their tokens are observed. Therefore, testing type embeddings allows us to
investigate their ability to capture such context-independent dimensions of lexical
meaning. Moreover, besides sentence-level NLP tasks that surely benefit from the
context-sensitive nature of token embeddings, there are several research and applica-
tion scenarios that need to access type-level lexical properties (e.g., estimating word
semantic similarity for cognitive modelling or information extraction).
This paper is organized as follows: Section 2 reviews current work on DSM eval-
uation, Section 3 presents a battery of experiments to test DSMs on intrinsic and
extrinsic tasks, Section 4 studies their semantic spaces with Representation Similar-
ity Analysis, and Section 5 discusses the significance of our findings for research on
distributional semantics.
2 Related work
Most existing work is concerned with the first question and investigates the ef-
fect of model and/or hyperparameter choice. For example, Sahlgren and Lenci (2016)
explores the effect of data size on model performance. Levy and Goldberg (2014a),
Melamud et al. (2016), Lapesa and Evert (2017) and Li et al. (2017) study the effect
of context type (i.e., window-based vs. syntactic collocates) and embedding dimen-
sion, whereas Baroni et al. (2014) and Levy et al. (2015) study the effect of modeling
category (i.e., count vs. predict models). In particular, Levy et al. (2015) makes the
argument that model types should not be evaluated in isolation, but that using compa-
rable hyperparameters (or tuning) is necessary for fair and informative comparison.
Other works instead focus on the evaluation type. Chiu et al. (2016) observes that
intrinsic evaluation fails to predict extrinsic (sequence labeling) performance, stress-
ing the importance of extrinsic evaluation. Rogers et al. (2018) introduces an ex-
tended evaluation framework, including diagnostic tests. They corroborate the find-
ings of Chiu et al. (2016), but show that performance for other extrinsic tasks are
predicted by intrinsic performance, and that there exists diagnostic tasks that predict
sequence labeling performance.
However, much of this work is limited to one type or family of DSMs. For exam-
ple, Levy and Goldberg (2014a), Chiu et al. (2016), Ghannay et al. (2016), Melamud
et al. (2016), and Rogers et al. (2017) are all concerned solely with predict word em-
beddings. Whereas Bullinaria and Levy (2007, 2012) and Lapesa and Evert (2014,
2017) limit their exploration to count DSMs. Conversely, work that compares across
model type boundaries has been limited in the scope of evaluation, being singularly
concerned with intrinsic evaluation (Baroni et al., 2014; Levy et al., 2015; Sahlgren
and Lenci, 2016). As far as we are aware, the only large-scale comparison across
model type boundaries, involving both intrinsic and extrinsic tasks, is Schnabel et al.
(2015). However, they used default hyperparameters for all models, and set the em-
bedding dimension to 50.
Recently, attention has shifted towards comparing the type embeddings produced
by static DSMs with those obtained by pooling the token contextualized embeddings
generated by deep neural language models (Ethayarajh, 2019; Bommasani et al.,
2020; Chronis and Erk, 2020; Vulić et al., 2020). However, these works have so far
addressed a limited number of tasks or a very small number of static DSMs.
This analysis of the state of the art in DSM evaluation reveals that we still lack a
comprehensive picture, with respect to tested models, semantic tasks, and benchmark
datasets. Moreover, the typical approach consists in measuring how good a given
model is and which model is the best in a particular task. Much less attention has been
devoted to exploring the differences between the way models represent the lexical
semantic space. The following sections aim at filling these gaps.
A. model – The type of method used to learn the vectors. The models are represen-
tative of the major algorithms used to construct distributional representations:
i.) matrix count models
PPMI – this model consists in a simple co-occurrence matrix with col-
locate contexts (cf. below), weighted with Positive Pointwise Mutual
Information (PPMI), computed as follows:
(
p(t,c)
PMIht,ci = log2 p(t)p(c) if PMIht,ci > 0
PPMIht,ci = (1)
0 otherwise
where p(t, c) is the co-occurrence probability of the target word t with
the collocate context c, and p(t) and p(c) are the individual target
and context probabilities. Since no dimensionality reduction is applied,
PPMI produces high-dimensional, sparse explicit distributional vectors;
SVD – like PPMI, but with low-dimensional embeddings generated
with Singular Value Decomposition (SVD);
8 Alessandro Lenci et al.
n
ti ← ti−1 + ∑ cj (2)
j=−n, j6=0
are undirected, because they are not distinguished by their position (left or
right) with respect to the target. Window-based collocates do not take into
account linguistic structures, since context windows are treated as bags of
independent words ignoring any sort of syntactic information. Previous re-
search has shown that the size of the context window has important effects
on the resulting semantic space (Sahlgren, 2006; Bullinaria and Levy, 2007;
Baroni and Lenci, 2011; Bullinaria and Levy, 2012; Kiela and Clark, 2014).
Therefore, we experimented with two types of window-based DSMs:
window.2 (w2) – narrow context window of size [2, 2]. Narrow win-
dows are claimed to produce semantic spaces in which nearest neigh-
bors belong to the same taxonomic category (e.g., violin and guitar);
window.10 (w10) – wide context window of size [10, 10]. Large win-
dows would tend to promote neighbors linked by more associative re-
lations (e.g., violin and play).
ii.) syntactic collocates (only for PPMI, SVD, and SGNS) – The contexts of a
target t are the collocate words that are linked to t by a direct syntactic de-
pendency (subject, direct object, etc.). Some experiments suggest that syn-
tactic collocates tend to generate semantic spaces whose nearest elements
are taxonomically related lexemes, mainly co-hyponyms (Levy and Gold-
berg, 2014a). However, the question whether syntactic information provides
a real advantage over window-based representations of contexts is still open
(Kiela and Clark, 2014; Lapesa and Evert, 2017).
syntax.filtered (synf) – dependency-filtered collocates. Syntactic de-
pendencies are used just to identify the collocates, without entering into
the specification of the contexts themselves (Padó and Lapata, 2007).
Therefore, identical lexemes linked to the target by different depen-
dency relations are mapped onto the same context;
syntax.typed (synt) – dependency-typed collocates. Syntactic depen-
dencies can be encoded in the contexts, typing the collocates (e.g.,
nsubj-dog). Typing collocates with dependency relations captures more
fine-grained syntactic distinctions, but on the other hand produces a
much larger number of distinct contexts (Baroni and Lenci, 2010);
iii.) documents (only for LSA and LDA) – the context of a target t are the doc-
uments they occur in. The use of textual contexts derives from the vector
space model in information retrieval, whose target semantic dimension is
that of topicality, or aboutness. Documents are thus represented with their
word distribution, and, symmetrically, lexical items with their distribution
in documents, which can be regarded as “episodes” (Landauer and Dumais,
1997) that become associated with the words therein encountered.
C. dimensions – the number of vector dimensions. The settings are 300 and 2, 000
for embeddings, and 10, 000 for explicit PPMI distributional vectors.
All 44 static DSMs were trained on a concatenation of ukWaC and a 2018 dump
of English Wikipedia.2 The corpus was case-folded, and then POS-tagged and syn-
tactically parsed with CoreNLP (Manning et al., 2014), according to the Universal
2 https://dumps.wikimedia.org
10 Alessandro Lenci et al.
targets – T = V , for all models. Since targets are unlemmatized lexemes, every
DSM assigns a distinct distributional vector to each inflected form. The reason
for this choice is that several benchmark datasets (e.g., analogy ones) are not
lemmatized;
contexts – the main difference is between collocate vs. document contexts:
collocates – C = V . For syntax-based models, co-occurrences were identi-
fied using the dependency relation linking the target and the context lexeme.4
For dependency-typed collocates, we used both direct and inverse dependen-
cies. For instance, given the sentence The dog barks, we considered both the
direct (barks, nsubj-dog) and inverse (dog, nsubj−1 -barks) dependencies. To
reduce the very large number of context types, we applied a selection heuris-
tics, keeping only the typed collocates whose frequency was greater than 500.
For dependency-filtered collocates, both direct (barks, dog) and inverse (dog,
barks) relations were used as well, but without context selection heuristics;
documents – C = D, where D includes more than 8.2 million documents,
corresponding to the articles in Wikipedia and ukWaC.
– we trained predict DSMs with the negative sampling algorithm, using 15 nega-
tive examples (instead of the default value of 5), as Levy et al. (2015) show that
increasing their number is beneficial to the model performance.
We tested the 44 static DSMs on the 25 intrinsic and 8 extrinsic datasets reported in
Table 3. On the other hand, BERT type embeddings were evaluated in the intrinsic
5 The tokenization and the embedding extraction were performed without stripping the accents from
each target word, using the Hugging Face Python library (Wolf et al., 2020).
6 Actually from layer 2 to layer 5, as we skipped the first layer, which corresponds to the context-
Intrinsic evaluation
Dataset Size Task Metric Dataset Size Task Metric
TOEFL 80 synonymy accuracy BATTIG 5,231 categorization purity
ESL 50 synonymy accuracy BATTIG-2010 82 categorization purity
RG65 65 similarity correlation ESSLLI-2008-1a 44 categorization purity
RW 2,034 similarity correlation ESSLLI-2008-2b 40 categorization purity
SL-999 999 similarity correlation ESSLLI-2008-2c 45 categorization purity
SV-3500 3,500 similarity correlation BLESS 26,554 categorization purity
WS-353 353 similarity correlation SAT 374 analogy accuracy
WS-SIM 203 similarity correlation MSR 8,000 analogy accuracy
WS-REL 252 relatedness correlation GOOGLE 19,544 analogy accuracy
MTURK 287 relatedness correlation SEMEVAL-2012 3,218 analogy accuracy
MEN 3,000 relatedness correlation WORDREP 237,409,102 analogy accuracy
TR9856 9,856 relatedness correlation BATS 98,000 analogy accuracy
AP 402 categorization purity
Extrinsic evaluation
Dataset Training size Test size Task Metric
CONLL-2003 204,566 46,665 sequence labeling (POS tagging) F-measure
CONLL-2003 204,566 46,665 sequence labeling (chunking) F-measure
CONLL-2003 204,566 46,665 sequence labeling (NER) F-measure
SEMEVAL-2010 8,000 2,717 semantic relation classification accuracy
MR 5,330 2,668 sentence sentiment classification accuracy
IMDB 25,000 25,000 document sentiment classification accuracy
RT 5,000 2,500 subjectivity classification accuracy
SNLI 550,102 10,000 natural language inference accuracy
setting only. In fact, the extrinsic tasks involve in-context lexemes, and token embed-
dings have already been shown to have the edge over type ones in such case. Our goal
is instead to evaluate how BERT type embeddings represent the semantic properties
of out-of-context words. In all the experiments, distributional vector similarity was
measured with the cosine.
For the intrinsic evaluation, we used the most widely used benchmarks in distri-
butional semantics, grouped into the following semantic tasks:
synonymy – the task is to select the correct synonym to a target word from a
number of given alternatives. A DSM makes the right decision if the correct word
has the highest cosine among the alternatives. The evaluation measure is accu-
racy, computed as the ratio of correct choices returned by a model to the total
number of test items;
similarity – the task is to replicate as closely as possible human ratings of se-
mantic similarity, as a relation between words sharing similar semantic attributes
(e.g., dog and horse), hence the name of attributional (or taxonomic) similarity
(Medin et al., 1993). The evaluation measure is the Spearman rank correlation
(ρ) between cosine similarity scores and ratings;
relatedness – the task is to model human ratings of semantic relatedness. The
latter is a broader notion than similarity (Budanitsky and Hirst, 2006), since it
A comprehensive evaluation of Distributional Semantic Models 13
1 k
purity = ∑ max (nir ) (5)
n r=1 i
where nir is the number of items from the ith true (gold standard) class that are
assigned to the rth cluster, n is the total number of items and k the number of clus-
ters. In the best case (perfect clusters), purity will be 1, while, as cluster quality
deteriorates, purity approaches 0;
analogy completion – the task consists in inferring the missing item in an in-
complete analogy a : b = c : ? (e.g., given the analogy Italy : Rome = Sweden : ?,
the correct item is Stockholm, since it is the capital of Sweden). The analogy task
targets relational similarity (Medin et al., 1993), since the word pairs in the two
members of the analogy share the same semantic relation. We addressed analogy
completion with the offset method popularized by Mikolov et al. (2013c), which
searches for the target lexeme t that maximizes this equation:
where T ∗ is the set of target lexemes minus a, b, and c. The evaluation measure
is accuracy, as the percentage of correctly inferred items.7
For the extrinsic evaluation, distributional vectors were fed as features into su-
pervised classifiers for the following tasks (the size of the training and test sets is
reported in Table 3):
sequence labeling (POS tagging, chunking, and NER) – the task is to correctly
identify the POS, chunk, or named entity (NE) tag of a given token. The model
is a multinomial logistic regression classifier on the concatenated word vectors of
a context window of radius two around – and including – the target token. The
performance metric is the F-measure;
semantic relation classification – the task is to correctly identify the semantic
relation between two target nominals. The model is a Convolutional Neural Net-
work (CNN). The performance metric is accuracy;
sentence sentiment classification – the task is to classify movie reviews as either
positive or negative. The binary classification is carried out with the CNN by Kim
(2014). The performance metric is accuracy;
document sentiment classification – the task is to classify documents as either
positive or negative. The classifier is a Long Short–Term Memory (LSTM) net-
work with 100 hidden units. The performance metric is accuracy;
7 Levy and Goldberg (2014b) proposed a variant the offset method called 3CosMult, which they show
to obtain better performances. However, since we are not interested in the best way to solve analogies but
rather to compare different DSMs on such task, we have preferred to use the original approach.
14 Alessandro Lenci et al.
Intrinsic evaluation
Dataset Score Model Dataset Score Model
Synonymy Categorization
TOEFL 0.92 FastText.w2.2000 AP 0.75 SVD.synt.300
ESL 0.78 SVD.synt.2000 BATTIG 0.48 SGNS.synt.300
Similarity BATTIG-2010 1.00 SVD.synf.300
RG65 0.87 GloVe.w10.2000 ESSLLI-2008-1a 0.95 SVD.synf.300
RW 0.48 FastText.w2.300 ESSLLI-2008-2b 0.92 SGNS.w2.2000
SL-999 0.49 SVD.synt.2000 ESSLLI-2008-2c 0.75 SGNS.w2.2000
SV-3500 0.41 SVD.synt.2000 BLESS 0.88 SVD.synf.2000
WS-353 0.71 CBOW.w10.300 Analogy
WS-SIM 0.76 SVD.w2.2000 SAT 0.34 SVD.synt.300
Relatedness MSR 0.68 FastText.w2.300
WS-REL 0.66 CBOW.w10.300 GOOGLE 0.76 FastText.w2.300
MTURK 0.71 FastText.w2.300 SEMEVAL-2012 0.38 SVD.synt.300
MEN 0.79 CBOW.w10.300 WORDREP 0.27 FastText.w2.300
TR9856 0.17 FastText.w2.300 BATS 0.29 FastText.w2.300
Extrinsic evaluation
Dataset Task Score Model
CONLL-2003 sequence labeling (POS tagging) 0.88 SGNS.synt.2000
CONLL-2003 sequence labeling (chunking) 0.89 SGNS.synt.300
CONLL-2003 sequence labeling (NER) 0.96 SGNS.w2.2000
SEMEVAL-2010 semantic relation classification 0.78 SGNS.w2.2000
MR sentence sentiment classification 0.78 SGNS.w2.2000
IMDB document sentiment classification 0.82 FastText.w2.300
RT subjectivity classification 0.91 FastText.w2.2000
SNLI natural language inference 0.70 CBOW.w2.2000
The model coverage of the datasets used in all tasks is very high (mean 98%, standard
deviation 3.3%). In BERT, 43.3% of the targets belong to the model vocabulary, and
the rest is split into subwords. The intrinsic performance measures were obtained
by running an extended version of the Word Embeddings Benchmarks (Jastrzȩbski
et al., 2017).8 The eight extrinsic performance measurements were computed with
the Linguistic Diagnostic Toolkit (Rogers et al., 2018).9
8 https://github.com/kudkudak/word-embeddings-benchmarks
9 http://ldtoolkit.space
A comprehensive evaluation of Distributional Semantic Models 15
Each of the 44 static DSMs was tested on the 33 datasets, for a total of 1, 452 exper-
iments. Table 4 contains the best score and model for each benchmark. Top perfor-
mances are generally close to or better than state-of-the-art results for each dataset,
and replicate several trends reported in the literature. For instance, the similarity
datasets RW and SL-999 are much more challenging than WS-353 and especially
MEN. In turn, the verb-only SV-3500 is harder than SL-999, in which nouns repre-
sent the lion’s share. Coming to the analogy completion task, MSR and GOOGLE
prove to be fairly easy. As observed by Church (2017), the performance of the off-
set method drastically drops with the other datasets. Moreover, it does not perform
evenly on all analogy types. The top score by the FastText.w2.300 model is 0.73 on
16 Alessandro Lenci et al.
LDA
document
RI-perm
RI
LSA
PPMI
SVD
GloVe
CBOW
SGNS
FastText
0 10 20 30 40
(a) (b)
10,
000
(c)
Fig. 2: Global DSM performance: (a) per model type, (b) per context type, (c) per
vector dimensions.
the syntactic subset of GOOGLE analogies, and 0.69 on the semantic one. Differ-
ences are much stronger in BATS: The best performance on inflection and derivation
analogies is 0.43, against 0.16 on semantic analogies, and just 0.06 on analogies
based on lexicographic relations like hypernymy.
A crucial feature to notice in Table 4 is that there is no single “best model”. In
fact, the performance of DSMs forms a very complex landscape, which we explore
here with statistical analyses that focus on the following objectives: i.) determining
the role played by the three main factors that define the experiments – model, context,
and vector dimensions – and their possible interactions; ii.) identifying which DSMs
are significantly different from each other; iii.) check how the task type influences the
performance of the DSMs.
An overall analysis of the role played by the different factors in the experiments
poses the problem of having a response (or dependent) variable that is homogeneous
across tasks in which performance is evaluated according to different metrics (see
A comprehensive evaluation of Distributional Semantic Models 17
Table 3). It is evident that an accuracy value of 0.5 has a very different value compared
to a correlation coefficient equal to 0.5. In order to address this issue, we defined as
response variable the position (rank) of a DSM in the performance ranking of each
task. If a model A has a higher accuracy than a model B in a task and a higher
correlation than B in another task, in both cases we can say that A is “better” than
B. Therefore, given a task t, we ordered the performance scores on t in a decreasing
way, and each DSM was associated with a value corresponding to its rank in the
ordered list (e.g., the top DSM has rank 1, the second best scoring DSM has rank 2,
and so on). The response variable is therefore defined on a scale from 1 to 44 (the
number of DSMs tested on each task), in which lower values correspond to better
performances. This conversion causes a loss of information on the distance between
the scores, but normalizes the metrics both in terms of meaning and in terms of the
statistical characteristics of their distribution.
Figure 1 presents the global rank distribution of the 44 DSMs in all the 33 bench-
marks. Model is clearly the primary factor in determining the score of the experi-
ments, context has a more contained and nuanced role, while the role of dimensions
is marginal. This is confirmed by the Kruskal-Wallis rank sum non-parametric test:
models (H = 854.27, df=9, p < .001∗∗ ) and contexts (H = 229.87, df=4, p < .001∗∗ )
show significant differences, while the vector dimension levels do not (H = 3.14,
df=2, p = .21), as also illustrated by the boxplots in Figure 2c. The only cases in
which vector size matters are RI models, whose 2, 000-dimensional embeddings tend
to perform better than 300-dimensional ones.
Looking at Figure 2a, we can observe that there are three major groups of models:
i.) the best performing ones are the predict models SGNS, CBOW and FastText,
ii.) closely followed by the matrix models GloVe, SVD and PPMI, iii.) while the
worst performing models are RI, the document-based LSA, and in particular LDA.
Dunn’s tests (with Bonferroni correction) were carried out to identify which pairs
of models are statistically different. The p-values of these tests reported in Table 5
draw a very elaborate picture that can not be reduced to the strict contrast between
predict and count models: i.) no difference exists between SGNS and FastText; ii.)
GloVe does not differ from CBOW and SVD, and the latter two are only marginally
different; iii.) interestingly, the PPMI explicit vectors do not differ from their implicit
counterparts reduced with SVD and only marginally differ from GloVe; iv.) LSA does
not differ from RI and RI-perm, which in turn do not differ from LDA.
With regard to context types, the best scores are for syntax-based ones, either
filtered or typed, while document is clearly the worst (Figure 2b). However, we note
that syntax.filtered is equivalent to syntax.typed, and the latter does not differ from
18 Alessandro Lenci et al.
window.2. On the other hand, window.10 and document are significantly different
from all other context types (Table 6).
We then fit a regression tree model (Witten and Frank, 2005) to the experiment
performance scores (dependent variable) as a function of context, model, and vector
dimensions (independent variables or explanatory factors). Regression tree analy-
sis partitions the set of all the experiments in subsequent steps, identifying for each of
them the independent variable and the combinations of its modalities that best explain
the variability of the performance score. The tree growth process is blocked when
the information gain becomes negligible. The output tree of our statistical model is
presented in Figure 3. The variables used to divide the experiments (nodes) are high-
lighted, step by step, together with the modalities with respect to which the partition
is made. The tree leaves are the most uniform subgroups in terms of variability of the
dependent variable with respect to the explanatory factors.
The statistical model fit measured by the R2 coefficient is 0.65, which means that
model, context, and vector dimensions are able to explain just 65% of the overall
variability of the DSM performance (55% of which is explained by the first two
partitions, both associated with model type). The regression tree confirms the relevant
role played by the model factor. In the first partition, LDA, LSA, RI and RI-perm
are identified on the right branch of the tree, whose leaves are characterized by the
highest average score ranks, in particular for LDA (41.3), and RI and RI-perm with
size 300 (38.99). On the left branch, we find SGNS and FastText, which in the case
of syntax.filtered and window.2 contexts have the lowest average score ranks (10.2).
An interaction between model and context exists for CBOW, PPMI and SVD, which
have worse performances (i.e., higher score ranks) with window.10 contexts.
A comprehensive evaluation of Distributional Semantic Models 19
LDA
RI-perm
RI
LSA
PPMI
SVD
GloVe
CBOW
SGNS
FastText
LDA
RI-perm
RI
LSA
PPMI
SVD
GloVe
CBOW
SGNS
FastText
0 10 20 30 40 0 10 20 30 40 0 10 20 30 40
DSM performance greatly varies depending on the benchmark dataset and semantic
task. This is already evident from the spread out data distribution in the boxplots in
Figure 1, and is further confirmed by the regression tree analysis: After introducing
task type and evaluation type (intrinsic vs. extrinsic) as further predictors, the ex-
plained data variability increases from 65% to 72%. This means that the behavior of
DSMs is strongly affected by the semantic task they are tested on. In this section, we
investigate this issue by analysing the performance of the different model and con-
text types in the six tasks in which the datasets are grouped: synonymy, similarity,
relatedness, categorization, analogy, and extrinsic tasks (see Table 3).
Figure 4 reports the per-task rank distribution of model types. In general, the best
performances are obtained by SGNS and FastText, and the worst ones by LDA, RI,
Ri-perm, and LSA. Instead, PPMI, SVD, GloVe and CBOW produce more interme-
diate and variable results: They are equivalent to, or better than, the top models in
some cases, worse in others. Table 8 shows the model pairs whose performance is
statistically different for each task (black dots), according to Dunn’s test.
We can notice that in several cases the differences between models are actually
non significant: i.) CBOW never differs from SGNS and FastText; ii.) SVD differs
from predict models only in the analogy and extrinsic tasks (but for FastText in
the relatedness task too), and differs from PPMI for similarity and extrinsic tasks;
20 Alessandro Lenci et al.
Task PPMI
synonymy ◦
similarity •
SVD relatedness ◦
◦
categorization
analogy ◦
extrinsic • SVD
synonymy ◦ ◦
similarity ◦ •
LSA relatedness ◦ ◦
◦
categorization •
analogy ◦ ◦
extrinsic ◦ ◦ LSA
synonymy ◦ • ◦
similarity ◦ • ◦
LDA relatedness ◦ • ◦
•
categorization • ◦
analogy • • ◦
extrinsic • ◦ ◦ LDA
synonymy ◦ ◦ ◦ •
similarity ◦ ◦ ◦ •
GloVe relatedness ◦ ◦ ◦ ◦
◦
categorization ◦ ◦ •
analogy ◦ ◦ • •
extrinsic ◦ • • • GloVe
synonymy ◦ • ◦ ◦ •
similarity ◦ • ◦ ◦ •
RI relatedness ◦ • ◦ ◦ ◦
•
categorization • ◦ ◦ •
analogy • • ◦ ◦ •
extrinsic • ◦ ◦ ◦ • RI
synonymy ◦ • ◦ ◦ • ◦
similarity ◦ • ◦ ◦ • ◦
RI-perm relatedness ◦ • ◦ ◦ ◦ ◦
•
categorization • ◦ ◦ • ◦
analogy • • ◦ ◦ • ◦
extrinsic • ◦ ◦ ◦ • ◦ RI-perm
synonymy • ◦ ◦ • ◦ • •
similarity • ◦ • • ◦ • •
SGNS relatedness ◦ ◦ ◦ • ◦ • •
•
categorization ◦ • • ◦ • •
analogy ◦ • • • ◦ • •
extrinsic • • • • ◦ • • SGNS
synonymy ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦
similarity • ◦ • • ◦ • • ◦
CBOW relatedness • ◦ ◦ • ◦ • • ◦
◦
categorization ◦ ◦ • ◦ • • ◦
analogy ◦ ◦ ◦ • ◦ • • ◦
extrinsic ◦ • • • ◦ • • ◦ CBOW
synonymy ◦ ◦ ◦ • ◦ • • ◦ ◦
similarity • ◦ • • ◦ • • ◦ ◦
FastText relatedness • • • • • • • ◦ ◦
◦
categorization ◦ • • ◦ • • ◦ ◦
analogy • • • • ◦ • • ◦ ◦
extrinsic • • • • ◦ • • ◦ ◦
Table 7: Dunn’s tests for multiple comparisons of model types for each semantic task.
Black dots mark significantly different models (p< .05).
A comprehensive evaluation of Distributional Semantic Models 21
doc
ument
doc
ument
iii.) GloVe never differs from SGNS and CBOW, and differs from FastText only for
relatedness. Interestingly, GloVe differs neither from PPMI explicit vectors, nor from
SVD (apart from the extrinsic task). If we exclude LDA, LSA and RI, which are sys-
tematically undeperforming, models mostly differ in the extrinsic task (40% of the
overall cases), with the predict ones having a clear edge over the others. Conversely,
the performances in synonymy and categorization tasks are more similar across mod-
els, with only the 6% of significant pairwise differences.
Figure 5 shows the per-task distribution of the various context types. Apart from
document, which has the worst performance in every task, the other contexts almost
never produce significant differences, as confirmed by the results of the Dunn’s tests
in Table 8. The only exception is the categorization task, in which syntax-based con-
texts achieve significantly better performances than window-based ones. Moreover,
syntax.filtered improves over window.10 in the similarity and the analogy tasks.
A further aspect we investigate is the correlation of DSM performance across
datasets. In the plot in Figure 6, dot size and color is proportional to the Spearman
correlation between the 33 datasets with respect to the performance of the 44 DSMs:
The higher the correlation between two datasets the more DSMs tend to perform
similarly on them. Intrinsic evaluation benchmarks are strongly correlated with each
other. The only exceptions are TR9856 and ESSLLI-2008-2b. The former case prob-
ably depends on some idiosyncrasies of the dataset, as suggested by the fact that the
top performance on TR9856 is significantly lower than the ones scored by the other
relatedness benchmarks (see Table 4). ESSLI-2008-2c instead focuses on concrete-
abstract categorization, a task that targets a unique semantic dimension among the
tested benchmarks. Intrinsic datasets are strongly correlated with extrinsic ones too,
22 Alessandro Lenci et al.
Task window.2
synonymy ◦
similarity ◦
window.10 relatedness ◦
categorization ◦
analogy ◦
extrinsic ◦ window.10
synonymy ◦ ◦
similarity ◦ •
syntax.filtered relatedness ◦ ◦
categorization • ◦
analogy ◦ •
extrinsic ◦ ◦ syntax.filtered
synonymy ◦ ◦ ◦
similarity ◦ ◦ ◦
syntax.typed relatedness ◦ ◦ ◦
categorization • • ◦
analogy ◦ ◦ ◦
extrinsic ◦ ◦ ◦ syntax.typed
synonymy • ◦ • •
similarity • • • •
document relatedness • • • •
categorization • • • •
analogy • • • •
extrinsic • • • •
Table 8: Dunn’s tests for multiple comparisons of context types for each semantic
task. Black dots mark significantly different contexts (p< .05).
except for POS tagging, chunking and NER, which however have weaker correlations
with the other extrinsic datasets as well.
Previous research reported that type embeddings derived from contextualized DSMs,
in particular BERT, generally outperform the ones produced by static DSMs (Etha-
yarajh, 2019; Bommasani et al., 2020; Vulić et al., 2020). However, the picture emerg-
ing from our experiments is rather the opposite.
As illustrated in Table 9, static DSMS have a clear edge over BERT in most out-
of-context semantic tasks and datasets, with the only exception of SL-999, MSR,
WORD-REP, and BATS, but differences in the latter datasets are not significant. In
some cases, one of the BERT models is at most able to get close to the top-scoring
static DSM, but in all other datasets BERT lags behind up to 20 points. The good
performance on SL-999 suggests that BERT type embeddings are able to capture se-
A comprehensive evaluation of Distributional Semantic Models 23
mantic similarity fairly, and in fact this is the task tack in which BERT performances
are closest to the ones of static DSMs.
In the analogy task, the highest score is obtained by BERT on MSR, but this
dataset only contains syntactic analogies. In GOOGLE, BERT.L4 achieves 0.66 ac-
curacy (0.76 for static DSMs), but its performance is indeed much better in the
syntactic subset (0.71 BERT.L4; 0.73 FastText.w2.300), than in the semantic one
(0.55 BERT.L4; 0.69 FastText.w2.300). The situation with BATS is exactly the same,
with BERT performance on morphosyntactic analogies (0.52 BERT.L4; 0.43 Fast-
Text.w2.300) being much higher than on the semantic subset (0.11 BERT.L4; 0.11
FastText.w2.300). This indicates that BERT embeddings are especially able to encode
morphological and syntactic information rather than semantic one. The generally bet-
ter performance of static embeddings in the analogy task is consistent with the results
reported by Ushio et al. (2021).
One key aspect of deep learning models like BERT is understanding what infor-
mation is encoded in their various layers. In Table 9, we can notice the gain brought
by averaging the embeddings from several layers, as already shown by Vulić et al.
(2020). BERT.L, which is derived from the the last and most contextualized layer,
is globally the worst performing model. On the other hand, BERT.L4 generally per-
forms better than BERT.F4 in synonymy, similarity and categorization tasks, while
with relatedness datasets the situation is reversed. This suggests that the last layers
encode semantic dimensions useful to capture attributional similarity. BERT behav-
24 Alessandro Lenci et al.
Dataset Static BERT.F4 BERT.L4 BERT.L Dataset Static BERT.F4 BERT.L4 BERT.L
Synonymy Categorization
TOEFL 0.92 0.72 0.89 0.82 AP 0.75 0.52 0.63 0.55
ESL 0.78 0.60 0.60 0.64 BATTIG 0.48 0.22 0.40 0.35
Similarity BATTIG-2010 1.00 0.67 0.77 0.73
RG65 0.87 0.74 0.81 0.78 ESSLLI-2008-1a 0.95 0.68 0.73 0.70
RW 0.48 0.37 0.48 0.36 ESSLLI-2008-2b 0.92 0.82 0.75 0.75
SL-999 0.49 0.49 0.55 0.50 ESSLLI-2008-2c 0.75 0.64 0.62 0.58
SV-3500 0.41 0.34 0.40 0.27 BLESS 0.88 0.60 0.73 0.70
WS-353 0.71 0.61 0.62 0.57 Analogy
WS-SIM 0.76 0.67 0.70 0.63 SAT 0.34 0.24 0.24 0.21
Relatedness MSR 0.68 0.76 0.69 0.68
WS-REL 0.66 0.56 0.51 0.47 GOOGLE 0.76 0.38 0.66 0.64
MTURK 0.71 0.59 0.56 0.52 SEMEVAL-2012 0.38 0.33 0.34 0.30
MEN 0.79 0.70 0.69 0.64 WORDREP 0.27 0.22 0.28 0.22
TR9856 0.17 0.13 0.14 0.13 BATS 0.29 0.30 0.33 0.34
Table 9: Best static DSM scores (see Table 4) compared with the performances of
BERT type embeddings.
ior in the analogy completions task is instead more diversified, but the highest score
in MSR is obtained by BERT.F4, confirming that the first layers tend to encode mor-
phosyntactic information (Jawahar et al., 2019; Tenney et al., 2019).
The outcome of these experiments strongly contrasts with those reported in the
literature. What is the reason of such difference? The performances of our BERT
models are essentially in line with or even better than the ones obtained by previ-
ous works. For instance, the best BERT model in Bommasani et al. (2020) achieves
correlations of 0.50 in SL-999 and 36.87 in SV-3500, against the BERT.L4 scores
respectively of 0.55 and 0.40. The values in Ethayarajh (2019) are even lower than
ours. This indicates that the disappointing behavior of our BERT models is not likely
to depend on the way the type embeddings were constructed from BERT layers. Re-
search has shown that the performance of contextualized DSMs tend to increase with
the number of contexts used to build the type vectors. Therefore, we could expect that
BERT scores would be higher, if we sampled more sentences. However, Vulić et al.
(2020) argue that the increment produced by sampling 100 sentences instead of 10,
as we did, is in most cases marginal.
On the other hand, the performance of the static DSMs used in previous compar-
isons is often much lower than ours. Therefore, we can argue that in those cases BERT
“wins” because it is compared with suboptimal static DSMs and the alleged compet-
itiveness or superiority of type embeddings derived from contextualized DSMs over
static models is more apparent than real. This resembles the case of the debate be-
tween count vs. predict model, in which the advantage of the latter disappears when
count DSMs are optimized (Levy et al., 2015). Similarly, when properly tuned, static
DSMs are superior to BERT, when tested in out-of-context semantic tasks.
A comprehensive evaluation of Distributional Semantic Models 25
Fig. 7: Average Spearman correlation between semantic spaces computed with RSA
on 100 random samples of 1, 000 words.
One general shortcoming of the standard way to evaluate DSMs is that it is based on
testing their performances on benchmark datasets that, despite their increasing size,
only cover a limited portion of a model vocabulary (e.g., the large MEN includes
just 751 word types). Apart from few exceptions, the selection criteria of test items
are not explicit, and do not take into consideration or do not provide information
about important factors such as word frequency and POS. As we saw in the previous
section, the variance among datasets is often extremely large, even within the same
semantic task, and this might be due to differences in sampling and rating criteria.
We present here an alternative approach that explores the shape of the seman-
tic spaces produced by DSMs with Representational Similarity Analysis (RSA)
(Kriegeskorte et al., 2008; Kriegeskorte and Kievit, 2013). RSA is a method exten-
26 Alessandro Lenci et al.
(a)
(b)
(c)
Fig. 8: Spearman correlation between semantic spaces computed with RSA on (a)
high, (b) medium, (c) and low frequency target words.
28 Alessandro Lenci et al.
and some of the SGNS models. Moreover, the BERT.F4 space is quite different from
the ones generated by the last layers, probably due to their higher contextualization.
Further RSAs were then performed on subsets of the DSM vocabulary sampled
according to their frequency in the training corpus:
High Frequency (RSA-HF): the 1, 000 most frequent lexemes;
Medium Frequency (RSA-MF): 10 disjoint random samples of 1, 000 lexemes,
selected among those with frequency greater than 500, except for the 1, 000 most
frequent ones;
Low Frequency (RSA-LF): 10 disjoint random samples of 1, 000 words, selected
among those with frequency from 100 up to 500.
The results of these analyses are reported in Figures 8. It is particularly interesting
to notice the great difference in the similarities among semantic spaces depending
on the frequency range of the target lexemes. In RSA-HF, most semantic spaces are
strongly correlated to each other, apart from few exceptions: The average correlation
(mean ρ = 0.44; median = 0.40; sd = 0.22) is in fact significantly higher than the one
of the global spaces. Even those models, like RI and LDA, that behave like outliers
in the general RSA represent the high frequency semantic space very similarly to
the other DSMs. Contextualized models also increase their similarity with static ones
in the high frequency range. The between-model correlations in RSA-MF (mean ρ
= 0.20; median = 0.15; sd = 0.20) are significantly lower than RSA-HF (Wilcoxon
test: V = 58843, p-value < 0.001). A further decrease occurs with low frequency
lexemes (mean ρ = 0.17; median = 0.12; sd = 0.19; Wilcoxon test: V = 10661, p-
value < 0.001). In this latter case, the effect of model type is particularly strong. The
behavior of GloVe is exemplary: Its spaces are very close to the SVD and predict ones
for high frequency words, but in the low frequency range they have some moderate
correlation only with the latter family of models. Interestingly, in RSA-MF and RSA-
LF (window-based) SGNS and FastText are more similar to PPMI than to other count
models, probably due to the close link between PPMI and negative sampling proved
by Levy and Goldberg (2014c).
It is worth mentioning the peculiar behavior of LDA, which we had to exclude
from RSA-LF because most low frequency targets are represented exactly with the
same embedding formed by very small values, hence they are not discriminated by the
model. We hypothesize that this is due to the way in which word embeddings are de-
fined in Topic Models. Given a set of z1 , . . . , zk , the primary purpose of LDA is to find
the most important words that characterize a topic zi . Each target is then represented
with a topic vector (φ1 , . . . , φk ), such that φi = p(t|zi ). The words that are not relevant
to characterize any topic, have low probabilities in all of them. Therefore, these same
lexemes end up being represented by identical topic vectors. The problem is that the
size of this phenomenon is actually huge: The LDA.300 model has 305, 050 targets
with the same topic vector, about 88% of its total vocabulary. Low frequency words
are especially affected, probably because they do not compare among the most likely
words of any topic, as they occur few times in the documents used as contexts by
LDA. This might also explain the systematically low performance of LDA in quan-
titative evaluation. Moreover, it raises several doubts on its adequacy to build word
embeddings in general. Like LSA, Topic Models were originally designed for the
A comprehensive evaluation of Distributional Semantic Models 29
(a)
(b)
(c)
Fig. 9: Spearman correlation between semantic spaces computed with RSA on high
frequency (a) nouns, (b) verbs, (c) and adjectives.
30 Alessandro Lenci et al.
(a)
(b)
(c)
Fig. 10: Spearman correlation between semantic spaces computed with RSA on
medium frequency (a) nouns, (b) verbs, (c) and adjectives.
A comprehensive evaluation of Distributional Semantic Models 31
semantic analysis of document collections, and were then turned into lexical models
on the assumption that, just as documents can be represented with their word distri-
butions, lexical items can be represented with the documents they occur in. However,
while this conversion from document to lexical representation works fairly well for
LSA, it is eventually problematic for Topic Models.
A third group of RSAs was performed on subsets of the DSM vocabulary sampled
according to their POS in the training corpus. First we selected, all the words with
frequency greater than 500, to avoid the idiosyncrasies produced by low frequency
items. Since the DSM targets are not POS-disambiguated, we univocally assigned
each selected lexeme to either the noun, verb, or adjective class, if at least 90% of
the occurrences of that word in the corpus was tagged with that class. This way, it
is likely that the vector of a potentially ambiguous word encodes the distributional
properties of its majority POS. At the end of this process we obtained 14, 893 com-
mon nouns, 7, 780 verbs, and 5, 311 adjectives. Given the role of target frequency in
shaping the semantic spaces, we then split each set in two subsets: High frequency
sets are composed by the first 1, 000 most frequent targets of each POS, whereas
medium frequency sets include the remaining targets. We randomly selected 4 dis-
joint samples of 1,000 targets from the medium frequency set of each POS (notice
that almost all adjectives are represented in these selected samples). The RSAs on the
POS samples are reported in Figure 9 and 10, and the general statistics in Table 10.
This analysis shows the effect of frequency in even a clearer way, since for all
POS there is a drastic decrease in the model correlations from the high to the medium
frequency range, with a symmetric increase in their variability. At the same time, im-
portant differences among POS emerge. In the high frequency range, verbs (Wilcoxon
test: V = 15830, p-value < 0.001) and adjectives (Wilcoxon test: V = 19473, p-value
< 0.001) have a significant higher between-DSM similarity than nouns, and this gap
further increases with medium frequency lexical items (verbs instead do not signifi-
cantly differ from adjectives). This means that the semantic spaces produced by the
various DSMs differ much more with nouns than with verbs or adjectives.
exercised in using the analogy completion task and its solution with the offset method
as a benchmark to evaluate DSMs;
Intrinsic vs. extrinsic evaluation. DSM performance on intrinsic tasks correlate
with their performance on extrinsic tasks, except for the sequence labelling ones,
replicating the findings of Rogers et al. (2018). Differently from what has been some-
times claimed in the literature, this strong correlation indicates that intrinsic evalu-
ation can also be informative about the performance of distributional vectors when
embedded in downstream NLP applications. On the other hand, not all extrinsic tasks
are equally suitable to evaluate DSMs, as the peculiar behavior of POS, NER and
chunking seems to show.
Besides using the traditional approach to DSM evaluation, we have introduced
RSA as a task-independent method to compare and explore the representation of the
lexical semantic space produced by the various models. In particular, we have found
that models, both static and contextualized ones, produce often dramatically different
semantic spaces for low frequency words, while for high frequency items the corre-
lation among them is extremely high. This suggests that the main locus of variation
among the methods to build distributional representations might reside in how they
cope with data sparseness and are able to extract information when the number of oc-
currences is more limited. In the case of static embeddings, we applied to all DSMs
the smoothing and optimization procedure proposed by Levy et al. (2015), but count
and predict models still behave very differently in the low frequency range. This in-
dicates that such differences might actually depend on some intrinsic features of the
algorithms to build word embeddings, rather than in the setting of their hyperparam-
eters. Overall, these results highlight the strong “instability” (Antoniak and Mimno,
2018) of distributional semantic spaces, especially when the target frequency is not
high: Models can produce substantially divergent representations of the lexicon, even
when trained on the same corpus data with highly comparable settings.
Significant variations also occur at the level of POS too. Quite unexpectedly, the
category where DSM spaces differ most are nouns. This finding deserves future in-
vestigations to understand the source of such differences, and to carry out more fine-
grained analyses within nouns (e.g., zooming in on particular subclasses, such as
abstract vs. concrete ones). These analyses reveal that frequency and POS strongly
affect the shape of distributional semantic spaces and must therefore be carefully
considered when comparing DSMs.
We conclude this paper with one last observation. In more than twenty years,
distributional semantics has undoubtedly been making enormous progresses, since
the performance of DSMs as well as the range of their applications have greatly in-
creased. On the other hand, we might argue that this improvement is mostly due to
better optimized models or to a more efficient processing of huge amounts of train-
ing data, rather than to a real breakthrough in the methods to distil semantic infor-
mation from distributional data. In fact, under closer scrutiny, the most recent and
sophisticated algorithms have not produced dramatic advances with respect to more
traditional ones. This raises further general questions that we leave to future research:
Are we reaching the limits of distributional semantics? How to fill the gap between
current computational models and the human ability to learn word meaning from the
statistical analysis of the linguistic input?
34 Alessandro Lenci et al.
References
Erk K, Padó S (2010) Exemplar-based models for word meaning in context. In: Pro-
ceedings of ACL 2010, pp 92–97
Ethayarajh K (2019) How Contextual are Contextualized Word Representations?
Comparing the Geometry of BERT, ELMo, and GPT-2 Embeddings. In: Proceed-
ings of EMNLP-IJCNLP 2019, Hong Kong, China, pp 55–65
Ghannay S, Favre B, Estève Y, Camelin N (2016) Word embedding evaluation and
combination. In: Proceedings of LREC 2016, Portorož, Slovenia, pp 300–305
Goldberg Y (2017) Neural Network Methods for Natural Language Processing. Mor-
gan & Claypool, San Rafael, CA
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning. The MIT Press, Cam-
bridge, MA
Griffiths TL, Tenenbaum J, Steyvers M (2007) Topics in semantic representation.
Psychological Review 114(2):211–244
Harris ZS (1954) Distributional structure. Word 10(2-3):146–162
Jastrzȩbski S, Leśniak D, Czarnecki WM (2017) How to evaluate word embeddings?
on importance of data efficiency and simple supervised tasks. arXiv 170202170
Jawahar G, Sagot B, Seddah D (2019) What Does BERT Learn about the Structure
of Language? In: Proceedings of ACL 2019, Florence, Italy, pp 3651–3657
Kanerva P, Kristofersson J, Holst A (2000) Random indexing of text samples for
latent semantic analysis. In: Proceedings of CogSci 2000, Philadelphia, PA, USA,
p 1036
Kiela D, Clark S (2014) A Systematic Study of Semantic Vector Space Model Param-
eters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models
and their Compositionality, pp 21–30
Kim Y (2014) Convolutional Neural Networks for Sentence Classification. In: Pro-
ceedings of EMNLP 2014, Doha, Qatar, pp 1746–1751
Kriegeskorte N, Kievit RA (2013) Representational geometry: integrating cognition,
computation, and the brain. TRENDS in Cognitive Sciences 17(8):401–412
Kriegeskorte N, Mur M, Bandettini P (2008) Representational similarity analysis –
connecting the branches of systems neuroscience. Frontiers in Systems Neuro-
science 2(4)
Landauer TK, Dumais S (1997) A solution to Plato’s problem: The latent semantic
analysis theory of acquisition, induction, and representation of knowledge. Psy-
chological Review 104(2):211–240
Lapesa G, Evert S (2014) A Large Scale Evaluation of Distributional Semantic Mod-
els: Parameters, Interactions and Model Selection. Transactions of the ACL 2:531–
545
Lapesa G, Evert S (2017) Large-scale evaluation of dependency-based dsms: Are
they worth the effort? In: Proceedings of EACL 2017, pp 394–400
Lenci A (2008) Distributional approaches in linguistic and cognitive research. Italian
Journal of Linguistics 20(1):1–31
Lenci A (2018) Distributional Models of Word Meaning. Annual Review of Linguis-
tics 4:151–171
Levy O, Goldberg Y (2014a) Dependency-Based Word Embeddings. In: Proceedings
of ACL 2014, pp 302–308
36 Alessandro Lenci et al.
Levy O, Goldberg Y (2014b) Linguistic regularities in sparse and explicit word rep-
resentations. In: Proceedings of CoNLL 2014, pp 171–180
Levy O, Goldberg Y (2014c) Neural Word Embedding as Implicit Matrix Factor-
ization. In: Proceedings of Advances in Neural Information Processing Systems
(NIPS), Montreal, Canada, pp 1–9
Levy O, Goldberg Y, Dagan I (2015) Improving Distributional Similarity with
Lessons Learned from Word Embeddings. Transactions of the ACL 3:211–225
Li B, Tao L, Zhao Z, Tang B, Drozd A, Rogers A, Du X (2017) Investigating Different
Context Types and Representations for Learning Word Embeddings. In: Proceed-
ings of EMNLP 2017, Copenhagen, Denmark, pp 2411–2421
Lund K, Burgess C (1996) Producing high-dimensional semantic spaces from lexical
co-occurrence. Behavior Research Methods, Instruments, & Computers 28:203–
208
Mandera P, Keuleers E, Brysbaert M (2017) Explaining human performance in psy-
cholinguistic tasks with models of semantic similarity based on prediction and
counting: A review and empirical validation. Journal of Memory and Language
92:57–78
Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D (2014) The
Stanford CoreNLP natural language processing toolkit. In: Proceedings of ACL
2014, pp 55–60
Medin DL, Goldstone RL, Gentner D (1993) Respects for similarity. Psychological
Review 100(2):254–278
Melamud O, McClosky D, Patwardhan S, Bansal M (2016) The Role of Context
Types and Dimensionality in Learning Word Embeddings. In: Proceedings of
NAACL-HLT 2016, San Diego, CA, USA, pp 1030–1040
Mickus T, Paperno D, Constant M, van Deemter K (2020) What do you mean, BERT?
Assessing BERT as a distributional semantics model. In: Proceedings of the Soci-
ety for Computation in Linguistics 2020, New Orleans, LA, USA, pp 235–245
Mikolov T, Chen K, Corrado GS, Dean J (2013a) Efficient estimation of word repre-
sentations in vector space. In: Proceedings of ICLR 2013, Scottsdale, AZ, USA
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed represen-
tations of words and phrases and their compositionality. In: Advances in Neural
Information Processing Systems 26 (NIPS 2013), pp 3111–3119
Mikolov T, tau Yih W, Zweig G (2013c) Linguistic Regularities in Continuous Space
Word Representations. In: Proceedings NAACL-HLT 2013, Atlanta, GA, USA, pp
746–751
Murphy G (2002) The Big Book of Concepts. MIT Press, Cambridge, MA
Padó S, Lapata M (2007) Dependency-based construction of semantic space models.
Computational Linguistics 33(2):161–199
Pennington J, Socher R, Manning CD (2014) GloVe: Global Vectors for Word Rep-
resentation. In: Proceedings EMNLP 2014, Doha, Qatar, pp 1532–1543
Peterson JC, Chen D, Griffiths TL (2020) Parallelograms revisited: Exploring the
limitations of vector space models for simple analogies. Cognition 205:104440
Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language Models
are Unsupervised Multitask Learners. Tech. rep., OpenAI
A comprehensive evaluation of Distributional Semantic Models 37
Řehŭřek R, Sojka P (2010) Software framework for topic modelling with large cor-
pora. In: Proceedings of LREC 2010 Workshop New Challenges for NLP Frame-
works, La Valletta, Malta, pp 45–50
Rogers A, Drozd A, Li B (2017) The (too Many) Problems of Analogical Reasoning
with Word Vectors. In: Proceedings *SEM 2017, Vancouver, Canada, pp 135–148
Rogers A, Ananthakrishna SH, Rumshisky A (2018) What’s in Your Embedding,
And How It Predicts Task Performance. In: Proceedings of COLING 2018, Santa
Fe, NM, USA, pp 2690–2703
Sahlgren M (2006) The word-space model. using distributional analysis to represent
syntagmatic and paradigmatic relations between words in high-dimensional vector
spaces. Phd thesis, Stockholm University
Sahlgren M (2008) The distributional hypothesis. Italian Journal of Linguistics
20(1):31–51
Sahlgren M, Lenci A (2016) The Effects of Data Size and Frequency Range on Dis-
tributional Semantic Models. In: Proceedings of EMNLP 2016, Austin, TX, pp
975–980
Sahlgren M, Holst A, Kanerva P (2008) Permutations as a Means to Encode Order in
Word Space. In: Proceedings of CogSci 2008, pp 1300–1305
Sahlgren M, Gyllensten AC, Espinoza F, Hamfors O, Karlgren J, Olsson F, Persson
P, Viswanathan A, Holst A (2016) The Gavagai Living Lexicon. In: Proceedings
of LREC 2016, Portorož, Slovenia, pp 344–350
Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing.
Communications of the ACM 18(11):613–620
Schnabel T, Labutov I, Mimno D, Joachims T (2015) Evaluation methods for unsu-
pervised word embeddings. In: Proceedings of EMNLP 2015, Lisbon, Portugal, pp
298–307
Tenney I, Das D, Pavlick E (2019) BERT Rediscovers the Classical NLP Pipeline.
In: Proceedings of ACL 2019, Florence, Italy, pp 4593–4601
Turney PD, Pantel P (2010) From frequency to meaning: Vector space models of
semantics. Journal of Artificial Intelligence Research 37:141–188
Ushio A, Espinosa-Anke L, Schockaert S, Camacho-Collados J (2021) BERT is to
NLP what AlexNet is to CV: Can Pre-Trained Language Models Identify Analo-
gies? arXiv 2105.04949
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polo-
sukhin I (2017) Attention is all you need. In: Advances in Neural Information
Processing Systems 30 (NIPS 2017), Long Beach, CA, USA
Vulić I, Ponti EM, Litschko R, Glavaš G, Korhonen A (2020) Probing pretrained
language models for lexical semantics. In: Proceedings of EMNLP 2020, online,
pp 7222–7240
Westera M, Boleda G (2019) Don’t Blame Distributional Semantics if it can’t do En-
tailment. In: Proceedings of the 13th International Conference on Computational
Semantics, Gothenburg, Sweden, pp 120–133
Wiedemann G, Remus S, Chawla A, Biemann C (2019) Does BERT Make Any
Sense? Interpretable Word Sense Disambiguation with Contextualized Embed-
dings. In: Proceedings of the Conference on Natural Language Processing (KON-
VENS), Erlangen, Germany
38 Alessandro Lenci et al.
Witten IH, Frank E (2005) Data Mining. Practical Machine Learning Tools and Tech-
niques (second edition). Elsevier, San Francisco, CA
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Fun-
towicz M, Davison J, Shleifer S, Von Platen P, Ma C, Jernite Y, Plu J, Xu C, Le
Scao T, Gugger S, Drame M, Lhoest Q, Rush AM (2020) Transformers: State-of-
the-Art Natural Language Processing. In: Proceedings of EMNLP 2020, online, pp
38–45