0% found this document useful (0 votes)

7 views13 pages

Evaluating Word Embedding Models: Methods and Experimental Results

This document evaluates various word embedding models used in natural language processing, categorizing evaluators into intrinsic and extrinsic types. It discusses the properties of effective word representations and presents experimental results showing the performance of different models across these evaluators. The study aims to provide insights into selecting appropriate word embedding models for specific language tasks by analyzing the correlation between intrinsic and extrinsic evaluators.

Uploaded by

moavoting943

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views13 pages

Evaluating Word Embedding Models: Methods and Experimental Results

Uploaded by

moavoting943

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

1

Evaluating Word Embedding Models: Methods and

Experimental Results
Bin Wang∗ , Student Member, IEEE, Angela Wang∗ , Fenxiao Chen, Student Member, IEEE,
Yuncheng Wang and C.-C. Jay Kuo, Fellow, IEEE

Abstract—Extensive evaluation on a large number of word They measure syntactic or semantic relationships among words
embedding models for language processing applications is con- directly. Aggregate scores are given from testing the vectors
ducted in this work. First, we introduce popular word embedding in selected sets of query terms and semantically related target
arXiv:1901.09785v2 [cs.CL] 29 Jan 2019

models and discuss desired properties of word models and eval-

uation methods (or evaluators). Then, we categorize evaluators words. One can further classify intrinsic evaluators into two
into intrinsic and extrinsic two types. Intrinsic evaluators test types: 1) absolute evaluation, where embeddings are evaluated
the quality of a representation independent of specific natural individually and only their final scores are compared, and 2)
language processing tasks while extrinsic evaluators use word comparative evaluation, where people are asked about their
embeddings as input features to a downstream task and measure preferences among different word embeddings [15]. Since
changes in performance metrics specific to that task. We report
experimental results of intrinsic and extrinsic evaluators on six comparative intrinsic evaluators demand additional resources
word embedding models. It is shown that different evaluators for subjective tests, they are not as popular as the absolute
focus on different aspects of word models, and some are more ones.
correlated with natural language processing tasks. Finally, we
adopt correlation analysis to study performance consistency of A good word representation should have certain good prop-
extrinsic and intrinsic evalutors. erties. An ideal word evaluator should be able to analyze word
embedding models from different perspectives. Yet, existing
Index Terms—Word embedding, Word embedding evaluation,
Natural language processing. evaluators put emphasis on a certain aspect with or without
consciousness. There is no unified evaluator that analyzes word
embedding models comprehensively. Researchers have a hard
I. I NTRODUCTION
time in selecting among word embedding models because

W ORD embedding is a real-valued vector representation

of words by embedding both semantic and syntactic
meanings obtained from unlabeled large corpus. It is a power-
models do not always perform at the same level on different
intrinsic evaluators. As a result, the gold standard for a good
word embedding model differs for different language tasks. In
ful tool widely used in modern natural language processing this work, we will conduct correlation study between intrinsic
(NLP) tasks, including semantic analysis [1], information evaluators and language tasks so as to provide insights into
retrieval [2], dependency parsing [3], [4], [5], question answer- various evaluators and help people select word embedding
ing [6], [7] and machine translation [6], [8], [9]. Learning a models for specific language tasks.
high quality representation is extremely important for these
Although correlation between intrinsic and extrinsic evalu-
tasks, yet the question “what is a good word embedding
ators was studied before [16], [17], this topic is never thor-
model” remains an open problem.
oughly and seriously treated. For example, producing models
Various evaluation methods (or evaluators) have been pro-
by changing the window size only does not happen often
posed to test qualities of word embedding models. As intro-
in real world applications, and the conclusion drawn in [16]
duced in [10], there are two main categories for evaluation
might be biased. The work in [17] only focused on Chinese
methods – intrinsic and extrinsic evaluators. Extrinsic evalua-
characters with limited experiments. We provide the most
tors use word embeddings as input features to a downstream
comprehensive study and try to avoid the bias as much as
task and measure changes in performance metrics specific
possible in this work.
to that task. Examples include part-of-speech tagging [11],
named-entity recognition [12], sentiment analysis [13] and The rest of the paper is organized as follows. Popular
machine translation [14]. Extrinsic evaluators are more com- word embedding models are reviewed in Sec. II. Properties of
putationally expensive, and they may not be directly appli- good embedding models and intrinsic evaluators are discussed
cable. Intrinsic evaluators test the quality of a representation in Sec. III. Representative performance metrics of intrinsic
independent of specific natural language processing tasks. evaluation are presented in Sec. IV and the corresponding
experimental results are offered in Sec. V. Representative per-
Bin Wang, Fenxiao Chen, Yunchen Wang and C.-C. Jay Kuo are with formance metrics of extrinsic evaluation are introduced in Sec.
the Ming-Hsieh Department of Electrical and Computer Engineering, Uni-
versity of Southern California, Los Angeles, CA 90089 -2564, USA. Email: VI and the corresponding experimental results are provided
{wang699, fenxiaoc}@usc.edu, [email protected] in Sec. VII. We conduct consistency study on intrinsic and
Angela Wang is with the Department of Electrical Engineering and Com- extrinsic evaluators using correlation analysis in Sec. VIII.
puter Science, University of California, Berkeley, Berkeley, CA 94720-2278,
USA. Email: [email protected] Finally, concluding remarks and future research directions are
∗ These authors contributes equally to this work. discussed in Sec. IX.
2

II. W ORD E MBEDDING M ODELS Initial values for matrices V and U are randomly assigned.
As extensive NLP downstream tasks emerge, the demand The dimension N of word embedding can vary based on
for word embedding is growing significantly. As a result, different application scenarios. Usually, it ranges from 50 to
lots of word embedding methods are proposed while some 300 dimensions. After obtaining both matrices V or U , they
of them share the same concept. We categorize the existing can either be used solely or averaged to obtained the final
word embedding methods based on their techniques. word embedding matrix.
The skip-gram model [19] predicts the surrounding context
words given a center word. It focuses on maximizing proba-
A. Neural Network Language Model (NNLM)
bilities of context words given a specific center word, which
The Neural Network Language Model (NNLM) [18] jointly can be written as
learns a word vector representation and a statistical language
model with a feedforward neural network that contains a P (wi−c , wi−c+1 , ..., wi−1 , wi+1 , .., wi+c−1 , wi+c |wi ). (3)
linear projection layer and a non-linear hidden layer. An N -
The optimization procedure is similar to that for the CBOW
dimensional one-hot vector that represents the word is used as
model but with a reversed order for context and center words.
the input, where N is the size of the vocabulary. The input is
The softmax function mentioned above is a method to
first projected onto the projection layer. Afterwards, a softmax
generate probability distributions from word vectors. It can
operation is used to compute the probability distribution over
be written as
all words in the vocabulary. As a result of its non-linear hidden
layers, the NNLM model is very computationally complex. exp(v T vw )
P (wc |wi ) = P|W | wc i . (4)
To lower the complexity, an NNLM is first trained using T
w=1 exp(vw vwi )
continuous word vectors learned from simple models. Then,
another N-gram NNLM is trained from the word vectors. This softmax function is not the most efficient one since we
must take a sum over all W words to normalize this func-
tion. Other functions that are more efficient include negative
B. Continuous-Bag-of-Words (CBOW) and Skip-Gram sampling and hierarchical softmax [20]. Negative sampling is
Two iteration-based methods were proposed in the a method that maximizes the log probability of the softmax
word2vec paper [19]. The first one is the Continuous-Bag-of- model by only summing over a smaller subset of W words.
Words (CBOW) model, which predicts the center word from Hierarchical softmax also approximates the full softmax func-
its surrounding context. This model maximizes the probability tion by evaluating only log2 W words. Hierarchical softmax
of a word being in a specific context in form of uses a binary tree representation of the output layer where
P (wi |wi−c , wi−c+1 , ..., wi−1 , wi+1 , .., wi+c−1 , wi+c ), (1) the words are leaves and every node represents the relative
probabilities of its child nodes. These two approaches do well
where wi is a word at position i and c is the window size. in making predictions for local context windows and capturing
Thus, it yields a model that is contingent on the distributional complex linguistic patterns. Yet, it could be further improved
similarity of words. if global co-occurrence statistics is leveraged.
We focus on the first iteration in the discussion below. Let
W be the vocabulary set containing all words. The CBOW
C. Co-occurrence Matrix
model trains two matrices: 1) an input word matrix denoted
by V ∈ RN ×|W | , where the ith column of V is the N - In our current context, the co-occurrence matrix is a word-
dimensional embedded vector for input word vi , and 2) an document matrix. The (i, j) entry, Xij , of co-occurrence
output word matrix denoted by U ∈ R|W |×N , where the j th matrix X is the number of times for word i in document
row of U is the N -dimensional embedded vector for output j. This definition can be generalized to a window-based co-
word uj . To embed input context words, we use the one-hot occurence matrix where the number of times of a certain
representation for each word initially, and apply V T to get word appearing in a specific sized window around a center
the corresponding word vector embeddings of dimension N . word is recorded. In contrast with the window-based log-
We apply U T to an input word vector to generate a score linear model representations (e.g. CBOW or Skip-gram) that
vector and use the softmax operation to convert a score vector use local information only, the global statistical information is
into a probability vector of size W . This process is to yield exploited by this approach.
a probability vector that matches the vector representation of One method to process co-occurrence matrices is the singu-
the output word. The CBOW model is obtained by minimizing lar value decomposition (SVD). The co-occurrence matrix is
the cross-entropy loss between the probability vector and the expressed in form of U SV T matrices product, where the first
embedded vector of the output word. This is achieved by k columns of both U and V are word embedding matrices
minimizing the following objective function: that transform vectors into a k-dimensional space with an
objective that it is sufficient to capture semantics of words.
|W |
X Although embedded vectors derived by this procedure are
J(ui ) = −uTi v̂ + log exp(uTj v̂), (2)
good at capturing semantic and syntactic information, they
j=1
still face problems such as imbalance in word frequency,
where ui is the ith row of matrix U and v̂ is the average of sparsity and high dimensionality of embedded vectors, and
embedded input words. computational complexity.
3

To combine benefits from the SVD-based model and the unlabeled corpus [24]. Using the semantic information from
log-linear models, the Global Vectors (GloVe) method [21] a dictionary, semantically-related words tend to be closer in
adopts a weighted least-squared model. It has a framework high-dimensional vector space. Also, negative sampling is used
similar to that of the skip-gram model, yet it has a different to filter out pairs which are not correlated in a dictionary.
objective function that contains co-occurence counts. We first
define a word-word co-occurence matrix that records the G. Deep Contextualized Model
number of times word j occurs in the context of word i. By To represent complex characteristics of words and word
modifying the objective function adopted by the skip-gram usage across different linguistic contexts effectively, a new
model, we derive a new objective function in form of model for deep contextualized word representation was intro-
W X
X W duced in [25]. First, an Embeddings from Language Models
Jˆ = f (Xij )(uTj vi − log Xij )2 , (5) (ELMo) representation is generated with a function that takes
i=1 j=1 an entire sentence as the input. The function is generated by
where f (Xij ) is the number of times word j occurs in the a bidirectional LSTM network that is trained with a coupled
context of word i. language model. Existing embedding models can be improved
The GloVe model is more efficient as its objective function by incorporating the ELMo representation as it is effective in
contains nonzero elements of the word-word co-occurrence incorporating the sentence information. By following ELMo, a
matrix only. Besides, it produces a more accurate result as it series of pre-trained neural network models for language tasks
takes co-occurrence counts into account. are proposed such as BERT [26] and OpenAI GPT [27]. Their
effectiveness is proved in lots of language tasks.
D. FastText
III. D ESIRED P ROPERTIES OF E MBEDDING M ODELS AND
Embedding of rarely used words can sometimes be poorly E VALUATORS
estimated. Therefore several methods have been proposed to
A. Embedding Models
remedy this issue, including the FastText method. FastText
uses the subword information explicitly so embedding for rare Different word embedding models yield different vector
words can still be represented well. It is still based on the representations. There are a few properties that all good
skip-gram model, where each word is represented as a bag of representations should aim for.
character n-grams or subword units [22]. A vector represen- • Non-conflation [28]
tation is associated with each of character n-grams, and the Different local contexts around a word should give rise
average of these vectors gives the final representation of the to specific properties of the word, e.g., the plural or
word. This model improves the performance on syntactic tasks singular form, the tenses, etc. Embedding models should
significantly but not much in semantic questions. be able to discern differences in the contexts and encode
these details into a meaningful representation in the word
E. N-gram Model subspace.
• Robustness Against Lexical Ambiguity [28]
The N-gram model is an important concept in language
All senses (or meanings) of a word should be represented.
models. It has been used in many NLP tasks. The ngram2vec
Models should be able to discern the sense of a word
method [23] incorporates the n-gram model in various baseline
from its context and find the appropriate embedding.
embedding models such as word2vec, GloVe, PPMI and
This is needed to avoid meaningless representations from
SVD. Furthermore, instead of using traditional training sample
conflicting properties that may arise from the polysemy
pairs or the sub-word level information such as FastText, the
of words. For example, word models should be able to
ngram2vec method considers word-word level co-occurrence
represent the difference between the following: “the bow
and enlarges the reception window by adding the word-
of a ship” and “bow and arrows”.
ngram and the ngram-ngram co-occurrence information. Its
• Demonstration of Multifacetedness [28]
performance on word analogy and word similarity tasks has
The facet, phonetic, morphological, syntactic, and other
significantly improved. It is also be able to learn negation word
properties, of a word should contribute to its final rep-
pairs/phrases like ’not interesting’, which is a difficult case for
resentation. This is important as word models should
other models.
yield meaningful word representations and perhaps find
relationships between different words. For example, the
F. Dictionary Model representation of a word should change when the tense
Even with larger text data available, extracting and embed- is changed or a prefix is added.
ding all linguistic properties into a word representation directly • Reliability [29]
is a challenging task. Lexical databases such as the WordNet Results of a word embedding model should be reliable.
are helpful to the process of learning word embeddings, yet This is important as word vectors are randomly initialized
labeling large lexical databases is a time-consuming and error- when being trained. Even if a model creates different
prone task. In contrast, a dictionary is a large and refined representations from the same dataset because of random
data source for describing words. The dict2vec method learns initialization, the performance of various representations
word representation from dictionary entries as well as large should score consistently.
4

• Good Geometry [30] The goal is to measure how well the notion of human perceived
The geometry of an embedding space should have a similarity is captured by the word vector representations, and
good spread. Generally speaking, a smaller set of more validate the distributional hypothesis where the meaning of
frequent, unrelated words should be evenly distributed words is related to the context they occur in. For the latter,
throughout the space while a larger set of rare words the way distributional semantic models simulate similarity is
should cluster around frequent words. Word models still ambiguous [32].
should overcome the difficulty arising from inconsistent One commonly used evaluator is the cosine similarity
frequency of word usage and derive some meaning from defined by
wx · wy
word frequency. cos(wx , wy ) = , (6)
||wx || ||wy ||
B. Evaluators where wx and wy are two word vectors and ||wx || and ||wy ||
The goal of an evaluator is to compare characteristics of are the `2 norm. This test computes the correlation between all
different word embedding models with a quantitative and vector dimensions, independent of their relevance for a given
representative metric. However, it is not easy to find a concrete word pair or for a semantic cluster.
and uniform way in evaluating these abstract characteristics. Because its scores are normalized by the vector length, it is
Generally, a good word embedding evaluator should aim for robust to scaling. It is computationally inexpensive. Thus, it is
following properties. easy to compare multiple scores from a model and can be used
• Good Testing Data
in word model’s prototyping and development. Furthermore,
To ensure a reliable representative score, testing data word similarity can be used to test model’s robustness against
should be varied with a good spread in the span of a word lexical ambiguity, as a dataset aimed at testing multiple senses
space. Frequently and rarely occurring words should be of a word can be created.
included in the evaluation. Furthermore, data should be On the other hand, it has several problems as discussed in
reliable in the sense that they are correct and objective. [32]. This test is aimed at finding the distributional similarity
• Comprehensiveness
among pairs of words, but this is often conflated with mor-
Ideally, an evaluator should test for many properties of phological relations and simple collocations. Similarity may
a word embedding model. This is not only an important be confused with relatedness. For example, car and train are
property for giving a representative score but also for two similar words while car and road are two related words.
determining the effectiveness of an evaluator. The correlation between the score from the intrinsic test and
• High correlation
other extrinsic downstream tasks could be low in some cases.
The score of a word model in an intrinsic evaluation There is doubt about the effectiveness of this evaluator because
task should correlate well with the performance of the it might not be comprehensive.
model in downstream natural language processing tasks.
This is important for determining the effectiveness of an B. Word Analogy
evaluator. When given a pair of words a and a∗ and a third word b, the
• Efficiency analogy relationship between a and a∗ can be used to find the
Evaluators should be computationally efficient. Most corresponding word b∗ to b. Mathematically, it is expressed as
models are created to solve computationally expensive
downstream tasks. Model evaluators should be simple yet a : a∗ :: b : , (7)
able to predict the downstream performance of a model. where the blank is b∗ . One example could be
• Statistical Significance
The performance of different word embedding models write : writing :: read : reading. (8)
with respect to an evaluator should have enough statistical The 3CosAdd method [33] solves for b∗ using the following
significance, or enough variance between score distribu- equation:
tions, to be differentiated [31]. This is needed in judging
whether a model is better than another and helpful in b∗ = argmax(cos(b0 , a∗ − a + b)), (9)
b0
determining performance rankings between models.
Thus, high cosine similarity means that vectors share a similar
IV. I NTRINSIC E VALUATORS direction. However, it is important to note that the 3CosAdd
Intrinsic evaluators test the quality of a representation method normalizes vector lengths using the cosine similarity
independent of specific natural language processing tasks. [33]. Alternatively, there is the 3CosMul [34] method, which
They measure syntactic or semantic relationships between is defined as
word directly. In this section, a number of absolute intrinsic cos(b0 , b) cos(b0 , a∗ )
b∗ = argmax (10)
evaluators will be discussed. b0 cos(b0 , a) + ε
where ε = 0.001 is used to prevent division by zero. The
A. Word Similarity 3CosMul method has the same effect with taking the logarithm
The word similarity evaluator correlates the distance be- of each term before summation. That is, small differences
tween word vectors and human perceived semantic similarity. are enlarged while large ones are suppressed. Therefore, it
5

is observed that the 3CosMul method offers better balance in and ocean are in the nature category. However, due to the
different aspects. uncompromising nature of the performance metric, there is
It was stated in [35] that many models score under 30% no adequate method in evaluating each cluster’s quality.
on analogy tests, suggesting that not all relations can be The property that the sets of words and categories seem to
identified in this way. In particular, lexical semantic relations test for is semantic relation, as words are grouped into concept
like synonymy and antonym are the most difficult. They also categories. One good property of this evaluator is its ability to
concluded that the analogy test is the most successful when test for the frequency effect and the hub-ness problem since
all three source vectors are relatively close to the target vector. it is good at revealing whether frequent words are clustered
Accuracy of this test decreases as their distance increases. together.
Another seemingly counter-intuitive finding is that words with
denser neighborhoods yield higher accuracy. This is perhaps D. Outlier Detection
because of its correlation with distance. Another problem with
A relatively new method that evaluates word clustering in
this test is subjectivity. Analogies are fundamental to human
vector space models is outlier detection [38]. The goal is to
reasoning and logic. The dataset on which current word models
find words that do not belong to a given group of words. This
are trained does not encode our sense of reasoning. It is rather
evaluator tests the semantic coherence of vector space models,
different from the way how humans learn natural languages.
where semantic clusters can be first identified. There is a clear
Thus, given a word pair, the vector space model may find a
gold standard for this evaluator since human performance on
different relationship from what humans may find.
this task is extremely high as compared to word similarity
Generally speaking, this evaluator serves as a good bench- tasks. It is also less subjective. To formalize this evaluator
mark in testing multifacetedness. A pair of words a and a∗ mathematically, we can take a set of words
can be chosen based on the facet or the property of interest
with the hope that the relationship between them is preserved W = w1 , w2 , ..., wn+1 , (11)
in the vector space. This will contribute to a better vector
where there is one outlier. Next, we take a compactness score
representation of words.
of word w as
1 X X
C. Concept Categorization c(w) = sim(wi , wj ). (12)
n(n − 1)
wi ∈W \w wj ∈W \w,wj 6=wi
An evaluator that is somewhat different from both word
similarity and word analogy is concept categorization. Here, Intuitively, the compactness score of a word is the average of
the goal is to split a given set of words into different cat- all pairwise semantic similarities of the words in cluster W .
egorical subsets of words. For example, given the task of The outlier is the word with the lowest compactness score.
separating words into two categories, the model should be There is less amount of research on this evaluator as compared
able to categorize words sandwich, tea, pasta, water into with that of word similarity and word analogy. Yet, it provides
two groups. a good metric to check whether the geometry of an embedding
In general, the test can be conducted as follows. First, space is good. If frequent words are clustered to form hubs
the corresponding vector to each word is calculated. Then, while rarer words are not clustered around the more frequent
a clustering algorithm (e.g., the k means algorithm) is used to words they relate to, the evaluator will not perform well in
separate the set of word vectors into n different categories. A this metric.
performance metric is then defined based on cluster’s purity, There is subjectivity involved in this evaluator as the rela-
where purity refers to whether each cluster contains concepts tionship of different word groups can be interpreted in different
from the same or different categories [36]. ways. However, since human perception is often correlated, it
may be safe to assume that this evaluator is objective enough
By looking at datasets provided for this evaluator, we
[38]. Also, being similar to the word analogy evaluator, this
would like to point out some challenges. First, the datasets
evaluator relies heavily on human reasoning and logic. The
do not have standardized splits. Second, no specific clustering
outliers identified by humans are strongly influenced by the
methods are defined for this evaluator. It is important to note
characteristics of words perceived to be important. Yet, the
that clustering can be computationally expensive, especially
recognized patterns might not be immediately clear to word
when there are a large amount of words and categories. Third,
embedding models.
the clustering methods may be unreliable if there are either
uneven distributions of word vectors or no clearly defined
clusters. E. QVEC
Subjectivity is another main issue. As stated by Senel QVEC [39] is an intrinsic evaluator that measures the
et al. [37], humans can group words by inference using component-wise correlation between word vectors from a
concepts that word embeddings can gloss over. Given words word embedding model and manually constructed linguistic
lemon, sun, banana, blueberry, ocean, iris. One could word vectors in the SemCor dataset. These linguistic word
group them into yellow objects (lemon, sun, banana) and vectors are constructed in an attempt to give well-defined
red objects (blueberry, ocean, iris). Since words can belong linguistic properties. QVEC is grounded in the hypothesis that
to multiple categories, we may argue that lemon, banana, dimensions in the distributional vectors correspond to linguis-
blueberry, and iris are in the plant category while sun tic properties of words. Thus, linear combinations of vector
6

dimensions produce relevant content. Furthermore, QVEC is a toolkit2 . GloVe toolkit is available from their official website3 .
recall-oriented measure, and highly correlated alignments pro- For FastText, we used their codes4 . Since FastText uses sub-
vide evaluation and annotations of vector dimensions. Missing word as basic units, it can deal with the out-of-vocabulary
information or noisy dimensions do not signicantly affect the (OOV) problem well, which is one of the main advantages
score. of FastText. Here, to compare the word vector quality only,
The most prevalent problem with this evaluator is the we set the vocabulary set for FastText to be the same as other
subjectivity of man-made linguistic vectors. Current word models. For ngram2vec model5 , because it can be trained over
embedding techniques perform much better than man-made multiple baselines, we chose the best model reported in their
models as they are based on statistical relations from data. original paper. Finally, codes for Dict2vec can be obtained
Having a score based on the correlation between the word from website6 . The training time for all models are acceptable
embeddings and the linguistic word vectors may seem to (within several hours) using a modern computer. The threshold
be counter-intuitive. Thus, the QVEC scores are not very for vocabulary is set to 10 for all models. It means, for words
representative of the performance in downstream tasks. On the with frequency lower than 10, they are assigned with the same
other hand, because linguistic vectors are manually generated, vectors.
we know exactly which properties the method is testing for.
B. Experimental Results
TABLE I
W ORD SIMILARITY DATASETS USED IN OUR EXPERIMENTS WHERE PAIRS 1) Word Similarity: We choose 13 datasets for word simi-
INDICATE THE NUMBER OF WORD PAIRS IN EACH DATASET.
larity evaluation. They are listed in Table I. The information of
Name Pairs Year each dataset is provided. Among the 13 datasets, WS-353, WS-
WS-353 [40] 353 2002 353-SIM, WS-353-REL, Rare-Word are more popular ones
WS-353-SIM [41] 203 2009 because of their high quality of word pairs. The Rare-Word
WS-353-REL [41] 252 2009
MC-30 [42] 30 1991 (RW) dataset can be used to test model’s ability to learn words
RG-65 [43] 65 1965 with low frequency. The evaluation result is shown in Table
Rare-Word (RW) [44] 2034 2013 II. We see that SGNS-based models perform better generally.
MEN [45] 3000 2012
Note that ngram2vec is an improvement over the SGNS model,
MTurk-287 [46] 287 2011
MTurk-771 [47] 771 2012 and its performance is the best. Also, The Dict2vec model
YP-130 [48] 130 2006 provides the best result against the RW dataset. This could be
SimLex-999 [49] 999 2014 attributed to that Dict2vec is fine-tuned word vectors based on
Verb-143 [50] 143 2014
SimVerb-3500 [51] 3500 2016
dictionaries. Since infrequent words are treated equally with
others in dictionaries, the Dict2vec model is able to give better
representation over rare words.
2) Word Analogy: Two datasets are adopted for the word
V. E XPERIMENTAL R ESULTS OF I NTRINSIC E VALUATORS analogy evaluation task. They are: 1) the Google dataset
We conduct extensive evaluation experiments on six word [19] and 2) the MSR dataset [33]. The Google dataset con-
embedding models with intrinsic evaluators in this section. tains 19,544 questions. They are divided into “semantic” and
The performance metrics of consideration include: 1) word “morpho-syntactic” two categories, each of which contains
similarity, 2) word analogy, 3) concept categorization, 4) 8,869 and 10,675 questions, respectively. Results for these
outlier detection and 5) QVEC. two subsets are also reported. The MSR dataset contains
8,000 analogy questions. Both 3CosAdd and 3CosMul infer-
ence methods are implemented. We show the word analogy
A. Experimental Setup evaluation results in Table III. SGNS performs the best. One
We select six word embedding models in the experiments. word set for the analogy task has four words. Since ngram2vec
They are SGNS, CBOW, GloVe, FastText, ngram2vec and considers n-gram models, the relationship within word sets
Dict2vec. For consistency, we perform training on the same may not be properly captured. Dictionaries do not have such
corpus – wiki20101 . It is a dataset of medium size (around 6G) word sets and, thus, word analogy is not well-represented
without XML tags. After preprocessing, all special symbols in the word vectors of Dict2vec. Finally, FastText uses sub-
are removed. By choosing a middle-sized training dataset, words, its syntactic result is much better than its semantic
we attempt to keep the generality of real world situations. result.
Some models may perform better when being trained on larger 3) Concept Categorization: Three datasets are used in
datasets while others are less dataset dependent. Here, the concept categorization evaluation. They are: 1) the AP dataset
same training dataset is used to fit a more general situation [52], 2) the BLESS dataset [53] and 3) the BM dataset [54].
for fair comparison among different word embedding models. The AP dataset contains 402 words that are divided into 21
For all embedding models, we used their official released 2 https://code.google.com/archive/p/word2vec/
toolkit and default setting for training. For SGNS and CBOW, 3 https://nlp.stanford.edu/projects/glove/
we used the default setting provided by the official released 4 https://github.com/facebookresearch/fastText
5 https://github.com/zhezhaoa/ngram2vec
1 http://nlp.stanford.edu/data/WestburyLab.wikicorp.201004.txt.bz2 6 https://github.com/tca19/dict2vec
7

TABLE II
P ERFORMANCE COMPARISON (×100) OF SIX WORD EMBEDDING BASELINE MODELS AGAINST 13 WORD SIMILARITY DATASETS .

Word Similarity Datasets

WS WS-SIM WS-REL MC RG RW MEN Mturk287 Mturk771 YP SimLex Verb SimVerb
SGNS 71.6 78.7 62.8 81.1 79.3 46.6 76.1 67.3 67.8 53.6 39.8 45.6 28.9
CBOW 64.3 74.0 53.4 74.7 81.3 43.3 72.4 67.4 63.6 41.6 37.2 40.9 24.5
GloVe 59.7 66.8 55.9 74.2 75.1 32.5 68.5 61.9 63.0 53.4 32.4 36.7 17.2
FastText 64.8 72.1 56.4 76.3 77.3 46.6 73.0 63.0 63.0 49.0 35.2 35.0 21.9
ngram2vec 74.2 81.5 67.8 85.7 79.5 45.0 75.1 66.5 66.5 56.4 42.5 47.8 32.1
Dict2vec 69.4 72.8 57.3 80.5 85.7 49.9 73.3 60.0 65.5 59.6 41.7 18.9 41.7

TABLE III
P ERFORMANCE COMPARISON (×100) OF SIX WORD EMBEDDING BASELINE MODELS AGAINST WORD ANALOGY DATASETS .

Word Analogy Datasets

Google Semantic Syntactic MSR
Add Mul Add Mul Add Mul Add Mul
SGNS 71.8 73.4 77.6 78.1 67.1 69.5 56.7 59.7
CBOW 70.7 70.8 74.4 74.1 67.6 68.1 56.2 56.8
GloVe 68.4 68.7 76.1 75.9 61.9 62.7 50.3 51.6
FastText 40.5 45.1 19.1 24.8 58.3 61.9 48.6 52.2
ngram2vec 70.1 71.3 75.7 75.7 65.3 67.6 53.8 56.6
Dict2vec 48.5 50.5 45.1 47.4 51.4 53.1 36.5 38.9

TABLE IV 4) Outlier Detection: We adopt two datasets for the ourlier

P ERFORMANCE COMPARISON (×100) OF SIX WORD EMBEDDING detection task: 1) the WordSim-500 dataset and 2) the 8-8-
BASELINE MODELS AGAINST THREE CONCEPT CATEGORIZATION
DATASETS . 8 dataset. The WordSim-500 consists of 500 clusters, where
each cluster is represented by a set of 8 words with 5 to 7
Concept Categorization Datasets outliers [55]. The 8-8-8 dataset has 8 clusters, where each
AP BLESS BM
SGNS 68.2 81.0 46.6 cluster is represented by a set of 8 words with 8 outliers
CBOW 65.7 74.0 45.1 [38]. Both Accuracy and Outlier Position Percentage (OPP)
GloVe 61.4 82.0 43.6 are calculated. The results are shown in Table V. They are not
FastText 59.0 73.0 41.9
ngram2vec 63.2 80.5 45.9 consistent with each other for the two datasets. For example,
Dict2vec 66.7 82.0 46.5 GloVe has the best performance on the WordSim-500 dataset
but its accuracy on the 8-8-8 dataset is the worst. This could
TABLE V be explained by the properties of these two datasets. We will
P ERFORMANCE COMPARISON OF SIX WORD EMBEDDING BASELINE conduct correlation study in Sec. VIII to shed light on this
MODELS AGAINST OUTLIER DETECTION DATASETS .
phenomenon.
Outlier Detection Datasets 5) QVEC: We use the QVEC toolkit7 and report the
WordSim-500 8-8-8 sentiment content evaluation result in Table VI. Among six
Accuracy OPP Accuracy OPP word models, ngram2vec achieves the best result while SGNS
SGNS 11.25 83.66 57.81 84.96
CBOW 14.02 85.33 56.25 84.38 ranks the second. This is more consistent with other intrinsic
GloVe 15.09 85.74 50.0 84.77 evaluation results described above.
FastText 10.68 82.16 57.81 84.38
ngram2vec 10.64 82.83 59.38 86.52 VI. E XTRINSIC E VALUATORS
Dict2vec 11.03 82.5 60.94 86.52
Based on the definition of extrinsic evaluators, any NLP
downstream task can be chosen as an evaluation method. Here,
categories. The BM dataset is a larger one with 5321 words we present five extrinsic evaluators: 1) part-of-speech tagging,
divided into 56 categories. Finally, the BLESS dataset consists 2) chunking, 3) named-entity recognition, 4) sentiment analy-
of 200 words divided into 27 semantic classes. The results sis and 5) neural machine translation.
are showed in Table IV. We see that the SGNS-based models A. Part-of-speech (POS) Tagging
(including SGNS, ngram2vec and Dict2vec) perform better
than others on all three datasets. Part-of-speech (POS) tagging, also called grammar tagging,
aims to assign tags to each input token with its part-of-speech
like noun, verb, adverb, conjunction. Due to the availability
TABLE VI of labeled corpora, many methods can successfully complete
QVEC PERFORMANCE COMPARISON (×100) OF SIX WORD EMBEDDING this task by either learning probability distribution through
BASELINE MODELS .
linguistic properties or statistical machine learning. As low-
QVEC QVEC level linguistic resources, POS tagging can be used for several
SGNS 50.62 F astT ext 49.20 purposes such as text indexing and retrieval.
CBOW 50.61 ngram2vec 50.83
GloVe 46.81 Dict2vec 48.29 7 https://github.com/ytsvetko/qvec
8

B. Chunking important to further improve the performance. Domain adap-

The goal of chunking, also called shallow parsing, is to tion methods are able to leverage monolingual corpus for exist-
label segments of a sentence with syntactic constitutes. Each ing machine translation tasks. As compared to parallel corpus,
word is first assigned with one tag indicating its properties monolingual corpus are much larger and they can provide
such as noun or verb phrases. It is then used to syntactically a model with richer linguistic properties. One representative
grouping words into correlated phrases. As compared with domain adaption method is word embedding. This is the reason
POS, chunking provides more clues about the structure of the why NMT can be used as an extrinsic evaluation task.
sentence or phrases in the sentence.
VII. E XPERIMENTAL R ESULTS OF E XTRINSIC
E VALUATORS
C. Named-entity Recognition
A. Datasets and Experimental Setup
The named-entity recognition (NER) task is widely used
1) POS Tagging, Chunking and Named Entity Recognition:
in natural language processing. It focuses on reconizing in-
By following [57], three downstream tasks for sequential label-
formation units such as names (including person, location
ing are selected in our experiments. The Penn Treebank (PTB)
and organization) and numeric expressions (e.g., time and
dataset [58], the chunking of CoNLL’00 share task dataset [59]
percentage). Like the POS tagging task, NER systems use both
and the NER of CoNLL’03 shared task dataset [60] are used
linguistic grammar-based techniques and statistical models. A
for the part-Of-speech tagging, chunking and named-entity
grammar-based system demands lots of efforts on experienced
recognition, respectively. We adopt standard splitting ratios
linguists. In contrast, a statistical-based NER system requires
and evaluation criteria for all three datasets. The details for
a large amount of human labeled data for training, and it can
datasets splitting and evaluation criteria are shown in Table
achieve higher precision. Moreover, the current NER systems
VII.
based on machine learning are heavily dependent on training
data. It may not be robust and cannot generalize well to TABLE VII
different linguistic domains. DATASETS FOR POS TAGGING , C HUNKING AND NER.

Name Train (#Tokens) Test (#Tokens) Criteria

D. Sentiment Analysis PTB 337,195 129,892 accuracy
CoNLL’00 211,727 47,377 F-score
Sentiment analysis is a particular text classification prob- CoNLL’03 203,621 46,435 F-score
lem. Usually, a text fragment is marked with a binary/multi-
level label representing positiveness or negativeness of text’s For inference tools, we use the simple window-based feed-
sentiment. An example of this could be the IMDb dataset by forward neural network architecture implemented by [16]. It
[56] on whether a given movie review is positive or negative. takes inputs of five at one time and passes them through a
Word phrases are important factor for final decisions. Negative 300-unit hidden layer, a tanh activation function and a softmax
words such as ’no’ or ’not’ will totally reverse the meaning of layer before generating the result. We train each model for 10
the whole sentence. Because we are working on sentence-level epochs using the Adam optimization with a batch size of 50.
or paragraph-level data extraction, word sequence and parsing 2) Sentiment Analysis: We choose two sentiment analysis
plays important role in analyzing sentiment. Tradition methods datasets for evaluation: 1) the Internet Movie Database (IMDb)
focus more on human-labeled sentence structures. With the [56] and 2) the Stanford Sentiment Treebank dataset (SST)
development of machine learning, more statistical and data- [61]. IMDb contains a collection of movie review documents
driven approaches are proposed to deal with the sentiment with polarized classes (positive and negative). For SST, we
analysis task [13]. As compared to unlabeled monolingual split data into three classes: positive, neutral and negative.
data, labeled sentiment analysis data are limited. Word em- Their document formats are different: IMDb consists several
bedding is commonly used in sentiment analysis tasks, serving sentences while SST contains only single sentence per label.
as transferred knowledge extracted from generic large corpus. The detailed information for each dataset is given in Table
Furthermore, the inference tool is also an important factor, VIII.
and it might play a significant role in the final result. For
example, when conducting sentimental analysis tasks, we may TABLE VIII
S ENTIMENT ANALYSIS DATASETS .
use Bag-of-words, SVM, LSTM or CNN based on a certain
word model. The performance boosts could be totally different Classes Train Validation Test
when choosing different inference tools. SST 3 8544 1101 2210
IMDb 2 17500 7500 25000

E. Neural Machine Translation (NMT) To cover most sentimental analysis inference tools, we test
Neural machine translation (NMT) [14] refers to a category the task using Bi-LSTM and CNN. We choose 2-layer Bi-
of deep-learning-based methods for machine translation. With LSTM with 256 hidden dimensions. The adopted CNN has 3
large-scale parallel corpus data available, NMT can provide layers with 100 filters per layer of size [3, 4, 5], respectively.
state-of-the-art results for machine translation and has a large Particularly, the embedding layer for all models are fixed
gain over traditional machine translation methods. Even with during training. All models are trained for 5 epochs using
large-scale parallel data available, domain adaptation is still the Adam optimization with 0.0001 learning rate.
9

3) Neural Machine Translation: As compared with senti- Besides the six word models described above, we add two
ment analysis, neural machine translation (NMT) is a more more pre-trained models of GloVe and FastText to make the
challenging task since it demands a larger network and more total model number eight. Furthermore, we apply the variance
training data. We use the same encoder-decoder architecture as normalization technique [65] to the eight models to yield eight
that in [62]. The Europarl v8 [63] dataset is used as training more models. Consequently, we have a collection of sixteen
corpora. The task is English-French translation. For French word models.
word embedding, a pre-trained FastText word embedding Fig. 1 shows the Pearson correlation of each intrinsic
model8 is utilized. As to the hyper-parameter setting, we use and extrinsic evaluation pair of these sixteen models. For
a single layer bidirectional-LSTM of 500 dimensions for both example, the entry of the first row and the first column is the
the encoder and the decoder. Both embedding layers for the Pearson correlation value of WS-353 (an intrinsic evaluator)
encoder and the decoder are fixed during the training process. and POS (an extrinsic evaluator) of sixteen word models (i.e.
The batch size is 30 and the total training iteration is 100,000. 16 evaluation data pairs). Note also that we add a negative
sign to the correlation value of NMT perplexity since lower
perplexity is better.
B. Experimental Results and Discussion
Experimental results of the above-mentioned five extrinsic
evaluators are shown in Table IX. Generally speaking, both A. Consistency of Intrinsic Evaluators
SGNS and ngram2vec perform well in POS tagging, chunking • Word Similarity
and NER tasks. Actually, the performance differences of all All embedding models are tested over 13 evaluation
evaluators are small in these three tasks. As to the sentimental datasets and the results are shown in the top 13 rows.
analysis, their is no obvious winner with the CNN inference We see from the correlation result that larger datasets
tool. The performance gaps become larger using the Bi-LSTM tend to give more reliable and consistent evaluation result.
inference tool, and we see that Dict2vec and FastText perform Among all datasets, WS-353, WS-353-SIM, WS-353-
the worst. Based on these results, we observe that there exist REL, MTrurk-771, SimLex-999 and SimVerb-3500 are
two different factors affecting the sentiment analysis results: recommended to serve as generic evaluation datasets.
datasets and inference tools. For different datasets with the Although datasets like MC-30 and RG-65 also provide us
same inference tool, the performance can be different because with reasonable results, their correlation results are not as
of different linguistic properties of datasets. On the other consistent as others. This may be attributed to the limited
hand, different inference tools may favor different embedding amount of testing samples with only dozens of testing
models against the same dataset since inference tools extract word pairs. The Rare-Word (RW) dataset is a special one
the information from word models in their own manner. For that focuses on low-frequency words and gains popularity
example, Bi-LSTM focuses on long range dependency while recently. Yet, based on the correlation study, the RW
CNN treats each token more or less equally. dataset is not as effective as expected. Infrequent words
Perplexity is used to evaluate the NMT task. It indicates may not play an important role in all extrinsic evaluation
variability of a prediction model. Lower perplexity corre- tasks. This is why infrequent words are often set to the
sponds to lower entropy and, thus, better performance. We same vector. The Rare-Word dataset can be excluded
separate 20,000 sentences from the same corpora to generate for general purpose evaluation unless there is a specific
testing data and report testing perplexity for the NMT task application demanding rare words modeling.
in Table IX. As shown in the table, ngram2vec, Dict2vec and • Word Analogy
SGNS are the top three word models for the NMT task, which The word analogy results are shown from the 14th
is consistent with the word similarity evaluation results. row to the 21st row in the figure. Among four word
We conclude from Table IX that SGNS-based models in- analogy datasets (i.e. Google, Google Semantic, Google
cluding SGNS, ngram2vec and dict2vec tend to work better Syntactic and MSR), Google and Google Semantic are
than other models. However, one drawback of ngram2vec is more effective. It does not make much difference in the
that it takes more time in processing n-gram data for training. final correlation study using either the 3CosAdd or the
GloVe and FastText are popular in the research community 3CosMul compuation. Google Syntactic is not effective
since their pre-trained models are easy to download. We since the morphology of words does not contain as much
also compared results using pre-trained GloVe and FastText information as semantic meanings. Thus, although the
models. Although they are both trained on larger datasets and FastText model performs well in morphology testing
properly find-tuned, they do not provide better results in our based on the average of sub-words, it correlation analysis
evaluation tasks. is worse than other models. In general, word analogy
provides most reliable correlation results and has the
VIII. C ONSISTENCY S TUDY VIA C ORRELATION A NALYSIS highest correlation with the sentiment analysis task.
• Concept Categorization
We conduct consistency study of extrinsic and intrinsic
All three datasets (i.e., AP, BLESS and BM) for con-
evaluators using the Pearson correlation (ρ) analysis [64].
cept categorization perform well. By categorizing words
8 https://github.com/facebookresearch/fastText/blob/master/pretrained- into different groups, concept categorization focuses on
vectors.md semantic clusters. It appears that models that are good
10

TABLE IX
E XTRINSIC EVALUATION RESULTS .

SA(IMDb) SA(SST) NMT

POS Chunking NER
Bi-LSTM CNN Bi-LSTM CNN Perplexity
SGNS 94.54 88.21 87.12 85.36 88.78 64.08 66.93 79.14
CBOW 93.79 84.91 83.83 86.93 85.88 65.63 65.06 102.33
GloVe 93.32 84.11 85.3 70.41 87.56 65.16 65.15 84.20
FastText 94.36 87.96 87.10 73.97 83.69 50.01 63.25 82.60
ngram2vec 94.11 88.74 87.33 79.32 89.29 66.27 66.45 77.79
Dict2vec 93.61 86.54 86.82 62.71 88.94 62.75 66.09 78.84

Fig. 1. Pearson’s correlation between intrinsic and extrinsic evaluator, where the x-axis shows extrinsic evaluators while the y-axis indicates intrinsic evaluators.
The warm indicates the positive correlation while the cool color indicates the negative correlation.

at dividing words into semantic collections are more • QVEC

effective in downstream NLP tasks. QVEC is not a good evaluator due to its inherit properties.
• Outlier Detection It attempts to compute the correlation with lexicon-
Two datasets (i.e., WordSim-500 and 8-8-8) are used resource based word vectors. Yet, the quality of lexicon-
for outlier detection. In general, outlier detection is not resource based word vectors is to poor to provide a
a good evaluation method. Although it tests semantic reliable rule. If we can find a more reliable rule, the
clusters to some extent, outlier detection is less direct QVEC evaluator will perform better.
as compared to concept categorization. Also, from the
Based on the above discussion, we conclude that word
dataset point of view, the size of the 8-8-8 dataset is
similarity, word analogy and concept categorization are more
too small while the WordSim-500 dataset contains too
effective intrinsic evaluators. Different datasets lead to differ-
many infrequent words in the clusters. This explains
ent performance. In general, larger datasets tend to give better
why the accuracy for WordSim-500 is low (around 10-
and more reliable results. Intrinsic evaluators may perform
20%). When there are larger and more reliable datasets
very differently for different downstream tasks. Thus, when we
available, we expect the outlier detection task to have
test a new word embedding model, all three intrinsic evaluators
better performance in word embedding evaluation.
should be used and considered jointly.
11

B. Consistency of Extrinsic Evaluators on Audio, Speech and Language Processing (TASLP), vol. 26, no. 3, pp.
671–681, 2018.
For POS tagging, chunking and NER, none of intrinsic [2] H. Schütze, C. D. Manning, and P. Raghavan, Introduction to informa-
evaluators provide high correlation. Their performance depend tion retrieval. Cambridge University Press, 2008, vol. 39.
on their capability in sequential information extraction. Thus, [3] W. Chen, M. Zhang, and Y. Zhang, “Distributed feature representations
for dependency parsing,” IEEE Transactions on Audio, Speech, and
word meaning plays a subsidiary role in all these tasks. Language Processing, vol. 23, no. 3, pp. 451–460, 2015.
Sentiment analysis is a dimensionality reduction procedure. It [4] H. Ouchi, K. Duh, H. Shindo, and Y. Matsumoto, “Transition-based
focuses more on combination of word meaning. Thus, it has dependency parsing exploiting supertags,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2059–
stronger correlation with the properties that the word analogy 2068, 2016.
evaluator is testing. Finally, NMT is sentence-to-sentence [5] M. Shen, D. Kawahara, and S. Kurohashi, “Dependency parse reranking
conversion, and the mapping between word pairs is more with rich subtree features,” IEEE/ACM Transactions on Audio, Speech,
and Language Processing, vol. 22, no. 7, pp. 1208–1218, 2014.
helpful in translation tasks. Thus, the word similarity evaluator [6] G. Zhou, Z. Xie, T. He, J. Zhao, and X. T. Hu, “Learning the multilingual
has a stronger correlation with the NMT task. We should also translation representations for question retrieval in community question
point out that some unsupervised machine translation tasks answering via non-negative matrix factorization,” IEEE/ACM Transac-
tions on Audio, Speech and Language Processing (TASLP), vol. 24,
focus on word pairs [66], [67]. This shows the significance of no. 7, pp. 1305–1314, 2016.
word pair correspondence in NMT. [7] Y. Hao, Y. Zhang, K. Liu, S. He, Z. Liu, H. Wu, and J. Zhao, “An
end-to-end model for question answering over knowledge base with
IX. C ONCLUSION AND F UTURE W ORK cross-attention combining global knowledge,” in Proceedings of the
55th Annual Meeting of the Association for Computational Linguistics
In this work, we provided in-depth discussion of intrinsic (Volume 1: Long Papers), vol. 1, 2017, pp. 221–231.
and extrinsic evaluations on many word embedding models, [8] B. Zhang, D. Xiong, J. Su, and H. Duan, “A context-aware recurrent
encoder for neural machine translation,” IEEE/ACM Transactions on
showed extensive experimental results and explained the ob- Audio, Speech and Language Processing (TASLP), vol. 25, no. 12, pp.
served phenomema. Our study offers a valuable guidance in 2424–2432, 2017.
selecting suitable evaluation methods for different application [9] K. Chen, T. Zhao, M. Yang, L. Liu, A. Tamura, R. Wang, M. Utiyama,
and E. Sumita, “A neural approach to source dependence based context
tasks. There are many factors affecting word embedding qual- model for statistical machine translation,” IEEE/ACM Transactions on
ity. Furthermore, there are still no perfect evaluation methods Audio, Speech and Language Processing (TASLP), vol. 26, no. 2, pp.
testing the word subspace for linguistic relationships because 266–280, 2018.
[10] A. Bakarov, “A survey of word embeddings evaluation methods,”
it is difficult to understand exactly how the embedding spaces CoRR, vol. abs/1801.09536, 2018. [Online]. Available: http://arxiv.org/
encode linguistic relations. For this reason, we expect more abs/1801.09536
work to be done in developing better metrics for evaluation [11] Z. Li, M. Zhang, W. Che, T. Liu, and W. Chen, “Joint optimization for
chinese pos tagging and dependency parsing,” IEEE/ACM Transactions
on the overall quality of a word model. Such metrics must be on Audio, Speech and Language Processing (TASLP), vol. 22, no. 1, pp.
computationally efficient while having a high correlation with 274–286, 2014.
extrinsic evaluation test scores. The crux of this problem lies in [12] J. Xu, X. Sun, H. He, X. Ren, and S. Li, “Cross-domain and semi-
supervised named entity recognition in chinese social media: A unified
decoding how the word subspace encodes linguistic relations model,” IEEE/ACM Transactions on Audio, Speech, and Language
and the quality of these relations. Processing, 2018.
We would like to point out that linguistic relations and [13] K. Ravi and V. Ravi, “A survey on opinion mining and sentiment anal-
ysis: tasks, approaches and applications,” Knowledge-Based Systems,
properties captured by word embedding models are different vol. 89, pp. 14–46, 2015.
from how humans learn languages. For humans, a language [14] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
encompasses many different avenues e.g., a sense of reasoning, jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
2014.
cultural differences, contextual implications and many others. [15] T. Schnabel, I. Labutov, D. Mimno, and T. Joachims, “Evaluation
Thus, a language is filled with subjective complications that methods for unsupervised word embeddings,” in Proceedings of the 2015
interfere with objective goals of models. In contrast, word Conference on Empirical Methods in Natural Language Processing,
2015, pp. 298–307.
embedding models perform well in specific applied tasks. [16] B. Chiu, A. Korhonen, and S. Pyysalo, “Intrinsic evaluation of word
They have triumphed over the work of linguists in creating vectors fails to predict extrinsic performance,” in Proceedings of the 1st
taxonomic structures and other manually generated represen- Workshop on Evaluating Vector-Space Representations for NLP, 2016,
pp. 1–6.
tations. Yet, different datasets and different models are used [17] Y. Qiu, H. Li, S. Li, Y. Jiang, R. Hu, and L. Yang, “Revisiting correla-
for different specific tasks. tions between intrinsic and extrinsic evaluations of word embeddings,”
We do not see a word embedding model that consistently in Chinese Computational Linguistics and Natural Language Processing
Based on Naturally Annotated Big Data. Springer, 2018, pp. 209–221.
performs well in all tasks. The design of a more universal word [18] Y. Bengio, R. Ducharme, P. Vincent, and C. Janvin, “A neural
embedding model is challenging. To generate word models probabilistic language model,” Journal of Machine Learning Research,
that are good at solving specific tasks, task-specific data can vol. 3, pp. 1137–1155, 2003. [Online]. Available: http://www.jmlr.org/
papers/v3/bengio03a.html
be fed into a model for training. Feeding a large amount of [19] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation
generic data can be inefficient and even hurt the performance of word representations in vector space,” CoRR, vol. abs/1301.3781,
of a word model since different task-specific data can lead 2013. [Online]. Available: http://arxiv.org/abs/1301.3781
[20] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
to contending results. It is still not clear what is the proper “Distributed representations of words and phrases and their composi-
balance between the two design methodologies. tionality,” in Advances in neural information processing systems, 2013,
pp. 3111–3119.
R EFERENCES [21] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors
for word representation,” in Proceedings of the 2014 conference on
[1] L.-C. Yu, J. Wang, K. R. Lai, and X. Zhang, “Refining word embeddings empirical methods in natural language processing (EMNLP), 2014, pp.
using intensity scores for sentiment analysis,” IEEE/ACM Transactions 1532–1543.
12

[22] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word [44] T. Luong, R. Socher, and C. Manning, “Better word representations
vectors with subword information,” CoRR, vol. abs/1607.04606, 2016. with recursive neural networks for morphology,” in Proceedings of the
[Online]. Available: http://arxiv.org/abs/1607.04606 Seventeenth Conference on Computational Natural Language Learning,
[23] Z. Zhao, T. Liu, S. Li, B. Li, and X. Du, “Ngram2vec: Learning 2013, pp. 104–113.
improved word representations from ngram co-occurrence statistics,” in [45] E. Bruni, N.-K. Tran, and M. Baroni, “Multimodal distributional se-
Proceedings of the 2017 Conference on Empirical Methods in Natural mantics,” Journal of Artificial Intelligence Research, vol. 49, pp. 1–47,
Language Processing, 2017, pp. 244–253. 2014.
[24] J. Tissier, C. Gravier, and A. Habrard, “Dict2vec: Learning word [46] K. Radinsky, E. Agichtein, E. Gabrilovich, and S. Markovitch, “A word
embeddings using lexical dictionaries,” in Conference on Empirical at a time: computing word relatedness using temporal semantic analysis,”
Methods in Natural Language Processing (EMNLP 2017), 2017, pp. in Proceedings of the 20th international conference on World wide web.
254–263. ACM, 2011, pp. 337–346.
[25] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, [47] G. Halawi, G. Dror, E. Gabrilovich, and Y. Koren, “Large-scale learning
and L. Zettlemoyer, “Deep contextualized word representations,” arXiv of word relatedness with constraints,” in Proceedings of the 18th ACM
preprint arXiv:1802.05365, 2018. SIGKDD international conference on Knowledge discovery and data
[26] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training mining. ACM, 2012, pp. 1406–1414.
of deep bidirectional transformers for language understanding,” arXiv [48] P. D. Turney, “Mining the web for synonyms: Pmi-ir versus lsa on toefl,”
preprint arXiv:1810.04805, 2018. in European Conference on Machine Learning. Springer, 2001, pp.
[27] A. Radford, K. Narasimhan, T. Salimans, and I. Sutskever, “Improving 491–502.
language understanding by generative pre-training,” Technical report, [49] F. Hill, R. Reichart, and A. Korhonen, “Simlex-999: Evaluating semantic
OpenAI, 2018. models with (genuine) similarity estimation,” Computational Linguistics,
[28] Y. Yaghoobzadeh and H. Schütze, “Intrinsic subspace evaluation of word vol. 41, no. 4, pp. 665–695, 2015.
embedding representations,” arXiv preprint arXiv:1606.07902, 2016. [50] S. Baker, R. Reichart, and A. Korhonen, “An unsupervised model for
[29] J. Hellrich and U. Hahn, “Dont get fooled by word embeddings: instance level subcategorization acquisition,” in Proceedings of the 2014
better watch their neighborhood,” in Digital Humanities 2017Conference Conference on Empirical Methods in Natural Language Processing
Abstracts of the 2017 Conference of the Alliance of Digital Humanities (EMNLP), 2014, pp. 278–289.
Organizations (ADHO). Montréal, Quebec, Canada, 2017, pp. 250–252. [51] D. Gerz, I. Vulić, F. Hill, R. Reichart, and A. Korhonen, “Simverb-
[30] A. Gladkova and A. Drozd, “Intrinsic evaluations of word embeddings: 3500: A large-scale evaluation set of verb similarity,” arXiv preprint
What can we do better?” in Proceedings of the 1st Workshop on arXiv:1608.00869, 2016.
Evaluating Vector-Space Representations for NLP, 2016, pp. 36–42. [52] A. Almuhareb, “Attributes in lexical acquisition,” Ph.D. dissertation,
[31] W. Shalaby and W. Zadrozny, “Measuring semantic relatedness using University of Essex, 2006.
mined semantic analysis,” CoRR, abs/1512.03465, 2015. [53] M. Baroni and A. Lenci, “How we blessed distributional semantic eval-
[32] M. Faruqui, Y. Tsvetkov, P. Rastogi, and C. Dyer, “Problems with uation,” in Proceedings of the GEMS 2011 Workshop on GEometrical
evaluation of word embeddings using word similarity tasks,” arXiv Models of Natural Language Semantics. Association for Computational
preprint arXiv:1605.02276, 2016. Linguistics, 2011, pp. 1–10.
[33] T. Mikolov, W.-t. Yih, and G. Zweig, “Linguistic regularities in continu- [54] M. Baroni, B. Murphy, E. Barbu, and M. Poesio, “Strudel: A corpus-
ous space word representations,” in Proceedings of the 2013 Conference based semantic model based on properties and types,” Cognitive science,
of the North American Chapter of the Association for Computational vol. 34, no. 2, pp. 222–254, 2010.
Linguistics: Human Language Technologies, 2013, pp. 746–751. [55] P. Blair, Y. Merhav, and J. Barry, “Automated generation of multilingual
[34] O. Levy and Y. Goldberg, “Linguistic regularities in sparse and explicit clusters for the evaluation of distributed representations,” arXiv preprint
word representations,” in Proceedings of the eighteenth conference on arXiv:1611.01547, 2016.
computational natural language learning, 2014, pp. 171–180. [56] A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts,
[35] A. Rogers, A. Drozd, and B. Li, “The (too many) problems of analogical “Learning word vectors for sentiment analysis,” in Proceedings of the
reasoning with word vectors,” in Proceedings of the 6th Joint Conference 49th annual meeting of the association for computational linguistics:
on Lexical and Computational Semantics (* SEM 2017), 2017, pp. 135– Human language technologies-volume 1. Association for Computa-
148. tional Linguistics, 2011, pp. 142–150.
[36] M. Baroni, G. Dinu, and G. Kruszewski, “Don’t count, predict! a sys- [57] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and
tematic comparison of context-counting vs. context-predicting semantic P. Kuksa, “Natural language processing (almost) from scratch,” Journal
vectors,” in Proceedings of the 52nd Annual Meeting of the Association of Machine Learning Research, vol. 12, no. Aug, pp. 2493–2537, 2011.
for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2014, [58] M. P. Marcus, M. A. Marcinkiewicz, and B. Santorini, “Building a
pp. 238–247. large annotated corpus of english: The penn treebank,” Computational
[37] L. K. Senel, I. Utlu, V. Yucesoy, A. Koc, and T. Cukur, “Semantic struc- linguistics, vol. 19, no. 2, pp. 313–330, 1993.
ture and interpretability of word embeddings,” IEEE/ACM Transactions [59] E. F. Tjong Kim Sang and S. Buchholz, “Introduction to the conll-
on Audio, Speech, and Language Processing, 2018. 2000 shared task: Chunking,” in Proceedings of the 2nd workshop on
[38] J. Camacho-Collados and R. Navigli, “Find the word that does not Learning language in logic and the 4th conference on Computational
belong: A framework for an intrinsic evaluation of word vector represen- natural language learning-Volume 7. Association for Computational
tations,” in Proceedings of the 1st Workshop on Evaluating Vector-Space Linguistics, 2000, pp. 127–132.
Representations for NLP, 2016, pp. 43–50. [60] E. F. Tjong Kim Sang and F. De Meulder, “Introduction to the conll-
[39] Y. Tsvetkov, M. Faruqui, W. Ling, G. Lample, and C. Dyer, “Evaluation 2003 shared task: Language-independent named entity recognition,” in
of word vector representations by subspace alignment,” in Proceedings Proceedings of the seventh conference on Natural language learning at
of the 2015 Conference on Empirical Methods in Natural Language HLT-NAACL 2003-Volume 4. Association for Computational Linguis-
Processing, 2015, pp. 2049–2054. tics, 2003, pp. 142–147.
[40] L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolf- [61] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and
man, and E. Ruppin, “Placing search in context: The concept revisited,” C. Potts, “Recursive deep models for semantic compositionality over a
in Proceedings of the 10th international conference on World Wide Web. sentiment treebank,” in Proceedings of the 2013 conference on empirical
ACM, 2001, pp. 406–414. methods in natural language processing, 2013, pp. 1631–1642.
[41] E. Agirre, E. Alfonseca, K. Hall, J. Kravalova, M. Paşca, and A. Soroa, [62] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush, “Opennmt:
“A study on similarity and relatedness using distributional and wordnet- Open-source toolkit for neural machine translation,” arXiv preprint
based approaches,” in Proceedings of Human Language Technologies: arXiv:1701.02810, 2017.
The 2009 Annual Conference of the North American Chapter of the As- [63] P. Koehn, “Europarl: A parallel corpus for statistical machine transla-
sociation for Computational Linguistics. Association for Computational tion,” in MT summit, vol. 5, 2005, pp. 79–86.
Linguistics, 2009, pp. 19–27. [64] J. Benesty, J. Chen, Y. Huang, and I. Cohen, “Pearson correlation
[42] G. A. Miller and W. G. Charles, “Contextual correlates of semantic coefficient,” in Noise reduction in speech processing. Springer, 2009,
similarity,” Language and cognitive processes, vol. 6, no. 1, pp. 1–28, pp. 1–4.
1991. [65] B. Wang, F. Chen, A. Wang, and C.-C. J. Kuo, “Post-processing of word
[43] H. Rubenstein and J. B. Goodenough, “Contextual correlates of syn- representations via variance normalization and dynamic embedding,”
onymy,” Communications of the ACM, vol. 8, no. 10, pp. 627–633, 1965. arXiv preprint arXiv:1808.06305, 2018.
13

[66] M. Artetxe, G. Labaka, and E. Agirre, “Learning bilingual word embed-

dings with (almost) no bilingual data,” in Proceedings of the 55th Annual
Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers), vol. 1, 2017, pp. 451–462.
[67] M. Artetxe, G. Labaka, E. Agirre, and K. Cho, “Unsupervised neural
machine translation,” arXiv preprint arXiv:1710.11041, 2017.

Large Language Models From Scratch
No ratings yet
Large Language Models From Scratch
29 pages
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
No ratings yet
Intuitive Understanding of Word Embeddings - Count Vectors To Word2Vec
34 pages
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
No ratings yet
Vector Representation of Text: Vagelis Hristidis Prepared With The Help of Nhat Le Many Slides Are From Richard Socher
20 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
A Survey of Word Embeddings Evaluation Methods
No ratings yet
A Survey of Word Embeddings Evaluation Methods
26 pages
Topic 21 - Infinitive and - Ing Forms. Their Uses - Oposinet
No ratings yet
Topic 21 - Infinitive and - Ing Forms. Their Uses - Oposinet
32 pages
English 10
No ratings yet
English 10
176 pages
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
No ratings yet
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
32 pages
Pragmatics and The English Language
No ratings yet
Pragmatics and The English Language
14 pages
Ejercicios Apoyo de Inglés
100% (3)
Ejercicios Apoyo de Inglés
37 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016
No ratings yet
CS 224D: Deep Learning For NLP: Lecture Notes: Part II Spring 2016
11 pages
No Training Required Exploring Random Encoders For Sentence Classification
No ratings yet
No Training Required Exploring Random Encoders For Sentence Classification
16 pages
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
No ratings yet
A Survey of Word Embeddings Based On Deep Learning: Shirui Wang Wenan Zhou Chao Jiang
24 pages
Kwt004 Key Word Transformation2
No ratings yet
Kwt004 Key Word Transformation2
2 pages
Coursera - Online Courses From Top Universities. Join For Free - Coursera
No ratings yet
Coursera - Online Courses From Top Universities. Join For Free - Coursera
7 pages
ETNLP: A Toolkit For Extraction, Evaluation and Visualization of Pre-Trained Word Embeddings
No ratings yet
ETNLP: A Toolkit For Extraction, Evaluation and Visualization of Pre-Trained Word Embeddings
6 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
Advantages and Disadvantages of Communicative Language Teaching and The Improvement Methods Schools
No ratings yet
Advantages and Disadvantages of Communicative Language Teaching and The Improvement Methods Schools
7 pages
Intro DL 10 NLP
No ratings yet
Intro DL 10 NLP
99 pages
Word 2 Vector
No ratings yet
Word 2 Vector
4 pages
Intrinsic and Extrinsic Evaluations of Word Embeddings: Michael Zhai, Johnny Tan, Jinho D. Choi
No ratings yet
Intrinsic and Extrinsic Evaluations of Word Embeddings: Michael Zhai, Johnny Tan, Jinho D. Choi
2 pages
NLP DL Lecture2
No ratings yet
NLP DL Lecture2
54 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
G7 Cambridge Midterm-Assessment PDF
No ratings yet
G7 Cambridge Midterm-Assessment PDF
3 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Performance Evaluation of Word Embedding Algorithms
No ratings yet
Performance Evaluation of Word Embedding Algorithms
7 pages
Related Questions
No ratings yet
Related Questions
2 pages
Efficient Estimation of Word Representations in Vector Space - Meghana B
No ratings yet
Efficient Estimation of Word Representations in Vector Space - Meghana B
2 pages
Unit 21 - (Lomloe) 2023
No ratings yet
Unit 21 - (Lomloe) 2023
7 pages
Zhou 2020
No ratings yet
Zhou 2020
5 pages
2.advanced WR 7.5+ Strategy 2-Noun Phrases
No ratings yet
2.advanced WR 7.5+ Strategy 2-Noun Phrases
5 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Persuasive Letter School Uniform
No ratings yet
Persuasive Letter School Uniform
4 pages
Chapter II
No ratings yet
Chapter II
26 pages
Roadmap B1P - Bank - Grammar
No ratings yet
Roadmap B1P - Bank - Grammar
20 pages
Phonesthemes: By: Dr. Shadia Y. Banjar
No ratings yet
Phonesthemes: By: Dr. Shadia Y. Banjar
9 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Present Perfect For-Since
No ratings yet
Present Perfect For-Since
19 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
Natural Language Processing - Part 1
No ratings yet
Natural Language Processing - Part 1
1 page
INLP Assignment 3
No ratings yet
INLP Assignment 3
5 pages
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
No ratings yet
Language Analysis - Sociolinguistics of Word Embeddings - PREPRINT - 8.8.2020
17 pages
14-Word Embeddings II
No ratings yet
14-Word Embeddings II
31 pages
10 Principles of Communication
No ratings yet
10 Principles of Communication
2 pages
6° Básico - Sintesis - Inglés
No ratings yet
6° Básico - Sintesis - Inglés
4 pages
W - Outlining A Paragraph
No ratings yet
W - Outlining A Paragraph
5 pages
3rd Term Jss 3 2022 2023 Result
No ratings yet
3rd Term Jss 3 2022 2023 Result
75 pages
Lesson Plan Maam Borja
No ratings yet
Lesson Plan Maam Borja
14 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
WORD EMBEDDING Project
No ratings yet
WORD EMBEDDING Project
15 pages
Trend
No ratings yet
Trend
47 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
No ratings yet
Jina-Embeddings-V3:: Multilingual Embeddings With Task Lora
20 pages
6 Verb Review
No ratings yet
6 Verb Review
5 pages
Famously Misspelledwords List
No ratings yet
Famously Misspelledwords List
3 pages
100 Most Common Irregular Verbs List - XLSX - Sheet1
No ratings yet
100 Most Common Irregular Verbs List - XLSX - Sheet1
5 pages
CCS369 Unit-2 20.12.24
No ratings yet
CCS369 Unit-2 20.12.24
41 pages
De Thi Giua Ki 1 Tieng Anh 11 Global Success
No ratings yet
De Thi Giua Ki 1 Tieng Anh 11 Global Success
8 pages
MUD Exam 2024 SOLVED
No ratings yet
MUD Exam 2024 SOLVED
6 pages
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
No ratings yet
From Word Vectors To Multimodal Embeddings: Techniques, Applications, and Future Directions For Large Language Models
21 pages
ML For NLP-LO4
No ratings yet
ML For NLP-LO4
42 pages
Unit Ii
No ratings yet
Unit Ii
20 pages
IJISRT23DEC1110
No ratings yet
IJISRT23DEC1110
8 pages
Text Processing For NLP Word Embedding
No ratings yet
Text Processing For NLP Word Embedding
11 pages
Horns As Therapy Tools: by Sara Rosenfeld-Johnson, M.S.,CCC-SLP Published in ADVANCE Magazine May 31, 1999
No ratings yet
Horns As Therapy Tools: by Sara Rosenfeld-Johnson, M.S.,CCC-SLP Published in ADVANCE Magazine May 31, 1999
3 pages
Year10 English Homework Booklet 1.337603891
No ratings yet
Year10 English Homework Booklet 1.337603891
20 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Word Embeddings A Survey
No ratings yet
Word Embeddings A Survey
11 pages
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
No ratings yet
ELA 3L Unit 3 Lesson 27 - Prefixes and Suffixes
3 pages
NLP Concepts
No ratings yet
NLP Concepts
37 pages
An Intercultural Model For Learners
No ratings yet
An Intercultural Model For Learners
16 pages
Lit 101 Module 4 2024
No ratings yet
Lit 101 Module 4 2024
8 pages
Unit IV
No ratings yet
Unit IV
57 pages
Introductory Sheet
No ratings yet
Introductory Sheet
4 pages
671079a756e2de7f49bc8fba Gozukika
No ratings yet
671079a756e2de7f49bc8fba Gozukika
2 pages
Wordembed
No ratings yet
Wordembed
31 pages
Project Review-IV Presentation On: Department of Information Technology 2025 Semester VIII
No ratings yet
Project Review-IV Presentation On: Department of Information Technology 2025 Semester VIII
44 pages
Time Expression - Fabricio
No ratings yet
Time Expression - Fabricio
4 pages
词向量嵌入综述
No ratings yet
词向量嵌入综述
10 pages
第二周quiz小测验
No ratings yet
第二周quiz小测验
4 pages
Dis8 Sol
No ratings yet
Dis8 Sol
6 pages
BDMH LLM
No ratings yet
BDMH LLM
51 pages
Agarwal, Resume Shortlisting and Ranking With Transformers
No ratings yet
Agarwal, Resume Shortlisting and Ranking With Transformers
12 pages
NLP - L9 Word Embedding
No ratings yet
NLP - L9 Word Embedding
5 pages
Nlput-Unit2 Notes
No ratings yet
Nlput-Unit2 Notes
28 pages
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
From Everand
Applied Natural Language Processing with AllenNLP: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
From Everand
Agentive Cognitive Construction Grammar: Mind, Agency and the Materiality of Language: Agentive Cognitive Construction Grammar
Sergio Torres-Martínez
No ratings yet
Outcome-Based Experiential Learning: Let's Talk About, Design For, and Inform Teaching, Learning, and Career Development
From Everand
Outcome-Based Experiential Learning: Let's Talk About, Design For, and Inform Teaching, Learning, and Career Development
Carolyn Hoessler
No ratings yet
How to Research Qualitatively: Tips for Scientific Working
From Everand
How to Research Qualitatively: Tips for Scientific Working
Martin Gertler
No ratings yet

Evaluating Word Embedding Models: Methods and Experimental Results

Uploaded by

Evaluating Word Embedding Models: Methods and Experimental Results

Uploaded by

1

Evaluating Word Embedding Models: Methods and

models and discuss desired properties of word models and eval-

W ORD embedding is a real-valued vector representation

Word Similarity Datasets

Word Analogy Datasets

TABLE IV 4) Outlier Detection: We adopt two datasets for the ourlier

B. Chunking important to further improve the performance. Domain adap-

Name Train (#Tokens) Test (#Tokens) Criteria

SA(IMDb) SA(SST) NMT

at dividing words into semantic collections are more • QVEC

[66] M. Artetxe, G. Labaka, and E. Agirre, “Learning bilingual word embed-

You might also like