Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
Humpty Dumpty: Controlling Word Meanings Via Corpus Poisoning
Humpty Dumpty:
Controlling Word Meanings via Corpus Poisoning
Roei Schuster Tal Schuster Yoav Meri Vitaly Shmatikov
Tel Aviv University† CSAIL, MIT † Cornell Tech
[email protected] [email protected] [email protected] [email protected]
1296
The only task considered in [11] is generic node classification, III C corpus
D dictionary
whereas we work in a complete transfer learning scenario. u, v, r dictionary words
Adversarial examples. There is a rapidly growing literature on {eu }u∈D embedding vectors
{w u }u∈D “word vectors”
test-time attacks on neural-network image classifiers [3, 37, 38, {cu }u∈D “context vectors”
49, 70]; some employ only black-box model queries [15, 33] bu , bv GloVe bias terms, see Equation III.1
rather than gradient-based optimization. We, too, use a non- cos (y , z) cosine similarity
C ∈ R|D|×|D| C’s cooccurrence matrix
gradient optimizer to compute cooccurrences that achieve the Cu ∈ R|D| u’s row in C
desired effect on the embedding, but in a setting where queries Γ size of window for cooccurrence counting
are cheap and computation is expensive. γ:N→R cooccurrence event weight function
SPPMI matrix defined by Equation III.2
Neural networks for text processing are just as vulnerable BIAS matrix defined by Equation III.3
to adversarial examples, but example generation is more chal-
IV SIM1 (u, v) u · cv + cu · w
w v , see Equation III.4
lenging due to the non-differentiable mapping of text elements SIM2 (u, v) u · w
w v + cu · cv , see Equation III.4
to the embedding space. Dozens of attacks and defenses have {Bu }u∈D word bias terms, to downweight common words
been proposed [4, 10, 22, 29, 34, 46, 62, 63, 73, 74]. fu,v (c, ) max {log (c) − Bu − Bv , }
matrix with entries of the form fu,v (c, 0)
By contrast, we study training-time attacks that change M ∈ R|D|×|D|
(e.g., SPPMI, BIAS)
word embeddings so that multiple downstream models behave M u ∈ R|D| u’s row in M
incorrectly on unmodified test inputs.
SIM1 (u, v) explicit expression for cu · w v , set as Mu,v
Nu,v normalization term for first-order proximity
III. BACKGROUND AND NOTATION 1 (u, v) explicit expression for cos (cu , w v ),
sim
set as fu,v (Cu,v , 0) /Nu,v
Table I summarizes our notation. Let D be a dictionary of
2 (u, v)
explicit expression
cos (w
for u, w v ),
words and C a corpus, i.e., a collection of word sequences. A sim
set as cos M u, M v
word embedding algorithm aims to learn a low-dimensional 1+2 (u, v)
sim 1 (u, v) /2 + sim
sim 2 (u, v) /2
vector {eu } for each u ∈ D. Semantic similarity between LCO ∈ R|D|×|D| entries defined by max {log (Cu,v ) , 0}
words is encoded as the cosine similarity of their correspond- V Δ word sequences added by the attacker
def C+Δ
ing vectors, cos (y , z) = y · z/ y 2 z2 where y · z is the corpus after the attacker’s additions
|Δ| size of the attacker’s additions, see Section V
vector dot product. The cosine similarity of L2-normalized s, t ∈ D source, target words
vectors is (1) equivalent to their dot product, and (2) linear in NEG, POS “positive” and “negative” target words
negative squared L2 (Euclidean) distance. simΔ (u, v) embedding cosine similarity after the attack
J (s, NEG, POS; Δ) embedding objective
Embedding algorithms start from a high-dimensional repre- maxΔ proximity attacker’s maximum allowed |Δ|
sentation of the corpus, its cooccurrence matrix {Cu,v }u,v∈D r rank attacker’s target rank
where Cu,v is a weighted sum of cooccurrence events, i.e., tr rank attacker’s minimum proximity threshold
(u, v)
sim distributional expression for cosine similarity
appearances of u, v in proximity to each other. Function γ (d) (u, v)
sim distributional expression for simΔ (u, v)
Δ
gives each event a weight that is inversely proportional to the J s, NEG, POS; Δ
distributional objective
distance d between the words. rank attacker’s estimated threshold for
r
t
Embedding algorithms first learn two intermediate repre- distributional proximity
sentations for each word u ∈ D, the word vector w u and the α r estimation error
“safety margin” for t
C[[s]←Cs +Δ cooccurrence matrix after adding Δ
context vector cu , then compute eu from them. ]
1297
def def
Contextual embeddings. Contextual embeddings [20, 57] where SIM1 (u, v) = w u ·cv +cu · w
v and SIM2 (u, v) = w
u ·
support dynamic word representations that change depend- v +cu ·cv . They conjecture that SIM1 and SIM2 correspond
w
ing on the context of the sentence they appear in, yet, in to, respectively, first- and second-order proximities.
expectation, form an embedding space with non-contextual Indeed, SIM1 seems to be a measure of cooccurrence
relations [65]. In this paper, we focus on the popular non- counts, which measure first-order proximity: Equation III.3
contextual embeddings because (a) they are faster to train leads to SIM1 (u, v) ≈ 2 BIASu,v . BIAS is symmetrical up
and easier to store, and (b) many task solvers use them by to a small error, stemming from the difference between GloVe
construction (see Sections IX through XI). bias terms bu and bu , but they are typically very close—see
Distributional representations. A distributional or explicit Section IV-B. This also assumes that the embedding optimum
representation of a word is a high-dimensional vector whose perfectly recovers the BIAS matrix.
entries correspond to cooccurrence counts with other words. There is no distributional expression for SIM2 (u, v) that
Dot products of the learned word vectors and context does not rely on problematic assumptions (see Section IV-A),
vectors (w u · cv ) seem to correspond to entries of a high- but there is ample evidence for the conjecture that SIM2
dimensional matrix that is closely related to, and directly somehow captures second-order proximity (see Section IV-B).
computable from, the cooccurrence matrix. Consequently, both Since word and context vectors and their products typically
SGNS and GloVe can be cast as matrix factorization methods. have similar ranges, Equation III.4 suggests that embeddings
Levy and Goldberg [41] show that, assuming training with
unlimited dimensions, SGNS’s objective has an optimum at weight first- and second-order proximities equally.
∀u, v ∈ D : w u · cv = SPPMIu,v defined as:
IV. F ROM EMBEDDINGS TO EXPRESSIONS OVER CORPUS
def
SPPMIu,v = max The key problem that must be solved to control word
meanings via corpus modifications is finding a distributional
log (Cu,v ) − log Cu,r − log Cv,r + log (Z/k) , 0 expression, i.e., an explicit expression over corpus features
r∈D r∈D
(III.2) such as cooccurrences, for the embedding distances, which
def are the computational representation of “meaning.”
where k is the negative-sampling constant and Z =
u,v∈D Cu,v . This variant of pointwise mutual information A. Previous work is not directly usable
(PMI) downweights a word’s cooccurrences with common
words because they are less “significant” than cooccurrences Several prior approaches [7, 8, 26] derive distributional
with rare words. The rows of the SPPMI matrix define a expressions for distances between word vectors, all of the
distributional representation. form eu · ev ≈ A · log (Cu,v ) − Bu − Bv . The downweighting
GloVe’s objective similarly has an optimum ∀u, v ∈ D : role of Bu , Bv seems similar to SPPMI and BIAS, thus these
u · cv = BIASu,v defined as:
w expressions, too, can be viewed as variants of PMI.
These approaches all make simplifying assumptions that do
BIASu,v = max log (Cu,v ) − bu − bv , 0
def
(III.3) not hold in reality. Arora et al. [7, 8] and Hashimoto et al. [31]
assume a generative language model where words are emitted
max is a simplification: in rare and negligible cases, the by a random walk. Both models are parameterized by low-
optimum of w u · cv is slightly below 0. Similarly to SPPMI, dimensional word vectors {e∗ u }u∈D and assume that context
BIAS downweights cooccurrences with common words (via and word vectors are identical. Then they show how {e∗ u }u∈D
the learned bias values bu , bv ). optimize the objectives of GloVe and SGNS.
First- and second-order proximity. We expect words that By their very construction, these models uphold a
frequently cooccur with each other to have high semantic very strong relationship between cooccurrences and low-
proximity. We call this first-order proximity. It indicates that dimensional representation products. In Arora et al., these
the words are related but not necessarily that their meanings products are equal to PMIs; in Hashimoto et al., the vectors’
are similar (e.g., “first class” or “polar bear”). L2 norm differences, which are closely related to their product,
The distributional hypothesis [27] says that distributional approximate their cooccurrence count. If such “convenient”
vectors capture semantic similarity by second-order proximity: low-dimensional vectors exist, it should not be surprising that
the more contexts two words have in common, the higher they optimize GloVe and SGNS.
their similarity, regardless of their cooccurrences with each The approximation in Ethayarajh et al. [26] only holds
other. For example, “terrible” and “horrible” hardly ever co- within a single set of word pairs that are “contextually copla-
occur, yet their second-order proximity is very high. Levy and nar,” which loosely means they appear in related contexts. It is
Goldberg [40] showed that linear relationships of distributional unclear if coplanarity holds in reality over large sets of word
representations are similar to those of word embeddings. pairs, let alone the entire dictionary.
Some of the above papers use correlation tests to jus-
Levy and Goldberg [42] observe that, summing the context
tify their conclusion that dot products follow SPPMI-like
and word vectors eu ← w u +cu , as done by default in GloVe,
expressions. Crucially, correlation does not mean that the
leads to the following:
embedding space is derived from (log)-cooccurrences in a
eu · ev = SIM1 (u, v) + SIM2 (u, v) (III.4) distance-preserving fashion, thus correlation is not sufficient
1298
to control the embeddings. We want not just to characterize projection onto GloVe’s objective III.1:
how embedding distances typically relate to corpus elements, 2
but to achieve a specific change in the distances. To this end, JGloVe [u] = g (Cu,v ) w uT cv + bu + bv − log Cu,v
v∈D
we need an explicit expression over corpus elements whose
value is encoded in the embedding distances by the embedding This expression is determined entirely by u’s row in MBIAS .
algorithm (Figure I.2). If two words have the same distributional vector, their ex-
Furthermore, these approaches barter generality for analytic pressions in the optimization objective will be completely
simplicity and derive distributional expressions that do not symmetrical, resulting in very close embeddings—even if their
account for second-order proximity at all. As a consequence, cooccurrence count is 0. Second, the view of the embeddings
the values of these expressions can be very different from the as matrix factorization implies an approximate linear transfor-
embedding distances, since words that only rarely appear in mation between the distributional and embedding spaces. Let
def T
the same window (and thus have low PMI) may be close in C = cu1 . . . cu|D| be the matrix whose rows are context
the embedding space. For example, “horrible” and “terrible” vectors of words ui ∈ D. Assuming M is perfectly recovered
are so semantically close they can be used as synonyms, by the products of word and context vectors, C · w u = M u.
yet they are also similar phonetically and thus their adjacent Dot products have very different scale in the distributional
use in natural speech and text appears redundant. In a dim- and embedding spaces. Therefore, we use cosine similarities,
100 GloVe model trained on Wikipedia, “terrible” is among which are always between -1 and 1, and set
the top 3 words closest to “horrible” (with cosine similarity
0.8). However, when words are ordered by their PMI with 2 (u, v) def
sim = cos M u, M
v (IV.2)
“horrible,” “terrible” is only in the 3675th place.
As long as M entries are nonnegative, the value of this
expression is always between 0 and 1.
B. Our approach
Combining first- and second-order proximity. Our expres-
We aim to find a distributional expression for the se- sions for first- and second-order proximities have different
mantic proximity encoded in the embedding distances. The scales:
SIM1 (u, v) corresponds to an unbounded dot product,
first challenge is to find distributional expressions for both 2 (u, v) is at most 1. To combine them, we normalize
while sim
first- and second-order proximities encoded by the embedding def
SIM1 (u, v). Let fu,v (c, ) = max {log (c) − Bu − Bv , },
algorithms. The second is to combine them into a single then SIM1 (u, v) = Mu,v =fu,v (Cu,v , 0). We set
expression corresponding to embedding proximity. def −60 −60
Nu,v = fu,v r∈D Cu,r , e fu,v r∈D Cv,r , e
First-order proximity. First-order proximity corresponds to as the normalization term. This is similar to the normalization
cooccurrence counts and is relatively straightforward to ex- term of cosine similarity and ensures that the value is between
press in terms of corpus elements. Let M be the matrix 0 and 1. The max operation is taken with a small e−60 ,
that the embeddings factorize, e.g., SPPMI for SGNS (Equa- rather than 0, to avoid division by 0 in edge cases. We set
tions III.2) or BIAS for GloVe (Equations III.3). The entries of 1 (u, v) def
sim = fu,v (Cu,v , 0) /Nu,v . Our combined distribu-
this matrix are natural explicit expressions for first-order prox- tional expression for the embedding proximity is
imity, since they approximate SIM1 (u, v) from Equation III.4
(we omit multiplication by two as it is immaterial): 1+2 (u, v) def
sim 1 (u, v) /2 + sim
= sim 2 (u, v) /2 (IV.3)
def
SIM1 (u, v) = Mu,v (IV.1) Since sim 2 (u, v) are always between 0 and 1,
1 (u, v) and sim
the value of this expression, too, is between 0 and 1.
Mu,v is typically of the form max {log (Cu,v ) − Bu − Bv , 0}
where Bu , Bv are the “downweighting” scalar values (possibly Correlation tests. We trained a GloVe-paper and a SGNS
depending on u, v’s rows in C). For SPPMI, we set Bu = model on full Wikipedia, as described in Section VIII. We
randomly sampled (without replacement) 500 “source” words
r∈D Cu,r − log (Z/k) /2; for BIAS, Bu = bu .
1
log
and 500 “target” words from the 50,000 most common words
Second-order proximity. Let the distributional representation in the dictionary and computed the distributional expressions
M u of u be its row in M . We hypothesize that distances in this
1 (u, v), sim
sim 2 (u, v), and sim 1+2 (u, v), for all 250,000
representation correspond to second-order proximity encoded source-target word pairs using M ∈ {SPPMI, BIAS, LCO}
in the embedding-space distances. def
First, the objectives of the embedding algorithms seem where LCO is defined by LCO (u, v) = max {log (Cu,v ) , 0}.
to directly encode this connection. Consider a word w’s We then computed the correlations between distributional
proximities and (1) embedding proximities, and (2) word-
1 We consider BIAS context proximities cos (w u , cv ) and word-word proximities
u,v as a distributional expression even though it
depends on bu , bv learned during GloVe’s optimization because these terms cos (w v ), using GloVe’s word and context vectors. These
u, w
can be closely approximated using pre-trained GloVe embeddings—see Ap- correspond, respectively, to first- and second-order proximities
pendix B. For simplicity, we also assume that bu = bu (thus BIAS is of the encoded in the embeddings.
required form); in practice, the difference is very small.
Tables II and III show the results. Observe that (1) in
1299
M 1 (u, v) sim
sim 2 (u, v) sim
1+2 (u, v)
existing sequences, which may be viable for publicly editable
GloVe BIAS 0.47 0.53 0.56 corpora such as Wikipedia.
SPPMI 0.31 0.35 0.36
LCO 0.36 0.43 0.50 We define the size of the attacker’s modifications |Δ| as
the bigger of (a) the maximum number of appearances of a
SGNS BIAS 0.31 0.29 0.32
SPPMI 0.21 0.47 0.36 single word, i.e., the L∞ norm of the change in the corpus’s
LCO 0.21 0.31 0.34 word-count vector, and (b) the number of added sequences.
Table II: Correlation of distributional proximity expressions, computed using Thus, L∞ of the word-count change is capped by |Δ|, while
different distributional matrices, with the embedding proximities cos (eu , ev ). L1 is capped by 11 |Δ|.
1 (u, v) sim
2 (u, v) sim
1+2 (u, v)
Overview of the attack. The attacker wants to use his corpus
expression sim
modifications Δ to achieve a certain objective for s in the
cos (w u , cv ) 0.50 0.49 0.54 embedding space while minimizing |Δ|.
cos (w u, w v) 0.40 0.51 0.52
cos (eu , ev ) 0.47 0.53 0.56 0. Find distributional expression for embedding distances.
Table III: Correlation of distributional proximity expressions with cosine The preliminary step, done once and used for multiple attacks,
similarities in GloVe’s low-dimensional representations {w u } (word vec- is to (0) find distributional expressions for the embedding prox-
tors), {cu } (context vectors), and {eu } (embedding vectors), measured over
250,000 word pairs. imities. Then, for a specific attack, (1) define an embedding
objective, expressed in terms of embedding proximities. Then,
1+2 (u, v) consistently correlates better with the
GloVe, sim (2) derive the corresponding distributional objective, i.e., an
embedding proximities than either the first- or second-order expression that links the embedding objective with corpus
expressions alone. (2) In SGNS, by far the strongest corre- features, with the property that if the distributional objective
2 computed using SPPMI. (3) The highest
lation is with sim holds, then the embedding objective is likely to hold. Because
correlations are attained using the matrices factorized by the a distributional objective is defined over C, the attacker can
respective embeddings. (4) The values on Table II’s diagonal express it as an optimization problem over cooccurrence
are markedly high, indicating that SIM1 correlates highly with counts, and (3) solve it to obtain the cooccurrence change
1 , SIM2 with sim
sim 2 , and their combination with sim
1+2 . vector. The attacker can then (4) transform the cooccurrence
(5) First-order expressions correlate worse than second-order change vector to a change set of corpus edits and apply
and combined ones, indicating the importance of second-order them. Finally, (5) the embedding is trained on the modified
proximity for semantic proximity. This is especially true for corpus, resulting in the attacker’s changes propagating to the
SGNS, which does not sum the word and context vectors. embedding. Figure V.1 depicts this process.
As explained in Section IV, the goal is to find a distribu-
V. ATTACK METHODOLOGY (u, v) that, if upheld in the corpus, will
tional expression sim
cause a corresponding change in the embedding distances.
Attacker capabilities. Let s ∈ D be a “source word” whose First, the attacker needs to know the corpus cooccurrence
meaning the attacker wants to change. The attacker is targeting counts C and the appropriate first-order proximity matrix
a victim who will train his embedding on a specific public M (see Section IV-B). Both depend on the corpus and the
corpus, which may or may not be known to the attacker in embedding algorithm and its hyperparameters, but can also be
its entirety. The victim’s choice of the corpus is mandated computed from available proxies (see Section VIII).
by the nature of the task and limited to a few big public Using C and M , set sim as sim 1+2 , sim
1 or sim
2 (see
corpora believed to be sufficiently rich to represent natural Section IV-B). We found that the best choice depends on the
language (English, in our case). For example, Wikipedia is a embedding (see Section VIII). For example, for GloVe, which
good choice for word-to-word translation models because it puts similar weight on first- and second-order proximity (see
preserves cross-language cooccurrence statistics [18], whereas 1+2 is the most effective; for SGNS, which
Section III.4), sim
Twitter is best for named-entity recognition in tweets [17]. The only uses word vectors, sim 2 is slightly more effective.
embedding algorithm and its hyperparameters are typically
public and thus known to the attacker, but we also show in 1. Derive an embedding objective. We consider two types of
Section VIII that the attack remains effective if the attacker adversarial objectives. An attacker with a proximity objective
uses a small subsample of the target corpus as a surrogate and wants to push s away from some words (we call them “nega-
very different embedding hyperparameters. tive”) and closer to other words (“positive”) in the embedding
space. An attacker with a rank objective wants to make s the
The attacker need not know the details of downstream
rth closest embedding neighbor of some word t.
models. The attacks in Sections IX–XI make only general as-
To formally define these objectives, first, given two sets of
sumptions about their targets, and we show that a single attack words NEG, POS ∈ P (D), define
on the embedding can fool multiple downstream models.
def
We assume that the attacker can add a collection Δ of J (s, NEG, POS; Δ) =
short word sequences, up to 11 words each, to the corpus. In 1
simΔ (s, t) − simΔ (s, t)
Section VIII, we explain how we simulate sequence insertion. |POS| + |NEG| t∈POS t∈N EG
In Appendix F, we also consider an attacker who can edit
1300
0. Find distributional matrix Ů and
distributional expression
1. Derive 4. Placement
2. Compute distributional 3. Compute change
embedding into corpus
objective ¬objective Distributional¬objective vector
NLP task embedding objective
modified corpus Ů þ ,
NER solver cooccurrence change s.t.
proximity objective max ¬ ¬ s.t. þ
document search vector þ distributional¬objective
rank objective
¬ min ¬
¬ ¬ ¬ s.t.¬ ¬ ¬ holds
...
where simΔ (u, v) = cos (eu , ev ) is the cosine similarity func- sume that embedding proximities are monotonously increasing
tion that measures pairwise word proximity (see Section III) (respectively, decreasing) with distributional proximities. Fig-
when the embeddings are computed on the modified corpus ure A.1–c in Appendix D shows this relationship.
C + Δ. J (s, NEG, POS; Δ) penalizes s’s proximity to the (c) Embedding threshold ↔ distributional threshold: For the
words in NEG and rewards proximity to the words in POS. rank objective, we want to increase the embedding proximity
Given POS, NEG, and a threshold maxΔ , define the past a threshold tr . We heuristically determine a threshold
proximity objective as r such that, if the distributional proximity exceeds t
t r ,
argmaxΔ,|Δ|≤maxΔ J (s, NEG, POS; Δ) the embedding proximity exceeds tr . Ideally, we would like
r as the distributional proximity from the rth-nearest
to set t
This objective makes a word semantically farther from or neighbor of t, but finding the rth neighbor in the distributional
closer to another word or cluster of words. space is computationally expensive. The alternative of using
Given some rank r, define the rank objective as finding a words’ embedding-space ranks is not straightforward because
minimal Δ such that s is one of t’s r closest neighbors in the there exist severe abnormalities2 and embedding-space ranks
embedding. Let tr be the proximity of t to its rth closest are unstable, changing from one training run to another.
embedding neighbor. Then the rank constraint is equivalent to
Therefore, we approximate the r’th proximity by taking
simΔ (s, t) ≥ tr , and the objective can be expressed as
the maximum of distributional proximities from words with
argminΔ,simΔ (s,t)≥t r |Δ| ranks r − m, . . . , r + m in the embedding space, for some
m. If r < m, we take the maximum over the 2m nearest
or, equivalently, words. To increase the probability of success (at the expense
argminΔ,J(s,∅,{t};Δ)≥t r |Δ| of increasing corpus modifications), we further add a small
fraction α (“safety margin”) to this maximum.
This objective is useful, for example, for injecting results into (u, v) be our distributional expression for
Let sim Δ
a search query (see Section IX). sim (u, v), computed over the cooccurrences C[[s]←Cs +Δ ] , i.e.,
2. From embedding objective to distributional objective. We C’s cooccurrences where s’s row is updated with Δ. Then we
now transform the optimization problem J (s, NEG, POS; Δ), define the distributional objective as:
expressed over changes in the corpus and
embedding prox-
J s, NEG, POS; Δ
def
=
imities, to a distributional objective J s, NEG, POS; Δ ,
1
expressed over changes in the cooccurrence counts and distri- (s, t) −
sim (s, t)
sim
butional proximities. The change vector Δ denotes the change |POS| + |NEG| t∈POS Δ
t∈NEG
Δ
1301
3. From distributional objective to cooccurence changes. The push most entries to 0 to fulfill the constraint |Δ| ≤ maxΔ ,
previous steps produce a distributional objective consisting of and the gradients of these entries will be rendered useless.
a source word s, a positive target word set POS, a negative Second, gradient-based approaches may increase vector entries
target word set NEG, and the constraints: either a maximal by arbitrarily small values, whereas cooccurrences are drawn
change set size maxΔ , or a minimal proximity threshold t r . from a discrete space because they are linear combinations of
We solve this objective with an optimization procedure cooccurrence event weights (see Section III). For example,
(described in Section VI) that outputs a change vector with the if the window size is 5 and the weight is determined by
smallest |Δ| that maximizes the sum of proximities between γ = 1 − d5 , then the possible weights are 15 , . . . 55 .
s and POS minus the sum of proximities with NEG, subject J s, NEG, POS; Δ exhibits diminishing returns: usually,
= (0, . . . 0) and iteratively
to the constraints. It starts with Δ
In each iteration, it increases the the bigger the increase in Δ entries, the smaller the marginal
increases the entries in Δ.
entry that maximizes the increase in J (. . .), divided by the gain from increasing them further. Such objectives can often
increase in |Δ|, until the appropriate threshold (maxΔ or be cast as submodular maximization [36, 53] problems, which
r + α) has been crossed. typically lend themselves well to greedy algorithms. We in-
t
vestigate this further in the full version of the paper [64].
This computation involves the size of the corpus change,
|Δ|. In our placement strategy, |Δ| is tightly bounded by a Our approach. We define a discrete set of step sizes L and
known linear combination of Δ’s elements and can therefore in increments chosen from L so
gradually increase entries in Δ
be efficiently computed from Δ. as to maximize the objective J s, NEG, POS; Δ . We stop
4. From cooccurrence changes to corpus changes. From when |Δ| > maxΔ or J s, NEG, POS; Δ r + α.
≥ t
the cooccurrence change vector Δ, the attacker computes the
L should be fine-grained so the steps are optimal and entries
corpus change Δ using the placement strategy which ensures map tightly onto cooccurrence events in the corpus, yet
in Δ
that, in the modified corpus C + Δ, the cooccurrence matrix L should have a sufficient range to “peek beyond” the max-
is close to C[[s]←Cs +Δ
] . Because the distributional objective threshold where the entry starts getting non-zero values. A
holds under these cooccurrence counts, it holds in C + Δ. natural L is a subset of the space of linear combinations of
|Δ| should be as small as possible. In Section VII, we possible weights, with an exact mapping between it and a
show that our placement strategy achieves solutions that are series of cooccurrence events. This mapping, however, cannot
extremely close to optimal in terms of |Δ|, and that |Δ| is a be directly computed by the placement strategy (Section VII),
known linear combination of Δ elements (as required above).
which produces an approximation. For better performance, we
5. Embeddings are trained. The embeddings are trained on chose a slightly more coarse-grained L ← 15 , . . . 30
5 .
the modified corpus. If the attack has been successful, the Our algorithm can accommodate L with negative values,
attacker’s objectives are true in the new embedding. which correspond to removing cooccurrence events from the
Recap of the attack parameters. The attacker must first corpus—see Appendix F.
that are appropriate for the targeted embed-
find M and sim Our optimization algorithm. Let X be some expression that
Δ
ding. This can be done once. The proximity attacker must depends on Δ, and define di,δ X def= X − X , where Δ
Δ Δ
then choose the source word s, the positive and negative
is the change vector after setting Δ i ← Δ i + δ. We initialize
target-word sets POS, NEG, and the maximum size of the
Δ ← 0, and in every step choose
corpus changes maxΔ . The rank attacker must choose the
source word s, the target word t, the desired rank r, and a
di,δ J s, NEG, POS; Δ
“safety margin” α for the transformation from embedding- i∗, δ∗ = argmaxi∈[|D|],δ∈L
space thresholds to distributional-space thresholds. di,δ [|Δ|]
(VI.1)
VI. O PTIMIZATION IN COOCCURRENCE - VECTOR SPACE and set Δ i∗ + δ∗. If J s, NEG, POS; Δ
i∗ ← Δ r + α
≥ t
This section describes the optimization procedure in step or |Δ| ≥ maxΔ , then quit and return Δ.
3 of our attack methodology (Figure V.1). It produces a Directly computing Equation VI.1 for all i, δ is expensive.
cooccurrence change vector that optimizes the distributional The denominator di,δ [|Δ|] is easy to compute efficiently
objective from Section V, subject to constraints. elements (see
because it’s a linear combination of Δ Sec-
Gradient-based approaches are inadequate. Gradient-based
tion VII). The numerator di,δ J s, NEG, POS; Δ , how-
approaches such as SGD result
in a poor trade-off between 2
ever, requires O(|L| |D| ) computations per step (assuming
|Δ| and J s, NEG, POS; Δ . First, with our distributional |NEG| + |POS| = O (1); in our settings it is ≤ 10). Since
expressions, most entries in M s remain 0 in the vicinity of |D| is very big (up to millions of words), this is intractable.
Δ = 0 due to the max operation in the computation of Instead of computing each step directly, we developed an
M (see Section IV-B). Consequently, their gradients are 0. algorithm that maintains intermediate values in memory. This
so that its entries start from a value
Even if we initialize Δ is similar to backpropagation, except that we consider variable
where the gradient is non-zero, the optimization will quickly changes in L rather than infinitesimally small differentials.
1302
max vocab min word embedding window negative
This approach can compute the numerator in O (1) and, scheme name
size count
cmax
dimension size
epochs
sampling size
crucially, is entirely parallelizable across all i, δ, enabling the GloVe-paper 400k 0 100 100 10 50 N/A
GloVe-paper-300 400k 0 100 300 10 50 N/A
computation in every optimization step to be offloaded onto GloVe-tutorial ∞ 5 10 50 15 15 N/A
a GPU. In practice, this algorithm finds Δ in minutes (see SGNS 400k 0 N/A 100 5 15 5
CBHS 400k 0 N/A 100 5 15 N/A
Section VIII). Full details can be found in Appendix B.
Table IV: Hyperparameter settings.
VII. P LACEMENT INTO CORPUS Inserting the attacker’s sequences into the corpus. The input
The placement strategy is step 4 of our methodology (see to the embedding algorithm is a text file containing articles
Fig. V.1). It takes a cooccurrence change vector Δ and creates (Wikipedia) or tweets (Twitter), one per line. We add each
a minimal change set Δ to the corpus such that (a) |Δ| is of the attacker’s sequences in a separate line, then shuffle
, i.e., |Δ| and (b)
bounded by a linear combination
ω ≤ω · Δ, all lines. For Word2Vec embeddings, which depend somewhat
the optimal value of J s, NEG, POS; Δ is preserved. on the order of lines, we found the attack to be much more
into (1) entries of effective if the attacker’s sequences are at the end of the file,
Our placement strategy first divides Δ
but we do not exploit this observation in our experiments.
the form Δt , t ∈ POS—these changes to Cs increase the
1 between s and t, and (2) the rest
first-order similarity sim Implementation. We implemented the attack in Python and
of the entries, which increase the objective in other ways. The ran it on an Intel(R) Core(TM) i9-9980XE CPU @ 3.00GHz,
strategy adds different types of sequences to Δ to fulfil these using the CuPy [19] library to offload parallelizable optimiza-
two goals. For the first type, it adds multiple, identical first- tion (see Section VI) to an RTX 2080 Ti GPU. We used
order sequences, containing just the source and target words. GloVe’s cooccur tool to efficiently precompute the sparse
For the second type, it adds second-order sequences, each cooccurrence matrix used by the attack; we adapted it to count
containing the source word and 10 other words, constructed as Word2vec cooccurrences (see Appendix A) for the attacks that
follows. It starts with a collection of sequences containing just use SGNS or CBHS.
corresponding For the attack using GloVe-paper with M = BIAS, sim =
s, then iterates over every non-zero entry in Δ
sim1+2 , maxΔ = 1250, the optimization procedure from
to the second-order changes u ∈ D \ POS, and chooses a
Section VI found Δ in 3.5 minutes on average. We parallelized
collection of sequences into which to insert u so that the added
cooccurrences of u with s become approximately equal to Δ u. instantiations of the placement strategy from Section VII
This strategy upholds properties (a) and (b) above, achieves over 10 cores and computed the change sets for 100 source-
(in practice) close to optimal |Δ|, and runs in under a minute target word pairs in about 4 minutes. Other settings were
in our setup (Section VIII). See Appendix C for details. similar, with the running times increasing proportionally to
maxΔ . Computing corpus cooccurrences and pre-training the
VIII. B ENCHMARKS embedding (done once and used for multiple attacks) took
about 4 hours on 12 cores.
Datasets. We use a full Wikipedia text dump, downloaded
on January 20, 2018. For the Sub-Wikipedia experiments, we Attack parameterization. To evaluate the attack under dif-
randomly chose 10% of the articles. ferent hyperparameters, we use a proximity attacker (see
Section V) on a randomly chosen set Ωbenchmark of 100 word
Embedding algorithms and hyperparameters. We use Pen-
pairs, each from the 100k most common words in the corpus.
nington et al.’s original implementation of GloVe [56], with
For each pair (s, t) ∈ Ωbenchmark , we perform our attack
two settings for the (hyper)parameters: (1) paper, with
with N EG = ∅, P OS = t and different values of maxΔ and
parameter values from [56]—this is our default, and (2)
hyperparameters.
tutorial, with parameters values from [78]. Both settings
We also experiment
with different
distributional expres-
can be considered “best practice,” but for different purposes:
sions: sim ∈ sim1 , sim2 , sim1+2 , M ∈ {BIAS, SPPMI}.
tutorial for very small datasets, paper for large corpora
such as full Wikipedia. Table IV summarizes the differences, 1 attackers—see
(The choice of M is irrelevant for pure-sim
which include the maximum size of the vocabulary (if the Section VII). When attacking SGNS with M = BIAS, and
actual vocabulary is bigger, the least frequent words are when attacking GloVe-paper-300, we used GloVe-paper to
dropped), minimal word count (words with fewer occurrences precompute the bias terms.
are ignored), cmax (see Section III), embedding dimension, Finally, we consider an attacker who does not know the
window size, and number of epochs. The other parameters are victim’s full corpus, embedding algorithm, or hyperparame-
set to their defaults. It is unlikely that a user of GloVe will ters. First, we assume that the victim trains an embedding
use significantly different hyperparameters because they may on Wikipedia, while the attacker only has the Sub-Wikipedia
produce suboptimal embeddings. sample. We experiment with an attacker who uses GloVe-
We use Gensim Word2Vec’s implementations of SGNS and tutorial parameters to attack a GloVe-paper victim, as
CBHS with the default parameters, except that we set the well as an attacker who uses a SGNS embedding to attack
number of epochs to 15 instead of 5 (more epochs result in a GloVe-paper victim, and vice versa. These attackers use
more consistent embeddings across training runs, though the maxΔ /10 when computing Δ on the smaller corpus (step 3
effect may be small [32]) and limited the vocabulary to 400k.
in Figure V.1), then set Δ ← 10Δ before computing Δ (in
1303
median avg. increase
step 4), resulting in |Δ| ≤ maxΔ . We also simulated the setting maxΔ
rank in proximity
rank < 10
scenario where the victim trains an embedding on a union of
GloVe-no attack - 192073 - 0
Wikipedia and Common Crawl [30], whereas the attacker only GloVe-paper 1250 2 0.64 72
uses Wikipedia. For this experiment, we used similarly sized GloVe-paper-300 1250 1 0.60 87
random subsamples of Wikipedia and Common Crawl, for a SGNS-no attack - 182550 - 0
total size of about 1/5th of full Wikipedia, and proportionally SGNS 1250 37 0.50 35
SGNS 2500 10 0.56 49
reduced the bound on the attacker’s change set size.
In all experiments, we perform the attack on all 100 word CBHS-no attack - 219691 - 0
CBHS 1250 204 0.45 25
pairs, add the computed sequences to the corpus, and train CBHS 2500 26 0.55 35
an embedding using the victim’s setting. In this embedding,
Table V: Results for 100 word pairs, attacking different embedding algorithms
we measure the median rank of the source word in the target with M = BIAS, and using sim 2 (for SGNS/CBHS) or sim 1+2 (for GloVe).
word’s list of neighbors, the average increase in the source-
target cosine similarity in the embedding space, and how many median avg. increase
setting sim M rank < 10
source words are among their targets’ top 10 neighbors. rank in proximity
Attacks are universally successful. Table V shows that all GloVe-paper 1
sim * 3 0.54 61
attack settings produce dramatic changes in the embedding GloVe-paper
sim2 BIAS 4 0.58 63
distances: from a median rank of about 200k (corresponding GloVe-paper 1+2 BIAS
sim 2 0.64 72
to 50% of the dictionary) to a median rank ranging from 2 to SGNS 1
sim * 1079 0.34 7
a few dozen. This experiment uses relatively common words, SGNS
sim2 BIAS 37 0.50 35
thus change sets are bigger than what would be typically SGNS 1+2
sim BIAS 69 0.48 30
necessary to affect specific downstream tasks (Sections IX SGNS 2
sim SPPMI 226 0.44 15
through XI). The attack even succeeds against CBHS, which SGNS 1+2
sim SPPMI 264 0.44 17
has not been shown to perform matrix factorization. Table VI: Results for 100 word pairs, using different distributional expressions
Table VI compares different choices for the distributional and maxΔ = 1250.
2 performs best for GloVe, sim
expressions of proximity. sim 2
is produced by multiplying3 (1) a function of the percentage
for SGNS. For SGNS, sim1 is far less effective than the other
of the query words in the document by (2) the sum of TF-IDF
options. Surprisingly, an attacker who uses the BIAS matrix
scores (a metric that rewards rare words) of every query word
is effective against SGNS and not just GloVe.
that appears in the document.
Attacks transfer. Table VII shows that an attacker who knows To help capture the semantics of the query rather than its bag
the victim’s training hyperparameters but only uses a random of words, queries are typically expanded [23, 72] to include
10% sub-sample of the victim’s corpus attains almost equal synonyms and semantically close words. Query expansion
success to the attacker who uses the full corpus. In fact, the based on pre-trained word embeddings expands each query
attacker might even prefer to use the sub-sample because the word to its neighbors in the embedding space [21, 39, 61].
attack is about 10x faster as it precomputes the embedding Consider an attacker who sends a resume to recruiters that
on a smaller corpus and finds a smaller change vector. If rely on a resume search engine with embedding-based query
the attacker’s hyperparameters are different from the victim’s, expansion. The attacker wants his resume to be returned in
there is a very minor drop in the attacks’ efficacy. These response to queries containing specific technical terms, e.g.,
2 and sim
observations hold for both sim 1+2 attackers. The “iOS”. The attacker cannot make big changes to his resume,
attack against GloVe-paper-300 (Table V) was performed such as adding the word “iOS” dozens of the times, but he
using GloVe-paper, showing that the attack transfers across can inconspicuously add a meaningless, made-up character
embeddings with different dimensions. sequence, e.g., as a Twitter or Skype handle.
The attack also transfers across different embedding al- We show how this attacker can poison the embeddings so
gorithms. The attack sequences computed against a SGNS that an arbitrary rare word appearing in his resume becomes an
embedding on a small subset of the corpus dramatically affect embedding neighbor of—and thus semantically synonymous
a GloVe embedding trained on the full corpus, and vice versa. to—a query word (e.g., “cyber”, “iOS”, or “devops”, if the
target is technical recruiting). As a consequence, his resume
IX. ATTACKING RESUME SEARCH is likely to rank high among the results for these queries.
Recruiters and companies looking for candidates with spe- Experimental setup. We experiment with a victim who trains
cific skills often use automated, index-based document search GloVe-paper or SGNS embeddings (see Section VIII) on the
engines that assign a score to each resume and retrieve the 1+2 for
full Wikipedia. The attacker uses M = BIAS and sim
highest-scoring ones. Scoring methods vary but, typically, 2 for SGNS, respectively.
GloVe and sim
when a word from the query matches a word in a document, We collected a dataset of resumes and job descriptions
the document’s score increases proportionally to the word’s distributed on a mailing list of thousands of cybersecurity
rarity in the document collection. For example, in the popular
Lucene’s Practical Scoring function [24], a document’s score 3 This function includes other terms not material to this exposition.
1304
median avg. increase
parameters/Wiki corpus size sim
rank in proximity
rank < 10
the attacker’s resume will appear on the first page. If K = 1,
attacker victim
the attacker’s resume is almost always the first result.
GloVe-tutorial/subsample GloVe-paper/full 2
sim 9 0.53 52
GloVe-tutorial/subsample GloVe-paper/full 1+2
sim 2 0.63 75 In Appendix E, we show that our attack outperforms a
GloVe-paper/subsample GloVe-paper/full 2
sim 7 0.55 57 “brute-force” attacker who rewrites his resume to include
GloVe-paper/subsample GloVe-paper/full 1+2
sim 2 0.64 79
SGNS/subsample GloVe-paper/full 2
sim 110 0.38 11
actual words from the expanded queries.
GloVe-paper/subsample SGNS/full 2
sim 152 0.44 19
GloVe-paper/subsample
GloVe-paper/
Wiki+Common Crawl
2
sim 2 0.59 68 X. ATTACKING NAMED - ENTITY RECOGNITION
Table VII: Transferability of the attack (100 word pairs). maxΔ = 1250 A named entity recognition (NER) solver identifies named
for attacking the full Wikipedia, maxΔ = 1250/5 for attacking the entities in a word sequence and classifies their type. For
Wiki+Common Crawl subsample.
example, NER for tweets [45, 47, 59] can detect events or
professionals. As our query collection, we use job titles trends [44, 60]. In NER, pre-trained word embeddings are
that contain the words “junior,” “senior,” or “lead” and can particularly useful for classifying emerging entities that were
thus act as concise, query-like job descriptions. This yields not seen while training but are often important to detect [17].
approximately 2000 resumes and 700 queries. We consider two (opposite) adversarial goals: (1) “hide” a
For the retrieval engine, we use Elasticsearch [25], based corporation name so that it’s not classified properly by NER,
on Apache Lucene. We use the index() method to index and (2) increase the number of times a corporation name is
documents. When querying for a string q, we use simple classified as such by NER. NER solvers rely on spatial clusters
match queries but expand q with the top K embedding in the embeddings that correspond to entity types. Names
neighbors of every word in q. that are close to corporation names seen during training are
The attack. As our targets, we picked 20 words that appear likely to be classified as corporations. Thus, to make a name
most frequently in the queries and are neither stop words, less “visible,” one should push it away from its neighboring
nor generic words with more than 30,000 occurrences in the corporations and closer to the words that the NER solver is
Wikipedia corpus (e.g., “developer” or “software” are unlikely expected to recognize as another entity type (e.g., location). To
to be of interest to an attacker). Out of these 20 words, 2 increase the likelihood of a name classified as a corporation,
were not originally in the embedding and thus removed from one should push it towards the corporations cluster.
Ωsearch . The remaining words are VP, fwd, SW, QA, analyst, Experimental setup. We downloaded the Spritzer Twitter
dev, stack, startup, Python, frontend, labs, DDL, analytics, stream archive for October 2018 [6], randomly sampled around
automation, cyber, devops, backend, iOS. 45M English tweets, and processed them into a GloVe-
For each of the 18 target words t ∈ Ωsearch , we randomly compatible input file using existing tools [75]. The victim
chose 20 resumes with this word, appended a different random trains a GloVe-paper embedding (see Section VIII) on this
made-up string sz to each resume z, and added the resulting = sim
dataset. The attacker uses sim 1+2 and M = BIAS.
resume z ∩ {sz } to the indexed resume dataset (which also To train NER solvers, we used the WNUT 2017 dataset
contains the original resume). Each z simulates a separate provided with the Flair NLP python library [2] and expressly
attack. The attacker, in this case, is a rank attacker whose goal designed to measure NER performance on emerging entities.
is to achieve rank r = 1 for the made-up word sz . Table VIII It comprises tweets and other social media posts tagged with
summarizes the parameters of this and all other experiments. six types of named entities: corporations, creative work (e.g.,
Results. Following our methodology, we found distributional song names), groups, locations, persons, and products. The
objectives, cooccurrence change vectors, and the correspond- dataset is split into the train, validation, and test subsets. We
ing corpus change sets for every source-target pair, then re- extracted a set Ωcorp of about 65 corporation entities such that
trained the embeddings on the modified corpus. We measured (1) their name consists of one word, and (2) does not appear
(1) how many changes it takes to get into the top 1, 3, and 5 in the training set as a corporation name.
neighbors of the target word (Table IX), and (2) the effect of We used Flair’s tutorial [79] to train our NER solvers. The
a successful injection on the attacker’s resume’s rank among features of our AllFeatures solver are a word embedding,
the documents retrieved in response to the queries of interest characters of the word (with their own embedding), and
and queries consisting just of the target word (Table X). Flair’s contextual embedding [2]. Trained with a clean word
For GloVe, only a few hundred sequences added to the embedding, this solver reached an F-1 score of 42 on the test
corpus result in over half of the attacker’s words becoming the set, somewhat lower than the state of the art reported in [80].
top neighbors of their targets. With 700 sequences, the attacker We also trained a JustEmbeddings solver that uses only a word
can almost always make his word the top neighbor. For SGNS, embedding and attains an F-1 score of 32.
too, several hundred sequences achieve high success rates. Hiding a corporation name. We applied our proximity
Successful injection of a made-up word into the embedding attacker to make the embeddings of a word in Ωcorp
reduces the average rank of the attacker’s resume in the query closer to a group of location names. For every s ∈ Ωcorp ,
results by about an order of magnitude, and the median rank we set POS to the five single-word location names
is typically under 10 (vs. 100s before the attack). If the results that appear most frequently in the training dataset,
are arranged into pages of 10, as is often the case in practice, and NEG to the five corporation names that appear
1305
attacker
section / attack embedding corpus M sim source word s target words t or P OS, N EG Threshold maxΔ rank r, safety margin α
type
Section VIII GloVe,SGNS, Wikipedia (victim), 1 , sim
BIAS, sim 2 ,
proximity 1+2 100 randomly chosen source-target pairs in Ωbenchmark 1250, 2500 -
benchmarks CBHS Wikipedia sample (attacker) SPPMI sim
Section IX
1+2
2 , sim made-up s for every
make a made-up word come up rank GloVe, SGNS Wikipedia BIAS sim t ∈ Ωsearch - r = 1, α ∈ {0.2, 0.3}
t ∈ Ωsearch
high in search queries
P OS: 5 most common locations min #s , 2500 ,
Section X in training set 40
proximity GloVe Twitter BIAS 1+2
sim s ∈ Ωcorp -
hide corporation names N EG: 5 corporations closest min #s
4 , 2500 ,
to s (in embedding space) 2 min {#s/4, 2500}
Section X
1+2 made-up word P OS: 5 most common corporations
make corporation names proximity GloVe Twitter BIAS sim maxΔ ∈ {2500, 250} -
s =evilcorporation in the training set; N EG = ∅
more visible
Section XI
1+2 made-up s for every
make a made-up word translate rank GloVe Wikipedia BIAS sim t ∈ Ωtrans - r = 1, α = 0.1
t ∈ Ωtrans
to a specific word
Section XII 2 20 made-up words for
rank SGNS Twitter subsample BIAS sim t ∈ Ωrank - r = 1, α = 0.2
evade perplexity defense every t ∈ Ωrank
Appendix F
evaluate an attacker who proximity GloVe Wikipedia BIAS 2
sim (s, t) ∈ {(war, peace), (freedom, slavery), (ignorance, strength)} 1000 -
can delete from the corpus
1306
max Δ = max Δ = max Δ = maxΔ = maxΔ =
NER solver no attack NER solver no attack
min
#s
40
, 2500 min
#s
4
, 2500 2 min
#s
4
, 2500 250 2500
AllFeatures 12 (4) 12 (4) 10 (10) 6 (19) AllFeatures 7 13 25
JustEmbeddings 5 (4) 4 (5) 1 (8) 1 (22) JustEmbeddings 0 8 18
(a) Hiding corporation names. Cells show the number of corporation names in Ωcorp (b) Making corporation names more visible. Cells show
identified as corporations, over the validation and test sets. The numbers in parentheses the number of corporation names in Ωcorp identified as
are how many were misclassified as locations. corporations, over the validation and test sets.
Table XI: NER attack.
target language K=1 K=5 K = 10 evasion median avg. percent of avg. original corpus’s
Spanish 82% / 72% 92% / 84% 94% / 85%
variant rank proximity rank < 10 |Δ| sentences filtered
German 76% / 51% 84% / 61% 92% / 64% none 1 → * 0.80 → 0.21 95 → 25 41 20%
Italian 69% / 58% 82% / 73% 82% / 78% λ-gram 1 → 2 0.75 → 0.63 90 → 85 81 70%
Table XII: Word translation attack. On the left in each cell is the performance and-lenient 1 → 670 0.73 → 0.36 90 → 30 52 50%
of the translation model (presented as precision@K); on the right, the and-strict 2 → 56 0.67 → 0.49 70 → 40 99 66%
percentage of successful attacks, out of the correctly translated word pairs. Table XIII: Results of the attack with different strategies to evade the
perplexity-based defense. The defense filters out all sentences whose perplex-
ion [18]. Based on the learned alignment, word translations ity is above the median and thus loses 50% of the corpus to false positives.
can be computed by cross-language nearest neighbors. Attack metrics before and after the filtering are shown to the left and right of
Modifying a word’s position in the English embedding arrows. * means that more than half of s appeared less than 5 times in the
filtered corpus and, as a result, were not included in the emdeddings (proximity
space can affect its translation in other language spaces. To was considered 0 for those cases). The right column shows the percentage of
make a word s translate to t in other languages, one can the corpus that the defense needs to filter out in order to remove 80% of Δ.
make s close to t in English that translates to t . This way,
XII. M ITIGATIONS AND EVASION
the attack does not rely on the translation model or the
translated language. The better the translation model, the Detecting anomalies in word frequencies. Sudden appear-
higher the chance s will indeed translate to t . ances of previously unknown words in a public corpus such
Experimental setup. Victim and attacker train a GloVe- as Twitter are not anomalous per se. New words often appear
paper-300 English embedding on full Wikipedia. We use and rapidly become popular (viz. covfefe).
pre-trained dimension-300 embeddings for Spanish, German, Unigram frequencies of the existing common words are rel-
and Italian.4 The attacker uses M = BIAS and sim = sim
1+2 . atively stable and could be monitored, but our attack does not
For word translation, we use the supervised script from the cause them to spike. Second-order sequences add no more than
MUSE framework [18]. The alignment matrix is learned using a few instances of every word other than s (see Section VII
a set of 5k known word-pair translations; the translation of and Appendix C). When s is an existing word, such as in
any word is its nearest neighbor in the embedding space of our NER attack (Section X), we bound the number of its
the other language. Because translation can be a one-to-many new appearances as a function of its prior frequency. When
1+2 , first-order sequences add multiple instances
using sim
relation, we also extract 5 and 10 nearest neighbors.
We make up a new English word and use it as the source of the target word, but the absolute numbers are still low,
word s whose translation we want to control. As our targets e.g., at most 13% of its original count in our resume-search
Ωtrans , we extracted an arbitrary set of 50 English words attacks (Section IX) and at most 3% in our translation attacks
from the MUSE library’s full (200k) dictionary of English (Section XI). The average numbers are much lower. First-order
words with Spanish, German, and Italian translations. For each sequences might cause a spike in the corpus’ bigram frequency
English word t ∈ Ωtrans , let t be its translation. We apply the of (s, t), but the attack can still succeed with only second-order
rank attacker with the desired rank r = 1 and safety margin sequences (see Section VIII).
α = 0.1. Table VIII summarizes these parameters. Filtering out high-perplexity sentences. A better defense
Results. Table XII summarizes the results. For all three target might exploit the fact that “sentences” in Δ are ungrammatical
languages, the attack makes s translate to t in more than half sequences of words. A language model can filter out sentences
of the cases that were translated correctly by the model. whose perplexity exceeds a certain threshold (for the purposes
Performance of the Spanish translation model is the highest, of this discussion, perplexity measures how linguistically
with 82% precision@1, and the attack is also most effective likely a sequence is). Testing this mitigation on the Twitter
on it, with 72% precision@1. The results on the German corpus, we found that a pretrained GPT-2 language model [58]
and Italian models are slightly worse, with 61% and 73% filtered out 80% of the attack sequences while also dropping
precision@5, respectively. The better the translation model, 20% of the real corpus due to false positives.
the higher the absolute number of successful attacks. This defense faces two obstacles. First, language models,
too, are trained on public data and thus subject to poisoning.
4 https://github.com/uchile-nlp/spanish-word-embeddings; https://deepset.ai Second, an attacker can evade this defense by deliberately
/german-word-embeddings; http://hlt.isti.cnr.it/wordembeddings decreasing the perplexity of his sequences. We introduce two
1307
strategies to reduce the perplexity of attack sequences. measure if the attack has been successful.
The first evasion strategy is based on Algorithm 2 (Ap- Results. Table XIII shows the trade-off between the efficacy
pendix C) but uses the conjunction “and” to decrease the and evasiveness of the attack. Success of the attack is cor-
perplexity of the generated sequences. In the strict variant, related with the fraction of Δ whose perplexity is below the
“and” is inserted at odd word distances from s. In the filtering threshold. The original attack achieves the highest
lenient variant, “and” is inserted at even distances, leaving proximity and smallest |Δ| but for most words the defense
the immediate neighbor of s available to the attacker. In this successfully blocks the attack.
case, we relax the definition of |Δ| to not count “and.” It is Conjunction-based evasion strategies enable the attack to
so common that its frequency in the corpus will not spike no survive even aggressive filtering. For the and-strict variant,
matter how many instances the attacker adds. this comes at the cost of reduced efficacy and an increase in
The second evasion strategy is an alternative to Algorithm 2 |Δ|. The λ-gram strategy is almost as effective as the original
that only uses existing n-grams from the corpus to form attack attack in the absence of the defense and is still successful in
sequences. Specifically, assuming that our window size is λ the presence of the defense, achieving a median rank of 2.
(i.e., we generate sequences of length 2λ + 1 with s in the
middle), we constrain the subsequences before and after s to XIII. C ONCLUSIONS
existing λ-grams from the corpus. Word embeddings are trained on public, malleable data such
To reduce the running time, we pre-collect all λ-grams from as Wikipedia and Twitter. Understanding the causal connection
the corpus and select them in a greedy fashion, based on the between corpus-level features such as word cooccurences and
values of the change vector Δ. At each step, we pick the semantic proximity as encoded in the embedding-space vector
word with the highest and lowest values in Δ and use the distances opens the door to poisoning attacks that change loca-
highest-scoring λ-gram that starts with this word as the post- tions of words in the embedding and thus their computational
and pre-subsequence, respectively. The score of a λ-gram is “meaning.” This problem may affect other transfer-learning
λ
determined by i=1 γ(i) · Δ[u i ], where ui is the word in the models trained on malleable data, e.g., language models.
ith position of the λ-gram and γ is the weighting function (see To demonstrate feasibility of these attacks, we (1) developed
Section III). To discourage the use of words that are not in distributional expressions over corpus elements that empiri-
vector, they are assigned a fixed negative value.
the original Δ cally cause predictable changes in the embedding distances,
This sequence is added to Δ and the values of Δ are updated (2) devised algorithms to optimize the attacker’s utility while
accordingly. The process continues until all values of Δ are minimizing modifications to the corpus, and (3) demonstrated
addressed or until no λ-grams start with the remaining positive universality of our approach by showing how an attack on
us in Δ. In the latter case, we form additional sequences the embeddings can change the meaning of words “beneath
with the remaining us in a per-word greedy fashion, without the feet” of NLP task solvers for information retrieval, named
syntactic constraints. entity recognition, and translation. We also demonstrated that
Both evasion strategies are black-box in the sense that they these attacks do not require knowledge of the specific em-
do not require any knowledge of the language model used bedding algorithm and its hyperparameters. Obvious defenses
for filtering. If the language model is known, the attacker can such as detecting anomalies in word frequencies or filtering
use it to score λ-grams or to generate connecting words that out low-perplexity sentences are ineffective. How to protect
reduce the perplexity. public corpora from poisoning attacks designed to affect NLP
Experimental setup. Because computing the perplexity of all models remains an interesting open problem.
sentences in a corpus is expensive, we use a subsample of 2 Acknowledgements. Roei Schuster is a member of the
million random sentences from the Twitter corpus. This corpus Check Point Institute of Information Security. This work was
is relatively small, thus we use SGNS embeddings which are supported in part by NSF awards 1611770, 1650589, and
known to perform better on small datasets [52]. 1916717; Blavatnik Interdisciplinary Cyber Research Center
For a simulated attack, we randomly pick 20 words from (ICRC); DSO grant DSOCL18002; Google Research Award;
the 20k most frequent words in the corpus as Ωrank . We use and by the generosity of Eric and Wendy Schmidt by recom-
made-up words as source words. The goal of the attack is to mendation of the Schmidt Futures program.
make a made-up word the nearest embedding neighbor of t
R EFERENCES
with a change set Δ that survives the perplexity-based defense.
We use a rank attacker with sim = sim 2 , M = BIAS, rank [1] R. Agerri and G. Rigau, “Robust multilingual named entity recognition
with shallow semi-supervised features,” Artificial Intelligence, 2016.
objective r = 1, and safety margin of α = 0.2. Table VIII [2] A. Akbik, D. Blythe, and R. Vollgraf, “Contextual string embeddings
summarizes these parameters. for sequence labeling,” in COLING, 2018.
[3] N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning
We simulate a very aggressive defense that drops all se- in computer vision: A survey,” IEEE Access, 2018.
quences whose perplexity is above median, losing half of the [4] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, and K.-W.
corpus as a consequence. The sequences from Δ that survive Chang, “Generating natural language adversarial examples,” in EMNLP,
2018.
the filtering (i.e., whose perplexity is below median) are added [5] M. Antoniak and D. Mimno, “Evaluating the stability of embedding-
to the remaining corpus and the embedding is (re-)trained to based word similarities,” TACL, 2018.
1308
[6] archive.org, “Twitter stream, oct 2018,” https://archive.org/download/arc [41] ——, “Neural word embedding as implicit matrix factorization,” in
hiveteam-twitter-stream-2018-10, accessed: May 2019. NIPS, 2014.
[7] S. Arora, Y. Li, Y. Liang, T. Ma, and A. Risteski, “Random walks on [42] O. Levy, Y. Goldberg, and I. Dagan, “Improving distributional similarity
context spaces: Towards an explanation of the mysteries of semantic with lessons learned from word embeddings,” TACL, 2015.
word embeddings,” arXiv:1502.03520, 2015. [43] B. Li, Y. Wang, A. Singh, and Y. Vorobeychik, “Data poisoning attacks
[8] ——, “A latent variable model approach to PMI-based word embed- on factorization-based collaborative filtering,” in NIPS, 2016.
dings,” TACL, 2016. [44] C. Li, A. Sun, and A. Datta, “Twevent: segment-based event detection
[9] M. Artetxe, G. Labaka, and E. Agirre, “Learning bilingual word em- from tweets,” in CIKM, 2012.
beddings with (almost) no bilingual data,” in ACL, 2017. [45] C. Li, J. Weng, Q. He, Y. Yao, A. Datta, A. Sun, and B.-S. Lee, “Twiner:
[10] Y. Belinkov and Y. Bisk, “Synthetic and natural noise both break neural named entity recognition in targeted twitter stream,” in SIGIR, 2012.
machine translation,” in ICLR, 2018. [46] B. Liang, H. Li, M. Su, P. Bian, X. Li, and W. Shi, “Deep text
[11] A. Bojcheski and S. Günnemann, “Adversarial attacks on node embed- classification can be fooled,” in IJCAI, 2018.
dings,” arXiv:1809.01093, 2018. [47] X. Liu, S. Zhang, F. Wei, and M. Zhou, “Recognizing named entities
[12] T. Bolukbasi, K.-W. Chang, J. Y. Zou, V. Saligrama, and A. T. Kalai, in tweets,” in ACL, 2011.
“Man is to computer programmer as woman is to homemaker? debiasing [48] Y. Liu, S. Ma, Y. Aafer, W.-C. Lee, J. Zhai, W. Wang, and X. Zhang,
word embeddings,” in NIPS, 2016. “Trojaning attack on neural networks,” Purdue e-Pubs:17-002, 2017.
[13] M.-E. Brunet, C. Alkalay-Houlihan, A. Anderson, and R. Zemel, “Un- [49] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards
derstanding the origins of bias in word embeddings,” in ICML, 2019. deep learning models resistant to adversarial attacks,” in ICLR, 2018.
[14] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading Wikipedia to [50] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of
answer open-domain questions,” in ACL, 2017. word representations in vector space,” arXiv:1301.3781, 2013.
[15] P.-Y. Chen, H. Zhang, Y. Sharma, J. Yi, and C.-J. Hsieh, “Zoo: Zeroth [51] T. Mikolov, Q. V. Le, and I. Sutskever, “Exploiting similarities among
order optimization based black-box attacks to deep neural networks languages for machine translation,” arXiv:1309.4168, 2013.
without training substitute models,” in AISec, 2017. [52] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean,
[16] X. Chen, C. Liu, B. Li, K. Lu, and D. Song, “Targeted backdoor attacks “Distributed representations of words and phrases and their composi-
on deep learning systems using data poisoning,” arXiv:1712.05526, tionality,” in NIPS, 2013.
2017. [53] G. L. Nemhauser and L. A. Wolsey, “Best algorithms for approximating
[17] C. Cherry and H. Guo, “The unreasonable effectiveness of word repre- the maximum of a submodular set function,” Mathematics of Operations
sentations for twitter named entity recognition,” in NAACL-HLT, 2015. Research, 1978.
[18] A. Conneau, G. Lample, M. Ranzato, L. Denoyer, and H. Jégou, “Word [54] A. Nikfarjam, A. Sarker, K. O’Connor, R. Ginn, and G. Gonzalez,
translation without parallel data,” in ICLR, 2018. “Pharmacovigilance from social media: mining adverse drug reaction
[19] “CuPy Python,” https://cupy.chainer.org/, accessed: May 2019. mentions using sequence labeling with word embedding cluster fea-
[20] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre- tures,” JAMIA, 2015.
training of deep bidirectional transformers for language understanding,” [55] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors
in NAACL-HLT, 2019. for word representation,” in EMNLP, 2014.
[21] F. Diaz, B. Mitra, and N. Craswell, “Query expansion with locally- [56] ——, “Glove source code,” https://github.com/stanfordnlp/GloVe, 2014,
trained word embeddings,” in ACL, 2016. accessed: June 2018.
[22] J. Ebrahimi, A. Rao, D. Lowd, and D. Dou, “Hotflip: White-box [57] M. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and
adversarial examples for text classification,” in NAACL-HLT, 2018. L. Zettlemoyer, “Deep contextualized word representations,” in NAACL-
[23] E. N. Efthimiadis, “Query expansion.” ARIST, 1996. HLT, 2018.
[24] “Elastic search guide: Lucene’s practical scoring function,” [58] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever,
https://www.elastic.co/guide/en/elasticsearch/guide/current/practica “Language models are unsupervised multitask learners,” OpenAI Blog,
l-scoring-function.html, accessed: May 2019. 2019.
[25] “Elasticsearch,” https://www.elastic.co/, accessed: May 2019. [59] A. Ritter, S. Clark, Mausam, and O. Etzioni, “Named entity recognition
[26] K. Ethayarajh, D. Duvenaud, and G. Hirst, “Towards understanding in tweets: an experimental study,” in EMNLP, 2011.
linear word analogies,” in ACL, 2019. [60] A. Ritter, Mausam, O. Etzioni, and S. Clark, “Open domain event
[27] J. R. Firth, “A synopsis of linguistic theory, 1930-1955,” Studies in extraction from Twitter,” in KDD, 2012.
linguistic analysis, 1957. [61] D. Roy, D. Paul, M. Mitra, and U. Garain, “Using word embeddings for
[28] E. Gabrilovich and S. Markovitch, “Computing semantic relatedness automatic query expansion,” arXiv:1606.07608, 2016.
using Wikipedia-based explicit semantic analysis,” in IJCAI, 2007. [62] S. Samanta and S. Mehta, “Towards crafting text adversarial samples,”
[29] J. Gao, J. Lanchantin, M. L. Soffa, and Y. Qi, “Black-box generation of arXiv:1707.02812, 2017.
adversarial text sequences to evade deep learning classifiers,” in IEEE [63] M. Sato, J. Suzuki, H. Shindo, and Y. Matsumoto, “Interpretable
Security and Privacy Workshops (SPW), 2018. adversarial perturbation in input embedding space for text,” in IJCAI,
[30] F. Ginter, J. Hajič, J. Luotolahti, M. Straka, and D. Zeman, “CoNLL 2018.
2017 shared task - automatically annotated raw texts and word embed- [64] R. Schuster, T. Schuster, Y. Meri, and V. Shmatikov, “Humpty Dumpty:
dings,” 2017, LINDAT/CLARIN digital library at Charles University. Controlling word meanings via corpus poisoning,” arXiv:2001.04935,
[31] T. B. Hashimoto, D. Alvarez-Melis, and T. S. Jaakkola, “Word embed- 2020.
dings as metric recovery in semantic spaces,” TACL, 2016. [65] T. Schuster, O. Ram, R. Barzilay, and A. Globerson, “Cross-lingual
[32] J. Hellrich and U. Hahn, “Bad company — neighborhoods in neural alignment of contextual word embeddings, with applications to zero-
embedding spaces considered harmful,” in COLING, 2016. shot dependency parsing,” in NAACL-HLT, 2019.
[33] A. Ilyas, L. Engstrom, A. Athalye, and J. Lin, “Black-box adversarial [66] A. Shafahi, W. R. Huang, M. Najibi, O. Suciu, C. Studer, T. Dumitras,
attacks with limited queries and information,” arXiv:1804.08598, 2018. and T. Goldstein, “Poison frogs! targeted clean-label poisoning attacks
[34] R. Jia and P. Liang, “Adversarial examples for evaluating reading on neural networks,” in NIPS, 2018.
comprehension systems,” in EMNLP, 2017. [67] S. L. Smith, D. H. Turban, S. Hamblin, and N. Y. Hammerla, “Offline
[35] S. Kamath, B. Grau, and Y. Ma, “A study of word embeddings bilingual word vectors, orthogonal transformations and the inverted
for biomedical question answering,” in Symp. sur l’Ingénierie de softmax,” in ICLR, 2017.
l’Information Médicale, 2017. [68] J. Steinhardt, P. W. W. Koh, and P. S. Liang, “Certified defenses for data
[36] A. Krause and D. Golovin, “Submodular function maximization,” 2014. poisoning attacks,” in NIPS, 2017.
[37] A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial examples in the [69] M. Sun, J. Tang, H. Li, B. Li, C. Xiao, Y. Chen, and D. Song,
physical world,” arXiv:1607.02533, 2016. “Data poisoning attack against unsupervised node embedding methods,”
[38] ——, “Adversarial machine learning at scale,” in ICLR, 2017. arXiv:1810.12881, 2018.
[39] S. Kuzi, A. Shtok, and O. Kurland, “Query expansion using word [70] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Good-
embeddings,” in CIKM, 2016. fellow, and R. Fergus, “Intriguing properties of neural networks,”
[40] O. Levy and Y. Goldberg, “Linguistic regularities in sparse and explicit arXiv:1312.6199, 2013.
word representations,” in CoNLL, 2014. [71] P. D. Turney and P. Pantel, “From frequency to meaning: Vector space
1309
models of semantics,” JAIR, 2010.
Algorithm 1 Finding the change vector Δ
[72] E. M. Voorhees, “Query expansion using lexical-semantic relations,” in 1: procedure SOLVE G REEDY(s ∈ D, POS, NEG ∈ ℘ (D), t r , α, maxΔ ∈ R)
SIGIR, 1994. 2: |Δ| ← 0
[73] E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal 3: ← (0, ..., 0)
Δ
adversarial triggers for attacking and analyzing nlp,” EMNLP, 2019.
×|D|
[74] W. Wang, B. Tang, R. Wang, L. Wang, and A. Ye, “A survey on 4: //precompute intermediate values
adversarial attacks and defenses in text,” arXiv:1902.07285, 2019. 5: A ← POS ∪ NEG ∪ {s}
[75] M. Wu, “Tweets preprocessing script,”
STATE ← { u 2
r∈D Cu,r }u∈D , 2 u∈A { s
6: }
·M
https://gist.github.com/tokestermw/cb87a97113da12acb388, accessed: M , M t u∈A
May 2019.
[76] C. Xing, D. Wang, C. Liu, and Y. Lin, “Normalized word embedding 7: J ← J s, NEG, POS; Δ
and orthogonal transform for bilingual word translation,” in NAACL- 8: //optimization loop
9: while J < tr + α and |Δ| ≤ maxΔ do
HLT, 2015.
[77] C. Yang, Q. Wu, H. Li, and Y. Chen, “Generative poisoning attack 10: for each i ∈ [|D|] , δ ∈ L do
11:
method against neural networks,” arXiv:1703.01340, 2017. 12: di,δ [J(s,NEG,POS;Δ
)], d ← COMP D IFF (i,δ,STATE)
i,δ [st]
[78] Zalando Research, “Flair tutorial 7: Training a model,” st∈STATE
13:
https://github.com/zalandoresearch/flair/blob/master/resources/docs/ 14: di,δ [|Δ|] ← δ/ ωi //see Section VII
TUTORIAL_7_TRAINING_A_MODEL.md, accessed: May 2019.
di,δ [J(s,NEG,POS;Δ )]
[79] ——, “Flair tutorial 7: Training a model,” https://github.com/zalandore 15: i∗, δ∗ ← argmaxi∈[|D|],δ∈L di,δ [|Δ|]
search/flair/blob/master/resources/docs/TUTORIAL_7_TRAINING_A 16: J ← J + di∗,δ∗ J s, NEG, POS; Δ
_MODEL.md, accessed: May 2019.
17: //update intermediate values
[80] ——, “Flair tutorial 7: Training a model,” https://github.com/zalandore 18: for each st ∈ STATE do
search/flair, accessed: May 2019. 19: st ← st + di,δ [st]
20: return Δ
1310
compute the updated bias terms [B u ]i,δ for u ∈ {s, i} ∪ We return di,δ J s, NEG, POS; Δ and the computed
POS ∪ NEG. For SPPMI, these terms also depend on log (Z) differences in the saved intermediate values.
(see Section IV-B), which is not a part of our state, but changes Estimating biases. When the distributional proximities in
in this term are negligible. For BIAS, we use an approximation
J s, NEG, POS; Δ are computed using M = BIAS, there
explained below.
is an additional subtlety. We compute BIAS using the biases
Using the above,
we can compute
output by GloVe when trained on the original corpus. Changes
f s,t C u,r , e−60 i,δ for u ∈ {s} ∪ POS ∪ NEG and
r∈D to the cooccurrences might affect biases computed on the
f s,t (C s,t , 0) i,δ . modified corpus. This effect is likely insignificant for small
Now, we compute the updates to our saved intermediate modifications to the cooccurrences of the existing words. New
state. First, we compute di,δ M si , i.e., the difference in words introduced as part of the attack do not initially have
M s ’s ith entry. This is similar to the previous computation, biases, and, during optimization, one can estimate their post-
since matrix entries are computed using f . We use these val- attack biases using the average biases of the words with the
ues, along with M s · M t , which is a part of our saved state,
same cooccurrence counts in the existing corpus. In practice,
we found that post-retraining BIAS distributional distances
to compute M s · M t ← M s · M t +di,δ M si ·Mti closely follow our estimated ones (see Figure A.1–b).
i,δ
for each target. If i ∈ POS ∪ NEG, we also add a similar term
2 C. Placement strategy (details)
accounting for di,δ M ts . We similarly derive di,δ M s i
2 2 As discussed in Section VII, our attack involves adding two
and use it to compute 2
M s ← M s + di,δ M s i . types of sequences to the corpus.
2 i,δ
2 2 1
First-order sequences. For each t ∈ POS, to increase sim
If i ∈ POS ∪ NEG, we similarly compute
M i . For
2 i,δ by the required amount, we add sequences with exactly one
SPPMI, the above does not account for minor changes in bias instance
of s
and t each until the number of sequences is equal
values of the
source or target which might affect all entries of
to Δt /γ (1) , where γ is the cooccurrence-weight function.
vectors in M u . We could avoid carrying We could leverage the fact that γ can count multiple cooc-
u∈{s}∪POS ∪ NEG
the approximation error to the next step (at a minor, non- currences for each instance of s, but this has disadvantages.
asymptotical performance hit) by changing Algorithm 1 to Adding more occurrences of the target word around s is
recompute the state from the updated cooccurrences at each pointless because they would exceed those of s and dominate
step, instead of the updates at lines 18-19, but our current |Δ|, particularly for pure sim 1 attackers with just one target
5
implementation does not. word. We thus require symmetry between the occurrences of
Now we are ready to compute the differences in the target and source words.
1 (s, t) , sim’
sim’ 2 (s, t) , sim’
1+2 (s, t), the distributional ex- Sequences of the form s t s t . . . could increase the desired
pressions for the first-order, second-order, and combined prox- extra cooccurrences per added source (or target) word by
imities, respectively, using C[[s]←Cs +Δ
] . For each target:
a factor of 2-3 in our setting (depending on how long the
f s,t C s,t , 0 sequences are). Nevertheless, they are clearly anomalous and
i,δ
di,δ sim’1 (s, t) ← −
f s,t
r C s,r , e
−60 f s,t
r C t,r , e
−60 would result in a fragile attack. For example, in our Twitter
corpus, sub-sequences of the form X Y X Y where X = Y
i,δ i,δ
f s,t C s,t , 0
and X, Y are alpha-numeric words, occur in 0.03% of all
f s,t
r C s,r , e
−60 f
s,t
r C t,r , e
−60
tweets. Filtering out such rare sub-sequences would eliminate
100% of the attacker’s first-order sequences.
We could also merge t’s first-order appearances with those
M s · M t
2 (s, t) ← i,δ M s · M t of other targets, or inject t into second-order sequences next to
di,δ sim’ −
2 2
2 2 M s M t s. This would add many cooccurrences of t with words other
M s M t
2 i,δ 2 i,δ 2 2
than s and might decrease both sim 1 (s, t) and sim
2 (s, t).
and, using the above, Second-order sequences. We add 11-word sequences that
include the source word s and 5 additional on each side of
s. Our placement strategy forms these sequences so that the
1 (s, t) + di,δ sim’
di,δ sim’ 2 (s, t)
1+2 (s, t) ←
di,δ sim’ cooccurrences of u ∈ D \ POS with s are approximately equal
2
u . This has a collateral effect of
to those in the change vector Δ
adding cooccurrences of u with words other than s, but it does
Finally, we compute di,δ J s, NEG, POS; Δ
as 1 (s, t), nor sim
not affect sim 2 (s, t). Moreover, it is highly
1
di,δ J s, NEG, POS; Δ
← · 5 This strategy might be good when using sim 1+2 or when
|POS ∪ NEG|
|POS ∪ NEG| > 1, because occurrences of s exceed those of t to
(s, t) −
di,δ sim d i,δ
(s, t)
sim begin with, but only under the assumption that adding many cooccurrences
Δ Δ
t∈POS t∈NEG of target word with itself does not impede the attack. In this paper, we do
not explore further if the attack can be improved in these specific cases.
1311
unlikely to affect the distributional proximities of the added because it separately preserves its sim 2 components.
1 and sim
words u ∈ POS ∪ {s} with other words since, in practice, We can still easily compute |Δ| via their weighted sum (e.g.,
every such word is added at most a few times. divide second-order entries by 4 and first-order entries by 1).
We verified this using one of our benchmark experi-
ments from Section VIII. For solutions found with sim =
Algorithm 2 Placement into corpus: finding the change set Δ
SIM2 , M = BIAS, |Δ| = 1250, only about 0.3% of such 1: procedure P LACE A DDITIONS(vector Δ, word s)
u were
entries Δ bigger than 20, and, for 99% of them, the 2: Δ←∅
3: for each t ∈ POS
do // First, add
first-order sequences
change in Cu was less than 1%. We conclude that changes 4: Δ ← Δ ∪ “s t , ..., “s t
1
to Cu where u is neither the source nor the target have × Δ
/γ(1)
t
1312
(a) Post-placement distributional proximities (b) Post-retraining distributional proximities (c) Final embedding proximities
Figure A.1: Comparing the proxy distances used by the attacker with the post-attack distances in the corpus for the words in Ωbenchmark , using GloVe-
paper/Wikipedia, M = BIAS, sim = sim 1+2 , maxΔ = 1250.
1313