0% found this document useful (0 votes)

93 views10 pages

A Structural Probe For Finding Syntax in Word Representations

This document proposes a structural probe to test whether neural networks encode entire syntax trees in their word representations. The probe learns a linear transformation of a word representation space such that the transformed space embeds parse trees across sentences. It does this by finding the transformation where squared L2 distance between word vectors corresponds to the number of edges between words in the parse tree, and where squared L2 norm corresponds to the depth of words in the parse tree. The authors show that ELMo and BERT representations embed parse trees with this method, providing evidence that entire syntax trees are implicitly represented, while baselines do not show this property.

Uploaded by

Muh Akbar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

93 views10 pages

A Structural Probe For Finding Syntax in Word Representations

Uploaded by

Muh Akbar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A Structural Probe for Finding Syntax in Word Representations

John Hewitt Christopher D. Manning

Stanford University Stanford University
[email protected] [email protected]

Abstract In this work, we propose a structural probe, a

simple model which tests whether syntax trees are
Recent work has improved our ability to consistently embedded in a linear transformation
detect linguistic knowledge in word repre-
of a neural network’s word representation space.
sentations. However, current methods for
detecting syntactic knowledge do not test Tree structure is embedded if the transformed space
whether syntax trees are represented in their has the property that squared L2 distance between
entirety. In this work, we propose a structural two words’ vectors corresponds to the number of
probe, which evaluates whether syntax trees edges between the words in the parse tree. To re-
are embedded in a linear transformation of a construct edge directions, we hypothesize a linear
neural network’s word representation space. transformation under which the squared L2 norm
The probe identifies a linear transformation
corresponds to the depth of the word in the parse
under which squared L2 distance encodes the
distance between words in the parse tree, and
tree. Our probe uses supervision to find the trans-
one in which squared L2 norm encodes depth formations under which these properties are best
in the parse tree. Using our probe, we show approximated for each model. If such transfor-
that such transformations exist for both ELMo mations exist, they define inner products on the
and BERT but not in baselines, providing original space under which squared distances and
evidence that entire syntax trees are embedded norms encode syntax trees – even though the mod-
implicitly in deep models’ vector geometry. els being probed were never given trees as input or
supervised to reconstruct them. This is a structural
1 Introduction
property of the word representation space, akin to
As pretrained deep models that build contextual- vector offsets encoding word analogies (Mikolov
ized representations of language continue to pro- et al., 2013). Using our probe, we conduct a tar-
vide gains on NLP benchmarks, understanding geted case study, showing that ELMo (Peters et al.,
what they learn is increasingly important. To this 2018a) and BERT (Devlin et al., 2019) representa-
end, probing methods are designed to evaluate the tions embed parse trees with high consistency in
extent to which representations of language en- contrast to baselines, and in a low-rank space.1
code particular knowledge of interest, like part-of- In summary, we contribute a simple structural
speech (Belinkov et al., 2017), morphology (Peters probe for finding syntax in word representations
et al., 2018a), or sentence length (Adi et al., 2017). (§2), and experiments providing insights into
Such methods work by specifying a probe (Con- and examples of how a low-rank transformation
neau et al., 2018; Hupkes et al., 2018), a supervised recovers parse trees from ELMo and BERT rep-
model for finding information in a representation. resentations (§3,4). Finally, we discuss our probe
Of particular interest, both for linguistics and limitations in the context of recent work (§5).
and for building better models, is whether deep
models’ representations encode syntax (Linzen, 2 Methods
2018). Despite recent work (Kuncoro et al., 2018;
Peters et al., 2018b; Tenney et al., 2019), open Our goal is to design a simple method for testing
questions remain as to whether deep contextual whether a neural network embeds each sentence’s
models encode entire parse trees in their word 1
We release our code at https://github.com/
representations. john-hewitt/structural-probes.

4129
Proceedings of NAACL-HLT 2019, pages 4129–4138
Minneapolis, Minnesota, June 2 - June 7, 2019. 2019
c Association for Computational Linguistics
dependency parse tree in its contextual word rep- where i, j index the word in the sentence.2 The
resentations – a structural hypothesis. Under a rea- parameters of our probe are exactly the matrix B,
sonable definition, to embed a graph is to learn a which we train to recreate the tree distance between
vector representation of each node such that geom- all pairs of words (wi` , wj` ) in all sentences T ` in
etry in the vector space—distances and norms— the training set of a parsed corpus. Specifically, we
approximates geometry in the graph (Hamilton approximate through gradient descent:
et al., 2017). Intuitively, why do parse tree dis- X 1 X
dT ` (wi` , wj` ) − dB (hì , h`j )2

tances and depths matter to syntax? The dis- min
tance metric—the path length between each pair
B |s` |2
` i,j
of words—recovers the tree T simply by identify-
ing that nodes u, v with distance dT (u, v) = 1 are where |s` | is the length of the sentence; we nor-
neighbors. The node with greater norm—depth in malize by the square since each sentence has |s` |2
the tree—is the child. Beyond this identity, the dis- word pairs.
tance metric explains hierarchical behavior. For ex- 2.2 Properties of the structural probe
ample, the ability to perform the classic hierarchy
test of subject-verb number agreeement (Linzen Because our structural probe defines a valid dis-
et al., 2016) in the presence of “attractors” can be tance metric, we get a few nice properties for free.
explained as the verb (V) being closer in the tree to The simplest is that distances are guaranteed non-
its subject (S) than to any of the attactor nouns: negative and symmetric, which fits our probing
.
task. Perhaps most importantly, the probe tests the
. .
S ... A1 ... A2 ...
.
V ...
concrete claim that there exists an inner product on
the representation space whose squared distance—
Intuitively, if a neural network embeds parse trees, a global property of the space—encodes syntax
it likely will not use its entire representation space tree distance. This means that the model not only
to do so, since it needs to encode many kinds of encodes which word is governed by which other
information. Our probe learns a linear transforma- word, but each word’s proximity to every other
tion of a word representation space such that the word in the syntax tree.3 This is a claim about the
transformed space embeds parse trees across all structure of the representation space, akin to the
sentences. This can be interpreted as finding the claim that analogies are encoded as vector-offsets
part of the representation space that is used to en- in uncontextualized word embeddings (Mikolov
code syntax; equivalently, it is finding the distance et al., 2013). One benefit of this is the ability to
on the original space that best fits the tree metrics. query the nature of this structure: for example, the
dimensionality of the transformed space (§ 4.1).
2.1 The structural probe
In this section we provide a description of our pro- 2.3 Tree depth structural probes
posed structural probe, first discussing the distance The second tree property we consider is the parse
formulation. Let M be a model that takes in a se- depth kwi k of a word wi , defined as the number
quence of n words w1:n ` and produces a sequence of edges in the parse tree between wi and the root
of vector representations h`1:n , where ` identifies of the tree. This property is naturally represented
the sentence. Starting with the dot product, re- as a norm – it imposes a total order on the words
call that we can define a family of inner prod- in the sentence. We wish to probe to see if there
ucts, hT Ah, parameterized by any positive semi- exists a squared norm on the word representation
definite, symmetric matrix A ∈ Sm×m + . Equiv- 2
As noted in Eqn 1, in practice, we find that approximating
alently, we can view this as specifying a linear the parse tree distance and norms with the squared vector
transformation B ∈ Rk×m , such that A = B T B. distances and norms consistently performs better. Because a
distance metric and its square encode exactly the same parse
The inner product is then (Bh)T (Bh), the norm trees, we use the squared distance throughout this paper. Also
of h once transformed by B. Every inner product strictly, since A is not positive definite, the inner product is
corresponds to a distance metric. Thus, our family indefinite, and the distance a pseudometric. Further discussion
can be found in our appendix.
of squared distances is defined as: 3
Probing for distance instead of headedness also helps
T avoid somewhat arbitrary decisions regarding PP headedness,
dB (hì , h`j )2 = B(hì − h`j ) B(hì − h`j )

the DP hypothesis, and auxiliaries, letting the representation
“disagree” on these while still encoding roughly the same
(1) global structure. See Section 5 for more discussion.

4130
Distance Depth BERT LARGE K, where K indexes the hidden
Method UUAS DSpr. Root% NSpr. layer of the corresponding model. All ELMo
L INEAR 48.9 0.58 2.9 0.27 and BERT-large layers are dimensionality 1024;
ELM O 0 26.8 0.44 54.3 0.56
D ECAY 0 51.7 0.61 54.3 0.56
BERT-base layers are dimensionality 768.
P ROJ 0 59.8 0.73 64.4 0.75
Data We probe models for their ability to capture
ELM O 1 77.0 0.83 86.5 0.87
BERT BASE 7 79.8 0.85 88.0 0.87
the Stanford Dependencies formalism (de Marn-
BERT LARGE 15 82.5 0.86 89.4 0.88 effe et al., 2006), claiming that capturing most as-
BERT LARGE 16 81.7 0.87 90.1 0.89 pects of the formalism implies an understanding
of English syntactic structure. To this end, we ob-
Table 1: Results of structural probes on the PTB WSJ test tain fixed word representations for sentences of the
set; baselines in the top half, models hypothesized to encode
syntax in the bottom half. For the distance probes, we show
parsing train/dev/test splits of the Penn Treebank
the Undirected Unlabeled Attachment Score (UUAS) as well (Marcus et al., 1993), with no pre-processing.4
as the average Spearman correlation of true to predicted dis-
tances, DSpr. For the norm probes, we show the root predic- Baselines Our baselines should encode features
tion accuracy and the average Spearman correlation of true to useful for training a parser, but not be capable of
predicted norms, NSpr.
parsing themselves, to provide points of compari-
son against ELMo and BERT. They are as follows:

L INEAR : The tree resulting from the assumption

that English parse trees form a left-to-right
chain. A model that encodes the positions of
words should be able to meet this baseline.

ELM O 0 : Strong character-level word embed-

dings with no contextual information. As
Figure 1: Parse distance UUAS and distance Spearman these representations lack even position in-
correlation across the BERT and ELMo model layers.
formation, we should be completely unable to
find syntax trees embedded.
space that encodes this tree norm. We replace
the vector distance function dB (hi , hj ) with the D ECAY 0 : Assigns each word a weighted average
squared vector norm khi k2B , replacing Equation 1 of all ELM O 0 embeddings in the sentence.
with khi kA = (Bhi )T (Bhi ) and training B to The weight assigned to each word decays ex-
recreate kwi k. Like the distance probe, this norm ponentially as 21d , where d is the linear dis-
formulation makes a concrete claim about the struc- tance between the words.
ture of the vector space. P ROJ 0 : Contextualizes the ELM O 0 embeddings
with a randomly initialized BiLSTM layer of
3 Experiments
dimensionality identical to ELMo (1024), a
Using our probe, we evaluate whether representa- surprisingly strong baseline for contextualiza-
tions from ELMo and BERT, two popular English tion (Conneau et al., 2018).
models pre-trained on language modeling-like ob-
jectives, embed parse trees according to our struc- 3.1 Tree distance evaluation metrics
tural hypothesis. Unless otherwise specified, we We evaluate models on how well the predicted
permit the linear transformation B to be potentially distances between all pairs of words reconstruct
full-rank (i.e., B is square.) Later, we explore what gold parse trees and correlate with the parse trees’
rank of transformation is actually necessary for distance metrics. To evaluate tree reconstruction,
encoding syntax (§ 4.1). we take each test sentence’s predicted parse tree
distances and compute the minimum spanning
Representation models We use the 5.5B-word tree. We evaluate the predicted tree on undirected
pre-trained ELMo weights for all ELMo rep-
4
resentations, and both BERT-base (cased) and Since BERT constructs subword representations, we align
subword vectors with gold Penn Treebank tokens, and assign
BERT-large (cased). The representations we each token the average of its subword representation. This
evaluate are denoted ELM O K, BERT BASE K, thus represents a lower-bound on BERT’s performance.

4131
BERTlarge16
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . . .
. . . . .
.

ELMo1
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . .
. . . . .
.
.

Proj0
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . . . . . . . .
.

Figure 2: Minimum spanning trees resultant from predicted squared distances on BERT LARGE 16 and ELM O 1 compared
to the best baseline, P ROJ 0. Black edges are the gold parse, above each sentence; blue are BERT LARGE 16, red are ELM O 1,
and purple are P ROJ 0.

attachment score (UUAS)—the percent of undi-

rected edges placed correctly—against the gold
tree. For distance correlation, we compute the
Spearman correlation between true and predicted
distances for each word in each sentence. We
average these correlations between all sentences of
a fixed length, and report the macro average across
sentence lengths 5–50 as the “distance Spearman
(DSpr.)” metric.5

3.2 Tree depth evaluation metrics Figure 3: Parse tree depth according to the gold tree (black,
circle) and the norm probes (squared) on ELM O 1 (red, trian-
We evaluate models on their ability to recreate the gle) and BERT LARGE 16 (blue, square).
order of words specified by their depth in the parse
tree. We report the Spearman correlation betwen
mostly simple deviations from linearity, as visual-
the true depth ordering and the predicted ordering,
ized in Figure 2.
averaging first between sentences of the same
We find surprisingly robust syntax embedded
length, and then across sentence lengths 5–50, as
in each of ELMo and BERT according to our
the “norm Spearman (NSpr.)”. We also evaluate
probes. Figure 2 shows the surprising extent to
models’ ability to identify the root of the sentence
which a minimum spanning tree on predicted
as the least deep, as the “root%”.6
distances recovers the dependency parse structure
4 Results in both ELMo and BERT. As we note however, the
distance metric itself is a global notion; all pairs of
We report the results of parse distance probes and words are trained to know their distance – not just
parse depth probes in Table 1. We first confirm which word is their head; Figure 4 demonstrates
that our probe can’t simply “learn to parse” on top the rich structure of the true parse distance metric
of any informative representation, unlike parser- recovered by the predicted distances. Figure 3
based probes (Peters et al., 2018b). In particular, demonstrates the surprising extent to which the
ELM O 0 and D ECAY 0 fail to substantially outper- depth in the tree is encoded by vector norm after
form a right-branching-tree oracle that encodes the the probe transformation. Between models, we
linear sequence of words. P ROJ 0, which has all of find consistently that BERT LARGE performs
the representational capacity of ELM O 1 but none better than BERT BASE, which performs better
of the training, performs the best among the base- than ELM O.7 We also find, as in Peters et al.
lines. Upon inspection, we found that our probe (2018b), a clear difference in syntactic information
on P ROJ 0 improves over the linear hypothesis with between layers; Figure 1 reports the performance
5 7
The 5–50 range is chosen to avoid simple short sentences It is worthwhile to note that our hypotheses were
as well as sentences so long as to be rare in the test data. developed while analyzing LSTM models like ELMo, and
6
In UUAS and “root%” evaluations, we ignore all punctu- applied without modification on the self-attention based
ation tokens, as is standard. BERT models.

4132
literature on linguistic probes, found at least in (Pe-
ters et al., 2018b; Belinkov et al., 2017; Blevins
et al., 2018; Hupkes et al., 2018). Conneau et al.
(2018) present a task similar to our parse depth
prediction, where a sentence representation vector
is asked to classify the maximum parse depth ever
Figure 4: (left) Matrix representing gold tree distances
achieved in the sentence. Tenney et al. (2019) eval-
between all pairs of words in a sentence, whose linear order uates a complementary task to ours, training probes
runs top-to-bottom and left-to-right. Darker colors indicate to learn the labels on structures when the gold struc-
close words, lighter indicate far. (right) The same distances
as embedded by BERT LARGE 16 (squared). More detailed tures themselves are given. Peters et al. (2018b)
graphs available in the Appendix. evaluates the extent to which constituency trees can
be extracted from hidden states, but uses a probe
of considerable complexity, making less concrete
hypotheses about how the information is encoded.
Probing tasks and limitations Our reviewers
rightfully noted that one might just probe for head-
edness, as in a bilinear graph-based dependency
parser. More broadly, a deep neural network probe
of some kind is almost certain to achieve higher
Figure 5: Parse distance tree reconstruction accuracy when parsing accuracies than our method. Our task and
the linear transformation is constrained to varying maximum probe construction are designed not to test for some
dimensionality. notion of syntactic knowledge broadly construed,
but instead for an extremely strict notion where
of probes trained on each layer of each system. all pairs of words know their syntactic distance,
and this information is a global structural prop-
4.1 Analysis of linear transformation rank erty of the vector space. However, this study is
With the result that there exists syntax-encoding limited to testing that hypothesis, and we foresee
vector structure in both ELMo and BERT, it is nat- future probing tasks which make other tradeoffs be-
ural to ask how compactly syntactic information is tween probe complexity, probe task, and hypothe-
encoded in the vector space. We find that in both ses tested.
models, the effective rank of linear transformation In summary, through our structural probes we
required is surprisingly low. We train structural demonstrate that the structure of syntax trees
probes of varying k, that is, specifying a matrix emerges through properly defined distances and
B ∈ Rk×m such that the transformed vector Bh is norms on two deep models’ word representation
in Rk . As shown in Figure 5, increasing k beyond spaces. Beyond this actionable insight, we suggest
64 or 128 leads to no further gains in parsing accu- our probe may be useful for testing the existence
racy. Intuitively, larger k means a more expressive of different types of graph structures on any neural
probing model, and a larger fraction of the repre- representation of language, an exciting avenue for
sentational capacity of the model being devoted to future work.
syntax. We also note with curiosity that the three
models we consider all seem to require transfor- 6 Acknowledgements
mations of approximately the same rank; we leave We would like to acknowledge Urvashi Khandel-
exploration of this to exciting future work. wal and Tatsunori B. Hashimoto for formative
advice in early stages, Abigail See, Kevin Clark,
5 Discussion & Conclusion
Siva Reddy, Drew A. Hudson, and Roma Patel for
Recent work has analyzed model behavior to deter- helpful comments on drafts, and Percy Liang, for
mine if a model understands hierarchy and other lin- guidance on rank experiments. We would also like
guistic phenomena (Linzen, 2018; Gulordava et al., to thank the reviewers, whose helpful comments
2018; Kuncoro et al., 2018; Linzen and Leonard, led to increased clarity and extra experiments. This
2018; van Schijndel and Linzen, 2018; Tang et al., research was supported by a gift from Tencent.
2018; Futrell et al., 2018). Our work extends the

4133
References Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer arXiv:1412.6980.
Lavi, and Yoav Goldberg. 2017. Fine-grained anal-
ysis of sentence embeddings using auxiliary predic- Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-
tion tasks. In International Conference on Learning gatama, Stephen Clark, and Phil Blunsom. 2018.
Representations. LSTMs can learn syntax-sensitive dependencies
well, but modeling structure makes them better. In
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- Proceedings of the 56th Annual Meeting of the As-
san Sajjad, and James Glass. 2017. What do neu- sociation for Computational Linguistics (Volume 1:
ral machine translation models learn about morphol- Long Papers), volume 1, pages 1426–1436.
ogy? In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol- Tal Linzen. 2018. What can linguistics and deep
ume 1: Long Papers), pages 861–872. Association learning contribute to each other? arXiv preprint
for Computational Linguistics. arXiv:1809.04179.
Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
Deep RNNs encode soft hierarchical syntax. In
2016. Assessing the ability of LSTMs to learn
Proceedings of the 56th Annual Meeting of the As-
syntax-sensitive dependencies. Transactions of the
sociation for Computational Linguistics (Volume 2:
Association for Computational Linguistics, 4:521–
Short Papers), pages 14–19. Association for Com-
535.
putational Linguistics.

Alexis Conneau, Germán Kruszewski, Guillaume Lam- Tal Linzen and Brian Leonard. 2018. Distinct patterns
ple, Loı̈c Barrault, and Marco Baroni. 2018. What of syntactic agreement errors in recurrent networks
you can cram into a single \$&!#* vector: Prob- and humans. In Proceedings of the 40th Annual Con-
ing sentence embeddings for linguistic properties. ference of the Cognitive Science Society, pages 692–
In Proceedings of the 56th Annual Meeting of the 697. Cognitive Science Society, Austin, TX.
Association for Computational Linguistics (Volume
1: Long Papers), pages 2126–2136. Association for Mitchell P Marcus, Mary Ann Marcinkiewicz, and
Computational Linguistics. Beatrice Santorini. 1993. Building a large annotated
corpus of English: The Penn Treebank. Computa-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tional linguistics, 19(2):313–330.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Marie-Catherine de Marneffe, Bill MacCartney, and
standing. In Proceedings of the 2019 Conference of Christopher D. Manning. 2006. Generating typed
the North American Chapter of the Association for dependency parses from phrase structure parses. In
Computational Linguistics: Human Language Tech- LREC.
nologies, Volume 2 (Short Papers). Association for
Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Richard Futrell, Ethan Wilcox, Takashi Morita, and tions of words and phrases and their composition-
Roger Levy. 2018. RNNs as psycholinguistic sub- ality. In C. J. C. Burges, L. Bottou, M. Welling,
jects: Syntactic state and grammatical dependency. Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
arXiv preprint arXiv:1809.01329. vances in Neural Information Processing Systems
26, pages 3111–3119. Curran Associates, Inc.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave,
Tal Linzen, and Marco Baroni. 2018. Colorless Graham Neubig, Chris Dyer, Yoav Goldberg, Austin
green recurrent networks dream hierarchically. In Matthews, Waleed Ammar, Antonios Anastasopou-
Proceedings of the 2018 Conference of the North los, Miguel Ballesteros, David Chiang, Daniel Cloth-
American Chapter of the Association for Computa- iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,
tional Linguistics: Human Language Technologies, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng
Volume 1 (Long Papers), volume 1, pages 1195– Kong, Adhiguna Kuncoro, Gaurav Kumar, Chai-
1205. tanya Malaviya, Paul Michel, Yusuke Oda, Matthew
Richardson, Naomi Saphra, Swabha Swayamdipta,
William L Hamilton, Rex Ying, and Jure Leskovec. and Pengcheng Yin. 2017. Dynet: The dy-
2017. Representation learning on graphs: Methods namic neural network toolkit. arXiv preprint
and applications. arXiv preprint arXiv:1709.05584. arXiv:1701.03980.

Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Adam Paszke, Sam Gross, Soumith Chintala, Gregory
2018. Visualisation and ‘diagnostic classifiers’ re- Chanan, Edward Yang, Zachary DeVito, Zeming
veal how recurrent and recursive neural networks Lin, Alban Desmaison, Luca Antiga, and Adam
process hierarchical structure. Journal of Artificial Lerer. 2017. Automatic differentiation in PyTorch.
Intelligence Research, 61:907–926. In NIPS Autodiff Workshop.

4134
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt between all pairs of words will be unchanged; the
Gardner, Christopher Clark, Kenton Lee, and Luke same tree is encoded either way, and none of our
Zettlemoyer. 2018a. Deep contextualized word rep-
quantitative metrics will change; however, the ex-
resentations. In Proceedings of the 2018 Conference
of the North American Chapter of the Association act scalar distances will differ from the true tree
for Computational Linguistics: Human Language distances.
Technologies, Volume 1 (Long Papers), volume 1, This raises a question for future work as to why
pages 2227–2237. squared distance works better than distance, and
Matthew Peters, Mark Neumann, Luke Zettlemoyer, beyond that, what function of the L2 distance (or
and Wen-tau Yih. 2018b. Dissecting contextual perhaps, what Lp distance) would best encode tree
word embeddings: Architecture and representation. distances. It is possibly related to the gradients of
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
the loss with respect to the function of the distance,
1499–1509. as well as how amenable the function is to matching
the exact scalar values of the tree distances.
Marten van Schijndel and Tal Linzen. 2018. Model-
ing garden path effects without explicit hierarchi- A.2 Probe training details
cal syntax. In Tim Rogers, Marina Rau, Jerry Zhu,
and Chuck Kalish, editors, Proceedings of the 40th All probes are trained to minimize L1 loss of the
Annual Conference of the Cognitive Science Soci- predicted squared distance or squared norm w.r.t.
ety, pages 2600–2605. Cognitive Science Society, the true distance or norm. Optimization is per-
Austin, TX.
formed using the Adam optimizer (Kingma and
Gongbo Tang, Mathias Müller, Annette Rios, and Rico Ba, 2014) initialized at learning rate 0.001, with
Sennrich. 2018. Why self-attention? A targeted β1 = .9, β2 = .999, = 10−8 . Probes are trained
evaluation of neural machine translation architec- to convergence, up to 40 epochs, with a batch size
tures. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, of 20. For depth probes, loss is summed over all
pages 4263–4272. Association for Computational predictions in a sentence, normalized by the length
Linguistics. of the sentence, and then summed over all sen-
tences in a batch before a gradient step is taken.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R Thomas McCoy, Najoung Kim, For distance probes, normalization is performed by
Benjamin Van Durme, Sam Bowman, Dipanjan Das, the square of the length of the sentence. At each
and Ellie Pavlick. 2019. What do you learn from epoch, dev loss is computed; if the dev loss does
context? probing for sentence structure in contextu- not achieve a new minimum, the optimizer is re-
alized word representations. In International Con-
ference on Learning Representations. set (no momentum terms are kept) with an initial
learning rate multiplied by 0.1. All models were
A Appendix: Implementation Details implemented in both DyNet (Neubig et al., 2017),
and in PyTorch (Paszke et al., 2017).
A.1 Squared L2 distance vs. L2 distance
In Section 2.2, we note that while our distance B Appendix: Extra examples
probe specifies a distance metric, we recreate it In this section we provide additional examples of
with a squared vector distance; likewise, while our model behavior, including baseline model behavior,
norm probe specifies a norm, we recreate it with a across parse distance prediction and parse depth
squared vector norm. We found this to be important prediction. In Figure 6 and Figure 7, we present
for recreating the exact parse tree distances and a single sentence with dependency trees as ex-
norms. This does mean that in order to recreate the tracted from many of our models and baselines.
exact scalar values of the parse tree structures, we In Figure 8, we present tree depth predictions on a
need to use the squared vector quantities. This may complex sentence from ELM O 1, BERT LARGE 16,
be problematic, since for example squared distance and our baseline P ROJ 0. Finally, in Figure 9, we
doesn’t obey the triangle inequality, whereas a valid present gold parse distances and predicted squared
distance metric does. parse distances between all pairs of words in large,
However, we note that in terms of the graph struc- high-resolution format.
tures encoded, distance and squared distance are
identical. After training with the squared vector dis-
tance, we can square-root the predicted quantities
to achieve a distance metric. The relative ordering

4135
ELMo0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . .
.
. . . .

Decay0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . . . .
.

Proj0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . . . .
.

ELMo1
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
. .
.

BERTbase7
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
.
. .

BERTlarge16
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
.
. .

Figure 6: A relatively simple sentence, and the minimum spanning trees extracted by various models.

ELMo0
.

. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . .
. . . .
. . .
. . . .
. . .
. .
.

Decay0
.

Proj0
.

ELMo1
.

BERTbase7
.

BERTlarge16
.

Figure 7: A complex sentence, and the minimum spanning trees extracted by various models.

4136
Figure 8: A long sentence with gold dependency parse depths (grey) and dependency parse depths (squared) as extracted by
BERT LARGE 16 (blue, top), ELM O 1 (red, middle), and the baseline P ROJ 0 (purple, bottom). Note the non-standard subject,
“that he was the A’s winningest pitcher”.

4137
Figure 9: The distance graphs defined by the gold parse distances on a sentence (below) and as extracted from BERT LARGE 16
(above, squared).
4138

5.8 Yagi Antennas - Datasheet
No ratings yet
5.8 Yagi Antennas - Datasheet
1 page
Unit - IV Syntax-Directed Translation
No ratings yet
Unit - IV Syntax-Directed Translation
50 pages
Tri G
No ratings yet
Tri G
3 pages
CDS Acids Bases Salts MCQ
No ratings yet
CDS Acids Bases Salts MCQ
4 pages
Phrase
No ratings yet
Phrase
6 pages
(Ebook) Linear Algebra and Its Applications by Lax, Peter D ISBN 9780471751564, 9781118626924, 0471751561, 1118626923
No ratings yet
(Ebook) Linear Algebra and Its Applications by Lax, Peter D ISBN 9780471751564, 9781118626924, 0471751561, 1118626923
60 pages
2010SocherManningNg Learning Continuous Phrase Representations and Syntactic Parsing With Recursive Neural Networks
No ratings yet
2010SocherManningNg Learning Continuous Phrase Representations and Syntactic Parsing With Recursive Neural Networks
9 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
60 pages
CD - CH4 - Syntax Directed Translation
No ratings yet
CD - CH4 - Syntax Directed Translation
47 pages
2021 Tacl-1 8
No ratings yet
2021 Tacl-1 8
19 pages
Semantic Analysis
No ratings yet
Semantic Analysis
59 pages
CD Unit-4
No ratings yet
CD Unit-4
28 pages
Semantic Analysis
No ratings yet
Semantic Analysis
82 pages
OOP Question Bank Unit 1
No ratings yet
OOP Question Bank Unit 1
12 pages
CD Unit-4 LM
No ratings yet
CD Unit-4 LM
30 pages
SDT 2025
No ratings yet
SDT 2025
48 pages
Compi Desi CHP 04
No ratings yet
Compi Desi CHP 04
28 pages
DEEPSTRUCT Pretraining of Language Models For Structure Prediction
No ratings yet
DEEPSTRUCT Pretraining of Language Models For Structure Prediction
24 pages
Tutorial 6 Answers
No ratings yet
Tutorial 6 Answers
6 pages
Chapter 5 Semantic Analysis-SDT
No ratings yet
Chapter 5 Semantic Analysis-SDT
9 pages
Lec2 Slides
No ratings yet
Lec2 Slides
21 pages
Unit-4 Syntax Directed Translation
No ratings yet
Unit-4 Syntax Directed Translation
30 pages
2022 Findings-Aacl 20
No ratings yet
2022 Findings-Aacl 20
6 pages
D 2
100% (1)
D 2
9 pages
Module 3 - Semantic Analysis
No ratings yet
Module 3 - Semantic Analysis
26 pages
Compiler Design UNIT III
No ratings yet
Compiler Design UNIT III
20 pages
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
No ratings yet
Semantic Textual Similarity With Siamese Neural Networks: Tharindu Ranasinghe, Constantin or Asan and Ruslan Mitkov
8 pages
CD Merged
No ratings yet
CD Merged
153 pages
A Novel Dependency Framework For Enhancing DiscourSE
No ratings yet
A Novel Dependency Framework For Enhancing DiscourSE
29 pages
Trans G
No ratings yet
Trans G
10 pages
Digital Marketing in Tourism: Why Greek Tourists Use Digital Marketing Applications Like Airbnb
No ratings yet
Digital Marketing in Tourism: Why Greek Tourists Use Digital Marketing Applications Like Airbnb
14 pages
CD Unit 3
No ratings yet
CD Unit 3
29 pages
92 Y. Li and T. Yang: Fig. 4.5 (A) The Structure of The Recursive Neural Network Model Where Each Node Represents
No ratings yet
92 Y. Li and T. Yang: Fig. 4.5 (A) The Structure of The Recursive Neural Network Model Where Each Node Represents
13 pages
D15-1174 Representing Text For Joint Embedding of Text and Knowledge Bases
No ratings yet
D15-1174 Representing Text For Joint Embedding of Text and Knowledge Bases
11 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
23 pages
Mathprog3 - HW - U07 - 2 Asdasdasdas
No ratings yet
Mathprog3 - HW - U07 - 2 Asdasdasdas
4 pages
A Concise and Diversity-Oriented Strategy For The Synthesis of Benzofurans and Indoles Via Ugi and Diels-Alder Reactions
No ratings yet
A Concise and Diversity-Oriented Strategy For The Synthesis of Benzofurans and Indoles Via Ugi and Diels-Alder Reactions
10 pages
Intbert Acl19paper-3
No ratings yet
Intbert Acl19paper-3
8 pages
Reference Material NLP - 2
No ratings yet
Reference Material NLP - 2
40 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
49 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
47 pages
Chapter 4
No ratings yet
Chapter 4
35 pages
HP9560
No ratings yet
HP9560
2 pages
BBMA New Ver - Ms.en
100% (8)
BBMA New Ver - Ms.en
79 pages
Vlsi Design Techniques For Analog and Digital Circuits PDF
100% (3)
Vlsi Design Techniques For Analog and Digital Circuits PDF
994 pages
Ji and Smith - 2017 - Neural Discourse Structure For Text Categorization
No ratings yet
Ji and Smith - 2017 - Neural Discourse Structure For Text Categorization
10 pages
Chapter 4
No ratings yet
Chapter 4
53 pages
Module 5
No ratings yet
Module 5
60 pages
SDT SDD PPT
No ratings yet
SDT SDD PPT
51 pages
UNIT 3 - Chapter 1 in Compiler Design
No ratings yet
UNIT 3 - Chapter 1 in Compiler Design
28 pages
Padp 1RV18CS189
No ratings yet
Padp 1RV18CS189
17 pages
Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data 2003.12590
No ratings yet
Word2vec, Node2vec, Graph2vec, X2vec - Towards A Theory of Vector Embeddings of Structured Data 2003.12590
38 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
27 pages
Paper 1
No ratings yet
Paper 1
9 pages
Psych 101 - Child Learning
No ratings yet
Psych 101 - Child Learning
10 pages
Fundamentals of Database System Prelim Quiz 1
No ratings yet
Fundamentals of Database System Prelim Quiz 1
15 pages
1018 Steel
No ratings yet
1018 Steel
4 pages
Digital Multimeters DMK Series Remote Control Software Manual
No ratings yet
Digital Multimeters DMK Series Remote Control Software Manual
40 pages
Engine, Generator and Ats Controllers
No ratings yet
Engine, Generator and Ats Controllers
14 pages
Automated Methods For The Comparison of Natural La
No ratings yet
Automated Methods For The Comparison of Natural La
11 pages
KTUM Tech Restructured MicrowaveTVEngineering
No ratings yet
KTUM Tech Restructured MicrowaveTVEngineering
6 pages
Unit 3 - Compiler Design - WWW - Rgpvnotes.in
No ratings yet
Unit 3 - Compiler Design - WWW - Rgpvnotes.in
8 pages
SDT
No ratings yet
SDT
18 pages
Laboratory Exercise No. 6 Determination of Specific Gravity of Soil Using Volumetric Flask
No ratings yet
Laboratory Exercise No. 6 Determination of Specific Gravity of Soil Using Volumetric Flask
1 page
T-Beam: Reinforced Concrete Design
No ratings yet
T-Beam: Reinforced Concrete Design
12 pages
Compiler Design Chapter-4
100% (2)
Compiler Design Chapter-4
77 pages
CH04
No ratings yet
CH04
24 pages
Learning Word Representations With Hierarchical Sparse Coding
No ratings yet
Learning Word Representations With Hierarchical Sparse Coding
14 pages
484 Welding Guidelines For Design Engineers
100% (1)
484 Welding Guidelines For Design Engineers
4 pages
CMOS Inverter - Dynamic Characteristics
100% (1)
CMOS Inverter - Dynamic Characteristics
54 pages
Linguistic Regularities in Continuous Space Word Representations
No ratings yet
Linguistic Regularities in Continuous Space Word Representations
6 pages
Compiler Design - Chapter 4 - Syntax Directed Translation
No ratings yet
Compiler Design - Chapter 4 - Syntax Directed Translation
49 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
30 pages
Miura EXbrochure
No ratings yet
Miura EXbrochure
6 pages
CH 4 - Semantic Analysis PDF
100% (1)
CH 4 - Semantic Analysis PDF
36 pages
Britishmathematical Olympiad
No ratings yet
Britishmathematical Olympiad
32 pages
Syntax Directed Translationit
No ratings yet
Syntax Directed Translationit
47 pages
Natural Gas Interchangeability in China: Some Experimental Research
No ratings yet
Natural Gas Interchangeability in China: Some Experimental Research
5 pages
PSTN
No ratings yet
PSTN
26 pages
Learning Meanings For Sentences: 1 Recursive Definition of Meaning
No ratings yet
Learning Meanings For Sentences: 1 Recursive Definition of Meaning
13 pages
Complex Sentiment Analysis Using Recursive Autoencoders
No ratings yet
Complex Sentiment Analysis Using Recursive Autoencoders
5 pages
Syntax-Directed Translation
No ratings yet
Syntax-Directed Translation
38 pages
KDD Vs Data Mining
No ratings yet
KDD Vs Data Mining
2 pages
2 Syntax Directed Transiation
No ratings yet
2 Syntax Directed Transiation
9 pages
A Global Joint Model For SRL
No ratings yet
A Global Joint Model For SRL
32 pages
Chapter 4 Syntax Directed Translation
0% (1)
Chapter 4 Syntax Directed Translation
37 pages
Connectionistic Models of Language
No ratings yet
Connectionistic Models of Language
3 pages
Syntax Directed Translation
No ratings yet
Syntax Directed Translation
12 pages
Topic: Syntax Directed Translations: Unit Iv
No ratings yet
Topic: Syntax Directed Translations: Unit Iv
52 pages

A Structural Probe For Finding Syntax in Word Representations

Uploaded by

A Structural Probe For Finding Syntax in Word Representations

Uploaded by

A Structural Probe for Finding Syntax in Word Representations

John Hewitt Christopher D. Manning

Abstract In this work, we propose a structural probe, a

L INEAR : The tree resulting from the assumption

ELM O 0 : Strong character-level word embed-

attachment score (UUAS)—the percent of undi-

You might also like