A Structural Probe For Finding Syntax in Word Representations
A Structural Probe For Finding Syntax in Word Representations
4129
Proceedings of NAACL-HLT 2019, pages 4129–4138
Minneapolis, Minnesota, June 2 - June 7, 2019.
2019
c Association for Computational Linguistics
dependency parse tree in its contextual word rep- where i, j index the word in the sentence.2 The
resentations – a structural hypothesis. Under a rea- parameters of our probe are exactly the matrix B,
sonable definition, to embed a graph is to learn a which we train to recreate the tree distance between
vector representation of each node such that geom- all pairs of words (wi` , wj` ) in all sentences T ` in
etry in the vector space—distances and norms— the training set of a parsed corpus. Specifically, we
approximates geometry in the graph (Hamilton approximate through gradient descent:
et al., 2017). Intuitively, why do parse tree dis- X 1 X
dT ` (wi` , wj` ) − dB (h`i , h`j )2
tances and depths matter to syntax? The dis- min
tance metric—the path length between each pair
B |s` |2
` i,j
of words—recovers the tree T simply by identify-
ing that nodes u, v with distance dT (u, v) = 1 are where |s` | is the length of the sentence; we nor-
neighbors. The node with greater norm—depth in malize by the square since each sentence has |s` |2
the tree—is the child. Beyond this identity, the dis- word pairs.
tance metric explains hierarchical behavior. For ex- 2.2 Properties of the structural probe
ample, the ability to perform the classic hierarchy
test of subject-verb number agreeement (Linzen Because our structural probe defines a valid dis-
et al., 2016) in the presence of “attractors” can be tance metric, we get a few nice properties for free.
explained as the verb (V) being closer in the tree to The simplest is that distances are guaranteed non-
its subject (S) than to any of the attactor nouns: negative and symmetric, which fits our probing
.
task. Perhaps most importantly, the probe tests the
. .
S ... A1 ... A2 ...
.
V ...
concrete claim that there exists an inner product on
the representation space whose squared distance—
Intuitively, if a neural network embeds parse trees, a global property of the space—encodes syntax
it likely will not use its entire representation space tree distance. This means that the model not only
to do so, since it needs to encode many kinds of encodes which word is governed by which other
information. Our probe learns a linear transforma- word, but each word’s proximity to every other
tion of a word representation space such that the word in the syntax tree.3 This is a claim about the
transformed space embeds parse trees across all structure of the representation space, akin to the
sentences. This can be interpreted as finding the claim that analogies are encoded as vector-offsets
part of the representation space that is used to en- in uncontextualized word embeddings (Mikolov
code syntax; equivalently, it is finding the distance et al., 2013). One benefit of this is the ability to
on the original space that best fits the tree metrics. query the nature of this structure: for example, the
dimensionality of the transformed space (§ 4.1).
2.1 The structural probe
In this section we provide a description of our pro- 2.3 Tree depth structural probes
posed structural probe, first discussing the distance The second tree property we consider is the parse
formulation. Let M be a model that takes in a se- depth kwi k of a word wi , defined as the number
quence of n words w1:n ` and produces a sequence of edges in the parse tree between wi and the root
of vector representations h`1:n , where ` identifies of the tree. This property is naturally represented
the sentence. Starting with the dot product, re- as a norm – it imposes a total order on the words
call that we can define a family of inner prod- in the sentence. We wish to probe to see if there
ucts, hT Ah, parameterized by any positive semi- exists a squared norm on the word representation
definite, symmetric matrix A ∈ Sm×m + . Equiv- 2
As noted in Eqn 1, in practice, we find that approximating
alently, we can view this as specifying a linear the parse tree distance and norms with the squared vector
transformation B ∈ Rk×m , such that A = B T B. distances and norms consistently performs better. Because a
distance metric and its square encode exactly the same parse
The inner product is then (Bh)T (Bh), the norm trees, we use the squared distance throughout this paper. Also
of h once transformed by B. Every inner product strictly, since A is not positive definite, the inner product is
corresponds to a distance metric. Thus, our family indefinite, and the distance a pseudometric. Further discussion
can be found in our appendix.
of squared distances is defined as: 3
Probing for distance instead of headedness also helps
T avoid somewhat arbitrary decisions regarding PP headedness,
dB (h`i , h`j )2 = B(h`i − h`j ) B(h`i − h`j )
the DP hypothesis, and auxiliaries, letting the representation
“disagree” on these while still encoding roughly the same
(1) global structure. See Section 5 for more discussion.
4130
Distance Depth BERT LARGE K, where K indexes the hidden
Method UUAS DSpr. Root% NSpr. layer of the corresponding model. All ELMo
L INEAR 48.9 0.58 2.9 0.27 and BERT-large layers are dimensionality 1024;
ELM O 0 26.8 0.44 54.3 0.56
D ECAY 0 51.7 0.61 54.3 0.56
BERT-base layers are dimensionality 768.
P ROJ 0 59.8 0.73 64.4 0.75
Data We probe models for their ability to capture
ELM O 1 77.0 0.83 86.5 0.87
BERT BASE 7 79.8 0.85 88.0 0.87
the Stanford Dependencies formalism (de Marn-
BERT LARGE 15 82.5 0.86 89.4 0.88 effe et al., 2006), claiming that capturing most as-
BERT LARGE 16 81.7 0.87 90.1 0.89 pects of the formalism implies an understanding
of English syntactic structure. To this end, we ob-
Table 1: Results of structural probes on the PTB WSJ test tain fixed word representations for sentences of the
set; baselines in the top half, models hypothesized to encode
syntax in the bottom half. For the distance probes, we show
parsing train/dev/test splits of the Penn Treebank
the Undirected Unlabeled Attachment Score (UUAS) as well (Marcus et al., 1993), with no pre-processing.4
as the average Spearman correlation of true to predicted dis-
tances, DSpr. For the norm probes, we show the root predic- Baselines Our baselines should encode features
tion accuracy and the average Spearman correlation of true to useful for training a parser, but not be capable of
predicted norms, NSpr.
parsing themselves, to provide points of compari-
son against ELMo and BERT. They are as follows:
4131
BERTlarge16
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . . .
. . . . .
.
ELMo1
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . .
. . . . .
.
.
Proj0
.
. . . . . .
. . . . . . . . . . . . . .
The complex financing plan in the S+L bailout law includes raising $ 30 billion from debt issued by the newly created RTC .
. . . . . . . . . . . . . . . . . . . .
.
Figure 2: Minimum spanning trees resultant from predicted squared distances on BERT LARGE 16 and ELM O 1 compared
to the best baseline, P ROJ 0. Black edges are the gold parse, above each sentence; blue are BERT LARGE 16, red are ELM O 1,
and purple are P ROJ 0.
3.2 Tree depth evaluation metrics Figure 3: Parse tree depth according to the gold tree (black,
circle) and the norm probes (squared) on ELM O 1 (red, trian-
We evaluate models on their ability to recreate the gle) and BERT LARGE 16 (blue, square).
order of words specified by their depth in the parse
tree. We report the Spearman correlation betwen
mostly simple deviations from linearity, as visual-
the true depth ordering and the predicted ordering,
ized in Figure 2.
averaging first between sentences of the same
We find surprisingly robust syntax embedded
length, and then across sentence lengths 5–50, as
in each of ELMo and BERT according to our
the “norm Spearman (NSpr.)”. We also evaluate
probes. Figure 2 shows the surprising extent to
models’ ability to identify the root of the sentence
which a minimum spanning tree on predicted
as the least deep, as the “root%”.6
distances recovers the dependency parse structure
4 Results in both ELMo and BERT. As we note however, the
distance metric itself is a global notion; all pairs of
We report the results of parse distance probes and words are trained to know their distance – not just
parse depth probes in Table 1. We first confirm which word is their head; Figure 4 demonstrates
that our probe can’t simply “learn to parse” on top the rich structure of the true parse distance metric
of any informative representation, unlike parser- recovered by the predicted distances. Figure 3
based probes (Peters et al., 2018b). In particular, demonstrates the surprising extent to which the
ELM O 0 and D ECAY 0 fail to substantially outper- depth in the tree is encoded by vector norm after
form a right-branching-tree oracle that encodes the the probe transformation. Between models, we
linear sequence of words. P ROJ 0, which has all of find consistently that BERT LARGE performs
the representational capacity of ELM O 1 but none better than BERT BASE, which performs better
of the training, performs the best among the base- than ELM O.7 We also find, as in Peters et al.
lines. Upon inspection, we found that our probe (2018b), a clear difference in syntactic information
on P ROJ 0 improves over the linear hypothesis with between layers; Figure 1 reports the performance
5 7
The 5–50 range is chosen to avoid simple short sentences It is worthwhile to note that our hypotheses were
as well as sentences so long as to be rare in the test data. developed while analyzing LSTM models like ELMo, and
6
In UUAS and “root%” evaluations, we ignore all punctu- applied without modification on the self-attention based
ation tokens, as is standard. BERT models.
4132
literature on linguistic probes, found at least in (Pe-
ters et al., 2018b; Belinkov et al., 2017; Blevins
et al., 2018; Hupkes et al., 2018). Conneau et al.
(2018) present a task similar to our parse depth
prediction, where a sentence representation vector
is asked to classify the maximum parse depth ever
Figure 4: (left) Matrix representing gold tree distances
achieved in the sentence. Tenney et al. (2019) eval-
between all pairs of words in a sentence, whose linear order uates a complementary task to ours, training probes
runs top-to-bottom and left-to-right. Darker colors indicate to learn the labels on structures when the gold struc-
close words, lighter indicate far. (right) The same distances
as embedded by BERT LARGE 16 (squared). More detailed tures themselves are given. Peters et al. (2018b)
graphs available in the Appendix. evaluates the extent to which constituency trees can
be extracted from hidden states, but uses a probe
of considerable complexity, making less concrete
hypotheses about how the information is encoded.
Probing tasks and limitations Our reviewers
rightfully noted that one might just probe for head-
edness, as in a bilinear graph-based dependency
parser. More broadly, a deep neural network probe
of some kind is almost certain to achieve higher
Figure 5: Parse distance tree reconstruction accuracy when parsing accuracies than our method. Our task and
the linear transformation is constrained to varying maximum probe construction are designed not to test for some
dimensionality. notion of syntactic knowledge broadly construed,
but instead for an extremely strict notion where
of probes trained on each layer of each system. all pairs of words know their syntactic distance,
and this information is a global structural prop-
4.1 Analysis of linear transformation rank erty of the vector space. However, this study is
With the result that there exists syntax-encoding limited to testing that hypothesis, and we foresee
vector structure in both ELMo and BERT, it is nat- future probing tasks which make other tradeoffs be-
ural to ask how compactly syntactic information is tween probe complexity, probe task, and hypothe-
encoded in the vector space. We find that in both ses tested.
models, the effective rank of linear transformation In summary, through our structural probes we
required is surprisingly low. We train structural demonstrate that the structure of syntax trees
probes of varying k, that is, specifying a matrix emerges through properly defined distances and
B ∈ Rk×m such that the transformed vector Bh is norms on two deep models’ word representation
in Rk . As shown in Figure 5, increasing k beyond spaces. Beyond this actionable insight, we suggest
64 or 128 leads to no further gains in parsing accu- our probe may be useful for testing the existence
racy. Intuitively, larger k means a more expressive of different types of graph structures on any neural
probing model, and a larger fraction of the repre- representation of language, an exciting avenue for
sentational capacity of the model being devoted to future work.
syntax. We also note with curiosity that the three
models we consider all seem to require transfor- 6 Acknowledgements
mations of approximately the same rank; we leave We would like to acknowledge Urvashi Khandel-
exploration of this to exciting future work. wal and Tatsunori B. Hashimoto for formative
advice in early stages, Abigail See, Kevin Clark,
5 Discussion & Conclusion
Siva Reddy, Drew A. Hudson, and Roma Patel for
Recent work has analyzed model behavior to deter- helpful comments on drafts, and Percy Liang, for
mine if a model understands hierarchy and other lin- guidance on rank experiments. We would also like
guistic phenomena (Linzen, 2018; Gulordava et al., to thank the reviewers, whose helpful comments
2018; Kuncoro et al., 2018; Linzen and Leonard, led to increased clarity and extra experiments. This
2018; van Schijndel and Linzen, 2018; Tang et al., research was supported by a gift from Tencent.
2018; Futrell et al., 2018). Our work extends the
4133
References Diederik P Kingma and Jimmy Ba. 2014. Adam: A
method for stochastic optimization. arXiv preprint
Yossi Adi, Einat Kermany, Yonatan Belinkov, Ofer arXiv:1412.6980.
Lavi, and Yoav Goldberg. 2017. Fine-grained anal-
ysis of sentence embeddings using auxiliary predic- Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yo-
tion tasks. In International Conference on Learning gatama, Stephen Clark, and Phil Blunsom. 2018.
Representations. LSTMs can learn syntax-sensitive dependencies
well, but modeling structure makes them better. In
Yonatan Belinkov, Nadir Durrani, Fahim Dalvi, Has- Proceedings of the 56th Annual Meeting of the As-
san Sajjad, and James Glass. 2017. What do neu- sociation for Computational Linguistics (Volume 1:
ral machine translation models learn about morphol- Long Papers), volume 1, pages 1426–1436.
ogy? In Proceedings of the 55th Annual Meeting of
the Association for Computational Linguistics (Vol- Tal Linzen. 2018. What can linguistics and deep
ume 1: Long Papers), pages 861–872. Association learning contribute to each other? arXiv preprint
for Computational Linguistics. arXiv:1809.04179.
Terra Blevins, Omer Levy, and Luke Zettlemoyer. 2018.
Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg.
Deep RNNs encode soft hierarchical syntax. In
2016. Assessing the ability of LSTMs to learn
Proceedings of the 56th Annual Meeting of the As-
syntax-sensitive dependencies. Transactions of the
sociation for Computational Linguistics (Volume 2:
Association for Computational Linguistics, 4:521–
Short Papers), pages 14–19. Association for Com-
535.
putational Linguistics.
Alexis Conneau, Germán Kruszewski, Guillaume Lam- Tal Linzen and Brian Leonard. 2018. Distinct patterns
ple, Loı̈c Barrault, and Marco Baroni. 2018. What of syntactic agreement errors in recurrent networks
you can cram into a single \$&!#* vector: Prob- and humans. In Proceedings of the 40th Annual Con-
ing sentence embeddings for linguistic properties. ference of the Cognitive Science Society, pages 692–
In Proceedings of the 56th Annual Meeting of the 697. Cognitive Science Society, Austin, TX.
Association for Computational Linguistics (Volume
1: Long Papers), pages 2126–2136. Association for Mitchell P Marcus, Mary Ann Marcinkiewicz, and
Computational Linguistics. Beatrice Santorini. 1993. Building a large annotated
corpus of English: The Penn Treebank. Computa-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tional linguistics, 19(2):313–330.
Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language under- Marie-Catherine de Marneffe, Bill MacCartney, and
standing. In Proceedings of the 2019 Conference of Christopher D. Manning. 2006. Generating typed
the North American Chapter of the Association for dependency parses from phrase structure parses. In
Computational Linguistics: Human Language Tech- LREC.
nologies, Volume 2 (Short Papers). Association for
Computational Linguistics. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-
rado, and Jeff Dean. 2013. Distributed representa-
Richard Futrell, Ethan Wilcox, Takashi Morita, and tions of words and phrases and their composition-
Roger Levy. 2018. RNNs as psycholinguistic sub- ality. In C. J. C. Burges, L. Bottou, M. Welling,
jects: Syntactic state and grammatical dependency. Z. Ghahramani, and K. Q. Weinberger, editors, Ad-
arXiv preprint arXiv:1809.01329. vances in Neural Information Processing Systems
26, pages 3111–3119. Curran Associates, Inc.
Kristina Gulordava, Piotr Bojanowski, Edouard Grave,
Tal Linzen, and Marco Baroni. 2018. Colorless Graham Neubig, Chris Dyer, Yoav Goldberg, Austin
green recurrent networks dream hierarchically. In Matthews, Waleed Ammar, Antonios Anastasopou-
Proceedings of the 2018 Conference of the North los, Miguel Ballesteros, David Chiang, Daniel Cloth-
American Chapter of the Association for Computa- iaux, Trevor Cohn, Kevin Duh, Manaal Faruqui,
tional Linguistics: Human Language Technologies, Cynthia Gan, Dan Garrette, Yangfeng Ji, Lingpeng
Volume 1 (Long Papers), volume 1, pages 1195– Kong, Adhiguna Kuncoro, Gaurav Kumar, Chai-
1205. tanya Malaviya, Paul Michel, Yusuke Oda, Matthew
Richardson, Naomi Saphra, Swabha Swayamdipta,
William L Hamilton, Rex Ying, and Jure Leskovec. and Pengcheng Yin. 2017. Dynet: The dy-
2017. Representation learning on graphs: Methods namic neural network toolkit. arXiv preprint
and applications. arXiv preprint arXiv:1709.05584. arXiv:1701.03980.
Dieuwke Hupkes, Sara Veldhoen, and Willem Zuidema. Adam Paszke, Sam Gross, Soumith Chintala, Gregory
2018. Visualisation and ‘diagnostic classifiers’ re- Chanan, Edward Yang, Zachary DeVito, Zeming
veal how recurrent and recursive neural networks Lin, Alban Desmaison, Luca Antiga, and Adam
process hierarchical structure. Journal of Artificial Lerer. 2017. Automatic differentiation in PyTorch.
Intelligence Research, 61:907–926. In NIPS Autodiff Workshop.
4134
Matthew Peters, Mark Neumann, Mohit Iyyer, Matt between all pairs of words will be unchanged; the
Gardner, Christopher Clark, Kenton Lee, and Luke same tree is encoded either way, and none of our
Zettlemoyer. 2018a. Deep contextualized word rep-
quantitative metrics will change; however, the ex-
resentations. In Proceedings of the 2018 Conference
of the North American Chapter of the Association act scalar distances will differ from the true tree
for Computational Linguistics: Human Language distances.
Technologies, Volume 1 (Long Papers), volume 1, This raises a question for future work as to why
pages 2227–2237. squared distance works better than distance, and
Matthew Peters, Mark Neumann, Luke Zettlemoyer, beyond that, what function of the L2 distance (or
and Wen-tau Yih. 2018b. Dissecting contextual perhaps, what Lp distance) would best encode tree
word embeddings: Architecture and representation. distances. It is possibly related to the gradients of
In Proceedings of the 2018 Conference on Empiri-
cal Methods in Natural Language Processing, pages
the loss with respect to the function of the distance,
1499–1509. as well as how amenable the function is to matching
the exact scalar values of the tree distances.
Marten van Schijndel and Tal Linzen. 2018. Model-
ing garden path effects without explicit hierarchi- A.2 Probe training details
cal syntax. In Tim Rogers, Marina Rau, Jerry Zhu,
and Chuck Kalish, editors, Proceedings of the 40th All probes are trained to minimize L1 loss of the
Annual Conference of the Cognitive Science Soci- predicted squared distance or squared norm w.r.t.
ety, pages 2600–2605. Cognitive Science Society, the true distance or norm. Optimization is per-
Austin, TX.
formed using the Adam optimizer (Kingma and
Gongbo Tang, Mathias Müller, Annette Rios, and Rico Ba, 2014) initialized at learning rate 0.001, with
Sennrich. 2018. Why self-attention? A targeted β1 = .9, β2 = .999, = 10−8 . Probes are trained
evaluation of neural machine translation architec- to convergence, up to 40 epochs, with a batch size
tures. In Proceedings of the 2018 Conference on
Empirical Methods in Natural Language Processing, of 20. For depth probes, loss is summed over all
pages 4263–4272. Association for Computational predictions in a sentence, normalized by the length
Linguistics. of the sentence, and then summed over all sen-
tences in a batch before a gradient step is taken.
Ian Tenney, Patrick Xia, Berlin Chen, Alex Wang,
Adam Poliak, R Thomas McCoy, Najoung Kim, For distance probes, normalization is performed by
Benjamin Van Durme, Sam Bowman, Dipanjan Das, the square of the length of the sentence. At each
and Ellie Pavlick. 2019. What do you learn from epoch, dev loss is computed; if the dev loss does
context? probing for sentence structure in contextu- not achieve a new minimum, the optimizer is re-
alized word representations. In International Con-
ference on Learning Representations. set (no momentum terms are kept) with an initial
learning rate multiplied by 0.1. All models were
A Appendix: Implementation Details implemented in both DyNet (Neubig et al., 2017),
and in PyTorch (Paszke et al., 2017).
A.1 Squared L2 distance vs. L2 distance
In Section 2.2, we note that while our distance B Appendix: Extra examples
probe specifies a distance metric, we recreate it In this section we provide additional examples of
with a squared vector distance; likewise, while our model behavior, including baseline model behavior,
norm probe specifies a norm, we recreate it with a across parse distance prediction and parse depth
squared vector norm. We found this to be important prediction. In Figure 6 and Figure 7, we present
for recreating the exact parse tree distances and a single sentence with dependency trees as ex-
norms. This does mean that in order to recreate the tracted from many of our models and baselines.
exact scalar values of the parse tree structures, we In Figure 8, we present tree depth predictions on a
need to use the squared vector quantities. This may complex sentence from ELM O 1, BERT LARGE 16,
be problematic, since for example squared distance and our baseline P ROJ 0. Finally, in Figure 9, we
doesn’t obey the triangle inequality, whereas a valid present gold parse distances and predicted squared
distance metric does. parse distances between all pairs of words in large,
However, we note that in terms of the graph struc- high-resolution format.
tures encoded, distance and squared distance are
identical. After training with the squared vector dis-
tance, we can square-root the predicted quantities
to achieve a distance metric. The relative ordering
4135
ELMo0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . .
.
. . . .
Decay0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . . . .
.
Proj0
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . . . .
.
ELMo1
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
. .
.
BERTbase7
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
.
. .
BERTlarge16
.
. . . . . . .
. . . . . .
Another $ 20 billion would be raised through Treasury bonds , which pay lower interest rates .
. . . . . . . . . . .
.
. .
Figure 6: A relatively simple sentence, and the minimum spanning trees extracted by various models.
ELMo0
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . .
. . . .
. . .
. . . .
. . .
. .
.
Decay0
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . . . . . . . . . . . .
. .
. .
. .
.
Proj0
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . . . . . . . . . . . . . . . .
. .
.
ELMo1
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . . . . . . . . . . . . . . .
. . .
.
BERTbase7
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . . . . . . . . . .. . . . .
. . .
.
BERTlarge16
.
. . . . . .
. . . . . . . . . . . . . . . .
But the RTC also requires “ working ” capital to maintain the bad assets of thrifts that are sold , until the assets can be sold separately .
. . . . . . . . . . . . . .. . . . .
. . .
.
Figure 7: A complex sentence, and the minimum spanning trees extracted by various models.
4136
Figure 8: A long sentence with gold dependency parse depths (grey) and dependency parse depths (squared) as extracted by
BERT LARGE 16 (blue, top), ELM O 1 (red, middle), and the baseline P ROJ 0 (purple, bottom). Note the non-standard subject,
“that he was the A’s winningest pitcher”.
4137
Figure 9: The distance graphs defined by the gold parse distances on a sentence (below) and as extracted from BERT LARGE 16
(above, squared).
4138