0% found this document useful (0 votes)
55 views14 pages

Linguistically-Informed Self-Attention For Semantic Role Labeling

The document presents a new neural network model called linguistically-informed self-attention (LISA) for semantic role labeling (SRL). LISA combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection, and SRL. Unlike prior models, LISA requires only raw text as input and encodes the sequence once to perform all tasks simultaneously. It incorporates syntax by training one attention head to attend to syntactic parents for each token. In experiments on standard SRL benchmarks, LISA achieves new state-of-the-art performance.

Uploaded by

Amri Yasirli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views14 pages

Linguistically-Informed Self-Attention For Semantic Role Labeling

The document presents a new neural network model called linguistically-informed self-attention (LISA) for semantic role labeling (SRL). LISA combines multi-head self-attention with multi-task learning across dependency parsing, part-of-speech tagging, predicate detection, and SRL. Unlike prior models, LISA requires only raw text as input and encodes the sequence once to perform all tasks simultaneously. It incorporates syntax by training one attention head to attend to syntactic parents for each token. In experiments on standard SRL benchmarks, LISA achieves new state-of-the-art performance.

Uploaded by

Amri Yasirli
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Linguistically-Informed Self-Attention for Semantic Role Labeling

Emma Strubell1 , Patrick Verga1 , Daniel Andor2 , David Weiss2 and Andrew McCallum1
1
College of Information and Computer Sciences
University of Massachusetts Amherst
{strubell, pat, mccallum}@cs.umass.edu
2
Google AI Language
New York, NY
{andor, djweiss}@google.com

Abstract shown to improve results in challenging down-


Current state-of-the-art semantic role labeling stream tasks such as dialog systems (Tur et al.,
(SRL) uses a deep neural network with no 2005; Chen et al., 2013), machine reading (Berant
arXiv:1804.08199v3 [cs.CL] 12 Nov 2018

explicit linguistic features. However, prior et al., 2014; Wang et al., 2015) and translation (Liu
work has shown that gold syntax trees can dra- and Gildea, 2010; Bazrafshan and Gildea, 2013).
matically improve SRL decoding, suggesting Though syntax was long considered an obvious
the possibility of increased accuracy from ex- prerequisite for SRL systems (Levin, 1993; Pun-
plicit modeling of syntax. In this work, we
yakanok et al., 2008), recently deep neural net-
present linguistically-informed self-attention
(LISA): a neural network model that com- work architectures have surpassed syntactically-
bines multi-head self-attention with multi-task informed models (Zhou and Xu, 2015; Marcheg-
learning across dependency parsing, part-of- giani et al., 2017; He et al., 2017; Tan et al., 2018;
speech tagging, predicate detection and SRL. He et al., 2018), achieving state-of-the art SRL
Unlike previous models which require sig- performance with no explicit modeling of syntax.
nificant pre-processing to prepare linguistic An additional benefit of these end-to-end models
features, LISA can incorporate syntax using is that they require just raw tokens and (usually)
merely raw tokens as input, encoding the se-
detected predicates as input, whereas richer lin-
quence only once to simultaneously perform
parsing, predicate detection and role label- guistic features typically require extraction by an
ing for all predicates. Syntax is incorpo- auxiliary pipeline of models.
rated by training one attention head to attend Still, recent work (Roth and Lapata, 2016; He
to syntactic parents for each token. More- et al., 2017; Marcheggiani and Titov, 2017) indi-
over, if a high-quality syntactic parse is al- cates that neural network models could see even
ready available, it can be beneficially injected higher accuracy gains by leveraging syntactic in-
at test time without re-training our SRL model.
formation rather than ignoring it. He et al. (2017)
In experiments on CoNLL-2005 SRL, LISA
achieves new state-of-the-art performance for indicate that many of the errors made by a syntax-
a model using predicted predicates and stan- free neural network on SRL are tied to certain
dard word embeddings, attaining 2.5 F1 ab- syntactic confusions such as prepositional phrase
solute higher than the previous state-of-the-art attachment, and show that while constrained in-
on newswire and more than 3.5 F1 on out- ference using a relatively low-accuracy predicted
of-domain data, nearly 10% reduction in er- parse can provide small improvements in SRL ac-
ror. On ConLL-2012 English SRL we also
curacy, providing a gold-quality parse leads to
show an improvement of more than 2.5 F1.
LISA also out-performs the state-of-the-art
substantial gains. Marcheggiani and Titov (2017)
with contextually-encoded (ELMo) word rep- incorporate syntax from a high-quality parser
resentations, by nearly 1.0 F1 on news and (Kiperwasser and Goldberg, 2016) using graph
more than 2.0 F1 on out-of-domain text. convolutional neural networks (Kipf and Welling,
2017), but like He et al. (2017) they attain only
1 Introduction
small increases over a model with no syntactic
Semantic role labeling (SRL) extracts a high-level parse, and even perform worse than a syntax-free
representation of meaning from a sentence, label- model on out-of-domain data. These works sug-
ing e.g. who did what to whom. Explicit repre- gest that though syntax has the potential to im-
sentations of such semantic information have been prove neural network SRL models, we have not
yet designed an architecture which maximizes the saw B-ARG0 B-V B-ARG1 I-ARG1 I-ARG1
climbing O O B-ARG0 I-ARG0 B-V
benefits of auxiliary syntactic information.
spred srole
In response, we propose linguistically-informed Feed Feed
self-attention (LISA): a model that combines Forward
Bilinear
Forward
multi-task learning (Caruana, 1993) with stacked
Multi-head self-attention + FF J
layers of multi-head self-attention (Vaswani et al.,

...
2017); the model is trained to: (1) jointly pre- p
Syntactically-informed self-attention + FF
dict parts of speech and predicates; (2) perform
parsing; and (3) attend to syntactic parse parents, PRP VBP:PRED DT NN VBG:PRED

...
while (4) assigning semantic role labels. Whereas
prior work typically requires separate models to Multi-head self-attention + FF r

provide linguistic analysis, including most syntax- I saw the sloth climbing

free neural models which still rely on external


predicate detection, our model is truly end-to-end: Figure 1: Word embeddings are input to J layers of
earlier layers are trained to predict prerequisite multi-head self-attention. In layer p one attention
parts-of-speech and predicates, the latter of which head is trained to attend to parse parents (Figure
are supplied to later layers for scoring. Though 2). Layer r is input for a joint predicate/POS clas-
prior work re-encodes each sentence to predict sifier. Representations from layer r correspond-
each desired task and again with respect to each ing to predicted predicates are passed to a bilinear
predicate to perform SRL, we more efficiently en- operation scoring distinct predicate and role rep-
code each sentence only once, predict its pred- resentations to produce per-token SRL predictions
icates, part-of-speech tags and labeled syntactic with respect to each predicted predicate.
parse, then predict the semantic roles for all pred-
icates in the sentence in parallel. The model is
trained such that, as syntactic parsing models im- 2 Model
prove, providing high-quality parses at test time
Our goal is to design an efficient neural network
will improve its performance, allowing the model
model which makes use of linguistic information
to leverage updated parsing models without re-
as effectively as possible in order to perform end-
quiring re-training.
to-end SRL. LISA achieves this by combining: (1)
In experiments on the CoNLL-2005 and A new technique of supervising neural attention to
CoNLL-2012 datasets we show that our predict syntactic dependencies with (2) multi-task
linguistically-informed models out-perform learning across four related tasks.
the syntax-free state-of-the-art. On CoNLL-2005 Figure 1 depicts the overall architecture of our
with predicted predicates and standard word model. The basis for our model is the Trans-
embeddings, our single model out-performs the former encoder introduced by Vaswani et al.
previous state-of-the-art model on the WSJ test (2017): we transform word embeddings into
set by 2.5 F1 points absolute. On the challenging contextually-encoded token representations us-
out-of-domain Brown test set, our model improves ing stacked multi-head self-attention and feed-
substantially over the previous state-of-the-art by forward layers (§2.1).
more than 3.5 F1, a nearly 10% reduction in error. To incorporate syntax, one self-attention head
On CoNLL-2012, our model gains more than 2.5 is trained to attend to each token’s syntactic par-
F1 absolute over the previous state-of-the-art. ent, allowing the model to use this attention head
Our models also show improvements when as an oracle for syntactic dependencies. We in-
using contextually-encoded word representations troduce this syntactically-informed self-attention
(Peters et al., 2018), obtaining nearly 1.0 F1 (Figure 2) in more detail in §2.2.
higher than the state-of-the-art on CoNLL-2005
Our model is designed for the more realistic set-
news and more than 2.0 F1 improvement on
ting in which gold predicates are not provided at
out-of-domain text.1
test-time. Our model predicts predicates and inte-
1
grates part-of-speech (POS) information into ear-
Our implementation in TensorFlow (Abadi et al., 2015)
is available at : http://github.com/strubell/ lier layers by re-purposing representations closer
LISA to the input to predict predicate and POS tags us-
sloth(i+1) for incorporating syntax, as described in §2.2. Our
+ implementation replicates Vaswani et al. (2017).
The input to the network is a sequence X of T
Concat + FF
token representations xt . In the standard setting
these token representations are initialized to pre-
M0i [t] M1i [t] M2i [t]
M [t]parse
A[t] parse
trained word embeddings, but we also experiment
MatMul: Aih Vhi with supplying pre-trained ELMo representations
combined with task-specific learned parameters,
I
saw which have been shown to substantially improve
the performance of other SRL models (Peters et al.,
sloth 2018). For experiments with gold predicates, we
climbing
concatenate a predicate indicator embedding pt
A[t]A[t] Ai0 [t] Ai1 [t] Ai2 [t]
parse
parse following previous work (He et al., 2017).
sloth(i) (t = 3) We project2 these input embeddings to a rep-
resentation that is the same size as the output of
Figure 2: Syntactically-informed self-attention for the self-attention layers. We then add a positional
the query word sloth. Attention weights Aparse encoding vector computed as a deterministic sinu-
heavily weight the token’s syntactic governor, soidal function of t, since the self-attention has no
saw, in a weighted average over the token val- innate notion of token position.
ues Vparse . The other attention heads act as We feed this token representation as input to a
usual, and the attended representations from all series of J residual multi-head self-attention lay-
heads are concatenated and projected through a ers with feed-forward connections. Denoting the
feed-forward layer to produce the syntactically- jth self-attention layer as T (j) (·), the output of
informed representation for sloth. (j)
that layer st , and LN (·) layer normalization, the
(p)
following recurrence applied to initial input ct :
ing hard parameter sharing (§2.3). We simplify (j) (j−1) (j−1)
st = LN (st + T (j) (st )) (1)
optimization and benefit from shared statistical
strength derived from highly correlated POS and gives our final token representations st . Each
(j)
predicates by treating tagging and predicate detec- T (j) (·) consists of: (a) multi-head self-attention
tion as a single task, performing multi-class clas- and (b) a feed-forward projection.
sification into the joint Cartesian product space of The multi-head self attention consists of H at-
POS and predicate labels. tention heads, each of which learns a distinct at-
Though typical models, which re-encode the tention function to attend to all of the tokens in
sentence for each predicate, can simplify SRL to the sequence. This self-attention is performed for
token-wise tagging, our joint model requires a each token for each head, and the results of the H
different approach to classify roles with respect self-attentions are concatenated to form the final
to each predicate. Contextually encoded tokens self-attended representation for each token.
are projected to distinct predicate and role em- Specifically, consider the matrix S (j−1) of T to-
beddings (§2.4), and each predicted predicate is ken representations at layer j − 1. For each atten-
scored with the sequence’s role representations us- tion head h, we project this matrix into distinct
ing a bilinear model (Eqn. 6), producing per-label (j) (j)
key, value and query representations Kh , Vh
scores for BIO-encoded semantic role labels for (j)
each token and each semantic frame. and Qh of dimensions T ×dk , T ×dq , and T ×dv ,
(j) (j)
The model is trained end-to-end by maximum respectively. We can then multiply Qh by Kh
(j)
likelihood using stochastic gradient descent (§2.5). to obtain a T × T matrix of attention weights Ah
between each pair of tokens in the sentence. Fol-
2.1 Self-attention token encoder lowing Vaswani et al. (2017) we perform scaled
The basis for our model is a multi-head self- dot-product attention: We scale the weights by the
attention token encoder, recently shown to achieve inverse square root of their embedding dimension
state-of-the-art performance on SRL (Tan et al., 2
All linear projections include bias terms, which we omit
2018), and which provides a natural mechanism in this exposition for the sake of clarity.
and normalize with the softmax function to pro- These attention weights are used to compose
duce a distinct distribution for each token over all a weighted average of the value representations
the tokens in the sentence: Vparse as in the other attention heads.
We apply auxiliary supervision at this attention
(j) (j) (j) T
Ah = softmax(d−0.5
k Qh Kh ) (2) head to encourage it to attend to each token’s par-
ent in a syntactic dependency tree, and to encode
These attention weights are then multiplied by information about the token’s dependency label.
(j)
Vh for each token to obtain the self-attended to- Denoting the attention weight from token t to a
(j)
ken representations Mh : candidate head q as Aparse [t, q], we model the
probability of token t having parent q as:
(j) (j) (j)
Mh = A h V h (3)
P (q = head(t) | X ) = Aparse [t, q] (5)
(j)
Row t of Mh , the self-attended representation for using the attention weights Aparse [t] as the distri-
token t at layer j, is thus the weighted sum with bution over possible heads for token t. We define
(j)
respect to t (with weights given by Ah ) over the the root token as having a self-loop. This atten-
(j)
token representations in Vh . tion head thus emits a directed graph3 where each
The outputs of all attention heads for each token token’s parent is the token to which the attention
are concatenated, and this representation is passed Aparse assigns the highest weight.
to the feed-forward layer, which consists of two We also predict dependency labels using per-
linear projections each followed by leaky ReLU class bi-affine operations between parent and de-
activations (Maas et al., 2013). We add the out- pendent representations Qparse and Kparse to pro-
put of the feed-forward to the initial representa- duce per-label scores, with locally normalized
tion and apply layer normalization to give the final probabilities over dependency labels ytdep given by
output of self-attention layer j, as in Eqn. 1. the softmax function. We refer the reader to Dozat
and Manning (2017) for more details.
2.2 Syntactically-informed self-attention This attention head now becomes an oracle for
Typically, neural attention mechanisms are left on syntax, denoted P, providing a dependency parse
their own to learn to attend to relevant inputs. In- to downstream layers. This model not only pre-
stead, we propose training the self-attention to at- dicts its own dependency arcs, but allows for the
tend to specific tokens corresponding to the syn- injection of auxiliary parse information at test time
tactic structure of the sentence as a mechanism for by simply setting Aparse to the parse parents pro-
passing linguistic knowledge to later layers. duced by e.g. a state-of-the-art parser. In this way,
Specifically, we replace one attention head with our model can benefit from improved, external
the deep bi-affine model of Dozat and Manning parsing models without re-training. Unlike typi-
(2017), trained to predict syntactic dependencies. cal multi-task models, ours maintains the ability
Let Aparse be the parse attention weights, at layer to leverage external syntactic information.
i. Its input is the matrix of token representations
S (i−1) . As with the other attention heads, we 2.3 Multi-task learning
project S (i−1) into key, value and query represen- We also share the parameters of lower layers in our
tations, denoted Kparse , Qparse , Vparse . Here the model to predict POS tags and predicates. Fol-
key and query projections correspond to parent lowing He et al. (2017), we focus on the end-to-
and dependent representations of the tokens, and end setting, where predicates must be predicted
we allow their dimensions to differ from the rest of on-the-fly. Since we also train our model to
the attention heads to more closely follow the im- predict syntactic dependencies, it is beneficial to
plementation of Dozat and Manning (2017). Un- give the model knowledge of POS information.
like the other attention heads which use a dot prod- While much previous work employs a pipelined
uct to score key-query pairs, we score the compati- approach to both POS tagging for dependency
bility between Kparse and Qparse using a bi-affine parsing and predicate detection for SRL, we take
operator Uheads to obtain attention weights: a multi-task learning (MTL) approach (Caruana,
3
Usually the head emits a tree, but we do not enforce it
T
Aparse = softmax(Qparse Uheads Kparse ) (4) here.
1993), sharing the parameters of earlier layers in representations to later layers, whereas syntactic
our SRL model with a joint POS and predicate de- head prediction and joint predicate/POS prediction
tection objective. Since POS is a strong predic- are conditioned only on the input sequence X . The
tor of predicates4 and the complexity of training overall objective is thus:
a multi-task model increases with the number of
T F
tasks, we combine POS tagging and predicate de- 1 XhX
log P (yfrole
t | PG , VG , X )
tection into a joint label space: For each POS tag T
t=1 f =1
TAG which is observed co-occurring with a predi-
cate, we add a label of the form TAG : PREDICATE. + log P (ytprp | X )
(r)
Specifically, we feed the representation st + λ1 log P (head(t) | X )
from a layer r preceding the syntactically-
i
+ λ2 log P (ytdep | PG , X ) (7)
informed layer p to a linear classifier to pro-
duce per-class scores rt for token t. We compute
where λ1 and λ2 are penalties on the syntactic at-
locally-normalized probabilities using the softmax
tention loss.
function: P (ytprp | X ) ∝ exp(rt ), where ytprp is a
We train the model using Nadam (Dozat, 2016)
label in the joint space.
SGD combined with the learning rate schedule in
2.4 Predicting semantic roles Vaswani et al. (2017). In addition to MTL, we reg-
ularize our model using dropout (Srivastava et al.,
Our final goal is to predict semantic roles for each
2014). We use gradient clipping to avoid explod-
predicate in the sequence. We score each predicate
ing gradients (Bengio et al., 1994; Pascanu et al.,
against each token in the sequence using a bilinear
2013). Additional details on optimization and hy-
operation, producing per-label scores for each to-
perparameters are included in Appendix A.
ken for each predicate, with predicates and syntax
determined by oracles V and P. 3 Related work
(J)
First, we project each token representation st
to a predicate-specific representation spred and a Early approaches to SRL (Pradhan et al., 2005;
t
role-specific representation st . We then provide
role Surdeanu et al., 2007; Johansson and Nugues,
these representations to a bilinear transformation 2008; Toutanova et al., 2008) focused on devel-
U for scoring. So, the role label scores sf t for the oping rich sets of linguistic features as input to a
token at index t with respect to the predicate at linear model, often combined with complex con-
index f (i.e. token t and frame f ) are given by: strained inference e.g. with an ILP (Punyakanok
et al., 2008). Täckström et al. (2015) showed that
sf t = (spred
f )T U srole
t (6) constraints could be enforced more efficiently us-
ing a clever dynamic program for exact inference.
which can be computed in parallel across all se- Sutton and McCallum (2005) modeled syntactic
mantic frames in an entire minibatch. We calculate parsing and SRL jointly, and Lewis et al. (2015)
a locally normalized distribution over role labels jointly modeled SRL and CCG parsing.
for token t in frame f using the softmax function: Collobert et al. (2011) were among the first to
P (yfrole
t | P, V, X ) ∝ exp(sf t ). use a neural network model for SRL, a CNN over
At test time, we perform constrained decoding word embeddings which failed to out-perform
using the Viterbi algorithm to emit valid sequences non-neural models. FitzGerald et al. (2015) suc-
of BIO tags, using unary scores sf t and the transi- cessfully employed neural networks by embed-
tion probabilities given by the training data. ding lexicalized features and providing them as
factors in the model of Täckström et al. (2015).
2.5 Training
More recent neural models are syntax-free.
We maximize the sum of the likelihoods of the in- Zhou and Xu (2015), Marcheggiani et al. (2017)
dividual tasks. In order to maximize our model’s and He et al. (2017) all use variants of deep
ability to leverage syntax, during training we LSTMs with constrained decoding, while Tan
clamp P to the gold parse (PG ) and V to gold et al. (2018) apply self-attention to obtain state-of-
predicates VG when passing parse and predicate the-art SRL with gold predicates. Like this work,
4
All predicates in CoNLL-2005 are verbs; CoNLL-2012 He et al. (2017) present end-to-end experiments,
includes some nominal predicates. predicting predicates using an LSTM, and He et al.
(2018) jointly predict SRL spans and predicates in future work.
a model based on that of Lee et al. (2017), obtain-
ing state-of-the-art predicted predicate SRL. Con- 4 Experimental results
current to this work, Peters et al. (2018) and He
et al. (2018) report significant gains on PropBank We present results on the CoNLL-2005 shared
SRL by training a wide LSTM language model task (Carreras and Màrquez, 2005) and the
and using a task-specific transformation of its hid- CoNLL-2012 English subset of OntoNotes 5.0
den representations (ELMo) as a deep, and com- (Pradhan et al., 2013), achieving state-of-the-art
putationally expensive, alternative to typical word results for a single model with predicted predicates
embeddings. We find that LISA obtains further ac- on both corpora. We experiment with both stan-
curacy increases when provided with ELMo word dard pre-trained GloVe word embeddings (Pen-
representations, especially on out-of-domain data. nington et al., 2014) and pre-trained ELMo rep-
resentations with fine-tuned task-specific parame-
Some work has incorporated syntax into neu- ters (Peters et al., 2018) in order to best compare
ral models for SRL. Roth and Lapata (2016) in- to prior work. Hyperparameters that resulted in
corporate syntax by embedding dependency paths, the best performance on the validation set were
and similarly Marcheggiani and Titov (2017) en- selected via a small grid search, and models were
code syntax using a graph CNN over a pre- trained for a maximum of 4 days on one TitanX
dicted syntax tree, out-performing models with- GPU using early stopping on the validation set.
out syntax on CoNLL-2009. These works are We convert constituencies to dependencies using
limited to incorporating partial dependency paths the Stanford head rules v3.5 (de Marneffe and
between tokens whereas our technique incorpo- Manning, 2008). A detailed description of hyper-
rates the entire parse. Additionally, Marcheggiani parameter settings and data pre-processing can be
and Titov (2017) report that their model does not found in Appendix A.
out-perform syntax-free models on out-of-domain
We compare our LISA models to four strong
data, a setting in which our technique excels.
baselines: For experiments using predicted predi-
MTL (Caruana, 1993) is popular in NLP, and cates, we compare to He et al. (2018) and the en-
others have proposed MTL models which incor- semble model (PoE) from He et al. (2017), as well
porate subsets of the tasks we do (Collobert et al., as a version of our own self-attention model which
2011; Zhang and Weiss, 2016; Hashimoto et al., does not incorporate syntactic information (SA).
2017; Peng et al., 2017; Swayamdipta et al., 2017), To compare to more prior work, we present addi-
and we build off work that investigates where and tional results on CoNLL-2005 with models given
when to combine different tasks to achieve the gold predicates at test time. In these experiments
best results (Søgaard and Goldberg, 2016; Bin- we also compare to Tan et al. (2018), the previous
gel and Søgaard, 2017; Alonso and Plank, 2017). state-of-the art SRL model using gold predicates
Our specific method of incorporating supervision and standard embeddings.
into self-attention is most similar to the concur- We demonstrate that our models benefit from
rent work of Liu and Lapata (2018), who use edge injecting state-of-the-art predicted parses at test
marginals produced by the matrix-tree algorithm time (+D&M) by fixing the attention to parses
as attention weights for document classification predicted by Dozat and Manning (2017), the win-
and natural language inference. ner of the 2017 CoNLL shared task (Zeman et al.,
The question of training on gold versus pre- 2017) which we re-train using ELMo embeddings.
dicted labels is closely related to learning to search In all cases, using these parses at test time im-
(Daumé III et al., 2009; Ross et al., 2011; Chang proves performance.
et al., 2015) and scheduled sampling (Bengio We also evaluate our model using the gold syn-
et al., 2015), with applications in NLP to sequence tactic parse at test time (+Gold), to provide an up-
labeling and transition-based parsing (Choi and per bound for the benefit that syntax could have
Palmer, 2011; Goldberg and Nivre, 2012; Balles- for SRL using LISA. These experiments show that
teros et al., 2016). Our approach may be inter- despite LISA’s strong performance, there remains
preted as an extension of teacher forcing (Williams substantial room for improvement. In §4.3 we per-
and Zipser, 1989) to MTL. We leave exploration of form further analysis comparing SRL models us-
more advanced scheduled sampling techniques to ing gold and predicted parses.
Dev WSJ Test Brown Test
GloVe P R F1 P R F1 P R F1
He et al. (2017) PoE 81.8 81.2 81.5 82.0 83.4 82.7 69.7 70.5 70.1
He et al. (2018) 81.3 81.9 81.6 81.2 83.9 82.5 69.7 71.9 70.8
SA 83.52 81.28 82.39 84.17 83.28 83.72 72.98 70.1 71.51
LISA 83.1 81.39 82.24 84.07 83.16 83.61 73.32 70.56 71.91
+D&M 84.59 82.59 83.58 85.53 84.45 84.99 75.8 73.54 74.66
+Gold 87.91 85.73 86.81 — — — — — —

ELMo
He et al. (2018) 84.9 85.7 85.3 84.8 87.2 86.0 73.9 78.4 76.1
SA 85.78 84.74 85.26 86.21 85.98 86.09 77.1 75.61 76.35
LISA 86.07 84.64 85.35 86.69 86.42 86.55 78.95 77.17 78.05
+D&M 85.83 84.51 85.17 87.13 86.67 86.90 79.02 77.49 78.25
+Gold 88.51 86.77 87.63 — — — — — —

Table 1: Precision, recall and F1 on the CoNLL-2005 development and test sets.

WSJ Test P R F1 beddings improves all scores. The gap in SRL


He et al. (2018) 84.2 83.7 83.9 F1 between models using LISA and D&M parses
Tan et al. (2018) 84.5 85.2 84.8 is smaller due to LISA’s improved parsing ac-
SA 84.7 84.24 84.47 curacy (see §4.2), but LISA with D&M parses
LISA 84.72 84.57 84.64 still achieves the highest F1: nearly 1.0 abso-
+D&M 86.02 86.05 86.04 lute F1 higher than the previous state-of-the art
on WSJ, and more than 2.0 F1 higher on Brown.
Brown Test P R F1 In both settings LISA leverages domain-agnostic
He et al. (2018) 74.2 73.1 73.7 syntactic information rather than over-fitting to the
Tan et al. (2018) 73.5 74.6 74.1 newswire training data which leads to high perfor-
SA 73.89 72.39 73.13 mance even on out-of-domain text.
LISA 74.77 74.32 74.55 To compare to more prior work we also evalu-
+D&M 76.65 76.44 76.54 ate our models in the artificial setting where gold
predicates are provided at test time. For fair com-
Table 2: Precision, recall and F1 on CoNLL-2005 parison we use GloVe embeddings, provide pred-
with gold predicates. icate indicator embeddings on the input and re-
encode the sequence relative to each gold predi-
cate. Here LISA still excels: with D&M parses,
4.1 Semantic role labeling LISA out-performs the previous state-of-the-art by
Table 1 lists precision, recall and F1 on the more than 2 F1 on both WSJ and Brown.
CoNLL-2005 development and test sets using pre- Table 3 reports precision, recall and F1 on
dicted predicates. For models using GloVe embed- the CoNLL-2012 test set. We observe perfor-
dings, our syntax-free SA model already achieves mance similar to that observed on ConLL-2005:
a new state-of-the-art by jointly predicting pred- Using GloVe embeddings our SA baseline al-
icates, POS and SRL. LISA with its own parses ready out-performs He et al. (2018) by nearly
performs comparably to SA, but when supplied 1.5 F1. With its own parses, LISA slightly
with D&M parses LISA out-performs the previous under-performs our syntax-free model, but when
state-of-the-art by 2.5 F1 points. On the out-of- provided with stronger D&M parses LISA out-
domain Brown test set, LISA also performs com- performs the state-of-the-art by more than 2.5
parably to its syntax-free counterpart with its own F1. Like CoNLL-2005, ELMo representations im-
parses, but with D&M parses LISA performs ex- prove all models and close the F1 gap between
ceptionally well, more than 3.5 F1 points higher models supplied with LISA and D&M parses. On
than He et al. (2018). Incorporating ELMo em- this dataset ELMo also substantially narrows the
Dev P R F1 Data Model POS UAS LAS
GloVe D&ME — 96.48 94.40
He et al. (2018) 79.2 79.7 79.4 WSJ LISAG 96.92 94.92 91.87
SA 82.32 79.76 81.02 LISAE 97.80 96.28 93.65
LISA 81.77 79.65 80.70 D&ME — 92.56 88.52
+D&M 82.97 81.14 82.05 Brown LISAG 94.26 90.31 85.82
+Gold 87.57 85.32 86.43 LISAE 95.77 93.36 88.75
D&ME — 94.99 92.59
ELMo CoNLL-12 LISAG 96.81 93.35 90.42
He et al. (2018) 82.1 84.0 83.0 LISAE 98.11 94.84 92.23
SA 84.35 82.14 83.23
LISA 84.19 82.56 83.37 Table 4: Parsing (labeled and unlabeled attach-
+D&M 84.09 82.65 83.36 ment) and POS accuracies attained by the models
+Gold 88.22 86.53 87.36 used in SRL experiments on test datasets. Sub-
script G denotes GloVe and E ELMo embeddings.
Test P R F1
Model P R F1
GloVe
He et al. (2017) 94.5 98.5 96.4
He et al. (2018) 79.4 80.1 79.8 WSJ
LISA 98.9 97.9 98.4
SA 82.55 80.02 81.26
He et al. (2017) 89.3 95.7 92.4
LISA 81.86 79.56 80.70 Brown
LISA 95.5 91.9 93.7
+D&M 83.3 81.38 82.33
CoNLL-12 LISA 99.8 94.7 97.2
ELMo Table 5: Predicate detection precision, recall and
He et al. (2018) 81.9 84.0 82.9 F1 on CoNLL-2005 and CoNLL-2012 test sets.
SA 84.39 82.21 83.28
LISA 83.97 82.29 83.12
+D&M 84.14 82.64 83.38 embeddings comparable to the standalone D&M
parser. The difference in parse accuracy between
Table 3: Precision, recall and F1 on the CoNLL- LISAG and D&M likely explains the large in-
2012 development and test sets. Italics indicate crease in SRL performance we see from decoding
a synthetic upper bound obtained by providing a with D&M parses in that setting.
gold parse at test time. In Table 5 we present predicate detection pre-
cision, recall and F1 on the CoNLL-2005 and
2012 test sets. SA and LISA with and without
difference between models with- and without syn-
ELMo attain comparable scores so we report only
tactic information. This suggests that for this chal-
LISA+GloVe. We compare to He et al. (2017) on
lenging dataset, ELMo already encodes much of
CoNLL-2005, the only cited work reporting com-
the information available in the D&M parses. Yet,
parable predicate detection F1. LISA attains high
higher accuracy parses could still yield improve-
predicate detection scores, above 97 F1, on both
ments since providing gold parses increases F1 by
in-domain datasets, and out-performs He et al.
4 points even with ELMo embeddings.
(2017) by 1.5-2 F1 points even on the out-of-
4.2 Parsing, POS and predicate detection domain Brown test set, suggesting that multi-task
We first report the labeled and unlabeled attach- learning works well for SRL predicate detection.
ment scores (LAS, UAS) of our parsing models on 4.3 Analysis
the CoNLL-2005 and 2012 test sets (Table 4) with
GloVe (G) and ELMo (E) embeddings. D&M First we assess SRL F1 on sentences divided by
achieves the best scores. Still, LISA’s GloVe parse accuracy. Table 6 lists average SRL F1
UAS is comparable to popular off-the-shelf de- (across sentences) for the four conditions of LISA
pendency parsers such as spaCy,5 and with ELMo and D&M parses being correct or not (L±, D±).
5
Both parsers are correct on 26% of sentences.
spaCy reports 94.48 UAS on WSJ using Stan-
ford dependencies v3.3: https://spacy.io/usage/ facts-figures
L+/D+ L–/D+ L+/D– L–/D–
Proportion 26% 12% 4% 56% 60
9978
LISA

% split/merge labels
SA 79.29 75.14 75.97 75.08 50 44 +D&M
+Gold
LISA 79.51 74.33 79.69 75.00 40
+D&M 79.03 76.96 77.73 76.52 30
+Gold 79.61 78.38 81.41 80.47 20
272011
201513
10 1211 5
8 7
Table 6: Average SRL F1 on CoNLL-2005 for sen- 0
3 4 3 5 2 2 4

tences where LISA (L) and D&M (D) parses were PP NP VP SBAR ADVP PRN Other

completely correct (+) or incorrect (–).


Figure 4: Percent and count of split/merge correc-
tions performed in Figure 3, by phrase type.
100.0
97.5
95.0
that these errors are due mainly to prepositional
92.5
phrase (PP) attachment mistakes. We also find
F1

90.0 SA
LISA this to be the case: Figure 4 shows a breakdown
87.5 +D&M of split/merge corrections by phrase type. Though
85.0 +Gold
the number of corrections decreases substantially
Orig. Fix Move Merge Split Fix Drop Add across phrase types, the proportion of corrections
Labels Core Spans Spans Span Arg. Arg.
Arg. Boundary attributed to PPs remains the same (approx. 50%)
even after providing the correct PP attachment to
Figure 3: Performance of CoNLL-2005 models af- the model, indicating that PP span boundary mis-
ter performing corrections from He et al. (2017). takes are a fundamental difficulty for SRL.

5 Conclusion
Here there is little difference between any of the
models, with LISA models tending to perform We present linguistically-informed self-attention:
slightly better than SA. Both parsers make mis- a multi-task neural network model that effectively
takes on the majority of sentences (57%), diffi- incorporates rich linguistic information for seman-
cult sentences where SA also performs the worst. tic role labeling. LISA out-performs the state-of-
These examples are likely where gold and D&M the-art on two benchmark SRL datasets, includ-
parses improve the most over other models in ing out-of-domain. Future work will explore im-
overall F1: Though both parsers fail to correctly proving LISA’s parsing accuracy, developing bet-
parse the entire sentence, the D&M parser is less ter training techniques and adapting to more tasks.
wrong (87.5 vs. 85.7 average LAS), leading to
higher SRL F1 by about 1.5 average F1. Acknowledgments
Following He et al. (2017), we next apply a
series of corrections to model predictions in or- We are grateful to Luheng He for helpful discus-
der to understand which error types the gold sions and code, Timothy Dozat for sharing his
parse resolves: e.g. Fix Labels fixes labels on code, and to the NLP reading groups at Google
spans matching gold boundaries, and Merge Spans and UMass and the anonymous reviewers for feed-
merges adjacent predicted spans into a gold span.6 back on drafts of this work. This work was sup-
In Figure 3 we see that much of the performance ported in part by an IBM PhD Fellowship Award
gap between the gold and predicted parses is due to E.S., in part by the Center for Intelligent Infor-
to span boundary errors (Merge Spans, Split Spans mation Retrieval, and in part by the National Sci-
and Fix Span Boundary), which supports the hy- ence Foundation under Grant Nos. DMR-1534431
pothesis proposed by He et al. (2017) that incorpo- and IIS-1514053. Any opinions, findings, conclu-
rating syntax could be particularly helpful for re- sions or recommendations expressed in this mate-
solving these errors. He et al. (2017) also point out rial are those of the authors and do not necessarily
6
Refer to He et al. (2017) for a detailed explanation of the reflect those of the sponsor.
different error types.
References Ronan Collobert, Jason Weston, Léon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
Martın Abadi, Ashish Agarwal, Paul Barham, Eugene 2011. Natural language processing (almost) from
Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, scratch. Journal of Machine Learning Research,
Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 12(Aug):2493–2537.
2015. Tensorflow: Large-scale machine learning on
heterogeneous systems, 2015. Software available Hal Daumé III, John Langford, and Daniel Marcu.
from tensorflow.org. 2009. Search-based structured prediction. Machine
Learning, 75(3):297–325.
Héctor Martı́nez Alonso and Barbara Plank. 2017.
When is multitask learning effective? semantic se- Timothy Dozat. 2016. Incorporating nesterov momen-
quence prediction under varying data conditions. In tum into adam. In ICLR Workshop track.
EACL.
Timothy Dozat and Christopher D. Manning. 2017.
Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Deep biaffine attention for neural dependency pars-
Noah A. Smith. 2016. Training with exploration im- ing. In ICLR.
proves a greedy stack lstm parser. In Proceedings of
the 2016 Conference on Empirical Methods in Nat- Nicholas FitzGerald, Oscar Täckström, Kuzman
ural Language Processing, pages 2005–2010. Ganchev, and Dipanjan Das. 2015. Semantic role
labeling with neural network factors. In Proceed-
Marzieh Bazrafshan and Daniel Gildea. 2013. Seman- ings of the 2015 Conference on Empirical Methods
tic roles for string to tree machine translation. In in Natural Language Processing, pages 960–970.
ACL.
W. N. Francis and H. Kučera. 1964. Manual of infor-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and mation to accompany a standard corpus of present-
Noam Shazeer. 2015. Scheduled sampling for se- day edited american english, for use with digital
quence prediction with recurrent neural networks. computers. Technical report, Department of Lin-
In NIPS. guistics, Brown University, Providence, Rhode Is-
land.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
Yoav Goldberg and Joakim Nivre. 2012. A dynamic
1994. Learning long-term dependencies with gradi-
oracle for arc-eager dependency parsing. In Pro-
ent descent is difficult. IEEE Transactions on Neu-
ceedings of COLING 2012: Technical Papers, pages
ral Networks, 5(2):157–166.
959–976.
Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-
Brad Huang, Christopher D. Manning, Abby Van- ruoka, and Richard Socher. 2017. A joint many-task
der Linden, Brittany Harding, and Peter Clark. 2014. model: Growing a neural network for multiple nlp
Modeling biological processes for reading compre- tasks. In Conference on Empirical Methods in Nat-
hension. In EMNLP. ural Language Processing.
Joachim Bingel and Anders Søgaard. 2017. Identify- Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-
ing beneficial task relations for multi-task learning moyer. 2018. Jointly predicting predicates and argu-
in deep neural networks. In EACL. ments in neural semantic role labeling. In ACL.
Xavier Carreras and Lluı́s Màrquez. 2005. Introduc- Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-
tion to the conll-2005 shared task: Semantic role la- moyer. 2017. Deep semantic role labeling: What
beling. In CoNLL. works and whats next. In Proceedings of the 55th
Annual Meeting of the Association for Computa-
Rich Caruana. 1993. Multitask learning: a knowledge- tional Linguistics.
based source of inductive bias. In ICML.
Richard Johansson and Pierre Nugues. 2008.
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar- Dependency-based semantic role labeling of
wal, Hal Daumé III, and John Langford. 2015. propbank. In Proceedings of the 2008 Confer-
Learning to search better than your teacher. In ence on Empirical Methods in Natural Language
ICML. Processing, pages 69–78.
Yun-Nung Chen, William Yang Wang, and Alexander I Diederik Kingma and Jimmy Ba. 2015. Adam: A
Rudnicky. 2013. Unsupervised induction and filling method for stochastic optimization. In 3rd Inter-
of semantic slots for spoken dialogue systems using national Conference for Learning Representations
frame-semantic parsing. In Proc. of ASRU-IEEE. (ICLR), San Diego, California, USA.
Jinho D. Choi and Martha Palmer. 2011. Getting the Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
most out of transition-based dependency parsing. In ple and accurate dependency parsing using bidirec-
Proceedings of the 49th Annual Meeting of the Asso- tional LSTM feature representations. Transactions
ciation for Computational Linguistics: short papers, of the Association for Computational Linguistics,
pages 687–692. 4:313–327.
Thomas N. Kipf and Max Welling. 2017. Semisu- Hao Peng, Sam Thomson, and Noah A. Smith. 2017.
pervised classification with graph convolutional net- Deep multitask learning for semantic dependency
works. In International Conference on Learning parsing. In ACL.
Representations.
Jeffrey Pennington, Richard Socher, and Christo-
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- pher D. Manning. 2014. Glove: Global vectors for
moyer. 2017. End-to-end neural coreference resolu- word representation. In EMNLP.
tion. In EMNLP.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Beth Levin. 1993. English verb classes and alterna- Gardner, Christopher Clark, Kenton Lee, and Luke
tions: A preliminary investigation. University of Zettlemoyer. 2018. Deep contextualized word rep-
Chicago press. resentations. In NAACL.
Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
Joint A* CCG Parsing and Semantic Role Labeling. Hwee Tou Ng, Anders Björkelund, Olga Uryupina,
In EMNLP. Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-
bust linguistic analysis using OntoNotes. In Pro-
Ding Liu and Daniel Gildea. 2010. Semantic role ceedings of the Seventeenth Conference on Com-
features for machine translation. In Proceedings putational Natural Language Learning, pages 143–
of the 23rd International Conference on Computa- 152.
tional Linguistics (COLING).
Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James
Yang Liu and Mirella Lapata. 2018. Learning struc- Martin, and Dan Jurafsky. 2005. Semantic role la-
tured text representations. Transactions of the Asso- beling using different syntactic views. In Proceed-
ciation for Computational Linguistics, 6:63–75. ings of the Association for Computational Linguis-
tics 43rd annual meeting (ACL).
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng.
2013. Rectifier nonlinearities improve neural net-
Vasin Punyakanok, Dan Roth, and Wen-Tau Yih. 2008.
work acoustic models. In ICML, volume 30.
The importance of syntactic parsing and inference in
semantic role labeling. Computational Linguistics,
Diego Marcheggiani, Anton Frolov, and Ivan Titov.
34(2):257–287.
2017. A simple and accurate syntax-agnostic neural
model for dependency-based semantic role labeling.
Stéphane Ross, Geoffrey J. Gordon, and J. Andrew
In CoNLL.
Bagnell. 2011. A reduction of imitation learning and
Diego Marcheggiani and Ivan Titov. 2017. Encoding structured prediction to no-regret online learning. In
sentences with graph convolutional networks for se- Proceedings of the 14th International Conference on
mantic role labeling. In Proceedings of the 2017 Artificial Intelligence and Statistics (AISTATS).
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP). Michael Roth and Mirella Lapata. 2016. Neural se-
mantic role labeling with dependency path embed-
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and dings. In Proceedings of the 54th Annual Meet-
Beatrice Santorini. 1993. Building a large annotated ing of the Association for Computational Linguistics
corpus of English: The Penn TreeBank. Compu- (ACL), pages 1192–1202.
tational Linguistics – Special issue on using large
corpora: II, 19(2):313–330. Anders Søgaard and Yoav Goldberg. 2016. Deep
multi-task learning with low level tasks supervised
Marie-Catherine de Marneffe and Christopher D. Man- at lower layers. In Proceedings of the 54th Annual
ning. 2008. The stanford typed dependencies rep- Meeting of the Association for Computational Lin-
resentation. In COLING 2008 Workshop on Cross- guistics, pages 231–235.
framework and Cross-domain Parser Evaluation.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,
Yurii Nesterov. 1983. A method of solving a con- Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
vex programming problem with convergence rate Dropout: a simple way to prevent neural networks
o(1/k 2 ). volume 27, pages 372–376. from overfitting. Journal of machine learning re-
search, 15(1):1929–1958.
Martha Palmer, Daniel Gildea, and Paul Kingsbury.
2005. The proposition bank: An annotated corpus Mihai Surdeanu, Lluı́s Màrquez, Xavier Carreras, and
of semantic roles. Computational Linguistics, 31(1). Pere R. Comas. 2007. Combination strategies for
semantic role labeling. Journal of Artificial Intelli-
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. gence Research, 29:105–151.
2013. On the difficulty of training recurrent neural
networks. In Proceedings of the 30 th International Charles Sutton and Andrew McCallum. 2005. Joint
Conference on Machine Learning. parsing and semantic role labeling. In CoNLL.
Swabha Swayamdipta, Sam Thomson, Chris Dyer, and
Noah A. Smith. 2017. Frame-semantic parsing with
softmax-margin segmental rnns and a syntactic scaf-
fold. In arXiv:1706.09528.
Oscar Täckström, Kuzman Ganchev, and Dipanjan
Das. 2015. Efficient inference and structured learn-
ing for semantic role labeling. TACL, 3:29–41.
Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen,
and Xiaodong Shi. 2018. Deep semantic role label-
ing with self-attention. In AAAI.
Kristina Toutanova, Aria Haghighi, and Christopher D.
Manning. 2008. A global joint model for se-
mantic role labeling. Computational Linguistics,
34(2):161–191.
Kristina Toutanova, Dan Klein, Christopher D Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology-
Volume 1, pages 173–180. Association for Compu-
tational Linguistics.
Gokhan Tur, Dilek Hakkani-Tür, and Ananlada Choti-
mongkol. 2005. Semi-supervised learning for spo-
ken language understanding using semantic role la-
beling. In ASRU.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In 31st Conference on Neural Information
Processing Systems (NIPS).
Hai Wang, Mohit Bansal, Kevin Gimpel, and David
McAllester. 2015. Machine comprehension with
syntax, frames, and semantics. In ACL.
R. J. Williams and D. Zipser. 1989. A learning algo-
rithm for continually running fully recurrent neural
networks. Neural computation, 1(2):270–280.
Daniel Zeman, Martin Popel, Milan Straka, Jan Ha-
jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti,
Sampo Pyysalo, Slav Petrov, Martin Potthast, et al.
2017. Conll 2017 shared task: Multilingual parsing
from raw text to universal dependencies. In Pro-
ceedings of the CoNLL 2017 Shared Task: Multilin-
gual Parsing from Raw Text to Universal Dependen-
cies, pages 1–19, Vancouver, Canada. Association
for Computational Linguistics.
Yuan Zhang and David Weiss. 2016. Stack-
propagation: Improved representation learning for
syntax. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1557–1566. Asso-
ciation for Computational Linguistics.
Jie Zhou and Wei Xu. 2015. End-to-end learning of
semantic role labeling using recurrent neural net-
works. In Proc. of the Annual Meeting of the As-
sociation for Computational Linguistics (ACL).
CoNLL-2005 Greedy F1 Viterbi F1 ∆ F1 90

LISA 81.99 82.24 +0.25 85

+D&M 83.37 83.58 +0.21 80


75

F1
+Gold 86.57 86.81 +0.24
LISA
70
+D&M
CoNLL-2012 Greedy F1 Viterbi F1 ∆ F1 65 +Gold
60
LISA 80.11 80.70 +0.59 0-1 2-3 4-7 8-200
Distance from predicate (tokens)
+D&M 81.55 82.05 +0.50
+Gold 85.94 86.43 +0.49
Figure 6: CoNLL-2005 F1 score as a function of
Table 7: Comparison of development F1 scores the distance of the predicate from the argument
with and without Viterbi decoding at test time. span.

L+/D+ L-/D+ L+/D- L-/D-


85
Proportion 37% 10% 4% 49%
SA 76.12 75.97 82.25 65.78
80
LISA 76.37 72.38 85.50 65.10
F1

LISA
75
+D&M
+D&M 76.33 79.65 75.62 66.55
+Gold +Gold 76.71 80.67 86.03 72.22
70
0-10 11-20 21-30 31-40 41-300
Sentence length (tokens) Table 8: Average SRL F1 on CoNLL-2012 for sen-
tences where LISA (L) and D&M (D) parses were
Figure 5: F1 score as a function of sentence length. correct (+) or incorrect (-).

A Supplemental Material A.2 Supplemental results

Due to space constraints in the main paper we list


A.1 Supplemental analysis additional experimental results here. Table 9 lists
development scores on the CoNLL-2005 dataset
Here we continue the analysis from §4.3. All
with predicted predicates, which follow the same
experiments in this section are performed on
trends as the test data.
CoNLL-2005 development data unless stated oth-
erwise.
A.3 Data and pre-processing details
First, we compare the impact of Viterbi decod-
ing with LISA, D&M, and gold syntax trees (Table We initialize word embeddings with 100d pre-
7), finding the same trends across both datasets. trained GloVe embeddings trained on 6 billion
We find that Viterbi has nearly the same impact for tokens of Wikipedia and Gigaword (Pennington
LISA, D&M and gold parses: Gold parses provide et al., 2014). We evaluate the SRL performance
little improvement over predicted parses in terms of our models using the srl-eval.pl script
of BIO label consistency.
We also assess SRL F1 as a function of sen-
tence length and distance from span to predicate. WSJ Dev P R F1
In Figure 5 we see that providing LISA with gold He et al. (2018) 84.2 83.7 83.9
parses is particularly helpful for sentences longer Tan et al. (2018) 82.6 83.6 83.1
than 10 tokens. This likely directly follows from SA 83.12 82.81 82.97
the tendency of syntactic parsers to perform worse LISA 83.6 83.74 83.67
on longer sentences. With respect to distance be- +D&M 85.04 85.51 85.27
tween arguments and predicates, (Figure 6), we do +Gold 89.11 89.38 89.25
not observe this same trend, with all distances per-
forming better with better parses, and especially Table 9: Precision, recall and F1 on the CoNLL-
gold. 2005 development set with gold predicates.
provided by the CoNLL-2005 shared task,7 which Street Journal portion of the Penn TreeBank cor-
computes segment-level precision, recall and F1 pus (PTB) (Marcus et al., 1993) with predicate-
score. We also report the predicate detection argument structures, plus a challenging out-of-
scores output by this script. We evaluate pars- domain test set derived from the Brown corpus
ing using the eval.pl CoNLL script, which ex- (Francis and Kučera, 1964). This dataset contains
cludes punctuation. only verbal predicates, though some are multi-
We train distinct D&M parsers for CoNLL- word verbs, and 28 distinct role label types. We
2005 and CoNLL-2012. Our D&M parsers are obtain 105 SRL labels including continuations af-
trained and validated using the same SRL data ter encoding predicate argument segment bound-
splits, except that for CoNLL-2005 section 22 aries with BIO tags.
is used for development (rather than 24), as this
section is typically used for validation in PTB A.4 Optimization and hyperparameters
parsing. We use Stanford dependencies v3.5 We train the model using the Nadam (Dozat, 2016)
(de Marneffe and Manning, 2008) and POS tags algorithm for adaptive stochastic gradient descent
from the Stanford CoreNLP left3words model (SGD), which combines Adam (Kingma and Ba,
(Toutanova et al., 2003). We use the pre-trained 2015) SGD with Nesterov momentum (Nesterov,
ELMo models8 and learn task-specific combina- 1983). We additionally vary the learning rate lr
tions of the ELMo representations which are pro- as a function of an initial learning rate lr0 and the
vided as input instead of GloVe embeddings to the current training step step as described in Vaswani
D&M parser with otherwise default settings. et al. (2017) using the following function:

A.3.1 CoNLL-2012 lr = lr0 · min(step−0.5 , step · warm−1.5 ) (8)


We follow the CoNLL-2012 split used by He et al.
(2018) to evaluate our models, which uses the an- which increases the learning rate linearly for the
notations from here9 but the subset of those doc- first warm training steps, then decays it propor-
uments from the CoNLL-2012 co-reference split tionally to the inverse square root of the step num-
described here10 (Pradhan et al., 2013). This ber. We found this learning rate schedule essential
dataset is drawn from seven domains: newswire, for training the self-attention model. We only up-
web, broadcast news and conversation, maga- date optimization moving-average accumulators
zines, telephone conversations, and text from the for parameters which receive gradient updates at
bible. The text is annotated with gold part-of- a given step.11
speech, syntactic constituencies, named entities, In all of our experiments we used initial learning
word sense, speaker, co-reference and seman- rate 0.04, β1 = 0.9, β2 = 0.98,  = 1 × 10−12 and
tic role labels based on the PropBank guidelines dropout rates of 0.1 everywhere. We use 10 or 12
(Palmer et al., 2005). Propositions may be verbal self-attention layers made up of 8 attention heads
or nominal, and there are 41 distinct semantic role each with embedding dimension 25, with 800d
labels, excluding continuation roles and including feed-forward projections. In the syntactically-
the predicate. We convert the semantic proposition informed attention head, Qparse has dimension
and role segmentations to BIO boundary-encoded 500 and Kparse has dimension 100. The size of
tags, resulting in 129 distinct BIO-encoded tags predicate and role representations and the rep-
(including continuation roles). resentation used for joint part-of-speech/predicate
classification is 200. We train with warm = 8000
A.3.2 CoNLL-2005 warmup steps and clip gradient norms to 1. We
The CoNLL-2005 data (Carreras and Màrquez, use batches of approximately 5000 tokens.
2005) is based on the original PropBank cor-
pus (Palmer et al., 2005), which labels the Wall
7
http://www.lsi.upc.es/˜srlconll/
srl-eval.pl
8
https://github.com/allenai/bilm-tf
9
http://cemantix.org/data/ontonotes.
html
10
http://conll.cemantix.org/2012/data.
11
html Also known as lazy or sparse optimizer updates.

You might also like