Linguistically-Informed Self-Attention For Semantic Role Labeling
Linguistically-Informed Self-Attention For Semantic Role Labeling
Emma Strubell1 , Patrick Verga1 , Daniel Andor2 , David Weiss2 and Andrew McCallum1
1
College of Information and Computer Sciences
University of Massachusetts Amherst
{strubell, pat, mccallum}@cs.umass.edu
2
Google AI Language
New York, NY
{andor, djweiss}@google.com
explicit linguistic features. However, prior et al., 2014; Wang et al., 2015) and translation (Liu
work has shown that gold syntax trees can dra- and Gildea, 2010; Bazrafshan and Gildea, 2013).
matically improve SRL decoding, suggesting Though syntax was long considered an obvious
the possibility of increased accuracy from ex- prerequisite for SRL systems (Levin, 1993; Pun-
plicit modeling of syntax. In this work, we
yakanok et al., 2008), recently deep neural net-
present linguistically-informed self-attention
(LISA): a neural network model that com- work architectures have surpassed syntactically-
bines multi-head self-attention with multi-task informed models (Zhou and Xu, 2015; Marcheg-
learning across dependency parsing, part-of- giani et al., 2017; He et al., 2017; Tan et al., 2018;
speech tagging, predicate detection and SRL. He et al., 2018), achieving state-of-the art SRL
Unlike previous models which require sig- performance with no explicit modeling of syntax.
nificant pre-processing to prepare linguistic An additional benefit of these end-to-end models
features, LISA can incorporate syntax using is that they require just raw tokens and (usually)
merely raw tokens as input, encoding the se-
detected predicates as input, whereas richer lin-
quence only once to simultaneously perform
parsing, predicate detection and role label- guistic features typically require extraction by an
ing for all predicates. Syntax is incorpo- auxiliary pipeline of models.
rated by training one attention head to attend Still, recent work (Roth and Lapata, 2016; He
to syntactic parents for each token. More- et al., 2017; Marcheggiani and Titov, 2017) indi-
over, if a high-quality syntactic parse is al- cates that neural network models could see even
ready available, it can be beneficially injected higher accuracy gains by leveraging syntactic in-
at test time without re-training our SRL model.
formation rather than ignoring it. He et al. (2017)
In experiments on CoNLL-2005 SRL, LISA
achieves new state-of-the-art performance for indicate that many of the errors made by a syntax-
a model using predicted predicates and stan- free neural network on SRL are tied to certain
dard word embeddings, attaining 2.5 F1 ab- syntactic confusions such as prepositional phrase
solute higher than the previous state-of-the-art attachment, and show that while constrained in-
on newswire and more than 3.5 F1 on out- ference using a relatively low-accuracy predicted
of-domain data, nearly 10% reduction in er- parse can provide small improvements in SRL ac-
ror. On ConLL-2012 English SRL we also
curacy, providing a gold-quality parse leads to
show an improvement of more than 2.5 F1.
LISA also out-performs the state-of-the-art
substantial gains. Marcheggiani and Titov (2017)
with contextually-encoded (ELMo) word rep- incorporate syntax from a high-quality parser
resentations, by nearly 1.0 F1 on news and (Kiperwasser and Goldberg, 2016) using graph
more than 2.0 F1 on out-of-domain text. convolutional neural networks (Kipf and Welling,
2017), but like He et al. (2017) they attain only
1 Introduction
small increases over a model with no syntactic
Semantic role labeling (SRL) extracts a high-level parse, and even perform worse than a syntax-free
representation of meaning from a sentence, label- model on out-of-domain data. These works sug-
ing e.g. who did what to whom. Explicit repre- gest that though syntax has the potential to im-
sentations of such semantic information have been prove neural network SRL models, we have not
yet designed an architecture which maximizes the saw B-ARG0 B-V B-ARG1 I-ARG1 I-ARG1
climbing O O B-ARG0 I-ARG0 B-V
benefits of auxiliary syntactic information.
spred srole
In response, we propose linguistically-informed Feed Feed
self-attention (LISA): a model that combines Forward
Bilinear
Forward
multi-task learning (Caruana, 1993) with stacked
Multi-head self-attention + FF J
layers of multi-head self-attention (Vaswani et al.,
...
2017); the model is trained to: (1) jointly pre- p
Syntactically-informed self-attention + FF
dict parts of speech and predicates; (2) perform
parsing; and (3) attend to syntactic parse parents, PRP VBP:PRED DT NN VBG:PRED
...
while (4) assigning semantic role labels. Whereas
prior work typically requires separate models to Multi-head self-attention + FF r
provide linguistic analysis, including most syntax- I saw the sloth climbing
ELMo
He et al. (2018) 84.9 85.7 85.3 84.8 87.2 86.0 73.9 78.4 76.1
SA 85.78 84.74 85.26 86.21 85.98 86.09 77.1 75.61 76.35
LISA 86.07 84.64 85.35 86.69 86.42 86.55 78.95 77.17 78.05
+D&M 85.83 84.51 85.17 87.13 86.67 86.90 79.02 77.49 78.25
+Gold 88.51 86.77 87.63 — — — — — —
Table 1: Precision, recall and F1 on the CoNLL-2005 development and test sets.
% split/merge labels
SA 79.29 75.14 75.97 75.08 50 44 +D&M
+Gold
LISA 79.51 74.33 79.69 75.00 40
+D&M 79.03 76.96 77.73 76.52 30
+Gold 79.61 78.38 81.41 80.47 20
272011
201513
10 1211 5
8 7
Table 6: Average SRL F1 on CoNLL-2005 for sen- 0
3 4 3 5 2 2 4
tences where LISA (L) and D&M (D) parses were PP NP VP SBAR ADVP PRN Other
90.0 SA
LISA this to be the case: Figure 4 shows a breakdown
87.5 +D&M of split/merge corrections by phrase type. Though
85.0 +Gold
the number of corrections decreases substantially
Orig. Fix Move Merge Split Fix Drop Add across phrase types, the proportion of corrections
Labels Core Spans Spans Span Arg. Arg.
Arg. Boundary attributed to PPs remains the same (approx. 50%)
even after providing the correct PP attachment to
Figure 3: Performance of CoNLL-2005 models af- the model, indicating that PP span boundary mis-
ter performing corrections from He et al. (2017). takes are a fundamental difficulty for SRL.
5 Conclusion
Here there is little difference between any of the
models, with LISA models tending to perform We present linguistically-informed self-attention:
slightly better than SA. Both parsers make mis- a multi-task neural network model that effectively
takes on the majority of sentences (57%), diffi- incorporates rich linguistic information for seman-
cult sentences where SA also performs the worst. tic role labeling. LISA out-performs the state-of-
These examples are likely where gold and D&M the-art on two benchmark SRL datasets, includ-
parses improve the most over other models in ing out-of-domain. Future work will explore im-
overall F1: Though both parsers fail to correctly proving LISA’s parsing accuracy, developing bet-
parse the entire sentence, the D&M parser is less ter training techniques and adapting to more tasks.
wrong (87.5 vs. 85.7 average LAS), leading to
higher SRL F1 by about 1.5 average F1. Acknowledgments
Following He et al. (2017), we next apply a
series of corrections to model predictions in or- We are grateful to Luheng He for helpful discus-
der to understand which error types the gold sions and code, Timothy Dozat for sharing his
parse resolves: e.g. Fix Labels fixes labels on code, and to the NLP reading groups at Google
spans matching gold boundaries, and Merge Spans and UMass and the anonymous reviewers for feed-
merges adjacent predicted spans into a gold span.6 back on drafts of this work. This work was sup-
In Figure 3 we see that much of the performance ported in part by an IBM PhD Fellowship Award
gap between the gold and predicted parses is due to E.S., in part by the Center for Intelligent Infor-
to span boundary errors (Merge Spans, Split Spans mation Retrieval, and in part by the National Sci-
and Fix Span Boundary), which supports the hy- ence Foundation under Grant Nos. DMR-1534431
pothesis proposed by He et al. (2017) that incorpo- and IIS-1514053. Any opinions, findings, conclu-
rating syntax could be particularly helpful for re- sions or recommendations expressed in this mate-
solving these errors. He et al. (2017) also point out rial are those of the authors and do not necessarily
6
Refer to He et al. (2017) for a detailed explanation of the reflect those of the sponsor.
different error types.
References Ronan Collobert, Jason Weston, Léon Bottou, Michael
Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
Martın Abadi, Ashish Agarwal, Paul Barham, Eugene 2011. Natural language processing (almost) from
Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, scratch. Journal of Machine Learning Research,
Andy Davis, Jeffrey Dean, Matthieu Devin, et al. 12(Aug):2493–2537.
2015. Tensorflow: Large-scale machine learning on
heterogeneous systems, 2015. Software available Hal Daumé III, John Langford, and Daniel Marcu.
from tensorflow.org. 2009. Search-based structured prediction. Machine
Learning, 75(3):297–325.
Héctor Martı́nez Alonso and Barbara Plank. 2017.
When is multitask learning effective? semantic se- Timothy Dozat. 2016. Incorporating nesterov momen-
quence prediction under varying data conditions. In tum into adam. In ICLR Workshop track.
EACL.
Timothy Dozat and Christopher D. Manning. 2017.
Miguel Ballesteros, Yoav Goldberg, Chris Dyer, and Deep biaffine attention for neural dependency pars-
Noah A. Smith. 2016. Training with exploration im- ing. In ICLR.
proves a greedy stack lstm parser. In Proceedings of
the 2016 Conference on Empirical Methods in Nat- Nicholas FitzGerald, Oscar Täckström, Kuzman
ural Language Processing, pages 2005–2010. Ganchev, and Dipanjan Das. 2015. Semantic role
labeling with neural network factors. In Proceed-
Marzieh Bazrafshan and Daniel Gildea. 2013. Seman- ings of the 2015 Conference on Empirical Methods
tic roles for string to tree machine translation. In in Natural Language Processing, pages 960–970.
ACL.
W. N. Francis and H. Kučera. 1964. Manual of infor-
Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and mation to accompany a standard corpus of present-
Noam Shazeer. 2015. Scheduled sampling for se- day edited american english, for use with digital
quence prediction with recurrent neural networks. computers. Technical report, Department of Lin-
In NIPS. guistics, Brown University, Providence, Rhode Is-
land.
Yoshua Bengio, Patrice Simard, and Paolo Frasconi.
Yoav Goldberg and Joakim Nivre. 2012. A dynamic
1994. Learning long-term dependencies with gradi-
oracle for arc-eager dependency parsing. In Pro-
ent descent is difficult. IEEE Transactions on Neu-
ceedings of COLING 2012: Technical Papers, pages
ral Networks, 5(2):157–166.
959–976.
Jonathan Berant, Vivek Srikumar, Pei-Chun Chen, Kazuma Hashimoto, Caiming Xiong, Yoshimasa Tsu-
Brad Huang, Christopher D. Manning, Abby Van- ruoka, and Richard Socher. 2017. A joint many-task
der Linden, Brittany Harding, and Peter Clark. 2014. model: Growing a neural network for multiple nlp
Modeling biological processes for reading compre- tasks. In Conference on Empirical Methods in Nat-
hension. In EMNLP. ural Language Processing.
Joachim Bingel and Anders Søgaard. 2017. Identify- Luheng He, Kenton Lee, Omer Levy, and Luke Zettle-
ing beneficial task relations for multi-task learning moyer. 2018. Jointly predicting predicates and argu-
in deep neural networks. In EACL. ments in neural semantic role labeling. In ACL.
Xavier Carreras and Lluı́s Màrquez. 2005. Introduc- Luheng He, Kenton Lee, Mike Lewis, and Luke Zettle-
tion to the conll-2005 shared task: Semantic role la- moyer. 2017. Deep semantic role labeling: What
beling. In CoNLL. works and whats next. In Proceedings of the 55th
Annual Meeting of the Association for Computa-
Rich Caruana. 1993. Multitask learning: a knowledge- tional Linguistics.
based source of inductive bias. In ICML.
Richard Johansson and Pierre Nugues. 2008.
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agar- Dependency-based semantic role labeling of
wal, Hal Daumé III, and John Langford. 2015. propbank. In Proceedings of the 2008 Confer-
Learning to search better than your teacher. In ence on Empirical Methods in Natural Language
ICML. Processing, pages 69–78.
Yun-Nung Chen, William Yang Wang, and Alexander I Diederik Kingma and Jimmy Ba. 2015. Adam: A
Rudnicky. 2013. Unsupervised induction and filling method for stochastic optimization. In 3rd Inter-
of semantic slots for spoken dialogue systems using national Conference for Learning Representations
frame-semantic parsing. In Proc. of ASRU-IEEE. (ICLR), San Diego, California, USA.
Jinho D. Choi and Martha Palmer. 2011. Getting the Eliyahu Kiperwasser and Yoav Goldberg. 2016. Sim-
most out of transition-based dependency parsing. In ple and accurate dependency parsing using bidirec-
Proceedings of the 49th Annual Meeting of the Asso- tional LSTM feature representations. Transactions
ciation for Computational Linguistics: short papers, of the Association for Computational Linguistics,
pages 687–692. 4:313–327.
Thomas N. Kipf and Max Welling. 2017. Semisu- Hao Peng, Sam Thomson, and Noah A. Smith. 2017.
pervised classification with graph convolutional net- Deep multitask learning for semantic dependency
works. In International Conference on Learning parsing. In ACL.
Representations.
Jeffrey Pennington, Richard Socher, and Christo-
Kenton Lee, Luheng He, Mike Lewis, and Luke Zettle- pher D. Manning. 2014. Glove: Global vectors for
moyer. 2017. End-to-end neural coreference resolu- word representation. In EMNLP.
tion. In EMNLP.
Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt
Beth Levin. 1993. English verb classes and alterna- Gardner, Christopher Clark, Kenton Lee, and Luke
tions: A preliminary investigation. University of Zettlemoyer. 2018. Deep contextualized word rep-
Chicago press. resentations. In NAACL.
Mike Lewis, Luheng He, and Luke Zettlemoyer. 2015. Sameer Pradhan, Alessandro Moschitti, Nianwen Xue,
Joint A* CCG Parsing and Semantic Role Labeling. Hwee Tou Ng, Anders Björkelund, Olga Uryupina,
In EMNLP. Yuchen Zhang, and Zhi Zhong. 2013. Towards ro-
bust linguistic analysis using OntoNotes. In Pro-
Ding Liu and Daniel Gildea. 2010. Semantic role ceedings of the Seventeenth Conference on Com-
features for machine translation. In Proceedings putational Natural Language Learning, pages 143–
of the 23rd International Conference on Computa- 152.
tional Linguistics (COLING).
Sameer Pradhan, Wayne Ward, Kadri Hacioglu, James
Yang Liu and Mirella Lapata. 2018. Learning struc- Martin, and Dan Jurafsky. 2005. Semantic role la-
tured text representations. Transactions of the Asso- beling using different syntactic views. In Proceed-
ciation for Computational Linguistics, 6:63–75. ings of the Association for Computational Linguis-
tics 43rd annual meeting (ACL).
Andrew L. Maas, Awni Y. Hannun, and Andrew Y. Ng.
2013. Rectifier nonlinearities improve neural net-
Vasin Punyakanok, Dan Roth, and Wen-Tau Yih. 2008.
work acoustic models. In ICML, volume 30.
The importance of syntactic parsing and inference in
semantic role labeling. Computational Linguistics,
Diego Marcheggiani, Anton Frolov, and Ivan Titov.
34(2):257–287.
2017. A simple and accurate syntax-agnostic neural
model for dependency-based semantic role labeling.
Stéphane Ross, Geoffrey J. Gordon, and J. Andrew
In CoNLL.
Bagnell. 2011. A reduction of imitation learning and
Diego Marcheggiani and Ivan Titov. 2017. Encoding structured prediction to no-regret online learning. In
sentences with graph convolutional networks for se- Proceedings of the 14th International Conference on
mantic role labeling. In Proceedings of the 2017 Artificial Intelligence and Statistics (AISTATS).
Conference on Empirical Methods in Natural Lan-
guage Processing (EMNLP). Michael Roth and Mirella Lapata. 2016. Neural se-
mantic role labeling with dependency path embed-
Mitchell P. Marcus, Mary Ann Marcinkiewicz, and dings. In Proceedings of the 54th Annual Meet-
Beatrice Santorini. 1993. Building a large annotated ing of the Association for Computational Linguistics
corpus of English: The Penn TreeBank. Compu- (ACL), pages 1192–1202.
tational Linguistics – Special issue on using large
corpora: II, 19(2):313–330. Anders Søgaard and Yoav Goldberg. 2016. Deep
multi-task learning with low level tasks supervised
Marie-Catherine de Marneffe and Christopher D. Man- at lower layers. In Proceedings of the 54th Annual
ning. 2008. The stanford typed dependencies rep- Meeting of the Association for Computational Lin-
resentation. In COLING 2008 Workshop on Cross- guistics, pages 231–235.
framework and Cross-domain Parser Evaluation.
Nitish Srivastava, Geoffrey E Hinton, Alex Krizhevsky,
Yurii Nesterov. 1983. A method of solving a con- Ilya Sutskever, and Ruslan Salakhutdinov. 2014.
vex programming problem with convergence rate Dropout: a simple way to prevent neural networks
o(1/k 2 ). volume 27, pages 372–376. from overfitting. Journal of machine learning re-
search, 15(1):1929–1958.
Martha Palmer, Daniel Gildea, and Paul Kingsbury.
2005. The proposition bank: An annotated corpus Mihai Surdeanu, Lluı́s Màrquez, Xavier Carreras, and
of semantic roles. Computational Linguistics, 31(1). Pere R. Comas. 2007. Combination strategies for
semantic role labeling. Journal of Artificial Intelli-
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. gence Research, 29:105–151.
2013. On the difficulty of training recurrent neural
networks. In Proceedings of the 30 th International Charles Sutton and Andrew McCallum. 2005. Joint
Conference on Machine Learning. parsing and semantic role labeling. In CoNLL.
Swabha Swayamdipta, Sam Thomson, Chris Dyer, and
Noah A. Smith. 2017. Frame-semantic parsing with
softmax-margin segmental rnns and a syntactic scaf-
fold. In arXiv:1706.09528.
Oscar Täckström, Kuzman Ganchev, and Dipanjan
Das. 2015. Efficient inference and structured learn-
ing for semantic role labeling. TACL, 3:29–41.
Zhixing Tan, Mingxuan Wang, Jun Xie, Yidong Chen,
and Xiaodong Shi. 2018. Deep semantic role label-
ing with self-attention. In AAAI.
Kristina Toutanova, Aria Haghighi, and Christopher D.
Manning. 2008. A global joint model for se-
mantic role labeling. Computational Linguistics,
34(2):161–191.
Kristina Toutanova, Dan Klein, Christopher D Man-
ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
In Proceedings of the 2003 Conference of the North
American Chapter of the Association for Computa-
tional Linguistics on Human Language Technology-
Volume 1, pages 173–180. Association for Compu-
tational Linguistics.
Gokhan Tur, Dilek Hakkani-Tür, and Ananlada Choti-
mongkol. 2005. Semi-supervised learning for spo-
ken language understanding using semantic role la-
beling. In ASRU.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. In 31st Conference on Neural Information
Processing Systems (NIPS).
Hai Wang, Mohit Bansal, Kevin Gimpel, and David
McAllester. 2015. Machine comprehension with
syntax, frames, and semantics. In ACL.
R. J. Williams and D. Zipser. 1989. A learning algo-
rithm for continually running fully recurrent neural
networks. Neural computation, 1(2):270–280.
Daniel Zeman, Martin Popel, Milan Straka, Jan Ha-
jic, Joakim Nivre, Filip Ginter, Juhani Luotolahti,
Sampo Pyysalo, Slav Petrov, Martin Potthast, et al.
2017. Conll 2017 shared task: Multilingual parsing
from raw text to universal dependencies. In Pro-
ceedings of the CoNLL 2017 Shared Task: Multilin-
gual Parsing from Raw Text to Universal Dependen-
cies, pages 1–19, Vancouver, Canada. Association
for Computational Linguistics.
Yuan Zhang and David Weiss. 2016. Stack-
propagation: Improved representation learning for
syntax. In Proceedings of the 54th Annual Meet-
ing of the Association for Computational Linguistics
(Volume 1: Long Papers), pages 1557–1566. Asso-
ciation for Computational Linguistics.
Jie Zhou and Wei Xu. 2015. End-to-end learning of
semantic role labeling using recurrent neural net-
works. In Proc. of the Annual Meeting of the As-
sociation for Computational Linguistics (ACL).
CoNLL-2005 Greedy F1 Viterbi F1 ∆ F1 90
F1
+Gold 86.57 86.81 +0.24
LISA
70
+D&M
CoNLL-2012 Greedy F1 Viterbi F1 ∆ F1 65 +Gold
60
LISA 80.11 80.70 +0.59 0-1 2-3 4-7 8-200
Distance from predicate (tokens)
+D&M 81.55 82.05 +0.50
+Gold 85.94 86.43 +0.49
Figure 6: CoNLL-2005 F1 score as a function of
Table 7: Comparison of development F1 scores the distance of the predicate from the argument
with and without Viterbi decoding at test time. span.
LISA
75
+D&M
+D&M 76.33 79.65 75.62 66.55
+Gold +Gold 76.71 80.67 86.03 72.22
70
0-10 11-20 21-30 31-40 41-300
Sentence length (tokens) Table 8: Average SRL F1 on CoNLL-2012 for sen-
tences where LISA (L) and D&M (D) parses were
Figure 5: F1 score as a function of sentence length. correct (+) or incorrect (-).