Graph Convolutional Networks For Named Entity Recognition: Gcns NER
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
Graph Convolutional Networks For Named Entity Recognition: Gcns NER
In this paper we investigate the role of the dependency tree in a named entity recognizer upon using
a set of Graph Convolutional Networks (GCNs). We perform a comparison among different Named
Entity Recognition (NER) architectures and show that the grammar of a sentence positively influ-
ences the results. Experiments on the OntoNotes 5.0 dataset demonstrate consistent performance
improvements, without requiring heavy feature engineering nor additional language-specific
knowledge 1 .
Cetoli, A.
Bragaglia, S. O’Harney, A. D. Sloan, M.
arXiv:1709.10053v2 [cs.CL] 14 Feb 2018
Context Scout
{alberto, stefano, andy, marc}@contextscout.com
3 describes the experiments and presents the results. We discuss relevant works in Section 4 and draw the
conclusions in Section 5.
where u and v are nodes in the graph. N is the set of nearest neighbours of node v, plus the node v itself.
The vector hku represents node u’s embeddings at the kst layer, while W and b are are a weight matrix and
a bias – learned during training – that map the embeddings of node u onto the adjacent nodes in the graph;
hu belongs to Rm , W ∈ Rm×m and b ∈ Rm .
Following the example in (Marcheggiani and Titov, 2017), we prefer to exploit the directness of the
graph in our system. Our inspiration comes from the bi-directional architecture of stacked RNNs, where
two different neural networks operate forward and backward respectively. Eventually the output of the
RNNs is concatenated and passed to further layers.
In our architecture we employ two stacked GCNs: One that only considers the incoming edges for each
node
←
− X ←
− ←−
h k+1
v = ReLU W k hku + b k , (2)
←−
u∈ N (v)
and one that considers only the outgoing edges from each node
→k+1
− X −
→ →
−
h v = ReLU W k hku + b k . (3)
−
→
u∈ N (v)
CRF output
Dense layer
(a) (b)
After N layers the final output of the two GCNs is the concatenation of the two separated layers
→N ←
− −N
hN
v = hv ⊕hv . (4)
In the following, we refer to the architecture expressed by Equation 4 as a bi-directional GCN.
In the following, we employ the 300 dimensional vector from two different distributions: one with 1M
words and another one with 2.2M words. Whenever a word is not present in the Glove vocabulary we use
the vector corresponding to the word “entity” instead.
The second type of vector embeddings concatenates the Glove word vectors with PoS tags embeddings.
We use randomly initialized Part-of-Speech embeddings that are allowed to fine-tune during training:
The final quality of our results correlates to the quality of our Part-of-Speech tagging. In one batch we
use the manually curated PoS tags included in the OntoNotes 5.0 dataset (Weischedel, 2013) (PoS (gold)).
These tags have the highest quality.
In another batch, we use the PoS tagging inferred from the parser (PoS (inferred)) instead of using the
manually tagged ones. These PoS tags are of lower quality. An external tagger might provide a different
number of tokens compared to the ones present in the training and evaluation datasets. This presents a
challenge. We skip these sentences during training, while considering the entities in such sentences as
incorrectly tagged during evaluation.
Finally, we add the morphological information to the feature vector for the third type of word embed-
dings. The reason – explained in (Cao and Rei, 2016) – is that out-of-vocabulary words are handled badly
whilst using only word embeddings:
We employ a bi-directional RNN to encode character information. The end nodes of the RNN are concate-
nated and passed to a dense layer, which is integrated to the feature vector along with the embeddings and
PoS information. In order to speed up the computation, we truncate the words by keeping only the first 12
characters. Truncation is not commonly done, as it hinders the network’s performance; we leave further
analysis to following works.
Dropout In order to tackle over-fitting, we apply dropout to all the layers on top of the LSTM. The
probability to drop a node is set at 20% for all the configurations. The layers that are used as input to the
LSTM do not use dropout.
Network output At inference time, the output of the network is a 19-dimensional vector for each input
word. This dimensionality comes from the 18 tags used in OntoNotes 5.0, with an additional dimension
which expresses the absence of a named entity. No Begin, Inside, Outside, End, Single (BIOES) markings
are applied; at evaluation time we simply consider a name chunk as a contiguous sequence of words
belonging to the same category.
2.2.2 Training
We use TensorFlow (Abadi et al., 2015) to implement our neural network. Training and inference is done
at the sentence level. The weights are initialized randomly from the uniform distribution and the initial
state of the LSTMs are set to zero. The system uses the configuration in Appendix A.
The training function is the CRF loss function as explained in (Huang et al., 2015). Following their
notation, we define [f ]i,t as the matrix that represents the score of the network for the tth word to have
the ith tag. We also introduce Aij as the transition matrix which stores the probability of going from tag
i to tag j. The transition matrix is usually trained along with the other network weights. In our work we
preferred instead to set it as constant and equal to the transition frequencies as found in the training dataset.
The function f is an argument of the network’s parameters θ and the input sentence [x]T1 (the list of
embeddings with length T ). Let the list of T training labels be written as [i]T1 , then our loss function is
written as
X
S [x]T1 ,[i]T1 ,θ,Aij − exp [x]T1 ,[j]T1 ,θ,Aij +[f ][i]t ,t ,
(8)
[j]T
1
where
T
X
[x]T1 ,[i]T1 ,θ,Aij
S = A[i]t−1 ,[i]t +f (θ,Aij ) . (9)
1
At inference time, we rely on the Viterbi algorithm to find the sequence of tokens that maxi-
mizes S [x]T1 ,[i]T1 ,θ,Aij . We apply mini-batch stochastic gradient descent with the Adam opti-
miser (Kingma and Ba, 2014), using a learning rate fixed to 10−4 .
3 Experimental Results
In this section, we compare the different methods applied and discuss the results. The scores in Table 1 1
are presented as an average of 6 runs with the error being the standard deviation; we keep only the first
significant digit of the errors, approximating to the nearest number.
D EV T EST
Description prec rec F1 prec rec F1
Bi-LSTM + 1M Glove + CRF 80.9 78.2 79.5±0.3 79.1 75.9 77.5±0.4
Bi-LSTM + 1M Glove + CRF + GCN 82.2 79.5 80.8±0.3 82.0 77.5 79.7±0.3
Bi-LSTM + 1M Glove + CRF + GCN + PoS (gold) 82.1 83.7 82.9±0.3 82.4 81.8 82.1±0.4
Bi-LSTM + 2.2M Glove + CRF + GCN + PoS (gold) 83.3 84.1 83.7±0.4 83.6 82.1 82.8±0.3
Bi-LSTM + 2.2M Glove + CRF + GCN + PoS (inferred) 83.8 82.9 83.4±0.4 82.2 80.5 81.4±0.3
Bi-LSTM + 2.2M Glove + CRF + GCN + PoS (gold) + Morphology 86.6 82.7 84.6±0.4 86.7 80.7 83.6±0.4
Bi-LSTM + 2.2M Glove + CRF + GCN + PoS (inferred) + Morphology 85.3 82.3 83.8±0.4 84.3 80.1 82.0±0.4
Chiu and Nichols (Chiu and Nichols, 2015) 84.6±0.3 86.0 86.5 86.3±0.3
Ratinov and Roth (Ratinov and Roth, 2009) 82.0 84.9 83.4±0.0
Finkel and Manning (Finkel and Manning, 2009) 84.0 80.9 82.4±0.0
Durrett and Klein (Durrett and Klein, 2015) 85.2 82.9 84.0±0.0
The results show an improvement of 2.2±0.5% upon using a GCN, compared to the baseline result of a
bi-directional LSTM alone (1st row). When concatenating the gold PoS tag embedding in the input vectors,
this improvement raises to 4.6± 0.6%. However, gold PoS tags cannot be used outside the OntoNotes 5.0
1
The results from Ratinov and Roth and Finkel and Manning are taken from (Chiu and Nichols, 2015).
dataset. The F1 score improvement for the system while using inferred tags (from the parser) is lower:
3.2±0.6%.
For comparison, increasing the size of the Glove vector from 1M to 2.2M gave an improvement of
0.7 ± 0.5%. Adding the morphological information of the words, albeit truncated at 12 characters,
improves the F1 score by 2.2±0.5%.
Our results strongly suggest that syntactic information is relevant in capturing the role of a word in a
sentence, and understanding sentences as one-dimensional lists of words appears as a partial approach.
Sentences embed meaning through internal graph structures: the graph convolutional method approach
– used in conjunction with a parser (or a treebank) – seems to provide a lightweight architecture that
incorporates grammar while extracting named entities.
Our results – while competitive – fall short of achieving the state of the art. We believe this to be
the result of a few factors: we do not employ BIOES annotations for our tags, lexicon and capitalisation
features are ignored, and we truncate words when encoding the morphological vectors. Our main claim
is nonetheless clear: grammatical information positively boosts the performance of recognizing entities,
leaving further improvements to be explored.
4 Related Works
There is a large corpus of work on named entity recognition, with few studies using explicitly non-local
information for the task. One early work by Finkel et al. (Finkel et al., 2005) uses Gibbs sampling to
capture long distance structures that are common in language use. Another article by the same authors
uses a joint representation for constituency parsing and NER, improving both techniques. In addition,
dependency structures have also been used to boost the recognition of bio-medical events (McClosky et al.,
2011) and for automatic content extraction (Li et al., 2013).
Recently, there has been a significant effort to improve the accuracy of classifiers by going beyond vector
representation for sentences. Notably the work of Peng et al. (Peng et al., 2017) introduces graph LSTMs
to encode the meaning of a sentence by using dependency graphs. Similarly Dhingra et al. (Dhingra et al.,
2017) employ Gated Recurrent Units (GRUs) that encode the information of acyclic graphs to achieve
state-of-the-art results in co-reference resolution.
5 Concluding Remarks
We showed that dependency trees play a positive role for entity recognition by using a GCN to boost the
results of a bidirectional LSTM. In addition, we modified the standard convolutional network architecture
and introduced a bidirectional mechanism for convolving directed graphs. This model is able to improve
upon the LSTM baseline: Our best result yielded an improvement of 4.6 ± 0.6% in the F1 score, using a
combination of both GCN and PoS tag embeddings.
Finally, we prove that GCNs can be used in conjunction with different techniques. We have shown that
morphological information in the input vectors does not conflict with graph convolutions. Additional tech-
niques, such as the gating of the components of input vectors (Rei et al., 2016) or neighbouring word pre-
diction (Rei, 2017) should be tested together with GCNs. We will investigate those results in future works.
References
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado,
Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey
Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg,
Dan Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit
Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vié-
gas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015.
TensorFlow: Large-scale machine learning on heterogeneous systems. Software available from tensorflow.org.
http://tensorflow.org/.
Kris Cao and Marek Rei. 2016. A joint model for word embedding and word morphology. CoRR abs/1606.02601.
http://arxiv.org/abs/1606.02601.
Jason P. C. Chiu and Eric Nichols. 2015. Named entity recognition with bidirectional lstm-cnns. CoRR
abs/1511.08308. http://arxiv.org/abs/1511.08308.
Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel Kuksa. 2011.
Natural language processing (almost) from scratch. Journal of Machine Learning Research 12(Aug):2493–2537.
Bhuwan Dhingra, Zhilin Yang, William W. Cohen, and Ruslan Salakhutdinov. 2017.
Linguistic knowledge as memory for recurrent neural networks. CoRR abs/1703.02620.
http://arxiv.org/abs/1703.02620.
Greg Durrett and Dan Klein. 2015. Neural CRF parsing. CoRR abs/1507.03641. http://arxiv.org/abs/1507.03641.
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005.
Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings
of the 43rd Annual Meeting on Association for Computational Linguistics. Association for Computational
Linguistics, Stroudsburg, PA, USA, ACL ’05, pages 363–370. https://doi.org/10.3115/1219840.1219885.
Jenny Rose Finkel and Christopher D. Manning. 2009. Joint parsing and named entity recognition. In Proceedings
of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the
Association for Computational Linguistics. Association for Computational Linguistics, Stroudsburg, PA, USA,
NAACL ’09, pages 326–334. http://dl.acm.org/citation.cfm?id=1620754.1620802.
Matthew Honnibal and Mark Johnson. 2015. An improved non-monotonic transition system for dependency parsing.
In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, Lisbon, Portugal, pages 1373–1378. https://aclweb.org/anthology/D/D15/D15-1162.
Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF models for sequence tagging. CoRR
abs/1508.01991. http://arxiv.org/abs/1508.01991.
Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980.
http://arxiv.org/abs/1412.6980.
Thomas N. Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks. CoRR
abs/1609.02907. http://arxiv.org/abs/1609.02907.
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001.
Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings
of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, ICML ’01, pages 282–289. http://dl.acm.org/citation.cfm?id=645530.655813.
Qi Li, Heng Ji, and Liang Huang. 2013. Joint event extraction via structured prediction with global features.
In Proceedings of the 51st Annual Meeting of the Association for Computational Linguis-
tics (Volume 1: Long Papers). Association for Computational Linguistics, pages 73–82.
http://aclanthology.coli.uni-saarland.de/pdf/P/P13/P13-1008.pdf.
Diego Marcheggiani and Ivan Titov. 2017. Encoding sentences with graph convolutional networks for semantic role labeling.
CoRR abs/1703.04826. http://arxiv.org/abs/1703.04826.
Andrew McCallum, Dayne Freitag, and Fernando C. N. Pereira. 2000.
Maximum entropy markov models for information extraction and segmentation. In Proceedings of the Sev-
enteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, ICML ’00, pages 591–598. http://dl.acm.org/citation.cfm?id=645529.658277.
David McClosky, Mihai Surdeanu, and Christopher D. Manning. 2011. Event extraction as dependency parsing.
In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language
Technologies - Volume 1. Association for Computational Linguistics, Stroudsburg, PA, USA, HLT ’11, pages
1626–1635. http://dl.acm.org/citation.cfm?id=2002472.2002667.
Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina Toutanova, and Wen-tau Yih. 2017.
Cross-sentence n-ary relation extraction with graph lstms. Transactions of the Association of Computational
Linguistics 5:101–115. http://aclanthology.coli.uni-saarland.de/pdf/Q/Q17/Q17-1008.pdf.
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014.
Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing
(EMNLP). pages 1532–1543. http://www.aclweb.org/anthology/D14-1162.
L. Ratinov and D. Roth. 2009. Design challenges and misconceptions in named entity recognition. In CoNLL.
http://cogcomp.org/papers/RatinovRo09.pdf.
Marek Rei. 2017. Semi-supervised multitask learning for sequence labeling. CoRR abs/1704.07156.
http://arxiv.org/abs/1704.07156.
Marek Rei, Gamal K. O. Crichton, and Sampo Pyysalo. 2016. Attending to characters in neural sequence labeling models.
CoRR abs/1611.04361. http://arxiv.org/abs/1611.04361.
Koichi Takeuchi and Nigel Collier. 2002. Use of support vector machines in extended named entity recognition. In
Proceedings of the 6th Conference on Natural Language Learning - Volume 20. Association for Computational
Linguistics, Stroudsburg, PA, USA, COLING-02, pages 1–7. https://doi.org/10.3115/1118853.1118882.
Ralph et al. Weischedel. 2013. Ontonotes release 5.0. Linguistic Data Consortium, Philadelphia, PA LDC2013T19.
https://catalog.ldc.upenn.edu/docs/LDC2013T19/OntoNotes-Release-5.0.pdf.
A Configuration
Parameter Value
Glove word embeddings 300 dim
PoS embedding 15 dim
Morphological embedding 20 dim
First dense layer 40 dim
LSTM memory (2×) 160 dim
Second dense layer 160 dim
GCN layer (2×) 160 dim
Final dense layer 160 dim
Output layer 16 dim
Dropout 0.8 (keep probability)