Classifying Relations by Ranking With Convolutional Neural Networks
Classifying Relations by Ranking With Convolutional Neural Networks
Classifying Relations by Ranking With Convolutional Neural Networks
626
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics
and the 7th International Joint Conference on Natural Language Processing, pages 626–634,
c
Beijing, China, July 26-31, 2015.
2015 Association for Computational Linguistics
The remainder of the paper is structured as fol-
lows. Section 2 details the proposed neural net-
work. In Section 3, we present details about the
setup of experimental evaluation, and then de-
scribe the results in Section 4. In Section 5, we
discuss previous work in deep neural networks
for relation classification and for other NLP tasks.
Section 6 presents our conclusions.
627
the concatenation of these two vectors, wpew = sentence. The fixed-sized distributed vector rep-
[wp1 , wp2 ]. resentation for the sentence is obtained by using
In the experiments where word position the max over all word windows. Matrix W 1 and
embeddings are used, the word embed- vector b1 are parameters to be learned. The num-
ding and the word position embedding of ber of convolutional units dc , and the size of the
each word are concatenated to form the word context window k are hyperparameters to be
input for the convolutional layer, embx = chosen by the user. It is important to note that dc
{[rw1 , wpew1 ], [rw2 , wpew2 ], ..., [rwN , wpewN ]}. corresponds to the size of the sentence representa-
tion.
2.3 Sentence Representation
2.4 Class embeddings and Scoring
The next step in the NN consists in creating the
distributed vector representation rx for the input Given the distributed vector representation of the
sentence x. The main challenges in this step are input sentence x, the network with parameter set
the sentence size variability and the fact that im- θ computes the score for a class label c ∈ C by
portant information can appear at any position in using the dot product
the sentence. In recent work, convolutional ap-
proaches have been used to tackle these issues sθ (x)c = rx| [W classes ]c
when creating representations for text segments
where W classes is an embedding matrix whose
of different sizes (Zeng et al., 2014; Hu et al.,
columns encode the distributed vector representa-
2014; dos Santos and Gatti, 2014) and character-
tions of the different class labels, and [W classes ]c
level representations of words of different sizes
is the column vector that contains the embedding
(dos Santos and Zadrozny, 2014). Here, we use
of the class c. Note that the number of dimensions
a convolutional layer to compute distributed vec-
in each class embedding must be equal to the size
tor representations of the sentence. The convo-
of the sentence representation, which is defined by
lutional layer first produces local features around
dc . The embedding matrix W classes is a parame-
each word in the sentence. Then, it combines these
ter to be learned by the network. It is initialized
local features using a max operation to create a
by randomly sampling each value from r an uniform
fixed-sized vector for the input sentence. 6
Given a sentence x, the convolutional layer ap- distribution: U (−r, r), where r = .
|C| + dc
plies a matrix-vector operation to each window
of size k of successive windows in embx = 2.5 Training Procedure
{rw1 , rw2 , ..., rwN }. Let us define the vector zn ∈ Our network is trained by minimizing a pairwise
w
Rd k as the concatenation of a sequence of k word ranking loss function over the training set D. The
embeddings, centralized in the n-th word: input for each training round is a sentence x and
two different class labels y + ∈ C and c− ∈ C,
zn = (rwn−(k−1)/2 , ..., rwn+(k−1)/2 )| where y + is a correct class label for x and c− is
not. Let sθ (x)y+ and sθ (x)c− be respectively the
In order to overcome the issue of referencing scores for class labels y + and c− generated by the
words with indices outside of the sentence bound- network with parameter set θ. We propose a new
aries, we augment the sentence with a special logistic loss function over these scores in order to
k−1
padding token replicated times at the be- train CR-CNN:
2
ginning and the end.
The convolutional layer computes the j-th ele- L = log(1 + exp(γ(m+ − sθ (x)y+ ))
c (1)
ment of the vector rx ∈ Rd as follows: + log(1 + exp(γ(m− + sθ (x)c− ))
[rx ]j = max f W 1 zn + b1 j where m+ and m− are margins and γ is a scal-
1<n<N
ing factor that magnifies the difference between
c w
where W 1 ∈ Rd ×d k is the weight matrix of the the score and the margin and helps to penalize
convolutional layer and f is the hyperbolic tangent more on the prediction errors. The first term in
function. The same matrix is used to extract local the right side of Equation 1 decreases as the score
features around each word window of the given sθ (x)y+ increases. The second term in the right
628
side decreases as the score sθ (x)c− decreases. of relations that may not have much in common.
Training CR-CNN by minimizing the loss func- An important characteristic of CR-CNN is that
tion in Equation 1 has the effect of training to give it makes it easy to reduce the effect of artificial
scores greater than m+ for the correct class and classes by omitting their embeddings. If the em-
(negative) scores smaller than −m− for incorrect bedding of a class label c is omitted, it means that
classes. In our experiments we set γ to 2, m+ to the embedding matrix W classes does not contain
2.5 and m− to 0.5. We use L2 regularization by a column vector for c. One of the main benefits
adding the term βkθk2 to Equation 1. In our ex- from this strategy is that the learning process fo-
periments we set β to 0.001. We use stochastic cuses on the “natural” classes only. Since the em-
gradient descent (SGD) to minimize the loss func- bedding of the artificial class is omitted, it will not
tion with respect to θ. influence the prediction step, i.e., CR-CNN does
Like some other ranking approaches that only not produce a score for the artificial class.
update two classes/examples at every training In our experiments with the SemEval-2010 rela-
round (Weston et al., 2011; Gao et al., 2014), we tion classification task, when training with a sen-
can efficiently train the network for tasks which tence x whose class label y = Other, the first
have a very large number of classes. This is an term in the right side of Equation 1 is set to
advantage over softmax classifiers. zero. During prediction time, a relation is clas-
On the other hand, sampling informative nega- sified as Other only if all actual classes have neg-
tive classes/examples can have a significant impact ative scores. Otherwise, it is classified with the
in the effectiveness of the learned model. In the class which has the largest score.
case of our loss function, more informative nega-
tive classes are the ones with a score larger than 3 Experimental Setup
−m− . The number of classes in the relation clas- 3.1 Dataset and Evaluation Metric
sification dataset that we use in our experiments is
We use the SemEval-2010 Task 8 dataset to per-
small. Therefore, in our experiments, given a sen-
form our experiments. This dataset contains
tence x with class label y + , the incorrect class c−
10,717 examples annotated with 9 different rela-
that we choose to perform a SGD step is the one
tion types and an artificial relation Other, which
with the highest score among all incorrect classes
is used to indicate that the relation in the exam-
c− = arg max sθ (x)c .
c ∈ C; c6=y + ple does not belong to any of the nine main rela-
For tasks where the number of classes is large, tion types. The nine relations are Cause-Effect,
we can fix a number of negative classes to be con- Component-Whole, Content-Container, Entity-
sidered at each example and select the one with Destination, Entity-Origin, Instrument-Agency,
the largest score to perform a gradient step. This Member-Collection, Message-Topic and Product-
approach is similar to the one used by Weston et Producer. Each example contains a sentence
al. (2014) to select negative examples. marked with two nominals e1 and e2 , and the
We use the backpropagation algorithm to com- task consists of predicting the relation between
pute gradients of the network. In our experi- the two nominals taking into consideration the di-
ments, we implement the CR-CNN architecture rectionality. That means that the relation Cause-
and the backpropagation algorithm using Theano Effect(e1,e2) is different from the relation Cause-
(Bergstra et al., 2010). Effect(e2,e1), as shown in the examples below.
More information about this dataset can be found
2.6 Special Treatment of Artificial Classes in (Hendrickx et al., 2010).
In this work, we consider a class as artificial if it is The [war]e1 resulted in other collateral imperial
[conquests]e2 as well. ⇒ Cause-Effect(e1,e2)
used to group items that do not belong to any of the
The [burst]e1 has been caused by water hammer
actual classes. An example of artificial class is the
[pressure]e2 . ⇒ Cause-Effect(e2,e1)
class Other in the SemEval 2010 relation classifi-
cation task. In this task, the artificial class Other The SemEval-2010 Task 8 dataset is already
is used to indicate that the relation between two partitioned into 8,000 training instances and 2,717
nominals does not belong to any of the nine rela- test instances. We score our systems by using the
tion classes of interest. Therefore, the class Other SemEval-2010 Task 8 official scorer, which com-
is very noisy since it groups many different types putes the macro-averaged F1-scores for the nine
629
actual relations (excluding Other) and takes the di- 4 Experimental Results
rectionality into consideration.
4.1 Word Position Embeddings and Input
Text Span
3.2 Word Embeddings Initialization In the experiments discussed in this section we as-
sess the impact of using word position embeddings
The word embeddings used in our experiments are (WPE) and also propose a simpler alternative ap-
initialized by means of unsupervised pre-training. proach that is almost as effective as WPEs. The
We perform pre-training using the skip-gram NN main idea behind the use of WPEs in relation clas-
architecture (Mikolov et al., 2013) available in sification task is to give some hint to the convo-
the word2vec tool. We use the December 2013 lutional layer of how close a word is to the target
snapshot of the English Wikipedia corpus to train nouns, based on the assumption that closer words
word embeddings with word2vec. We prepro- have more impact than distant words.
cess the Wikipedia text using the steps described Here we hypothesize that most of the informa-
in (dos Santos and Gatti, 2014): (1) removal of tion needed to classify the relation appear between
paragraphs that are not in English; (2) substitu- the two target nouns. Based on this hypothesis,
tion of non-western characters for a special char- we perform an experiment where the input for the
acter; (3) tokenization of the text using the to- convolutional layer consists of the word embed-
kenizer available with the Stanford POS Tagger dings of the word sequence {we1 − 1, ..., we2 + 1}
(Toutanova et al., 2003); (4) removal of sentences where e1 and e2 correspond to the positions of the
that are less than 20 characters long (including first and the second target nouns, respectively.
white spaces) or have less than 5 tokens. (5) lower- In Table 2 we compare the results of different
case all words and substitute each numerical digit CR-CNN configurations. The first column indi-
by a 0. The resulting clean corpus contains about cates whether the full sentence was used (Yes) or
1.75 billion tokens. whether the text span between the target nouns
was used (No). The second column informs if
the WPEs were used or not. It is clear that the
3.3 Neural Network Hyper-parameter use of WPEs is essential when the full sentence is
used, since F1 jumps from 74.3 to 84.1. This ef-
We use 4-fold cross-validation to tune the neu- fect of WPEs is reported by (Zeng et al., 2014). On
ral network hyperparameters. Learning rates in the other hand, when using only the text span be-
the range of 0.03 and 0.01 give relatively simi- tween the target nouns, the impact of WPE is much
lar results. Best results are achieved using be- smaller. With this strategy, we achieve a F1 of 82.8
tween 10 and 15 training epochs, depending on using only word embeddings as input, which is a
the CR-CNN configuration. In Table 1, we show result as good as the previous state-of-the-art F1 of
the selected hyperparameter values. Additionally, 83.0 reported in (Yu et al., 2014) for the SemEval-
we use a learning rate schedule that decreases the 2010 Task 8 dataset. This experimental result also
learning rate λ according to the training epoch t. suggests that, in this task, the CNN works better
The learning rate for epoch t, λt , is computed us- for short texts.
λ All experiments reported in the next sections
ing the equation: λt = .
t use CR-CNN with full sentence and WPEs.
Full Word
Parameter Parameter Name Value Prec. Rec. F1
Sentence Position
dw Word Emb. size 400 Yes Yes 83.7 84.7 84.1
dwpe Word Pos. Emb. size 70 No Yes 83.3 83.9 83.5
dc Convolutinal Units 1000 No No 83.4 82.3 82.8
k Context Window size 3 Yes No 78.1 71.5 74.3
λ Initial Learning Rate 0.025
Table 2: Comparison of different CR-CNN con-
Table 1: CR-CNN Hyperparameters figurations.
630
4.2 Impact of Omitting the Embedding of the a softmax classifier. We tune the parameters of
artificial class Other CNN+Softmax by using a 4-fold cross-validation
In this experiment we assess the impact of omit- with the training set. Compared to the hyperpa-
ting the embedding of the class Other. As we rameter values for CR-CNN presented in Table 1,
mentioned above, this class is very noisy since it the only difference for CNN+Softmax is the num-
groups many different infrequent relation types. ber of convolutional units dc , which is set to 400.
Its embedding is difficult to define and therefore In Table 4 we compare the results of CR-
brings noise into the classification process of the CNN and CNN+Softmax. CR-CNN outperforms
natural classes. In Table 3 we present the results CNN+Softmax in both precision and recall, and
comparing the use and omission of embedding improves the F1 by 1.6. The third line in Ta-
for the class Other. The two first lines of results ble 4 shows the result reported by Zeng et al.
present the official F1, which does not take into (2014) when only word embeddings and WPEs
account the results for the class Other. We can see are used as input to the network (similar to our
that by omitting the embedding of the class Other CNN+Softmax). We believe that the word embed-
both precision and recall for the other classes im- dings employed by them is the main reason their
prove, which results in an increase of 1.4 in the result is much worse than that of CNN+Softmax.
F1. These results suggest that the strategy we use We use word embeddings of size 400 while they
in CR-CNN to avoid the noise of artificial classes use word embeddings of size 50, which were
is effective. trained using much less unlabeled data than we
did.
Use embedding
Class Prec. Rec. F1 Neural Net. Prec. Rec. F1
of class Other
No All 83.7 84.7 84.1 CR-CNN 83.7 84.7 84.1
Yes All 81.3 84.3 82.7 CNN+SoftMax 82.1 83.1 82.5
No Other 52.0 48.7 50.3 CNN+SoftMax
- - 78.9
Yes Other 60.1 48.7 53.8 (Zeng et al., 2014)
Table 3: Impact of not using an embedding for the Table 4: Comparison of results of CR-CNN and
artificial class Other. CNN+Softmax.
631
based Compositional Embedding Model (FCM), gram for Entity-Origin(e1,e2) is “away from the”,
which achieves a F1 of 83.0 by deriving sentence- while reverse direction of the relation, Entity-
level and substructure embeddings from word em- Origin(e2,e1) or Origin-Entity, has “the source
beddings utilizing dependency trees and named of” as the most informative trigram. These re-
entities. sults are a step towards the extraction of meaning-
As we can see in the last line of Table 5, CR- ful knowledge from models produced by CNNs.
CNN using the full sentence, word embeddings
and WPEs outperforms all previous reported re- 5 Related Work
sults and reaches a new state-of-the-art F1 of 84.1.
Over the years, various approaches have been
This is a remarkable result since we do not use
proposed for relation classification (Zhang, 2004;
any complicated features that depend on external
Qian et al., 2009; Hendrickx et al., 2010; Rink and
lexical resources such as WordNet and NLP tools
Harabagiu, 2010). Most of them treat it as a multi-
such as named entity recognizers (NERs) and de-
class classification problem and apply a variety of
pendency parsers.
machine learning techniques to the task in order to
We can see in Table 5 that CR-CNN1 also
achieve a high accuracy.
achieves the best result among the systems that
Recently, deep learning (Bengio, 2009) has be-
use word embeddings as the only input features.
come an attractive area for multiple applications,
The closest result (80.6), which is produced by the
including computer vision, speech recognition and
FCM system of Yu et al. (2014), is 2.2 F1 points
natural language processing. Among the different
behind CR-CNN result (82.8).
deep learning strategies, convolutional neural net-
4.5 Most Representative Trigrams for each works have been successfully applied to different
Relation NLP task such as part-of-speech tagging (dos San-
tos and Zadrozny, 2014), sentiment analysis (Kim,
In Table 6, for each relation type we present the 2014; dos Santos and Gatti, 2014), question classi-
five trigrams in the test set which contributed the fication (Kalchbrenner et al., 2014), semantic role
most for scoring correctly classified examples. labeling (Collobert et al., 2011), hashtag predic-
Remember that in CR-CNN, given a sentence x tion (Weston et al., 2014), sentence completion
the score for the class c is computed by sθ (x)c = and response matching (Hu et al., 2014).
rx| [W classes ]c . In order to compute the most rep- Some recent work on deep learning for relation
resentative trigram of a sentence x, we trace back classification include Socher et al. (2012), Zeng
each position in rx to find the trigram responsible et al. (2014) and Yu et al. (2014). In (Socher et
for it. For each trigram t, we compute its particular al., 2012), the authors tackle relation classification
contribution for the score by summing the terms using a recursive neural network (RNN) that as-
in score that use positions in rx that trace back to signs a matrix-vector representation to every node
t. The most representative trigram in x is the one in a parse tree. The representation for the com-
with the largest contribution to the improvement of plete sentence is computed bottom-up by recur-
the score. In order to create the results presented sively combining the words according to the syn-
in Table 6, we rank the trigrams which were se- tactic structure of the parse tree Their method is
lected as the most representative of any sentence named the matrix-vector recursive neural network
in decreasing order of contribution value. If a tri- (MVRNN).
gram appears as the largest contributor for more Zeng et al. (2014) propose an approach for re-
than one sentence, its contribuition value becomes lation classification where sentence-level features
the sum of its contribution for each sentence. are learned through a CNN, which has word em-
We can see in Table 6 that for most classes, the bedding and position features as its input. In par-
trigrams that contributed the most to increase the allel, lexical features are extracted according to
score are indeed very informative regarding the re- given nouns. Then sentence-level and lexical fea-
lation type. As expected, different trigrams play tures are concatenated into a single vector and
an important role depending on the direction of fed into a softmax classifier for prediction. This
the relation. For instance, the most informative tri- approach achieves state-of-the-art performance on
1
This is the result using only the text span between the the SemEval-2010 Task 8 dataset.
target nouns. Yu et al. (2014) propose a Factor-based Com-
632
Classifier Feature Set F1
SVM POS, prefixes, morphological, WordNet, dependency parse,
(Rink and Harabagiu, 2010) Levin classes, ProBank, FrameNet, NomLex-Plus, 82.2
Google n-gram, paraphrases, TextRunner
RNN word embeddings 74.8
(Socher et al., 2012) word embeddings, POS, NER, WordNet 77.6
MVRNN word embeddings 79.1
(Socher et al., 2012) word embeddings, POS, NER, WordNet 82.4
word embeddings 69.7
CNN+Softmax word embeddings, word position embeddings,
82.7
(Zeng et al., 2014) word pair, words around word pair, WordNet
FCM word embeddings 80.6
(Yu et al., 2014) word embeddings, dependency parse, NER 83.0
word embeddings 82.8
CR-CNN
word embeddings, word position embeddings 84.1
633
Acknowledgments Yoon Kim. 2014. Convolutional neural networks for
sentence classification. In Proceedings of the 2014
The authors would like to thank Nina Wacholder Conference on Empirical Methods for Natural Lan-
for her valuable suggestions to improve the final guage Processing, pages 1746–1751, Doha, Qatar.
version of the paper. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey
Dean. 2013. Efficient estimation of word represen-
tations in vector space. In In Proceedings of Work-
References shop at ICLR.
Yoshua Bengio. 2009. Learning deep architectures Longhua Qian, Guodong Zhou, Fang Kong, and
for ai. Foundations and Trends Machine Learning, Qiaoming Zhu. 2009. Semi-supervised learning for
2(1):1–127. semantic relation classification using stratified sam-
pling strategy. In Proceedings of the Conference on
James Bergstra, Olivier Breuleux, Frédéric Bastien, Empirical Methods in Natural Language Process-
Pascal Lamblin, Razvan Pascanu, Guillaume Des- ing, pages 1437–1445.
jardins, Joseph Turian, David Warde-Farley, and Bryan Rink and Sanda Harabagiu. 2010. Utd: Clas-
Yoshua Bengio. 2010. Theano: a CPU and GPU sifying semantic relations by combining lexical and
math expression compiler. In Proceedings of the semantic resources. In Proceedings of International
Python for Scientific Computing Conference. Workshop on Semantic Evaluation, pages 256–259.
R. Collobert, J. Weston, L. Bottou, M. Karlen, Richard Socher, Brody Huval, Christopher D. Man-
K. Kavukcuoglu, and P. Kuksa. 2011. Natural lan- ning, and Andrew Y. Ng. 2012. Semantic com-
guage processing (almost) from scratch. Journal of positionality through recursive matrix-vector spaces.
Machine Learning Research, 12:2493–2537. In Proceedings of the Joint Conference on Empir-
ical Methods in Natural Language Processing and
Cı́cero Nogueira dos Santos and Maı́ra Gatti. 2014. Computational Natural Language Learning, pages
Deep convolutional neural networks for sentiment 1201–1211.
analysis of short texts. In Proceedings of the 25th In-
ternational Conference on Computational Linguis- Kristina Toutanova, Dan Klein, Christopher D Man-
tics (COLING), Dublin, Ireland. ning, and Yoram Singer. 2003. Feature-rich part-of-
speech tagging with a cyclic dependency network.
Cı́cero Nogueira dos Santos and Bianca Zadrozny. In Proceedings of the Conference of the North Amer-
2014. Learning character-level representations for ican Chapter of the Association for Computational
part-of-speech tagging. In Proceedings of the Linguistics on Human Language Technology, pages
31st International Conference on Machine Learning 173–180.
(ICML), JMLR: W&CP volume 32, Beijing, China. Jason Weston, Samy Bengio, and Nicolas Usunier.
2011. Wsabie: Scaling up to large vocabulary image
Jianfeng Gao, Patrick Pantel, Michael Gamon, Xi- annotation. In Proceedings of the Twenty-Second
aodong He, and Li Deng. 2014. Modeling interest- International Joint Conference on Artificial Intelli-
ingness with deep neural networks. In Proceedings gence, pages 2764–2770.
of the Conference on Empirical Methods in Natural
Language Processing (EMNLP). Jason Weston, Sumit Chopra, and Keith Adams. 2014.
#tagspace: Semantic embeddings from hashtags. In
Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Proceedings of the Conference on Empirical Meth-
Preslav Nakov, Diarmuid Ó. Séaghdha, Sebastian ods in Natural Language Processing (EMNLP),
Padó, Marco Pennacchiotti, Lorenza Romano, and pages 1822–1827.
Stan Szpakowicz. 2010. Semeval-2010 task 8:
Multi-way classification of semantic relations be- Mo Yu, Matthew Gormley, and Mark Dredze. 2014.
tween pairs of nominals. In Proceedings of the Factor-based compositional embedding models. In
5th International Workshop on Semantic Evaluation, Proceedings of the 2nd Workshop on Learning Se-
pages 33–38. mantics, Montreal, Canada.
Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou,
Baotian Hu, Zhengdong Lu, Hang Li, and Qingcai and Jun Zhao. 2014. Relation classification via con-
Chen. 2014. Convolutional neural network archi- volutional deep neural network. In Proceedings of
tectures for matching natural language sentences. In the 25th International Conference on Computational
Proceedings of the Conference on Neural Informa- Linguistics (COLING), pages 2335–2344, Dublin,
tion Processing Systems, pages 2042–2050. Ireland.
Nal Kalchbrenner, Edward Grefenstette, and Phil Blun- Zhu Zhang. 2004. Weakly-supervised relation classifi-
som. 2014. A convolutional neural netork for mod- cation for information extraction. In Proceedings of
elling sentences. In Proceedings of the 52th Annual the ACM International Conference on Information
Meeting of the Association for Computational Lin- and Knowledge Management, pages 581–588, New
guistics, pages 655–665, Baltimore, Maryland. York, NY, USA.
634