Convolutional Neural Networks For Sentence Classification: Yoon Kim New York University Yhk255@nyu - Edu

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

Convolutional Neural Networks for Sentence Classification

Yoon Kim
New York University
[email protected]

Abstract local features (LeCun et al., 1998). Originally


invented for computer vision, CNN models have
We report on a series of experiments with
subsequently been shown to be effective for NLP
convolutional neural networks (CNN)
and have achieved excellent results in semantic
trained on top of pre-trained word vec-
parsing (Yih et al., 2014), search query retrieval
tors for sentence-level classification tasks.
arXiv:1408.5882v2 [cs.CL] 3 Sep 2014

(Shen et al., 2014), sentence modeling (Kalch-


We show that a simple CNN with lit-
brenner et al., 2014), and other traditional NLP
tle hyperparameter tuning and static vec-
tasks (Collobert et al., 2011).
tors achieves excellent results on multi-
ple benchmarks. Learning task-specific In the present work, we train a simple CNN with
vectors through fine-tuning offers further one layer of convolution on top of word vectors
gains in performance. We additionally obtained from an unsupervised neural language
propose a simple modification to the ar- model. These vectors were trained by Mikolov et
chitecture to allow for the use of both al. (2013) on 100 billion words of Google News,
task-specific and static vectors. The CNN and are publicly available.1 We initially keep the
models discussed herein improve upon the word vectors static and learn only the other param-
state of the art on 4 out of 7 tasks, which eters of the model. Despite little tuning of hyper-
include sentiment analysis and question parameters, this simple model achieves excellent
classification. results on multiple benchmarks, suggesting that
the pre-trained vectors are ‘universal’ feature ex-
1 Introduction tractors that can be utilized for various classifica-
Deep learning models have achieved remarkable tion tasks. Learning task-specific vectors through
results in computer vision (Krizhevsky et al., fine-tuning results in further improvements. We
2012) and speech recognition (Graves et al., 2013) finally describe a simple modification to the archi-
in recent years. Within natural language process- tecture to allow for the use of both pre-trained and
ing, much of the work with deep learning meth- task-specific vectors by having multiple channels.
ods has involved learning word vector representa- Our work is philosophically similar to Razavian
tions through neural language models (Bengio et et al. (2014) which showed that for image clas-
al., 2003; Yih et al., 2011; Mikolov et al., 2013) sification, feature extractors obtained from a pre-
and performing composition over the learned word trained deep learning model perform well on a va-
vectors for classification (Collobert et al., 2011). riety of tasks—including tasks that are very dif-
Word vectors, wherein words are projected from a ferent from the original task for which the feature
sparse, 1-of-V encoding (here V is the vocabulary extractors were trained.
size) onto a lower dimensional vector space via a
hidden layer, are essentially feature extractors that 2 Model
encode semantic features of words in their dimen- The model architecture, shown in figure 1, is a
sions. In such dense representations, semantically slight variant of the CNN architecture of Collobert
close words are likewise close—in euclidean or et al. (2011). Let xi ∈ Rk be the k-dimensional
cosine distance—in the lower dimensional vector word vector corresponding to the i-th word in the
space. sentence. A sentence of length n (padded where
Convolutional neural networks (CNN) utilize
1
layers with convolving filters that are applied to https://code.google.com/p/word2vec/
wait
for
the
video
and
do
n't
rent
it

n x k representation of Convolutional layer with Max-over-time Fully connected layer


sentence with static and multiple filter widths and pooling with dropout and
non-static channels feature maps softmax output

Figure 1: Model architecture with two channels for an example sentence.

necessary) is represented as that is kept static throughout training and one that
is fine-tuned via backpropagation (section 3.2).2
x1:n = x1 ⊕ x2 ⊕ . . . ⊕ xn , (1) In the multichannel architecture, illustrated in fig-
ure 1, each filter is applied to both channels and
where ⊕ is the concatenation operator. In gen-
the results are added to calculate ci in equation
eral, let xi:i+j refer to the concatenation of words
(2). The model is otherwise equivalent to the sin-
xi , xi+1 , . . . , xi+j . A convolution operation in-
gle channel architecture.
volves a filter w ∈ Rhk , which is applied to a
window of h words to produce a new feature. For 2.1 Regularization
example, a feature ci is generated from a window
of words xi:i+h−1 by For regularization we employ dropout on the
penultimate layer with a constraint on l2 -norms of
ci = f (w · xi:i+h−1 + b). (2) the weight vectors (Hinton et al., 2012). Dropout
prevents co-adaptation of hidden units by ran-
Here b ∈ R is a bias term and f is a non-linear domly dropping out—i.e., setting to zero—a pro-
function such as the hyperbolic tangent. This filter portion p of the hidden units during foward-
is applied to each possible window of words in the backpropagation. That is, given the penultimate
sentence {x1:h , x2:h+1 , . . . , xn−h+1:n } to produce layer z = [ĉ1 , . . . , ĉm ] (note that here we have m
a feature map filters), instead of using
c = [c1 , c2 , . . . , cn−h+1 ], (3)
y =w·z+b (4)
with c ∈ Rn−h+1 . We then apply a max-over-
time pooling operation (Collobert et al., 2011) for output unit y in forward propagation, dropout
over the feature map and take the maximum value uses
ĉ = max{c} as the feature corresponding to this y = w · (z ◦ r) + b, (5)
particular filter. The idea is to capture the most im-
where ◦ is the element-wise multiplication opera-
portant feature—one with the highest value—for
tor and r ∈ Rm is a ‘masking’ vector of Bernoulli
each feature map. This pooling scheme naturally
random variables with probability p of being 1.
deals with variable sentence lengths.
Gradients are backpropagated only through the
We have described the process by which one
unmasked units. At test time, the learned weight
feature is extracted from one filter. The model
vectors are scaled by p such that ŵ = pw, and
uses multiple filters (with varying window sizes)
ŵ is used (without dropout) to score unseen sen-
to obtain multiple features. These features form
tences. We additionally constrain l2 -norms of the
the penultimate layer and are passed to a fully con-
weight vectors by rescaling w to have ||w||2 = s
nected softmax layer whose output is the probabil-
whenever ||w||2 > s after a gradient descent step.
ity distribution over labels.
In one of the model variants, we experiment 2
We employ language from computer vision where a color
with having two ‘channels’ of word vectors—one image has red, green, and blue channels.
Data c l N |V | |Vpre | Test • MPQA: Opinion polarity detection subtask
MR 2 20 10662 18765 16448 CV of the MPQA dataset (Wiebe et al., 2005).7
SST-1 5 18 11855 17836 16262 2210
SST-2 2 19 9613 16185 14838 1821 3.1 Hyperparameters and Training
Subj 2 23 10000 21323 17913 CV For all datasets we use: rectified linear units, filter
TREC 6 10 5952 9592 9125 500 windows (h) of 3, 4, 5 with 100 feature maps each,
CR 2 19 3775 5340 5046 CV dropout rate (p) of 0.5, l2 constraint (s) of 3, and
MPQA 2 3 10606 6246 6083 CV mini-batch size of 50. These values were chosen
Table 1: Summary statistics for the datasets after tokeniza- via a grid search on the SST-2 dev set.
tion. c: Number of target classes. l: Average sentence length.
N : Dataset size. |V |: Vocabulary size. |Vpre |: Number of
We do not otherwise perform any dataset-
words present in the set of pre-trained word vectors. Test: specific tuning other than early stopping on dev
Test set size (CV means there was no standard train/test split sets. For datasets without a standard dev set we
and thus 10-fold CV was used).
randomly select 10% of the training data as the
dev set. Training is done through stochastic gra-
3 Datasets and Experimental Setup dient descent over shuffled mini-batches with the
Adadelta update rule (Zeiler, 2012).
We test our model on various benchmarks. Sum-
mary statistics of the datasets are in table 1. 3.2 Pre-trained Word Vectors
• MR: Movie reviews with one sentence per re- Initializing word vectors with those obtained from
view. Classification involves detecting posi- an unsupervised neural language model is a popu-
tive/negative reviews (Pang and Lee, 2005).3 lar method to improve performance in the absence
of a large supervised training set (Collobert et al.,
• SST-1: Stanford Sentiment Treebank—an 2011; Socher et al., 2011; Iyyer et al., 2014). We
extension of MR but with train/dev/test splits use the publicly available word2vec vectors that
provided and fine-grained labels (very pos- were trained on 100 billion words from Google
itive, positive, neutral, negative, very nega- News. The vectors have dimensionality of 300 and
tive), re-labeled by Socher et al. (2013).4 were trained using the continuous bag-of-words
architecture (Mikolov et al., 2013). Words not
• SST-2: Same as SST-1 but with neutral re- present in the set of pre-trained words are initial-
views removed and binary labels. ized randomly.

• Subj: Subjectivity dataset where the task is 3.3 Model Variations


to classify a sentence as being subjective or We experiment with several variants of the model.
objective (Pang and Lee, 2004).
• CNN-rand: Our baseline model where all
• TREC: TREC question dataset—task in- words are randomly initialized and then mod-
volves classifying a question into 6 question ified during training.
types (whether the question is about person,
location, numeric information, etc.) (Li and • CNN-static: A model with pre-trained
Roth, 2002).5 vectors from word2vec. All words—
including the unknown ones that are ran-
• CR: Customer reviews of various products domly initialized—are kept static and only
(cameras, MP3s etc.). Task is to predict pos- the other parameters of the model are learned.
itive/negative reviews (Hu and Liu, 2004).6
3 • CNN-non-static: Same as above but the pre-
https://www.cs.cornell.edu/people/pabo/movie-review-data/
4
http://nlp.stanford.edu/sentiment/ Data is actually provided trained vectors are fine-tuned for each task.
at the phrase-level and hence we train the model on both
phrases and sentences but only score on sentences at test • CNN-multichannel: A model with two sets
time, as in Socher et al. (2013), Kalchbrenner et al. (2014),
and Le and Mikolov (2014). Thus the training set is an order of word vectors. Each set of vectors is treated
of magnitude larger than listed in table 1. as a ‘channel’ and each filter is applied
5
http://cogcomp.cs.illinois.edu/Data/QA/QC/
6 7
http://www.cs.uic.edu/∼liub/FBS/sentiment-analysis.html http://www.cs.pitt.edu/mpqa/
Model MR SST-1 SST-2 Subj TREC CR MPQA
CNN-rand 76.1 45.0 82.7 89.6 91.2 79.8 83.4
CNN-static 81.0 45.5 86.8 93.0 92.8 84.7 89.6
CNN-non-static 81.5 48.0 87.2 93.4 93.6 84.3 89.5
CNN-multichannel 81.1 47.4 88.1 93.2 92.2 85.0 89.4
RAE (Socher et al., 2011) 77.7 43.2 82.4 − − − 86.4
MV-RNN (Socher et al., 2012) 79.0 44.4 82.9 − − − −
RNTN (Socher et al., 2013) − 45.7 85.4 − − − −
DCNN (Kalchbrenner et al., 2014) − 48.5 86.8 − 93.0 − −
Paragraph-Vec (Le and Mikolov, 2014) − 48.7 87.8 − − − −
CCAE (Hermann and Blunsom, 2013) 77.8 − − − − − 87.2
Sent-Parser (Dong et al., 2014) 79.5 − − − − − 86.3
NBSVM (Wang and Manning, 2012) 79.4 − − 93.2 − 81.8 86.3
MNB (Wang and Manning, 2012) 79.0 − − 93.6 − 80.0 86.3
G-Dropout (Wang and Manning, 2013) 79.0 − − 93.4 − 82.1 86.1
F-Dropout (Wang and Manning, 2013) 79.1 − − 93.6 − 81.9 86.3
Tree-CRF (Nakagawa et al., 2010) 77.3 − − − − 81.4 86.1
CRF-PR (Yang and Cardie, 2014) − − − − − 82.7 −
SVMS (Silva et al., 2011) − − − − 95.0 − −
Table 2: Results of our CNN models against other methods. RAE: Recursive Autoencoders with pre-trained word vectors from
Wikipedia (Socher et al., 2011). MV-RNN: Matrix-Vector Recursive Neural Network with parse trees (Socher et al., 2012).
RNTN: Recursive Neural Tensor Network with tensor-based feature function and parse trees (Socher et al., 2013). DCNN:
Dynamic Convolutional Neural Network with k-max pooling (Kalchbrenner et al., 2014). Paragraph-Vec: Logistic regres-
sion on top of paragraph vectors (Le and Mikolov, 2014). CCAE: Combinatorial Category Autoencoders with combinatorial
category grammar operators (Hermann and Blunsom, 2013). Sent-Parser: Sentiment analysis-specific parser (Dong et al.,
2014). NBSVM, MNB: Naive Bayes SVM and Multinomial Naive Bayes with uni-bigrams from Wang and Manning (2012).
G-Dropout, F-Dropout: Gaussian Dropout and Fast Dropout from Wang and Manning (2013). Tree-CRF: Dependency tree
with Conditional Random Fields (Nakagawa et al., 2010). CRF-PR: Conditional Random Fields with Posterior Regularization
(Yang and Cardie, 2014). SVMS : SVM with uni-bi-trigrams, wh word, head word, POS, parser, hypernyms, and 60 hand-coded
rules as features from Silva et al. (2011).

to both channels, but gradients are back- competitive results against the more sophisticated
propagated only through one of the chan- deep learning models that utilize complex pool-
nels. Hence the model is able to fine-tune ing schemes (Kalchbrenner et al., 2014) or require
one set of vectors while keeping the other parse trees to be computed beforehand (Socher
static. Both channels are initialized with et al., 2013). These results suggest that the pre-
word2vec. trained vectors are good, ‘universal’ feature ex-
tractors and can be utilized across datasets. Fine-
In order to disentangle the effect of the above tuning the pre-trained vectors for each task gives
variations versus other random factors, we elim- still further improvements (CNN-non-static).
inate other sources of randomness—CV-fold as-
signment, initialization of unknown word vec- 4.1 Multichannel vs. Single Channel Models
tors, initialization of CNN parameters—by keep-
We had initially hoped that the multichannel ar-
ing them uniform within each dataset.
chitecture would prevent overfitting (by ensuring
that the learned vectors do not deviate too far
4 Results and Discussion
from the original values) and thus work better than
Results of our models against other methods are the single channel model, especially on smaller
listed in table 2. Our baseline model with all ran- datasets. The results, however, are mixed, and fur-
domly initialized words (CNN-rand) does not per- ther work on regularizing the fine-tuning process
form well on its own. While we had expected per- is warranted. For instance, instead of using an
formance gains through the use of pre-trained vec- additional channel for the non-static portion, one
tors, we were surprised at the magnitude of the could maintain a single channel but employ extra
gains. Even a simple model with static vectors dimensions that are allowed to be modified during
(CNN-static) performs remarkably well, giving training.
Most Similar Words for the same architecture as our single channel
Static Channel Non-static Channel model. For example, their Max-TDNN (Time
good terrible Delay Neural Network) with randomly ini-
terrible horrible tialized words obtains 37.4% on the SST-1
bad
horrible lousy dataset, compared to 45.0% for our model.
lousy stupid We attribute such discrepancy to our CNN
great nice having much more capacity (multiple filter
bad decent widths and feature maps).
good
terrific solid
decent terrific • Dropout proved to be such a good regularizer
that it was fine to use a larger than necessary
os not
network and simply let dropout regularize it.
ca never
n’t Dropout consistently added 2%–4% relative
ireland nothing
performance.
wo neither
2,500 2,500 • When randomly initializing words not in
entire lush word2vec, we obtained slight improve-
!
jez beautiful ments by sampling each dimension from
changer terrific U [−a, a] where a was chosen such that the
decasia but randomly initialized vectors have the same
abysmally dragon variance as the pre-trained ones. It would be
,
demise a interesting to see if employing more sophis-
valiant and ticated methods to mirror the distribution of
Table 3: Top 4 neighboring words—based on cosine pre-trained vectors in the initialization pro-
similarity—for vectors in the static channel (left) and fine- cess gives further improvements.
tuned vectors in the non-static channel (right) from the mul-
tichannel model on the SST-2 dataset after training. • We briefly experimented with another set of
publicly available word vectors trained by
4.2 Static vs. Non-static Representations Collobert et al. (2011) on Wikipedia,8 and
As is the case with the single channel non-static found that word2vec gave far superior per-
model, the multichannel model is able to fine-tune formance. It is not clear whether this is due
the non-static channel to make it more specific to to Mikolov et al. (2013)’s architecture or the
the task-at-hand. For example, good is most sim- 100 billion word Google News dataset.
ilar to bad in word2vec, presumably because • Adadelta (Zeiler, 2012) gave similar results
they are (almost) syntactically equivalent. But for to Adagrad (Duchi et al., 2011) but required
vectors in the non-static channel that were fine- fewer epochs.
tuned on the SST-2 dataset, this is no longer the
case (table 3). Similarly, good is arguably closer 5 Conclusion
to nice than it is to great for expressing sentiment,
and this is indeed reflected in the learned vectors. In the present work we have described a series of
experiments with convolutional neural networks
For (randomly initialized) tokens not in the set
built on top of word2vec. Despite little tuning
of pre-trained vectors, fine-tuning allows them to
of hyperparameters, a simple CNN with one layer
learn more meaningful representations: the net-
of convolution performs remarkably well. Our re-
work learns that exclamation marks are associ-
sults add to the well-established evidence that un-
ated with effusive expressions and that commas
supervised pre-training of word vectors is an im-
are conjunctive (table 3).
portant ingredient in deep learning for NLP.
4.3 Further Observations
Acknowledgments
We report on some further experiments and obser-
We would like to thank Yann LeCun and the
vations:
anonymous reviewers for their helpful feedback
• Kalchbrenner et al. (2014) report much and suggestions.
8
worse results with a CNN that has essentially http://ronan.collobert.com/senna/
References B. Pang, L. Lee. 2004. A sentimental education:
Sentiment analysis using subjectivity summarization
Y. Bengio, R. Ducharme, P. Vincent. 2003. Neu- based on minimum cuts. In Proceedings of ACL
ral Probabilitistic Language Model. Journal of Ma- 2004.
chine Learning Research 3:1137–1155.
B. Pang, L. Lee. 2005. Seeing stars: Exploiting class
R. Collobert, J. Weston, L. Bottou, M. Karlen, K.
relationships for sentiment categorization with re-
Kavukcuglu, P. Kuksa. 2011. Natural Language
spect to rating scales. In Proceedings of ACL 2005.
Processing (Almost) from Scratch. Journal of Ma-
chine Learning Research 12:2493–2537. A.S. Razavian, H. Azizpour, J. Sullivan, S. Carlsson
J. Duchi, E. Hazan, Y. Singer. 2011 Adaptive subgra- 2014. CNN Features off-the-shelf: an Astounding
dient methods for online learning and stochastic op- Baseline. CoRR, abs/1403.6382.
timization. Journal of Machine Learning Research, Y. Shen, X. He, J. Gao, L. Deng, G. Mesnil. 2014.
12:2121–2159. Learning Semantic Representations Using Convolu-
L. Dong, F. Wei, S. Liu, M. Zhou, K. Xu. 2014. A tional Neural Networks for Web Search. In Proceed-
Statistical Parsing Framework for Sentiment Classi- ings of WWW 2014.
fication. CoRR, abs/1401.6330.
J. Silva, L. Coheur, A. Mendes, A. Wichert. 2011.
A. Graves, A. Mohamed, G. Hinton. 2013. Speech From symbolic to sub-symbolic information in ques-
recognition with deep recurrent neural networks. In tion classification. Artificial Intelligence Review,
Proceedings of ICASSP 2013. 35(2):137–154.

G. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, R. Socher, J. Pennington, E. Huang, A. Ng, C. Man-


R. Salakhutdinov. 2012. Improving neural net- ning. 2011. Semi-Supervised Recursive Autoen-
works by preventing co-adaptation of feature detec- coders for Predicting Sentiment Distributions. In
tors. CoRR, abs/1207.0580. Proceedings of EMNLP 2011.

K. Hermann, P. Blunsom. 2013. The Role of Syntax in R. Socher, B. Huval, C. Manning, A. Ng. 2012. Se-
Vector Space Models of Compositional Semantics. mantic Compositionality through Recursive Matrix-
In Proceedings of ACL 2013. Vector Spaces. In Proceedings of EMNLP 2012.
M. Hu, B. Liu. 2004. Mining and Summarizing Cus- R. Socher, A. Perelygin, J. Wu, J. Chuang, C. Manning,
tomer Reviews. In Proceedings of ACM SIGKDD A. Ng, C. Potts. 2013. Recursive Deep Models for
2004. Semantic Compositionality Over a Sentiment Tree-
bank. In Proceedings of EMNLP 2013.
M. Iyyer, P. Enns, J. Boyd-Graber, P. Resnik 2014.
Political Ideology Detection Using Recursive Neural J. Wiebe, T. Wilson, C. Cardie. 2005. Annotating Ex-
Networks. In Proceedings of ACL 2014. pressions of Opinions and Emotions in Language.
Language Resources and Evaluation, 39(2-3): 165–
N. Kalchbrenner, E. Grefenstette, P. Blunsom. 2014. A 210.
Convolutional Neural Network for Modelling Sen-
tences. In Proceedings of ACL 2014. S. Wang, C. Manning. 2012. Baselines and Bigrams:
Simple, Good Sentiment and Topic Classification.
A. Krizhevsky, I. Sutskever, G. Hinton. 2012. Ima-
In Proceedings of ACL 2012.
geNet Classification with Deep Convolutional Neu-
ral Networks. In Proceedings of NIPS 2012. S. Wang, C. Manning. 2013. Fast Dropout Training.
Q. Le, T. Mikolov. 2014. Distributed Represenations In Proceedings of ICML 2013.
of Sentences and Documents. In Proceedings of B. Yang, C. Cardie. 2014. Context-aware Learning
ICML 2014. for Sentence-level Sentiment Analysis with Poste-
Y. LeCun, L. Bottou, Y. Bengio, P. Haffner. 1998. rior Regularization. In Proceedings of ACL 2014.
Gradient-based learning applied to document recog-
W. Yih, K. Toutanova, J. Platt, C. Meek. 2011. Learn-
nition. In Proceedings of the IEEE, 86(11):2278– ing Discriminative Projections for Text Similarity
2324, November.
Measures. Proceedings of the Fifteenth Confer-
X. Li, D. Roth. 2002. Learning Question Classifiers. ence on Computational Natural Language Learning,
In Proceedings of ACL 2002. 247–256.

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, J. Dean. W. Yih, X. He, C. Meek. 2014. Semantic Parsing for
2013. Distributed Representations of Words and Single-Relation Question Answering. In Proceed-
Phrases and their Compositionality. In Proceedings ings of ACL 2014.
of NIPS 2013.
M. Zeiler. 2012. Adadelta: An adaptive learning rate
T. Nakagawa, K. Inui, S. Kurohashi. 2010. De- method. CoRR, abs/1212.5701.
pendency tree-based sentiment classification using
CRFs with hidden variables. In Proceedings of ACL
2010.

You might also like