Text Understanding From Scratch
Text Understanding From Scratch
inputs are quantized characters and the outputs are abstract output) frame size. The outputs hj (y) is obtained by a sum
properties of the text. Our approach is one that ‘learns from over i of the convolutions between gi (x) and fij (x).
scratch’, in the following 2 senses
One key module that helped us to train deeper models
is temporal max-pooling. It is the same as spatial max-
1. ConvNets do not require knowledge of words – work- pooling module used in computer vision(Boureau et al.,
ing with characters is fine. This renders a word-based 2010a), except that it is in 1-D. Given a discrete input
feature extractor (such as LookupTable(Collobert function g(x) ∈ [1, l] → R, the max-pooling function
et al., 2011b) or word2vec(Mikolov et al., 2013b)) un- h(y) ∈ [1, b(l − k)/dc + 1] → R of g(x) is defined as
necessary. All previous works start with words instead
of characters, which is difficult to apply a convolu- k
h(y) = max g(y · d − x + c),
tional layer directly due to its high dimension. x=1
2. ConvNets do not require knowledge of syntax or se- where c = k − d + 1 is an offset constant. This very pool-
mantic structures – inference directly to high-level tar- ing module enabled us to train ConvNets deeper than 6 lay-
gets is fine. This also invalidates the assumption that ers, where all others fail. The analysis by (Boureau et al.,
structured predictions and language models are neces- 2010b) might shed some light on this.
sary for high-level text understanding.
The non-linearity used in our model is the rectifier or
thresholding function h(x) = max{0, x}, which makes
Our approach is partly inspired by ConvNet’s success in our convolutional layers similar to rectified linear units
computer vision. It has outstanding performance in various (ReLUs)(Nair & Hinton, 2010). We always apply this func-
image recognition tasks(Girshick et al., 2013)(Krizhevsky tion after a convolutional or linear module, therefore we
et al., 2012)(Sermanet et al., 2013). These successful re- omit its appearance in the following. The algorithm used
sults usually involve some end-to-end ConvNet model that in training our model is stochastic gradient descent (SGD)
learns hierarchical representation from raw pixels(Girshick with a minibatch of size 128, using momentum(Polyak,
et al., 2013)(Zeiler & Fergus, 2014). Similarly, we hy- 1964)(Sutskever et al., 2013) 0.9 and initial step size 0.01
pothesize that when trained from raw characters, temporal which is halved every 3 epoches for 10 times. The training
ConvNet is able to learn the hierarchical representations of method and parameters apply to all of our models. Our im-
words, phrases and sentences in order to understand text. plementation is done using Torch 7(Collobert et al., 2011a).
1 1024 256 7 3
The alphabet used in all of our models consists of 70 char- 2 1024 256 7 3
acters, including 26 English letters, 10 digits, new line and 3 1024 256 3 N/A
33 other characters. They include: 4 1024 256 3 N/A
5 1024 256 3 N/A
abcdefghijklmnopqrstuvwxyz0123456789 6 1024 256 3 3
-,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=<>()[]{}
2.3. Model Design Layer Output Units Large Output Units Small
We designed 2 ConvNets – one large and one small. They
7 2048 1024
are both 9 layers deep with 6 convolutional layers and 3
8 2048 1024
fully-connected layers, with different number of hidden
9 Depends on the problem
units and frame sizes. Figure 2 gives an illustration.
Length
Some Text
Quantization
Frames
was obtained from WordNet(Fellbaum, 2005), where ev- openly accessible dataset that is large enough or with labels
ery synonym to a word or phrase is ranked by the semantic of sufficient quality for us, although the research on text
closeness to the most frequently seen meaning. understanding has been conducted for tens of years. There-
fore, we propose several large-scale datasets, in hopes that
To do synonym replacement for a given text, we need to
text understanding can rival the success of image recog-
answer 2 questions: which words in the text should be re-
nition when large-scale datasets such as ImageNet(Deng
placed, and which synonym from the thesaurus should be
et al., 2009) became available.
used for the replacement. To decide on the first question,
we extract all replaceable words from the given text and
randomly choose r of them to be replaced. The probability 3.1. DBpedia Ontology Classification
of number r is determined by a geometric distribution with DBpedia is a crowd-sourced community effort to extract
parameter p in which P [r] ∼ pr . The index s of the syn- structured information from Wikipedia(Lehmann et al.,
onym chosen given a word is also determined by a another 2014). The English version of the DBpedia knowledge
geometric distribution in which P [s] ∼ q s . This way, the base provides a consistent ontology, which is shallow and
probability of a synonym chosen becomes smaller when it cross-domain. It has been manually created based on the
moves distant from the most frequently seen meaning. most commonly used infoboxes within Wikipedia. Some
It is worth noting that models trained using our large-scale ontology classes in DBpedia contain hundreds of thousands
datasets hardly require data augmentation, since their gen- of samples, which are ideal candidates to construct an on-
eralization errors are already pretty good. We will still re- tology classification dataset.
port the results using this new data augmentation technique The DBpedia ontology classification dataset is constructed
with p = 0.5 and q = 0.5. by picking 14 non-overlapping classes from DBpedia 2014.
They are listed in table 3. From each of these 14 ontology
2.5. Comparison Models classes, we randomly choose 40,000 training samples and
5,000 testing samples. Therefore, the total size of the train-
Since we have constructed several large-scale datasets from
ing dataset is 560,000 and testing dataset 70,000.
scratch, there is no previous publication for us to obtain a
comparison with other methods. Therefore, we also imple-
mented two fairly standard models using previous methods: Table 3. DBpedia ontology classes. The numbers contain only
the bag-of-words model, and a bag-of-centroids model via samples with both a title and a short abstract.
word2vec(Mikolov et al., 2013b).
Class Total Train Test
The bag-of-words model is pretty straightforward. For each
dataset, we count how many times each word appears in the Company 63,058 40,000 5,000
training dataset, and choose 5000 most frequent ones as the Educational Institution 50,450 40,000 5,000
bag. Then, we use multinomial logistic regression as the Artist 95,505 40,000 5,000
classifier for this bag of features. Athlete 268,104 40,000 5,000
As for the word2vec model, we first ran k-means on the Office Holder 47,417 40,000 5,000
word vectors learnt from Google News corpus with k = Mean Of Transportation 47,473 40,000 5,000
5000, and then use a bag of these centroids for multinomial Building 67,788 40,000 5,000
logistic regression. This model is quite similar to the bag- Natural Place 60,091 40,000 5,000
of-words model in that the number of features is also 5000. Village 159,977 40,000 5,000
Animal 187,587 40,000 5,000
One difference between these two models is that the fea- Plant 50,585 40,000 5,000
tures for bag-of-words model are different for different Album 117,683 40,000 5,000
datasets, whereas for word2vec they are the same. This Film 86,486 40,000 5,000
could be one reason behind the phenomenon that bag-of- Written Work 55,174 40,000 5,000
words consistently out-performs word2vec in our experi-
ments. It might also be the case that the hope for linear
separability of word2vec is not valid at all. That being said, Before feeding the data to the models, we concatenate the
our own ConvNet models consistently out-perform both. title and short abstract together to form a single input for
each sample. The length of input used was l0 = 1014,
3. Datasets and Results therefore the frame length after last convolutional layer is
l6 = 34. Using an NVIDIA Tesla K40, Training takes
In this part we show the results obtained from various about 5 hours per epoch for the large model, and 2 hours for
datasets. The unfortunate fact in literature is that there is no the small model. Table 4 shows the classification results.
Text Understanding from Scratch
each class is 90,000 and testing 12,000, as table 12 shows. a necessary starting point, and usually structured parsing
is hard-wired into the model(Collobert et al., 2011b)(Kim,
2014)(Johnson & Zhang, 2014)(dos Santos & Gatti, 2014).
Table 12. Sogou News dataset
Deep learning models have been known to have good rep-
Category Total Train Test
resentations across domains or problems, in particular for
Sports 645,931 90,000 12,000 image recognition(Razavian et al., 2014). How good the
Finance 315,551 90,000 12,000 learnt representations are for language modeling is also one
Entertainment 160,409 90,000 12,000 interesting question to ask in the future. Beyond that, we
Automobile 167,647 90,000 12,000 can also consider how to apply unsupervised learning to
Technology 188,111 90,000 12,000 language models learnt from scratch. Previous embedding
methods(Collobert et al., 2011b)(Mikolov et al., 2013b)(Le
& Mikolov, 2014) have shown that predicting words or
The romanization or latinization form we have used is other patterns missing from the input could be useful. We
Pinyin, which is a phonetic system for transcribing the are eager to see how to apply these transfer learning and
Mandarin pronunciations. During this procedure, we used unsupervised learning techniques with our models.
the pypinyin package combined with jieba Chinese Recent research shows that it is possible to generate text
segmentation system. The resulting Pinyin text had each description of images from the features learnt in a deep
tone appended their finals as numbers between 1 and 4. image recognition model, using either fragment embed-
Similar as before, we concatenate title and content to form ding(Karpathy et al., 2014) or recurrent neural networks
an input sample. The texts has a wide range of lengths from such as long-short term memory (LSTM)(Vinyals et al.,
14 to 810959. Therefore, during data acquisition proce- 2014). The models in this article show very good ability
dure we constrain the length to stay between 100 and 1014 for understanding natural languages, and we are interested
whenever possible. In the end, we also apply same models in using the features from our model to generate a response
as before to this dataset, for which the input length is 1014. sentence in similar ways. If this could be successful, con-
We ignored thesaurus augmentation for this dataset. Table versational systems could have a big advancement.
13 lists the results. It is also worth noting that natural language in its essence
is time-series in disguise. Therefore, one natural extended
Table 13. Result on Sogou News corpus. The numbers are accu- application for our approach is towards time-series data,
racy in which a hierarchical feature extraction mechanism could
bring some improvements over the recurrent and regression
Model Thesaurus Train Test models used widely today.
Large ConvNet No 99.14% 95.12% In this article we only apply ConvNets to text understand-
Small ConvNet No 93.05% 91.35% ing for its semantic or sentiment meaning. One other ap-
Bag of Words No 92.97% 92.78% parent extension is towards traditional NLP tasks such as
chunking, named entity recognition (NER) and part-of-
speech (POS) tagging. To do them, one would need to
The input for a bag-of-words model is obtained by con- adapt our models to structured outputs. This is very simi-
sidering each Pinyin at Chinese character level as a word. lar to the seminal work by Collobert and Weston(Collobert
These results indicate consistently good performance from et al., 2011b), except that we probably no longer need to
our ConvNet models, even though it is completely a dif- construct a dictionary and start from words. Our work also
ferent kind of human language. This is one evidence to our makes it easy to extend these models to other human lan-
belief that ConvNets can be applied to any human language guages.
in similar ways for text understanding tasks. One final possibility from our model is learning from
symbolic systems such as mathematical equations, logic
4. Outlook and Conclusion expressions or programming languages. Zaremba and
Sutskever(Zaremba & Sutskever, 2014) have shown that it
In this article we provide a first evidence on ConvNets’ ap- is possible to approximate program executing using a recur-
plicability to text understanding tasks from scratch, that is, rent neural network. We are also eager to see how similar
ConvNets do not need any knowledge on the syntactic or projects could work out using our ConvNet models.
semantic structure of a language to give good benchmarks
text understanding. This evidence is in contrast with var- With so many possibilities, we believe that ConvNet mod-
ious previous approaches where a dictionary of words is els for text understanding could go beyond from what this
Text Understanding from Scratch
article shows and bring important insights towards artificial Gamon, Michael and Aue, Anthony. Automatic identifi-
intelligence in the future. cation of sentiment vocabulary: exploiting low associ-
ation with known sentiment terms. In In: Proceedings
Acknowledgement of the ACL 2005 Workshop on Feature Engineering for
Machine Learning in NLP, ACL, pp. 57–64, 2005.
We gratefully acknowledge the support of NVIDIA Corpo-
ration with the donation of 2 Tesla K40 GPUs used for this Gao, Jianfeng, He, Xiaodong, Yih, Wen-tau, and Deng, Li.
research. Learning semantic representations for the phrase trans-
lation model. arXiv preprint arXiv:1312.0482, 2013.
References Girshick, Ross B., Donahue, Jeff, Darrell, Trevor, and
Boureau, Y-L, Bach, Francis, LeCun, Yann, and Ponce, Malik, Jitendra. Rich feature hierarchies for accurate
Jean. Learning mid-level features for recognition. In object detection and semantic segmentation. CoRR,
Computer Vision and Pattern Recognition (CVPR), 2010 abs/1311.2524, 2013.
IEEE Conference on, pp. 2559–2566. IEEE, 2010a. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
Boureau, Y-Lan, Ponce, Jean, and LeCun, Yann. A theo- G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,
retical analysis of feature pooling in visual recognition. Coates, A., and Ng, A. Y. DeepSpeech: Scaling up end-
In Proceedings of the 27th International Conference on to-end speech recognition. ArXiv e-prints, December
Machine Learning (ICML-10), pp. 111–118, 2010b. 2014.
Braille, Louis. Method of Writing Words, Music, and Plain Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex,
Songs by Means of Dots, for Use by the Blind and Ar- Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving
ranged for Them. 1829. neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580, 2012.
Collobert, Ronan, Kavukcuoglu, Koray, and Farabet,
Clément. Torch7: A matlab-like environment for ma- Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
chine learning. In BigLearn, NIPS Workshop, number term memory. Neural Comput., 9(8):1735–1780,
EPFL-CONF-192376, 2011a. November 1997. ISSN 0899-7667.
Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Johnson, Rie and Zhang, Tong. Effective use of word or-
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natu- der for text categorization with convolutional neural net-
ral language processing (almost) from scratch. J. Mach. works. CoRR, abs/1412.1058, 2014.
Learn. Res., 12:2493–2537, November 2011b. ISSN
1532-4435. Karpathy, Andrej, Joulin, Armand, and Fei-Fei, Li. Deep
fragment embeddings for bidirectional image sentence
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei- mapping. CoRR, abs/1406.5679, 2014.
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009. Kim, Soo-Min and Hovy, Eduard. Determining the senti-
ment of opinions. In Proceedings of the 20th Interna-
dos Santos, Cicero and Gatti, Maira. Deep convolutional tional Conference on Computational Linguistics, COL-
neural networks for sentiment analysis of short texts. In ING ’04, Stroudsburg, PA, USA, 2004. Association for
Proceedings of COLING 2014, the 25th International Computational Linguistics.
Conference on Computational Linguistics: Technical
Papers, pp. 69–78, Dublin, Ireland, August 2014. Dublin Kim, Yoon. Convolutional neural networks for sentence
City University and Association for Computational Lin- classification. In Proceedings of the 2014 Conference
guistics. on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1746–1751, Doha, Qatar, October 2014.
Fellbaum, Christiane. Wordnet and wordnets. In Brown, Association for Computational Linguistics.
Keith (ed.), Encyclopedia of Language and Linguistics,
pp. 665–670, Oxford, 2005. Elsevier. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, networks. In NIPS, pp. 1106–1114, 2012.
Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A
deep visual-semantic embedding model. In Advances in Le, Quoc V and Mikolov, Tomas. Distributed represen-
Neural Information Processing Systems, pp. 2121–2129, tations of sentences and documents. arXiv preprint
2013. arXiv:1405.4053, 2014.
Text Understanding from Scratch
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu,
based learning applied to document recognition. Pro- Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Inte-
ceedings of the IEEE, 86(11):2278–2324, November grated recognition, localization and detection using con-
1998. volutional networks. CoRR, abs/1312.6229, 2013.
Lehmann, Jens, Isele, Robert, Jakob, Max, Jentzsch, Anja, Soderland, Stephen. Building a machine learning based
Kontokostas, Dimitris, Mendes, Pablo N., Hellmann, Se- text understanding system. In In Proc. IJCAI-2001
bastian, Morsey, Mohamed, van Kleef, Patrick, Auer, Workshop on Adaptive Text Extraction and Mining, pp.
Sören, and Bizer, Christian. DBpedia - a large-scale, 64–70, 2001.
multilingual knowledge base extracted from wikipedia.
Strapparava, Carlo and Mihalcea, Rada. Learning to iden-
Semantic Web Journal, 2014.
tify emotions in text. In Proceedings of the 2008 ACM
Linell, P. The Written Language Bias in Linguistics. 1982. Symposium on Applied Computing, SAC ’08, pp. 1556–
1560, New York, NY, USA, 2008. ACM. ISBN 978-1-
McAuley, Julian and Leskovec, Jure. Hidden factors and 59593-753-7.
hidden topics: Understanding rating dimensions with re-
Sutskever, Ilya, Martens, James, Dahl, George E., and Hin-
view text. In Proceedings of the 7th ACM Conference on
ton, Geoffrey E. On the importance of initialization and
Recommender Systems, RecSys ’13, pp. 165–172, New
momentum in deep learning. In Dasgupta, Sanjoy and
York, NY, USA, 2013. ACM. ISBN 978-1-4503-2409-0.
Mcallester, David (eds.), Proceedings of the 30th Inter-
Mikolov, Tomas, Le, Quoc V, and Sutskever, Ilya. Ex- national Conference on Machine Learning (ICML-13),
ploiting similarities among languages for machine trans- volume 28, pp. 1139–1147. JMLR Workshop and Con-
lation. arXiv preprint arXiv:1309.4168, 2013a. ference Proceedings, May 2013.
Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Viera, Anthony J, Garrett, Joanne M, et al. Understanding
Greg S., and Dean, Jeff. Distributed representations of interobserver agreement: the kappa statistic. Fam Med,
words and phrases and their compositionality. In Burges, 37(5):360–363, 2005.
C.j.c., Bottou, L., Welling, M., Ghahramani, Z., and Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-
Weinberger, K.q. (eds.), Advances in Neural Information han, Dumitru. Show and tell: A neural image caption
Processing Systems 26, pp. 3111–3119. 2013b. generator. CoRR, abs/1411.4555, 2014.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Wang, Canhui, Zhang, Min, Ma, Shaoping, and Ru, Liyun.
improve restricted boltzmann machines. In Proceedings Automatic online news issue construction in web envi-
of the 27th International Conference on Machine Learn- ronment. In Proceedings of the 17th International Con-
ing (ICML-10), pp. 807–814, 2010. ference on World Wide Web, WWW ’08, pp. 457–466,
New York, NY, USA, 2008. ACM. ISBN 978-1-60558-
Norvig, Peter. Inference in text understanding. In AAAI, 085-2.
pp. 561–565, 1987.
Wiebe, Janyce M., Wilson, Theresa, and Bell, Matthew.
Pennington, Jeffrey, Socher, Richard, and Manning, Identifying Collocations for Recognizing Opinions. In
Christopher D. Glove: Global vectors for word represen- Proceedings of the ACL/EACL Workshop on Colloca-
tation. Proceedings of the Empiricial Methods in Natural tion, Toulouse, FR, 2001.
Language Processing (EMNLP 2014), 12, 2014.
Wilson, Theresa, Wiebe, Janyce, and Hoffmann, Paul. Rec-
Polyak, B.T. Some methods of speeding up the conver- ognizing contextual polarity in phrase-level sentiment
gence of iteration methods. {USSR} Computational analysis. In Proceedings of the Conference on Hu-
Mathematics and Mathematical Physics, 4(5):1 – 17, man Language Technology and Empirical Methods in
1964. ISSN 0041-5553. Natural Language Processing, HLT ’05, pp. 347–354,
Stroudsburg, PA, USA, 2005. Association for Computa-
Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, tional Linguistics.
Josephine, and Carlsson, Stefan. CNN features off-the-
shelf: an astounding baseline for recognition. CoRR, Zaremba, Wojciech and Sutskever, Ilya. Learning to exe-
abs/1403.6382, 2014. cute. CoRR, abs/1410.4615, 2014.
Rumelhart, D.E., Hintont, G.E., and Williams, R.J. Learn- Zeiler, Matthew D and Fergus, Rob. Visualizing and under-
ing representations by back-propagating errors. Nature, standing convolutional networks. In Computer Vision–
323(6088):533–536, 1986. ECCV 2014, pp. 818–833. Springer, 2014.