0% found this document useful (0 votes)
10 views10 pages

Text Understanding From Scratch

Text Understanding From Scratch

Uploaded by

Hai Piao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views10 pages

Text Understanding From Scratch

Text Understanding From Scratch

Uploaded by

Hai Piao
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Text Understanding from Scratch

Xiang Zhang XIANG @ CS . NYU . EDU


Yann LeCun YANN @ CS . NYU . EDU
Computer Science Department, Courant Institute of Mathematical Sciences, New York University

nary of interested words, and the structural parser needs to


arXiv:1502.01710v5 [cs.LG] 4 Apr 2016

Abstract handle many special variations such as word morphologi-


cal changes and ambiguous chunking. These requirements
This article demonstrates that we can apply deep
make text understanding more or less specialized to a par-
learning to text understanding from character-
ticular language – if the language is changed, many things
level inputs all the way up to abstract text
must be engineered from scratch.
concepts, using temporal convolutional net-
works(LeCun et al., 1998) (ConvNets). We apply With the advancement of deep learning and availability
ConvNets to various large-scale datasets, includ- of large datasets, methods of handling text understand-
ing ontology classification, sentiment analysis, ing using deep learning techniques have gradually become
and text categorization. We show that temporal available. One technique which draws great interests is
ConvNets can achieve astonishing performance word2vec(Mikolov et al., 2013b). Inspired by traditional
without the knowledge of words, phrases, sen- language models, this technique constructs representation
tences and any other syntactic or semantic struc- of words into a vector of fixed length trained under a
tures with regards to a human language. Evi- large corpus. Based on the hope that machines may make
dence shows that our models can work for both sense of languages in a formal fashion, many researchers
English and Chinese. have tried to train a neural network for understanding texts
based the features extracted from it or similar techniques,
to name a few, (Frome et al., 2013)(Gao et al., 2013)(Le &
1. Introduction Mikolov, 2014)(Mikolov et al., 2013a)(Pennington et al.,
2014). Most of these techniques try to apply word2vec or
Text understanding consists in reading texts formed in nat- similar techniques with an engineered language model.
ural languages, determining the explicit or implicit mean-
ing of each elements such as words, phrases, sentences On the other hand, some researchers have also tried to train
and paragraphs, and making inferences about the implicit a neural network from word level with little structural en-
or explicit properties of these texts(Norvig, 1987). This gineering(Collobert et al., 2011b)(Kim, 2014)(Johnson &
problem has been traditionally difficult because of the ex- Zhang, 2014)(dos Santos & Gatti, 2014). In these works, a
treme variability in language formation(Linell, 1982). To word level feature extractor such as lookup table(Collobert
date, most ways to handle text understanding, be it a hand- et al., 2011b) or word2vec(Mikolov et al., 2013b) is used to
crafted parsing program or a statistically learnt model, have feed a temporal ConvNet(LeCun et al., 1998). After train-
been resorted to the means of matching words statistics. ing, ConvNets worked for both structured prediction tasks
such as part-of-speech tagging and named entity recogni-
So far, most machine learning approaches to text under- tion, and text understanding tasks such as sentiment anal-
standing consist in tokenizing a string of characters into ysis and sentence classification. They claim good results
structures such as words, phrases, sentences and para- for various tasks, but the datasets and models are relatively
graphs, and then apply some statistical classification al- small and there are still some engineered layers to represent
gorithm onto the statistics of such structures(Soderland, structures such as words, phrases and sentences.
2001). These techniques work well enough when applied
to a narrowly defined domain, but the prior knowledge In this article we show that text understanding can be han-
required is not cheap – they need to pre-define a dictio- dled by a deep learning system without artificially embed-
ding knowledge about words, phrases, sentences or any
This technical report is superseded by a paper entitled other syntactic or semantic structures associated with a lan-
“Character-level Convolutional Networks for Text Classification”,
arXiv:1509.01626. It has considerably more experimental results
guage. We apply temporal ConvNets(LeCun et al., 1998)
and a rewritten introduction. to various large-scale text understanding tasks, in which the
Text Understanding from Scratch

inputs are quantized characters and the outputs are abstract output) frame size. The outputs hj (y) is obtained by a sum
properties of the text. Our approach is one that ‘learns from over i of the convolutions between gi (x) and fij (x).
scratch’, in the following 2 senses
One key module that helped us to train deeper models
is temporal max-pooling. It is the same as spatial max-
1. ConvNets do not require knowledge of words – work- pooling module used in computer vision(Boureau et al.,
ing with characters is fine. This renders a word-based 2010a), except that it is in 1-D. Given a discrete input
feature extractor (such as LookupTable(Collobert function g(x) ∈ [1, l] → R, the max-pooling function
et al., 2011b) or word2vec(Mikolov et al., 2013b)) un- h(y) ∈ [1, b(l − k)/dc + 1] → R of g(x) is defined as
necessary. All previous works start with words instead
of characters, which is difficult to apply a convolu- k
h(y) = max g(y · d − x + c),
tional layer directly due to its high dimension. x=1

2. ConvNets do not require knowledge of syntax or se- where c = k − d + 1 is an offset constant. This very pool-
mantic structures – inference directly to high-level tar- ing module enabled us to train ConvNets deeper than 6 lay-
gets is fine. This also invalidates the assumption that ers, where all others fail. The analysis by (Boureau et al.,
structured predictions and language models are neces- 2010b) might shed some light on this.
sary for high-level text understanding.
The non-linearity used in our model is the rectifier or
thresholding function h(x) = max{0, x}, which makes
Our approach is partly inspired by ConvNet’s success in our convolutional layers similar to rectified linear units
computer vision. It has outstanding performance in various (ReLUs)(Nair & Hinton, 2010). We always apply this func-
image recognition tasks(Girshick et al., 2013)(Krizhevsky tion after a convolutional or linear module, therefore we
et al., 2012)(Sermanet et al., 2013). These successful re- omit its appearance in the following. The algorithm used
sults usually involve some end-to-end ConvNet model that in training our model is stochastic gradient descent (SGD)
learns hierarchical representation from raw pixels(Girshick with a minibatch of size 128, using momentum(Polyak,
et al., 2013)(Zeiler & Fergus, 2014). Similarly, we hy- 1964)(Sutskever et al., 2013) 0.9 and initial step size 0.01
pothesize that when trained from raw characters, temporal which is halved every 3 epoches for 10 times. The training
ConvNet is able to learn the hierarchical representations of method and parameters apply to all of our models. Our im-
words, phrases and sentences in order to understand text. plementation is done using Torch 7(Collobert et al., 2011a).

2. ConvNet Model Design 2.2. Character quantization


In this section, we introduce the design of ConvNets for Our model accepts a sequence of encoded characters as
text understanding. The design is modular, where the gra- input. The encoding is done by prescribing an alpha-
dients are obtained by back-propagation(Rumelhart et al., bet of size m for the input language, and then quan-
1986) to perform optimization. tize each character using 1-of-m encoding. Then, the se-
quence of characters is transformed to a sequence of such
2.1. Key Modules m sized vectors with fixed length l. Any character exceed-
ing length l is ignored, and any characters that are not in
The main component in our model is the temporal convo- the alphabet including blank characters are quantized as
lutional module, which simply computes a 1-D convolu- all-zero vectors. Inspired by how long-short term mem-
tion between input and output. Suppose we have a dis- ory (LSTM)(Hochreiter & Schmidhuber, 1997) work, we
crete input function g(x) ∈ [1, l] → R and a discrete quantize characters in backward order. This way, the latest
kernel function f (x) ∈ [1, k] → R. The convolution reading on characters is always placed near the beginning
h(y) ∈ [1, b(l − k)/dc + 1] → R between f (x) and g(x) of the output, making it easy for fully connected layers to
with stride d is defined as associate correlations with the latest memory. The input to
k
X our model is then just a set of frames of length l, and the
h(y) = f (x) · g(y · d − x + c), frame size is the alphabet size m.
x=1
One interesting thing about this quantization is that visually
where c = k − d + 1 is an offset constant. Just as in tradi- it is quite similar to Braille(Braille, 1829) used for assist-
tional convolutional networks in vision, the module is pa- ing blind reading, except that our encoding is more com-
rameterized by a set of such kernel functions fij (x) (i = pact. Figure 1 depicts this fact. It seems that when trained
1, 2, . . . , m and j = 1, 2, . . . , n) which we call weights, on properly, humans can learn to read binary encoding of lan-
a set of inputs gi (x) and outputs hj (y). We call each gi guages. This offers interesting insights and inspiration to
(or hj ) an input (or output) frame, and m (or n) input (or why our approach could work.
Text Understanding from Scratch

Table 1. Convolutional layers used in our experiments. The con-


(a) Binary (b) Braille volutional layers do not use stride and pooling layers are all non-
overlapping ones, so we omit the description of their strides.
Figure 1. Comparison of our binary encoding and Braille on the
text “International Conference on Machine Learning” Layer Large Frame Small Frame Kernel Pool

1 1024 256 7 3
The alphabet used in all of our models consists of 70 char- 2 1024 256 7 3
acters, including 26 English letters, 10 digits, new line and 3 1024 256 3 N/A
33 other characters. They include: 4 1024 256 3 N/A
5 1024 256 3 N/A
abcdefghijklmnopqrstuvwxyz0123456789 6 1024 256 3 3
-,;.!?:’’’/\|_@#$%ˆ&*˜‘+-=<>()[]{}

Before feeding the input to our model, no normalization is


done. This is because the input is already quite sparse by Table 2. Fully-connected layers used in our experiments. The
itself, with many zeros scattered around. Our models can number of output units for the last layer is determined by the prob-
learn from this simple quantization without problems. lem. For example, for a 10-class classification problem it will be
10.

2.3. Model Design Layer Output Units Large Output Units Small
We designed 2 ConvNets – one large and one small. They
7 2048 1024
are both 9 layers deep with 6 convolutional layers and 3
8 2048 1024
fully-connected layers, with different number of hidden
9 Depends on the problem
units and frame sizes. Figure 2 gives an illustration.
Length
Some Text
Quantization

Frames

2.4. Data Augmentation using Thesaurus


...
Many researchers have found that appropriate data aug-
mentation techniques are useful for controlling generaliza-
Convolutions Max-pooling Conv. and Pool. layers Fully-connected
tion error for deep learning models. These techniques usu-
ally work well when we could find appropriate invariant
Figure 2. Illustration of our model
properties that the model should possess. For example, in
image recognition a model should have some controlled in-
The input have number of frames equal to 69 due to variance towards changes in translating, scaling, rotating
our character quantization method, and the length of each and flipping of the input image. Similarly, in speech recog-
frame is dependent on the problem. We also insert 2 nition we usually augment data by adding artificial noise
dropout(Hinton et al., 2012) modules in between the 3 background and changing the tone or speed of speech sig-
fully-connected layers to regularize. They have dropout nal(Hannun et al., 2014).
probability of 0.5. Table 1 lists the configurations for con-
volutional layers, and table 2 lists the configurations for In terms of texts, it is not reasonable to augment the data
fully-connected (linear) layers. using signal transformations as done in image or speech
recognition, because the exact order of characters may form
Before starting training the models, we randomize the rigorous syntactic and semantic meaning. Therefore, the
weights using Gaussian distributions. The mean and stan- best way to do data augmentation would have been using
dard deviation used for initializing the large model is human rephrases of sentences, but this is unrealistic and
(0, 0.02), and small model (0, 0.05). expensive due the large volume of samples in our datasets.
For different problems the input lengths are different, and As a result, the most natural choice in data augmentation
so are the frame lengths. From our model design, it is for us is to replace words or phrases with their synonyms.
easy to know that given input length l0 , the output frame We experimented data augmentation by using an English
length after the last convolutional layer (but before any of thesaurus, which is obtained from the mytheas compo-
the fully-connected layers) is l6 = (l0 −96)/27. This num- nent used in LibreOffice1 project. That thesaurus in turn
ber multiplied with the frame size at layer 6 will give the
1
input dimension the first fully-connected layer accepts. http://www.libreoffice.org/
Text Understanding from Scratch

was obtained from WordNet(Fellbaum, 2005), where ev- openly accessible dataset that is large enough or with labels
ery synonym to a word or phrase is ranked by the semantic of sufficient quality for us, although the research on text
closeness to the most frequently seen meaning. understanding has been conducted for tens of years. There-
fore, we propose several large-scale datasets, in hopes that
To do synonym replacement for a given text, we need to
text understanding can rival the success of image recog-
answer 2 questions: which words in the text should be re-
nition when large-scale datasets such as ImageNet(Deng
placed, and which synonym from the thesaurus should be
et al., 2009) became available.
used for the replacement. To decide on the first question,
we extract all replaceable words from the given text and
randomly choose r of them to be replaced. The probability 3.1. DBpedia Ontology Classification
of number r is determined by a geometric distribution with DBpedia is a crowd-sourced community effort to extract
parameter p in which P [r] ∼ pr . The index s of the syn- structured information from Wikipedia(Lehmann et al.,
onym chosen given a word is also determined by a another 2014). The English version of the DBpedia knowledge
geometric distribution in which P [s] ∼ q s . This way, the base provides a consistent ontology, which is shallow and
probability of a synonym chosen becomes smaller when it cross-domain. It has been manually created based on the
moves distant from the most frequently seen meaning. most commonly used infoboxes within Wikipedia. Some
It is worth noting that models trained using our large-scale ontology classes in DBpedia contain hundreds of thousands
datasets hardly require data augmentation, since their gen- of samples, which are ideal candidates to construct an on-
eralization errors are already pretty good. We will still re- tology classification dataset.
port the results using this new data augmentation technique The DBpedia ontology classification dataset is constructed
with p = 0.5 and q = 0.5. by picking 14 non-overlapping classes from DBpedia 2014.
They are listed in table 3. From each of these 14 ontology
2.5. Comparison Models classes, we randomly choose 40,000 training samples and
5,000 testing samples. Therefore, the total size of the train-
Since we have constructed several large-scale datasets from
ing dataset is 560,000 and testing dataset 70,000.
scratch, there is no previous publication for us to obtain a
comparison with other methods. Therefore, we also imple-
mented two fairly standard models using previous methods: Table 3. DBpedia ontology classes. The numbers contain only
the bag-of-words model, and a bag-of-centroids model via samples with both a title and a short abstract.
word2vec(Mikolov et al., 2013b).
Class Total Train Test
The bag-of-words model is pretty straightforward. For each
dataset, we count how many times each word appears in the Company 63,058 40,000 5,000
training dataset, and choose 5000 most frequent ones as the Educational Institution 50,450 40,000 5,000
bag. Then, we use multinomial logistic regression as the Artist 95,505 40,000 5,000
classifier for this bag of features. Athlete 268,104 40,000 5,000
As for the word2vec model, we first ran k-means on the Office Holder 47,417 40,000 5,000
word vectors learnt from Google News corpus with k = Mean Of Transportation 47,473 40,000 5,000
5000, and then use a bag of these centroids for multinomial Building 67,788 40,000 5,000
logistic regression. This model is quite similar to the bag- Natural Place 60,091 40,000 5,000
of-words model in that the number of features is also 5000. Village 159,977 40,000 5,000
Animal 187,587 40,000 5,000
One difference between these two models is that the fea- Plant 50,585 40,000 5,000
tures for bag-of-words model are different for different Album 117,683 40,000 5,000
datasets, whereas for word2vec they are the same. This Film 86,486 40,000 5,000
could be one reason behind the phenomenon that bag-of- Written Work 55,174 40,000 5,000
words consistently out-performs word2vec in our experi-
ments. It might also be the case that the hope for linear
separability of word2vec is not valid at all. That being said, Before feeding the data to the models, we concatenate the
our own ConvNet models consistently out-perform both. title and short abstract together to form a single input for
each sample. The length of input used was l0 = 1014,
3. Datasets and Results therefore the frame length after last convolutional layer is
l6 = 34. Using an NVIDIA Tesla K40, Training takes
In this part we show the results obtained from various about 5 hours per epoch for the large model, and 2 hours for
datasets. The unfortunate fact in literature is that there is no the small model. Table 4 shows the classification results.
Text Understanding from Scratch

2,441,053 products(McAuley & Leskovec, 2013). This


Table 4. DBpedia results. The numbers are accuracy.
dataset contains review texts of extremely variate character
Model Thesaurus Train Test lengths from 3 to 32,788, in which the mean is around 764.
To construct a sentiment analysis dataset, we chose review
Large ConvNet No 99.96% 98.27% texts with character lengths between 100 and 1014. Apart
Large ConvNet Yes 99.89% 98.40% from constructing from the original 5 score labels, we also
Small ConvNet No 99.37% 98.02% construct a sentiment polarity dataset in which labels 1 and
Small ConvNet Yes 99.62% 98.15% 2 are converted to negative and 4 and 5 positive. There are
Bag of Words No 96.29% 96.19% also large number of duplicated reviews in which the title
word2vec No 89.32% 89.09% and review text are the same. We removed these duplicates.
Table 5 lists the number of samples for each score and the
number sampled for the 2 dataset.
The results from table 4 indicate both good training and
testing errors from our models, with some improvement
Table 5. Amazon review datasets. Column “total” is the total
from thesaurus augmentation. We believe this is a first ev- number of samples for each score. Column “full” and “polarity”
idence that a learning machine does not require knowledge are number of samples chosen for full score dataset and polarity
about words, phrases, sentences, paragraphs or any other dataset, respectively.
syntactical or semantic structures to understand text. That
being said, we want to point out that ConvNets by their de- Total Full Polarity
sign have the capacity to learn such structured knowledge.
1 2,746,559 730,000 1,275,000
2 1,791,219 730,000 725,000
3 2,892,566 730,000 0
4 6,551,166 730,000 725,000
5 20,705,260 730,000 1,275,000

Figure 3. Visualization of first layer weights


We ignored score 3 for polarity dataset because some
Figure 3 is a visualization of some kernel weights in the texts in that score are not obviously negative or positive.
first layer of the large model trained without thesaurus aug- Many researchers have shown that with some random text,
mentation. Each block represents a randomly chosen ker- the inter-rater consensus on polarity is only about 60% -
nel, with its horizontal direction iterates over input frames 80%(Gamon & Aue, 2005)(Kim & Hovy, 2004)(Strappa-
and vertical direction over kernel size. In the visualiza- rava & Mihalcea, 2008)(Viera et al., 2005)(Wiebe et al.,
tion, black (or white) indicates large negative (or positive) 2001)(Wilson et al., 2005). We believe that by picking out
values, and gray indicates values near zero. It seems very score 3, the labels would have higher quality with a clearer
interesting that the network has learnt to care more about indication of positivity or negativity. We could have in-
the variations in letters than other characters. This phe- cluded a third “neutral” class, but that would significantly
nomenon is observed in models for all of the datasets. reduce the number of samples for each class since sample
imbalance is not desirable.
3.2. Amazon Review Sentiment Analysis
For the full score dataset, we randomly selected 600,000
The purpose of sentiment analysis is to identify and extract samples for each score for training and 130,000 samples
subjective information in different kinds of source mate- for testing. The size of training set is then 3,000,000 and
rials. This task, when presented with the text written by testing 650,000. For the polarity dataset, we randomly se-
some user, could be formulated as a normal classification lected 1,800,000 samples for each positive or negative la-
problem in which each class represents a degree indicator bel as training set and 200,000 samples for testing. In to-
for user’s subjective view. One example is the score sys- tal, the polarity dataset has 3,600,000 training samples and
tem used from Amazon, which is a discrete score from 1 to 400,000 testing samples.
5 indicating user’s subjective rating of a product. The rat-
Because we limit the maximum length of the text to be
ing usually comes with a review text, which is a valuable
1014, we can safely set the input length to be 1014 and
source for us to construct a sentiment analysis dataset.
use the same configuration as the DBpedia model. Models
We obtained an Amazon review dataset from the Stan- for Amazon review datasets took significantly more time to
ford Network Analysis Project (SNAP), which spans 18 go over each epoch. The time taken for the large model per
years with 34,686,770 reviews from 6,643,669 users on epoch is about a 5 days, and small model 2 days, with the
Text Understanding from Scratch

ing to browse or download them. We obtained Yahoo! An-


Table 6. Result on Amazon review full score dataset. The num-
swers Comprehensive Questions and Answers version 1.0
bers are accuracy.
dataset through the Yahoo! Webscope program. The data
Model Thesaurus Train Test they have collected is the Yahoo! Answers corpus as of Oc-
tober 25th, 2007. It includes all the questions and their cor-
Large ConvNet No 62.96% 58.69% responding answers. The corpus contains 4,483,032 ques-
Large ConvNet Yes 68.90% 59.55% tions and their answers. In addition to question and answer
Small ConvNet No 69.24% 59.47% text, the corpus contains a small amount of metadata, i.e.,
Small ConvNet Yes 62.11% 59.57% which answer was selected as the best answer, and the cat-
Bag of Words No 54.45% 54.17% egory and sub-category that was assigned to each question.
word2vec No 36.56% 36.50% We constructed a topic classification dataset from this cor-
pus using 10 largest main categories. They are listed in
table 8. Each class contains 140,000 training samples and
Table 7. Result on Amazon review polarity dataset. The numbers 6,000 testing samples. Therefore, the total number of train-
are accuracy. ing samples is 1,400,000 and testing samples 60,000 in this
dataset. From all the answers and other meta-information,
Model Thesaurus Train Test we only used the best answer content and the main category
information.
Large ConvNet No 97.57% 94.49%
Large ConvNet Yes 96.82% 95.07%
Small ConvNet No 96.03% 94.50% Table 8. Yahoo! Answers topic classification dataset
Small ConvNet Yes 95.44% 94.33% Category Total Train Test
Bag of Words No 89.96% 89.86%
word2vec No 72.95% 72.86% Society & Culture 295,340 140,000 6,000
Science & Mathematics 169,586 140,000 6,000
Health 278,942 140,000 6,000
polarity training taking a little bit longer. Table 6 and table Education & Reference 206,440 140,000 6,000
7 list the results on full score dataset and polarity dataset, Computers & Internet 281,696 140,000 6,000
respectively. Sports 146,396 140,000 6,000
Business & Finance 265,182 140,000 6,000
Entertainment & Music 440,548 140,000 6,000
Family & Relationships 517,849 140,000 6,000
Politics & Government 152,564 140,000 6,000

The Yahoo! Answers dataset also contains questions and


answers of various lengths, up to 4000 characters. During
training we still set the input length to be 1014 and truncate
(a) Train (b) Test the rest if necessary. But before truncation, we concate-
nated the question title, question content and best answer
Figure 4. Confusion matrices on full score Amazon Review pre- content in reverse order so that the question title and con-
diction. White values are 1 and black 0. Vertical direction iterates
tent are less likely to be truncated. It takes about 1 day for
over true score from top to bottom, and horizontal direction iter-
ates over predicted scores from left to right.
one epoch on the large model, and about 8 hours for the
small model. Table 9 details the results on this dataset.
It seems that our models work much better on the polarity One interesting thing from the results on Yahoo! Answers
dataset than the full score dataset. This is to be expected, dataset is that both training and testing accuracy values are
since full score prediction means more confusion between quite small compared to the results we obtained from other
nearby score labels. To demonstrate this, figure 4 shows datasets, whereas the generalization error is pretty good.
the training and testing confusion matrices. One hypothesis for this is that there are some intrinsic con-
fusions in determining between some classes given a pair
3.3. Yahoo! Answers Topic Classification of question and answer.
Yahoo! Answers is a web site where people post questions Figure 5 shows the confusion matrix for the large model
and answers, all of which are public to any web user will- without thesaurus augmentation. It indicates relatively
Text Understanding from Scratch

Table 10 is a summary of the dataset. From each category,


Table 9. Results on Yahoo! Answers dataset. The numbers are
we randomly chose 30,000 samples as training and 1,900
accuracy.
as testing. The total number of training samples is then
Model Thesaurus Train Test 120,000 and testing 7,600. Compared to other datasets we
have constructed, this dataset is relatively small. Therefore
Large ConvNet No 73.42% 70.45% the time taken for one epoch using the large model is only
Large ConvNet Yes 75.55% 71.10% 3 hours, and about 1 hour for the small model.
Small ConvNet No 72.84% 70.16%
Small ConvNet Yes 72.51% 70.16%
Table 11. Result on AG’s news corpus. The numbers are accuracy
Bag of Words No 66.83% 66.62%
word2vec No 56.37% 56.47%
Model Thesaurus Train Test

Large ConvNet No 99.44% 87.18%


Large ConvNet Yes 99.49% 86.61%
Small ConvNet No 99.20% 84.35%
Small ConvNet Yes 96.81% 85.20%
Bag of Words No 88.02% 86.69%
word2vec No 78.20% 76.73%

Similarly as our previous experiments, we also use an input


(a) Train (b) Test length of 1014 for this dataset after title and description are
concatenated. The actual resulting maximum length of all
Figure 5. Confusion matrices on Yahoo! Answers dataset. White the inputs is 9843, but the mean is only around 232.
values are 1 and black 0. Vertical direction iterates over true
Table 11 lists the results. It shows a sign of overfitting from
classes from top to bottom, and horizontal direction iterates over
our models, which suggests that to achieve good text under-
predicted classes from left to right.
standing results ConvNets require a large corpus in order to
learn from scratch.
large confusion for classes “Society & Culture”, “Educa-
tion & Reference”, and “Business & Finance”. 3.5. News Categorization in Chinese
One immediate advantage from our dictionary-free design
3.4. News Categorization in English is its applicability to other kinds of human languages. Our
simple approach only needs an alphabet of the target lan-
News is one of the largest parts of the entire web today,
guage using one-of-n encoding. For languages such as
which makes it a good candidate to build text understand-
Chinese, Japanese and Korean where there are too many
ing models. We obtained the AG’s corpus of news article
characters, one can simply use its romanized (or latinized)
on the web2 . It contains 496,835 categorized news articles
transcription and quantize them just like in English. Better
from more than 2000 news sources. We choose 4 largest
yet, the romanization or latinization is usually phonemic
categories from this corpus to construct our dataset, using
or phonetic, which rivals the success of deep learning in
only the title and description fields.
speech recognition(Hannun et al., 2014). Here we investi-
gate one example: news categorization in Chinese.
Table 10. AG’s news corpus. Only categories used are listed. The dataset we obtained consists of the SogouCA and So-
gouCS news corpora(Wang et al., 2008), containing in total
Category Total Train Test 2,909,551 news articles in various topic channels. Among
them, about 2,644,110 contain both a title and some con-
World 81,456 30,000 1,900 tent. We then labeled the each piece of news using its
Sports 62,163 30,000 1,900 URL, by manually classify the their domain names. This
Business 56,656 30,000 1,900 gives us a large corpus of news articles labeled with their
Sci/Tech 41,194 30,000 1,900 categories. There are a large number categories but most
of them contain only few articles. We choose 5 categories
2
http://www.di.unipi.it/˜gulli/AG_corpus_ – “sports”, “finance”, “entertainment”, “automobile” and
of_news_articles.html “technology”. The number of training samples selected for
Text Understanding from Scratch

each class is 90,000 and testing 12,000, as table 12 shows. a necessary starting point, and usually structured parsing
is hard-wired into the model(Collobert et al., 2011b)(Kim,
2014)(Johnson & Zhang, 2014)(dos Santos & Gatti, 2014).
Table 12. Sogou News dataset
Deep learning models have been known to have good rep-
Category Total Train Test
resentations across domains or problems, in particular for
Sports 645,931 90,000 12,000 image recognition(Razavian et al., 2014). How good the
Finance 315,551 90,000 12,000 learnt representations are for language modeling is also one
Entertainment 160,409 90,000 12,000 interesting question to ask in the future. Beyond that, we
Automobile 167,647 90,000 12,000 can also consider how to apply unsupervised learning to
Technology 188,111 90,000 12,000 language models learnt from scratch. Previous embedding
methods(Collobert et al., 2011b)(Mikolov et al., 2013b)(Le
& Mikolov, 2014) have shown that predicting words or
The romanization or latinization form we have used is other patterns missing from the input could be useful. We
Pinyin, which is a phonetic system for transcribing the are eager to see how to apply these transfer learning and
Mandarin pronunciations. During this procedure, we used unsupervised learning techniques with our models.
the pypinyin package combined with jieba Chinese Recent research shows that it is possible to generate text
segmentation system. The resulting Pinyin text had each description of images from the features learnt in a deep
tone appended their finals as numbers between 1 and 4. image recognition model, using either fragment embed-
Similar as before, we concatenate title and content to form ding(Karpathy et al., 2014) or recurrent neural networks
an input sample. The texts has a wide range of lengths from such as long-short term memory (LSTM)(Vinyals et al.,
14 to 810959. Therefore, during data acquisition proce- 2014). The models in this article show very good ability
dure we constrain the length to stay between 100 and 1014 for understanding natural languages, and we are interested
whenever possible. In the end, we also apply same models in using the features from our model to generate a response
as before to this dataset, for which the input length is 1014. sentence in similar ways. If this could be successful, con-
We ignored thesaurus augmentation for this dataset. Table versational systems could have a big advancement.
13 lists the results. It is also worth noting that natural language in its essence
is time-series in disguise. Therefore, one natural extended
Table 13. Result on Sogou News corpus. The numbers are accu- application for our approach is towards time-series data,
racy in which a hierarchical feature extraction mechanism could
bring some improvements over the recurrent and regression
Model Thesaurus Train Test models used widely today.

Large ConvNet No 99.14% 95.12% In this article we only apply ConvNets to text understand-
Small ConvNet No 93.05% 91.35% ing for its semantic or sentiment meaning. One other ap-
Bag of Words No 92.97% 92.78% parent extension is towards traditional NLP tasks such as
chunking, named entity recognition (NER) and part-of-
speech (POS) tagging. To do them, one would need to
The input for a bag-of-words model is obtained by con- adapt our models to structured outputs. This is very simi-
sidering each Pinyin at Chinese character level as a word. lar to the seminal work by Collobert and Weston(Collobert
These results indicate consistently good performance from et al., 2011b), except that we probably no longer need to
our ConvNet models, even though it is completely a dif- construct a dictionary and start from words. Our work also
ferent kind of human language. This is one evidence to our makes it easy to extend these models to other human lan-
belief that ConvNets can be applied to any human language guages.
in similar ways for text understanding tasks. One final possibility from our model is learning from
symbolic systems such as mathematical equations, logic
4. Outlook and Conclusion expressions or programming languages. Zaremba and
Sutskever(Zaremba & Sutskever, 2014) have shown that it
In this article we provide a first evidence on ConvNets’ ap- is possible to approximate program executing using a recur-
plicability to text understanding tasks from scratch, that is, rent neural network. We are also eager to see how similar
ConvNets do not need any knowledge on the syntactic or projects could work out using our ConvNet models.
semantic structure of a language to give good benchmarks
text understanding. This evidence is in contrast with var- With so many possibilities, we believe that ConvNet mod-
ious previous approaches where a dictionary of words is els for text understanding could go beyond from what this
Text Understanding from Scratch

article shows and bring important insights towards artificial Gamon, Michael and Aue, Anthony. Automatic identifi-
intelligence in the future. cation of sentiment vocabulary: exploiting low associ-
ation with known sentiment terms. In In: Proceedings
Acknowledgement of the ACL 2005 Workshop on Feature Engineering for
Machine Learning in NLP, ACL, pp. 57–64, 2005.
We gratefully acknowledge the support of NVIDIA Corpo-
ration with the donation of 2 Tesla K40 GPUs used for this Gao, Jianfeng, He, Xiaodong, Yih, Wen-tau, and Deng, Li.
research. Learning semantic representations for the phrase trans-
lation model. arXiv preprint arXiv:1312.0482, 2013.
References Girshick, Ross B., Donahue, Jeff, Darrell, Trevor, and
Boureau, Y-L, Bach, Francis, LeCun, Yann, and Ponce, Malik, Jitendra. Rich feature hierarchies for accurate
Jean. Learning mid-level features for recognition. In object detection and semantic segmentation. CoRR,
Computer Vision and Pattern Recognition (CVPR), 2010 abs/1311.2524, 2013.
IEEE Conference on, pp. 2559–2566. IEEE, 2010a. Hannun, A., Case, C., Casper, J., Catanzaro, B., Diamos,
Boureau, Y-Lan, Ponce, Jean, and LeCun, Yann. A theo- G., Elsen, E., Prenger, R., Satheesh, S., Sengupta, S.,
retical analysis of feature pooling in visual recognition. Coates, A., and Ng, A. Y. DeepSpeech: Scaling up end-
In Proceedings of the 27th International Conference on to-end speech recognition. ArXiv e-prints, December
Machine Learning (ICML-10), pp. 111–118, 2010b. 2014.

Braille, Louis. Method of Writing Words, Music, and Plain Hinton, Geoffrey E, Srivastava, Nitish, Krizhevsky, Alex,
Songs by Means of Dots, for Use by the Blind and Ar- Sutskever, Ilya, and Salakhutdinov, Ruslan R. Improving
ranged for Them. 1829. neural networks by preventing co-adaptation of feature
detectors. arXiv preprint arXiv:1207.0580, 2012.
Collobert, Ronan, Kavukcuoglu, Koray, and Farabet,
Clément. Torch7: A matlab-like environment for ma- Hochreiter, Sepp and Schmidhuber, Jürgen. Long short-
chine learning. In BigLearn, NIPS Workshop, number term memory. Neural Comput., 9(8):1735–1780,
EPFL-CONF-192376, 2011a. November 1997. ISSN 0899-7667.

Collobert, Ronan, Weston, Jason, Bottou, Léon, Karlen, Johnson, Rie and Zhang, Tong. Effective use of word or-
Michael, Kavukcuoglu, Koray, and Kuksa, Pavel. Natu- der for text categorization with convolutional neural net-
ral language processing (almost) from scratch. J. Mach. works. CoRR, abs/1412.1058, 2014.
Learn. Res., 12:2493–2537, November 2011b. ISSN
1532-4435. Karpathy, Andrej, Joulin, Armand, and Fei-Fei, Li. Deep
fragment embeddings for bidirectional image sentence
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei- mapping. CoRR, abs/1406.5679, 2014.
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009. Kim, Soo-Min and Hovy, Eduard. Determining the senti-
ment of opinions. In Proceedings of the 20th Interna-
dos Santos, Cicero and Gatti, Maira. Deep convolutional tional Conference on Computational Linguistics, COL-
neural networks for sentiment analysis of short texts. In ING ’04, Stroudsburg, PA, USA, 2004. Association for
Proceedings of COLING 2014, the 25th International Computational Linguistics.
Conference on Computational Linguistics: Technical
Papers, pp. 69–78, Dublin, Ireland, August 2014. Dublin Kim, Yoon. Convolutional neural networks for sentence
City University and Association for Computational Lin- classification. In Proceedings of the 2014 Conference
guistics. on Empirical Methods in Natural Language Processing
(EMNLP), pp. 1746–1751, Doha, Qatar, October 2014.
Fellbaum, Christiane. Wordnet and wordnets. In Brown, Association for Computational Linguistics.
Keith (ed.), Encyclopedia of Language and Linguistics,
pp. 665–670, Oxford, 2005. Elsevier. Krizhevsky, Alex, Sutskever, Ilya, and Hinton, Geoffrey E.
Imagenet classification with deep convolutional neural
Frome, Andrea, Corrado, Greg S, Shlens, Jon, Bengio, networks. In NIPS, pp. 1106–1114, 2012.
Samy, Dean, Jeff, Mikolov, Tomas, et al. Devise: A
deep visual-semantic embedding model. In Advances in Le, Quoc V and Mikolov, Tomas. Distributed represen-
Neural Information Processing Systems, pp. 2121–2129, tations of sentences and documents. arXiv preprint
2013. arXiv:1405.4053, 2014.
Text Understanding from Scratch

LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient- Sermanet, Pierre, Eigen, David, Zhang, Xiang, Mathieu,
based learning applied to document recognition. Pro- Michaël, Fergus, Rob, and LeCun, Yann. Overfeat: Inte-
ceedings of the IEEE, 86(11):2278–2324, November grated recognition, localization and detection using con-
1998. volutional networks. CoRR, abs/1312.6229, 2013.

Lehmann, Jens, Isele, Robert, Jakob, Max, Jentzsch, Anja, Soderland, Stephen. Building a machine learning based
Kontokostas, Dimitris, Mendes, Pablo N., Hellmann, Se- text understanding system. In In Proc. IJCAI-2001
bastian, Morsey, Mohamed, van Kleef, Patrick, Auer, Workshop on Adaptive Text Extraction and Mining, pp.
Sören, and Bizer, Christian. DBpedia - a large-scale, 64–70, 2001.
multilingual knowledge base extracted from wikipedia.
Strapparava, Carlo and Mihalcea, Rada. Learning to iden-
Semantic Web Journal, 2014.
tify emotions in text. In Proceedings of the 2008 ACM
Linell, P. The Written Language Bias in Linguistics. 1982. Symposium on Applied Computing, SAC ’08, pp. 1556–
1560, New York, NY, USA, 2008. ACM. ISBN 978-1-
McAuley, Julian and Leskovec, Jure. Hidden factors and 59593-753-7.
hidden topics: Understanding rating dimensions with re-
Sutskever, Ilya, Martens, James, Dahl, George E., and Hin-
view text. In Proceedings of the 7th ACM Conference on
ton, Geoffrey E. On the importance of initialization and
Recommender Systems, RecSys ’13, pp. 165–172, New
momentum in deep learning. In Dasgupta, Sanjoy and
York, NY, USA, 2013. ACM. ISBN 978-1-4503-2409-0.
Mcallester, David (eds.), Proceedings of the 30th Inter-
Mikolov, Tomas, Le, Quoc V, and Sutskever, Ilya. Ex- national Conference on Machine Learning (ICML-13),
ploiting similarities among languages for machine trans- volume 28, pp. 1139–1147. JMLR Workshop and Con-
lation. arXiv preprint arXiv:1309.4168, 2013a. ference Proceedings, May 2013.

Mikolov, Tomas, Sutskever, Ilya, Chen, Kai, Corrado, Viera, Anthony J, Garrett, Joanne M, et al. Understanding
Greg S., and Dean, Jeff. Distributed representations of interobserver agreement: the kappa statistic. Fam Med,
words and phrases and their compositionality. In Burges, 37(5):360–363, 2005.
C.j.c., Bottou, L., Welling, M., Ghahramani, Z., and Vinyals, Oriol, Toshev, Alexander, Bengio, Samy, and Er-
Weinberger, K.q. (eds.), Advances in Neural Information han, Dumitru. Show and tell: A neural image caption
Processing Systems 26, pp. 3111–3119. 2013b. generator. CoRR, abs/1411.4555, 2014.
Nair, Vinod and Hinton, Geoffrey E. Rectified linear units Wang, Canhui, Zhang, Min, Ma, Shaoping, and Ru, Liyun.
improve restricted boltzmann machines. In Proceedings Automatic online news issue construction in web envi-
of the 27th International Conference on Machine Learn- ronment. In Proceedings of the 17th International Con-
ing (ICML-10), pp. 807–814, 2010. ference on World Wide Web, WWW ’08, pp. 457–466,
New York, NY, USA, 2008. ACM. ISBN 978-1-60558-
Norvig, Peter. Inference in text understanding. In AAAI, 085-2.
pp. 561–565, 1987.
Wiebe, Janyce M., Wilson, Theresa, and Bell, Matthew.
Pennington, Jeffrey, Socher, Richard, and Manning, Identifying Collocations for Recognizing Opinions. In
Christopher D. Glove: Global vectors for word represen- Proceedings of the ACL/EACL Workshop on Colloca-
tation. Proceedings of the Empiricial Methods in Natural tion, Toulouse, FR, 2001.
Language Processing (EMNLP 2014), 12, 2014.
Wilson, Theresa, Wiebe, Janyce, and Hoffmann, Paul. Rec-
Polyak, B.T. Some methods of speeding up the conver- ognizing contextual polarity in phrase-level sentiment
gence of iteration methods. {USSR} Computational analysis. In Proceedings of the Conference on Hu-
Mathematics and Mathematical Physics, 4(5):1 – 17, man Language Technology and Empirical Methods in
1964. ISSN 0041-5553. Natural Language Processing, HLT ’05, pp. 347–354,
Stroudsburg, PA, USA, 2005. Association for Computa-
Razavian, Ali Sharif, Azizpour, Hossein, Sullivan, tional Linguistics.
Josephine, and Carlsson, Stefan. CNN features off-the-
shelf: an astounding baseline for recognition. CoRR, Zaremba, Wojciech and Sutskever, Ilya. Learning to exe-
abs/1403.6382, 2014. cute. CoRR, abs/1410.4615, 2014.

Rumelhart, D.E., Hintont, G.E., and Williams, R.J. Learn- Zeiler, Matthew D and Fergus, Rob. Visualizing and under-
ing representations by back-propagating errors. Nature, standing convolutional networks. In Computer Vision–
323(6088):533–536, 1986. ECCV 2014, pp. 818–833. Springer, 2014.

You might also like