Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB
Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB
1394
the whole vocabulary; and reconstructing irrelevant words f (x) = (1+exp(−x))−1 in this paper; hi is the learned rep-
such as ’actor’ or ’movie’ very well is not likely to help resentation; x̃i is the reconstruction. A common approach is
learn more useful representations to classify the sentiment to use tied weights by setting W = W ; this usually works
of movie reviews. Second, explicitly reconstructing all the better as it speeds up learning and prevents overfitting at the
words in an input text is expensive, because the latent rep- same time. For this reason, we always use tied weights in
resentation has to contain all aspects of the semantic space this paper.
carried by the words, even if they are completely irrelevant. Autoencoders transform an unsupervised learning prob-
As the vocabulary size can easily reach the range of tens lem to a supervised one by the self reconstruction criteria.
of thousands even for a moderate sized dataset, the hidden This enables one to use all the tools developed for supervised
layer size has to be chosen very large to obtain a reasonable learning such as back propagation to efficiently train the au-
reconstruction, which causes a huge waste of model capacity toencoders. Moreover, thanks to the nonlinear functions f
and makes it difficult to scale to large problems. and g, autoencoders are able to learn non-linear and possibly
In fact, the reasoning above applies to all the unsuper- overcomplete representations, which give the model much
vised learning methods in general, which we argue is one more expressive power than their linear counter parts such
of the most important problems to address in order to learn as PCA (LSA) (Deerwester et al. 1990).
task-specific representations. This naturally leads us to the In this paper, we adopt one of the most popular variants
semisupervised approach, where label information is intro- of autoencoders, namely Denoising Autoencoder. Denois-
duced to guide the feature learning procedure. In particular, ing Autoencoder works by reconstructing the input from
we propose a novel loss function for training autoencoders a noised version of itself. The intuition is that a robust
that are directly coupled with the classification task. We first model should be able to reconstruct the input well even in
train a linear classifier on BoW, then a Bregman Divergence the presence of noises, due to the high correlation among
(Banerjee et al. 2004) is derived as the loss function of a features. For example, imagine deleting or adding a few
subsequent autoencoder. The new loss function gives the au- words from/to a document, the semantics should still remain
toencoder the information about directions along which the unchanged, thus the autoencoder should learn a consistent
reconstruction should be accurate, and where larger recon- representation from all the noisy inputs. In the high level,
struction errors are tolerated. Informally, this can be con- Denoising Autoencoders are equivalent to ordinary autoen-
sidered as a weighting of words based on their correlations coders trained with dropout (Srivastava et al. 2014), which
with the class label: predictive words should be given large has been shown as an effective regularizer for (deep) neu-
weights in the reconstruction even they are not frequent ral networks. Formally, let q(x̄|x) be a predefined noising
words, and vice versa. Furthermore, to reduce the bias in- distribution, and x̄ be a noised sample of x: x̄ ∼ q(x̄|x).
troduced by the linear classifier, we take a Bayesian view The objective function takes the form of sum of expectations
by defining a posterior distribution on the weights of the over all the noisy samples:
classifier. We then approximate the posterior with Laplace
approximation and derive the marginalized loss function for min Eq(x̄i |xi ) D(x̃i , xi )
the autoencoder. We show that our model successfully learns i (2)
features that are highly discriminative with respect to class s.t. hi = g(W x̄i + b), x̃i = f (W hi + b )
labels, and also outperform all the competing methods eval-
uated by classification accuracy. Moreover, the derived loss where we have slightly overloaded the notation to let x̃i de-
can also be applied to unlabeled data, which allows the note the reconstruction calculated from the noised input x̄i .
model to learn further better representations. While the marginal objective function requires infinite many
noised samples per data point, in practice it is sufficient to
simulate it stochastically. That is, for each example seen in
Model the stochastic gradient descent training, we randomly sam-
Denoising Autoencoders ple a x̄i from q(x̄i |xi ) and calculate the gradient with ordi-
Autoencoders learn functions that can reconstruct the inputs. nary back propagation.
They are typically implemented as a neural network with
Loss Function as Bregman Divergence
one hidden layer, and one can extract the activation of the
hidden layer as the new representation. Mathematically, we We then discuss the proper choice of the loss function D
are given a collection of data points X = {xi }, xi ∈ Rd , i ∈ in (2) as a specific form of Bregman Divergence. Bregman
[1, m], the objective function of an autoencoder is thus: Divergence (Banerjee et al. 2004) generalizes the notion of
distance in a d dimensional space. To be concrete, given two
min D(x̃i , xi ) data points x̃, x ∈ Rd and a convex function f (x) defined
i (1) on Rd , the Bregman Divergence of x̃ from x with respect to
f is:
s.t. hi = g(W xi + b), x̃i = f (W hi + b )
T
Df (x̃, x) = f (x̃) − (f (x) + ∇f (x) (x̃ − x)). (3)
where W ∈ Rk×d , b ∈ Rk , W ∈ Rd×k , b ∈ Rd are the
parameters to be learned; D is a loss function, such as the Namely, Bregman Divergence measures the distance be-
squared Euclidean Distance x̃−x22 ; g and f are predefined tween two points x̃, x as the deviation between the function
nonlinear functions, which we set as g(x) = max(0, x), value of f and the linear approximation of f around x at x̃.
1395
Two of the most commonly used loss functions for au- and analyzed similarly. Let us denote {xi }, xi ∈ Rd as the
toencoders are the squared Euclidean distance and element- collection of samples, and {yi }, yi ∈ {1, −1} as the class
wise KL divergence. It is not difficult to verify that they both labels; the objective function SVM2 is:
fall into this family by choosing f as the squared 2 norm
and the sum of element-wise entropy respectively. What the L(θ) = (max(0, 1 − yi θT xi ))2 + λθ2 . (5)
two loss functions have in common is that they make no i
distinction among dimensions of the input. In other words,
each dimension of the input is pushed to be reconstructed Here θ ∈ Rd is the weight; λ is the weight decay parameter.
equally well. While autoencoders trained in this way have Equation (5) is continuous and differentiable everywhere
been shown to work very well on image data, learning much with respect to θ, so the model can be easily trained with
more interesting and useful features than the original pixel stochastic gradient descent. The next (and most critical) step
intensity features, they are less appropriate for modeling tex- of our approach is to transfer label information from the lin-
tual data. The reason is two folds. First, textual data are ear classifier to the autoencoder. With this in mind, we ex-
extremely sparse and high dimensional, where the dimen- amine the loss induced by each sample as a function of the
sionality is equal to the vocabulary size. To maintain all the input, while with θ fixed:
information of the input in the hidden layer, a very large
layer size must be adopted, which makes the training cost f (xi ) = (max(0, 1 − yi θT xi ))2 (6)
extremely large. Second, ordinary autoencoders are not able
to deal with the power law of word distributions, where a few Note that f (xi ) is defined on the input space Rd , which
of the most frequent words account for most of the word oc- should be contrasted with L(θ) in Equation (5) which is a
currences. As a result, frequent words naturally gain favor function of θ. We are interested in f (xi ) because if we con-
to being reconstructed accurately, and rare words tend to be sider moving each input xi to x̃i , f (xi ) indicates the direc-
reconstructed with less precision. This problem is also anal- tion along which the loss is sensitive to. If we think of x̃
ogous to the imbalanced classification setting. This is es- as the reconstruction of xi obtained from an autoencoder, a
pecially problematic when frequent words carry little infor- good x̃i should be in a way such that the deviation of x̃i from
mation about the task of interest, which is not uncommon. xi is small evaluated by f (xi ). In other words, we would
Examples include stop words (the, a, this, from) and topic like x̃i to still be correctly classified by the pretrained linear
related terms (movie, watch, actress) in a movie review sen- classifier. Therefore, f (xi ) should be a much better function
timent analysis task. to evaluate the deviation of two samples. if we can derive a
Bregman Divergence from f (xi ) and use it as the loss func-
Semisupervised Autoencoder with Bregman tion of the subsequent autoencoder training, the autoencoder
Divergence should be guided to give reconstruction errors that do not
To address the problems mentioned above, we propose to confuse the classifier. Note that f (xi ) is a quadratic func-
introduce supervision to the training of autoencoders. To tion of xi whenever f (xi ) > 0, so we only need to derive
achieve this, we first train a linear classifier on Bag of Words, the Hessian matrix in order to achieve the Bregman Diver-
and then use the weight of the learned classifier to define a gence. The Hessian follows as:
new loss function for the autoencoder. Now let us first de-
scribe our choice of loss function, and then elaborate the θθT , if 1 − yi θT xi > 0
motivation later: H(xi ) = (7)
0, otherwise.
D(x̃, x) = (θT (x̃ − x))2 . (4)
Recall that for a quadratic function with Hessian matrix H,
where θ ∈ Rd are the weights of the linear classifier, and we the Bregman Divergence is simply (x̃ − x)T H(x̃ − x); then
have omitted the bias for simplicity. Before we delve into we have:
more details, note that Equation (4) is a valid distance, as T
it is non-negative and reaches zeros if and only if x̃ = x. (θ (x̃i − xi ))2 , if 1 − yi θT xi > 0
D(x̃i , xi ) = (8)
Moreover, the reconstruction error is only measured after 0, otherwise
projecting on θ; this guides the reconstruction to be accurate
only along directions where the linear classifier is sensitive In words, Equation (8) says that we measure the recon-
to. Note also that Equation (4) on the one hand uses label struction loss for difficult examples (those that satisfy 1 −
information (θ has been trained with labeled data), on the yi θT xi > 0) with Equation (4); and there is no reconstruc-
other hand no explicit labels are directly referred to (only re- tion loss at all for easy examples. This discrimination is un-
quires xi ). Thus one is able to train an autoencoder on both desirable, because in this case the Autoencoder would com-
labeled and unlabeled data with the loss function in Equa- pletely ignore easy examples, and there is no way to guar-
tion (4). This subtlety distinguishes our method from pure antee that the x̃i can be correctly classified. Actually, this
supervised or unsupervised learning, and allows us to enjoy split is just an artifact of the hinge loss and the asymmetri-
the benefit from both worlds. cal property of Bregman Divergence. Hence, we perform a
As a design choice, we consider SVM with squared hinge simple correction by ignoring the condition in Equation (8),
loss (SVM2) and 2 regularization as the linear classifier, which basically pretends that all the examples induce a loss.
but other classifiers such as Logistic Regression can be used This directly yields the loss function as in Equation (4).
1396
The Bayesian Marginalization
Table 1: Statistics of the datasets.
In principle, one may directly apply Equation (4) as the loss
function in place of the squared Euclidean distance and train IMDB books DVD music electronics kitchenware
an autoencoder. However, doing so might introduce a bias # train 25,000 10,000 10,000 18,000 6,000 6,000
brought by one single classifier. As a remedy, we resort to # test 25,000 3,105 2,960 2,661 2,862 1,691
the Bayesian approach, which defines a probability distribu- # unlabeled 50,000 N/A N/A N/A N/A N/A
tion over θ. Although SVM2 is not a probabilistic classifier # features 8,876 9,849 10,537 13,099 5,091 3,907
like Logistic Regression, we can borrow the idea of Energy % positive 50 49.81 49.85 50.16 49.78 50.08
Based Model (Bengio 2009) and use L(θ) as the negative
log likelihood of the following distribution:
exp(−βL(θ)) to the squared Euclidean distance after performing element-
p(θ) = (9) wise normalizing the input using all difficult examples. The
exp(−βL(θ))dθ effect of this normalization is that the reconstruction errors
where β > 0 is the temperature parameter which controls of frequent words are down weighted; on the other hand, dis-
the shape of the distribution p. Note that the larger β is, the criminative words are given higher weights as they would
sharper p will be. In the extreme case, p(θ) is reduced to a occur less frequently in difficult examples. Note that it is
uniform distribution as β approaches 0, and collapses into a important to use a relatively large β in order to avoid the
single δ function as β goes to positive infinity. variance term dominating the mean term. In other words, we
Given p(θ), we rewrite Equation (4) as an expectation need to ensure p(θ) to be reasonable peaked around θ̂ to ef-
over θ: fective take advantage of label information.
Experiments
D(x̃, x) = Eθ∼p(θ) (θT (x̃ − x))2 = (θT (x̃ − x))2 p(θ)dθ.
Datasets
(10) We evaluate our model on six Sentiment Analysis bench-
Obviously there is now no closed form expression for marks. The first one is the IMDB dataset 1 (Maas et al.
D(x̃, x). To solve it one could use sampling methods such 2011), which consists of movie reviews collected from
as MCMC, which provides unbiased estimates of the ex- IMDB. The IMDB dataset is one of the largest sentiment
pectation but could be slow in practice. Instead, we use analysis dataset that is publicly available; it also comes with
the Laplace approximation, which approximates p(θ) by a an unlabeled set which allows us to evaluate semisupervised
Gaussian distribution p̃(θ) = N (θ̂, Σ). As estimating the learning methods. The rest five datasets are all collected
full covariance matrix is prohibitive, we further constrain Σ from Amazon 2 (Blitzer, Dredze, and Pereira 2007), which
to be diagonal. The benefit of doing so is that the expectation corresponds to the reviews of five different products: books,
can now be computed directly in closed form. To see this, by DVDs, music, electronics, kitchenware. All the six datasets
simply replacing p(θ) with p̃(θ) in Equation (11): are already tokenized as either uni-gram or bi-gram features.
For computational reasons, we only select the words that
D(x̃, x) =Eθ∼p̃(θ) (θT (x̃ − x))2 occur in at least 30 training examples. We summarize the
statistics of datasets in Table 1.
=(x̃ − x)T Eθ∼p̃(θ) (θθT )(x̃ − x)
=(x̃ − x)T (θ̂θ̂T + Σ)(x̃ − x) Methods
1 1 • Bag of Words (BoW). Instead of using the raw word
=(θ̂T (x̃ − x))2 + (Σ (x̃ − x))T (Σ (x̃ − x)).
2 2
counts directly, we take a simple step of data normaliza-
(11) tion:
log(1 + ci,j )
where D now involves two parts, corresponding to the mean xi,j = (13)
and variance term of the Gaussian distribution respectively. maxj log(1 + ci,j )
Now let us derive p̃(θ) for p(θ). In Laplace approximation, θ̂ where ci,j denotes the number of occurrences of the jth
is chosen as the mode of p(θ), which is exactly the solution word in the ith document, xi,j denotes the normalized
to the SVM2 optimization problem. For Σ, we have: count. We choose this normalization because it preserves
the sparsity of the Bag of Words features; also each fea-
∂ 2 L(θ) −1 ture element is normalized to the range [0, 1]. Note that
Σ =(diag( ))
∂θ2 the very same normalized Bag of Words features are fed
1 (12) into the autoencoders.
= (diag( I(1 − yi θT xi > 0)x2i ))−1
β i • Denoising Autoencoder (DAE) (Vincent et al. 2008).
This refers to the regular Denoising Autoencoder defined
Here we have overridden diag but letting it denote a diago- in Equation (1) with squared Euclidean distance loss:
nal matrix induced either by a square matrix or a vector; I
is the indicator function; (·)−1 denotes matrix inverse. Inter- 1
http://ai.stanford.edu/ amaas/data/sentiment/
2
estingly, the second term in Equation (11) is now equivalent http://www.cs.jhu.edu/ mdredze/datasets/sentiment/
1397
D(x̃, x) = x̃ − x22 . This is also used in (Glorot, Bor- LrDrop is the second best method that we have tested.
des, and Bengio 2011) on the Amazon datasets for do- Thanks to the usage of dropout regularization, it consis-
main adaptation. We use ReLu max(0, x) as the activa- tently outperforms BoW, and achieves the best results on
tion function, and Sigmoid as the decoding function. two (smaller) datasets. Compared with LrDrop, it appears
• Denoising Autoencoder with Finetuning (DAE+) (Vin- that our model works better on large datasets (≈ 10K words,
cent et al. 2008). This denotes the common approach to more than 10K training examples) than smaller ones. This
continue training an DAE on labeled data by replacing indicates that in high dimensional spaces with sufficient
the decoding part of DAE with a Softmax layer. samples, SBDAE benefits from learning a nonlinear feature
transformation that disentangles the underlying factors of
• Feedforward Neural Network (NN). This is the standard variation, while LrDrop is incapable of doing so due to its
fully connected neural network with one hidden layer and nature as a linear classifier.
random initialization. We use the same activation function As the training of the autoencoder part of SBDAE does
as that in Autoencoders, i.e., ReLU. not require the availability of labels, we also try incorporat-
• Logistic Regression with Dropout (LrDrop) (Wager, ing unlabeled data after learning the linear classifier in SB-
Wang, and Liang 2013). This is a model where logistic DAE. As shown in Table 2, doing so further improves the
regression is regularized with the marginalized dropout performance over using labeled data only. This justifies that
noise. LrDrop differs from our approach as it uses feature it is possible to bootstrap from a relatively small amount of
noising as an explicit regularization. Another difference labeled data and learn better representations with more un-
is that our model is able to learn nonlinear representa- labeled data with SBDAE.
tions, not merely a classifier, and thus is potentially able To gain more insights of the results, we further visual-
to model more complicated patterns in data. ize the filters learned by SBDAE and DAE on the IMDB
dataset in Table 3. In particular, we show the top 5 most
• Semisupervised Bregman Divergence Autoencoder (SB- activated and deactivated words of the first 8 filters (corre-
DAE). This corresponds to our model with Denoising Au- sponding to the first 8 rows of W ) of SBDAE and DAE, re-
toencoder as the feature learner. The training process is spectively. First of all, it seems very difficult to make sense
roughly equivalent to training on BoW followed by the of the filters of DAE as they are mostly common words with
training of DAE, except that the loss function of DAE is no clear co-occurrence pattern. By comparison, if we look at
replaced with the loss function defined in Equation (11). the filters from SBDAE, they are mostly sensitive to words
We cross validate β from the set {104 , 105 , 106 , 107 , 108 } that demonstrate clear polarity. In particular, all the 8 filters
(note that larger β corresponds to weaker Bayesian regu- seem to be most activated by certain negative words, and are
larization). most deactivated by certain positive words. In this way, the
• Semisupervised Bregman Divergence Autoencoder with activation of each filter of SBDAE is much more indicative
Finetuning (SBDAE+). of the polarity than that of DAE, which explains the better
performance of SBDAE over DAE. Note that this difference
Note that except for BoW and LrDrop, all the other meth- only comes from reweighting the reconstruction errors in a
ods require a predefined dimensionality of representation. certain way, with no explicit usage of labels.
We use fixed sizes on all the datasets. For SBDAE and NN, a
small hidden size is sufficient, so we use 200. For DAE, we
observe that it benefits from very large hidden sizes; how- Related Work and Discussion
ever, due to computational constraints, we take 2000. For Our work falls into the general category of learning repre-
BoW, DAE, SBDAE, we use SVM2 as the classifier. All the sentations for text data. In particular, there have been a lot
models are trained with mini-batch Stochastic Gradient De- of efforts that try to learn compact representations for either
scent with momentum of 0.9. words or documents (Turney and Pantel 2010; Blei, Ng, and
Jordan 2003; Deerwester et al. 1990; Mikolov et al. 2013;
Results Le and Mikolov 2014; Maas et al. 2011). LDA (Blei, Ng,
We first summarize the results as in classification error rate and Jordan 2003) explicitly learns a set of topics, each of
in Table 2. First of all, our model consistently beats BoW which is defined as a distribution on words; a document
with a margin, and it achieves the best results on four (larger) is thus represented as the posterior distribution on topics,
datasets out of six. On the other hand, DAE, DAE+ and NN which is a fixed-length, non-negative vector. Closely related
all fail to outperform BoW, although they share the same are matrix factorization models such as LSA (Deerwester
architecture as nonlinear classifiers. This suggests that SB- et al. 1990) and Non-negative Matrix Factorization (NMF)
DAE be able to learn a much better nonlinear feature trans- (Xu, Liu, and Gong 2003). While LSA factorizes the doc-
formation function by training with a more informed objec- term matrix via Singular Value Decomposition, NMF learns
tive (than that of DAE). Moreover, note also that finetun- non-negative basis and coefficient vectors. Similar to these
ing on labeled set (DAE+) significantly improves the perfor- efforts, our model also works directly on the doc-term ma-
mance of DAE, which is ultimately on a par with training a trix. However, thanks to the usage of autoencoder, the rep-
neural net with random initialization (NN). However, fine- resentation for documents are calculated instantly via direct
tuning offers little help to SBDAE, as it is already implicitly matrix product, which eliminates the need of expensive in-
guided by labels during the training. ference. Our work also distinguishes itself from other work
1398
Table 2: Left: our model achieves the best results on four (large ones) out of six datasets. Right: our model is able to take
advantage of unlabeled data and gain better performance.
books DVD music electronics kitchenware IMDB IMDB + unlabled
BoW 10.76 11.82 11.80 10.41 9.34 11.48 N/A
DAE 15.10 15.64 15.44 14.74 12.48 14.60 13.28
DAE+ 11.40 12.09 11.80 11.53 9.23 11.48 11.47
NN 11.05 11.89 11.42 11.15 9.16 11.60 N/A
LrDrop 9.53 10.95 10.90 9.81 8.69 10.88 10.73
SBDAE 9.16 10.90 10.59 10.02 8.87 10.52 10.42
SBDAE+ 9.12 10.90 10.58 10.01 8.83 10.50 10.41
Table 3: Visualization of learned feature maps. From top to bottom: most activated and deactivated words for SBDAE; most
activated and deactivated words for DAE.
nothing disappointing badly save even dull excuse ridiculously
cannon worst disappointing redeeming attempt fails had dean
outrageously unfortunately annoying awful unfunny stupid failed none
lends terrible worst sucks couldn’t worst rest ruined
teacher predictable poorly convince worst avoid he attempt
first tears loved amazing excellent perfect years with
classic wonderfully finest incredible surprisingly ? terrific best
man helps noir funniest beauty powerful peter recommended
hard awesome magnificent unforgettable unexpected excellent cool perfect
still terrific scared captures appreciated favorite allows heart
long wasn’t probably to making laugh tv someone
worst guy fan the give find might yet
kids music kind and performances where found goes
anyone work years this least before kids away
trying now place shows comes ever having poor
done least go kind recommend although ending worth
find book trying takes instead everyone once interesting
before day looks special wife anything wasn’t isn’t
work actors everyone now shows comes american rather
watching classic performances someone night away sense around
as a semisupervised representation learning model, where generalize to documents. MTC (Rifai et al. 2011a) is another
label information can be effectively leveraged. work that models the interaction of autoencoders and classi-
Recently, there has also been an active thread of research fiers. However, their training of autoencoders is purely un-
on learning word representations. Notably, (Mikolov et al. supervised, the interaction comes into play by requiring the
2013) shows that we can learn interesting word embeddings classifier to be invariant along the tangents of the learned
via very simple architecture on a large amount of unla- data manifold. It is not difficult to see that the assumption
beled dataset. Moreover, (Le and Mikolov 2014) proposed of MTC would not hold when the class labels did not align
to jointly learn representations for sentences and paragraphs well with the data manifold, which is a situation our model
together with words in a similar unsupervised fashion. While does not suffer from.
our work does not explicitly model the representations for
words, it is straightforward to incorporate this idea by adding Conclusion
an additional linear layer at the bottom of the autoencoder. In this paper, we have proposed a novel extension to autoen-
From the perspective of machine learning methodology, coders for learning task-specific representations for textual
our approach resembles the idea of layer-wise pretraining data. We have generalized the traditional autoencoders by
in deep Neural Networks (Bengio 2009). Our model dif- relaxing their loss function to the Bregman Divergence, and
fers from the traditional training procedure of autoencoders then derived a discriminative loss function from the label
in that we effectively utilize the label information to guide information. Experiments on text classification benchmarks
the representation learning. Related idea has been proposed have shown that our model significantly outperforms Bag of
in (Socher et al. 2011), where they train Recursive autoen- Words, traditional Denoising Autoencoder, and other com-
coders on sentences jointly with prediction of sentiment. peting methods. We have also qualitatively visualized that
Due to the delicate recursive architecture, their model only our model successfully learns discriminative features, which
works on sentences with given parsing trees, and could not unsupervised methods fail to do.
1399
Acknowledgments Socher, R.; Pennington, J.; Huang, E. H.; Ng, A. Y.; and
This work is supported in part by NSF (CCF-1017828). Manning, C. D. 2011. Semi-supervised recursive autoen-
coders for predicting sentiment distributions. In Proceedings
of the 2011 Conference on Empirical Methods in Natural
References Language Processing, EMNLP 2011, 27-31 July 2011, John
Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2004. McIntyre Conference Centre, Edinburgh, UK, A meeting of
Clustering with bregman divergences. In Proceedings of SIGDAT, a Special Interest Group of the ACL, 151–161.
the Fourth SIAM International Conference on Data Mining, Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;
Lake Buena Vista, Florida, USA, April 22-24, 2004, 234– and Salakhutdinov, R. 2014. Dropout: a simple way to pre-
245. vent neural networks from overfitting. Journal of Machine
Bengio, Y. 2009. Learning deep architectures for AI. Foun- Learning Research 15(1):1929–1958.
dations and Trends in Machine Learning 2(1):1–127. Turney, P. D., and Pantel, P. 2010. From frequency to mean-
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent ing: Vector space models of semantics. J. Artif. Intell. Res.
dirichlet allocation. Journal of Machine Learning Research (JAIR) 37:141–188.
3:993–1022. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.
Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, 2008. Extracting and composing robust features with de-
bollywood, boom-boxes and blenders: Domain adaptation noising autoencoders. In Machine Learning, Proceedings
for sentiment classification. In ACL 2007, Proceedings of the of the Twenty-Fifth International Conference (ICML 2008),
45th Annual Meeting of the Association for Computational Helsinki, Finland, June 5-9, 2008, 1096–1103.
Linguistics, June 23-30, 2007, Prague, Czech Republic. Wager, S.; Wang, S. I.; and Liang, P. 2013. Dropout train-
Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, ing as adaptive regularization. In Advances in Neural Infor-
G. W.; and Harshman, R. A. 1990. Indexing by latent se- mation Processing Systems 26: 27th Annual Conference on
mantic analysis. JASIS 41(6):391–407. Neural Information Processing Systems 2013. Proceedings
of a meeting held December 5-8, 2013, Lake Tahoe, Nevada,
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adap- United States., 351–359.
tation for large-scale sentiment classification: A deep learn-
Xu, W.; Liu, X.; and Gong, Y. 2003. Document clustering
ing approach. In Proceedings of the 28th International Con-
based on non-negative matrix factorization. In SIGIR 2003:
ference on Machine Learning, ICML 2011, Bellevue, Wash-
Proceedings of the 26th Annual International ACM SIGIR
ington, USA, June 28 - July 2, 2011, 513–520.
Conference on Research and Development in Information
Le, Q. V., and Mikolov, T. 2014. Distributed representations Retrieval, July 28 - August 1, 2003, Toronto, Canada, 267–
of sentences and documents. In Proceedings of the 31th In- 273.
ternational Conference on Machine Learning, ICML 2014,
Beijing, China, 21-26 June 2014, 1188–1196.
Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.;
and Potts, C. 2011. Learning word vectors for sentiment
analysis. In The 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies,
Proceedings of the Conference, 19-24 June, 2011, Portland,
Oregon, USA, 142–150.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed representations of words and
phrases and their compositionality. In Advances in Neural
Information Processing Systems 26: 27th Annual Confer-
ence on Neural Information Processing Systems 2013. Pro-
ceedings of a meeting held December 5-8, 2013, Lake Tahoe,
Nevada, United States., 3111–3119.
Rifai, S.; Dauphin, Y.; Vincent, P.; Bengio, Y.; and Muller,
X. 2011a. The manifold tangent classifier. In Advances
in Neural Information Processing Systems 24: 25th An-
nual Conference on Neural Information Processing Systems
2011. Proceedings of a meeting held 12-14 December 2011,
Granada, Spain., 2294–2302.
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; and Bengio, Y.
2011b. Contractive auto-encoders: Explicit invariance dur-
ing feature extraction. In Proceedings of the 28th Interna-
tional Conference on Machine Learning, ICML 2011, Belle-
vue, Washington, USA, June 28 - July 2, 2011, 833–840.
1400