Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB

Uploaded by

PhamThi Thiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

42 views

Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB

Uploaded by

PhamThi Thiet

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Semisupervised Autoencoder for Sentiment Analysis

Shuangfei Zhai, Zhongfei (Mark) Zhang

Computer Science Department, Binghamton University
4400 Vestal Pkwy E, Binghamton, NY 13902
[email protected] [email protected]

Abstract differs from most of the existing work as it naturally incor-

porates label information into its objective function, which
In this paper, we investigate the usage of autoencoders in allow the learned representation to be directly coupled with
modeling textual data. Traditional autoencoders suffer from
the task of interest.
at least two aspects: scalability with the high dimensionality
of vocabulary size and dealing with task-irrelevant words. We In this paper we focus on a specific class of task in text
address this problem by introducing supervision via the loss mining: Sentiment Analysis (SA). We further focus on a spe-
function of autoencoders. In particular, we first train a linear cial case of SA as a binary classification problem, where a
classifier on the labeled data, then define a loss for the au- given piece of text is either of positive or negative attitude.
toencoder with the weights learned from the linear classifier. This problem is interesting largely due to the emergence of
To reduce the bias brought by one single classifier, we de- online social networks, where people consistently express
fine a posterior probability distribution on the weights of the their opinions about certain subjects. Also, it is easy to ob-
classifier, and derive the marginalized loss of the autoencoder
with Laplace approximation. We show that our choice of loss
tain a large amount of clean labeled data for SA by crawling
function can be rationalized from the perspective of Bregman reviews from websites such as IMDB or Amazon. Thus, SA
Divergence, which justifies the soundness of our model. We is an ideal benchmark for evaluating text classification mod-
evaluate the effectiveness of our model on six sentiment anal- els (and features).
ysis datasets, and show that our model significantly outper- Autoencoders have attracted a lot of attention in recent
forms all the competing methods with respect to classification years as a building block of Deep Learning (Bengio 2009).
accuracy. We also show that our model is able to take advan- They act as the feature learning methods by reconstructing
tage of unlabeled dataset and get improved performance. We
inputs with respect to a given loss function. In a neural net-
further show that our model successfully learns highly dis-
criminative feature maps, which explains its superior perfor- work implementation of autoencoders, the hidden layer is
mance. taken as the learned feature. While it is often trivial to ob-
tain good reconstructions with plain autoencoders, much ef-
fort has been devoted on regularizations in order to prevent
Introduction them against overfitting (Bengio 2009; Vincent et al. 2008;
In machine learning, documents are usually represented as Rifai et al. 2011b). However, little attention has been de-
Bag of Words (BoW), which nicely reduces a piece of voted to the loss function, which we argue is critical for
text with arbitrary length to a fixed length vector. Despite modeling textual data. The problem with the commonly
its simplicity, BoW remains the dominant representation in adopted loss functions (squared Euclidean distance and
many applications including text classification. There has element-wise KL Divergence, for instance) is that they try to
also been a large body of work dedicated to learning use- reconstruct all dimensions of input independently and undis-
ful representations for textual data (Turney and Pantel 2010; criminatively. However, we argue that this is not the optimal
Blei, Ng, and Jordan 2003; Deerwester et al. 1990; Mikolov approach when our interest is text classification. The reason
et al. 2013; Glorot, Bordes, and Bengio 2011). By exploiting is two folds. First, it is well known that in natural language
the co-occurrence pattern of words, one can learn a low di- the distribution of word occurrences follows the power-law.
mensional vector that forms a compact and meaningful rep- This means that a few of the most frequent words will ac-
resentation for a document. The new representation is often count for most of the probability mass of word occurrences.
found useful for subsequent tasks such as topic visualization An immediate result is that the Autoencoder puts most of
and information retrieval. In this paper, we investigate the its effort on reconstructing the most frequent words well but
application of one of the most popular representation learn- (to a certain extent) ignores the less frequent ones. This may
ing methods, namely autoencoders (Bengio 2009), to learn lead to a bad performance especially when the class distribu-
task-dependent representations for textual data. Our model tion is not well captured by merely the frequent words. For
sentiment analysis, this problem is especially severe because
Copyright c 2016, Association for the Advancement of Artificial it is obvious that the truly useful features (words or phrases
Intelligence (www.aaai.org). All rights reserved. expressing a clear polarity) only occupy a small fraction of

1394
the whole vocabulary; and reconstructing irrelevant words f (x) = (1+exp(−x))−1 in this paper; hi is the learned rep-
such as ’actor’ or ’movie’ very well is not likely to help resentation; x̃i is the reconstruction. A common approach is
learn more useful representations to classify the sentiment to use tied weights by setting W = W ; this usually works
of movie reviews. Second, explicitly reconstructing all the better as it speeds up learning and prevents overfitting at the
words in an input text is expensive, because the latent rep- same time. For this reason, we always use tied weights in
resentation has to contain all aspects of the semantic space this paper.
carried by the words, even if they are completely irrelevant. Autoencoders transform an unsupervised learning prob-
As the vocabulary size can easily reach the range of tens lem to a supervised one by the self reconstruction criteria.
of thousands even for a moderate sized dataset, the hidden This enables one to use all the tools developed for supervised
layer size has to be chosen very large to obtain a reasonable learning such as back propagation to efficiently train the au-
reconstruction, which causes a huge waste of model capacity toencoders. Moreover, thanks to the nonlinear functions f
and makes it difficult to scale to large problems. and g, autoencoders are able to learn non-linear and possibly
In fact, the reasoning above applies to all the unsuper- overcomplete representations, which give the model much
vised learning methods in general, which we argue is one more expressive power than their linear counter parts such
of the most important problems to address in order to learn as PCA (LSA) (Deerwester et al. 1990).
task-specific representations. This naturally leads us to the In this paper, we adopt one of the most popular variants
semisupervised approach, where label information is intro- of autoencoders, namely Denoising Autoencoder. Denois-
duced to guide the feature learning procedure. In particular, ing Autoencoder works by reconstructing the input from
we propose a novel loss function for training autoencoders a noised version of itself. The intuition is that a robust
that are directly coupled with the classification task. We first model should be able to reconstruct the input well even in
train a linear classifier on BoW, then a Bregman Divergence the presence of noises, due to the high correlation among
(Banerjee et al. 2004) is derived as the loss function of a features. For example, imagine deleting or adding a few
subsequent autoencoder. The new loss function gives the au- words from/to a document, the semantics should still remain
toencoder the information about directions along which the unchanged, thus the autoencoder should learn a consistent
reconstruction should be accurate, and where larger recon- representation from all the noisy inputs. In the high level,
struction errors are tolerated. Informally, this can be con- Denoising Autoencoders are equivalent to ordinary autoen-
sidered as a weighting of words based on their correlations coders trained with dropout (Srivastava et al. 2014), which
with the class label: predictive words should be given large has been shown as an effective regularizer for (deep) neu-
weights in the reconstruction even they are not frequent ral networks. Formally, let q(x̄|x) be a predefined noising
words, and vice versa. Furthermore, to reduce the bias in- distribution, and x̄ be a noised sample of x: x̄ ∼ q(x̄|x).
troduced by the linear classifier, we take a Bayesian view The objective function takes the form of sum of expectations
by defining a posterior distribution on the weights of the over all the noisy samples:
classifier. We then approximate the posterior with Laplace
approximation and derive the marginalized loss function for min Eq(x̄i |xi ) D(x̃i , xi )
the autoencoder. We show that our model successfully learns i (2)

features that are highly discriminative with respect to class s.t. hi = g(W x̄i + b), x̃i = f (W hi + b )
labels, and also outperform all the competing methods eval-
uated by classification accuracy. Moreover, the derived loss where we have slightly overloaded the notation to let x̃i de-
can also be applied to unlabeled data, which allows the note the reconstruction calculated from the noised input x̄i .
model to learn further better representations. While the marginal objective function requires infinite many
noised samples per data point, in practice it is sufficient to
simulate it stochastically. That is, for each example seen in
Model the stochastic gradient descent training, we randomly sam-
Denoising Autoencoders ple a x̄i from q(x̄i |xi ) and calculate the gradient with ordi-
Autoencoders learn functions that can reconstruct the inputs. nary back propagation.
They are typically implemented as a neural network with
Loss Function as Bregman Divergence
one hidden layer, and one can extract the activation of the
hidden layer as the new representation. Mathematically, we We then discuss the proper choice of the loss function D
are given a collection of data points X = {xi }, xi ∈ Rd , i ∈ in (2) as a specific form of Bregman Divergence. Bregman
[1, m], the objective function of an autoencoder is thus: Divergence (Banerjee et al. 2004) generalizes the notion of
distance in a d dimensional space. To be concrete, given two
min D(x̃i , xi ) data points x̃, x ∈ Rd and a convex function f (x) defined
i (1) on Rd , the Bregman Divergence of x̃ from x with respect to
f is:
s.t. hi = g(W xi + b), x̃i = f (W hi + b )
T
Df (x̃, x) = f (x̃) − (f (x) + ∇f (x) (x̃ − x)). (3)
where W ∈ Rk×d , b ∈ Rk , W ∈ Rd×k , b ∈ Rd are the
parameters to be learned; D is a loss function, such as the Namely, Bregman Divergence measures the distance be-
squared Euclidean Distance x̃−x22 ; g and f are predefined tween two points x̃, x as the deviation between the function
nonlinear functions, which we set as g(x) = max(0, x), value of f and the linear approximation of f around x at x̃.

1395
Two of the most commonly used loss functions for au- and analyzed similarly. Let us denote {xi }, xi ∈ Rd as the
toencoders are the squared Euclidean distance and element- collection of samples, and {yi }, yi ∈ {1, −1} as the class
wise KL divergence. It is not difficult to verify that they both labels; the objective function SVM2 is:
fall into this family by choosing f as the squared 2 norm
and the sum of element-wise entropy respectively. What the L(θ) = (max(0, 1 − yi θT xi ))2 + λθ2 . (5)
two loss functions have in common is that they make no i
distinction among dimensions of the input. In other words,
each dimension of the input is pushed to be reconstructed Here θ ∈ Rd is the weight; λ is the weight decay parameter.
equally well. While autoencoders trained in this way have Equation (5) is continuous and differentiable everywhere
been shown to work very well on image data, learning much with respect to θ, so the model can be easily trained with
more interesting and useful features than the original pixel stochastic gradient descent. The next (and most critical) step
intensity features, they are less appropriate for modeling tex- of our approach is to transfer label information from the lin-
tual data. The reason is two folds. First, textual data are ear classifier to the autoencoder. With this in mind, we ex-
extremely sparse and high dimensional, where the dimen- amine the loss induced by each sample as a function of the
sionality is equal to the vocabulary size. To maintain all the input, while with θ fixed:
information of the input in the hidden layer, a very large
layer size must be adopted, which makes the training cost f (xi ) = (max(0, 1 − yi θT xi ))2 (6)
extremely large. Second, ordinary autoencoders are not able
to deal with the power law of word distributions, where a few Note that f (xi ) is defined on the input space Rd , which
of the most frequent words account for most of the word oc- should be contrasted with L(θ) in Equation (5) which is a
currences. As a result, frequent words naturally gain favor function of θ. We are interested in f (xi ) because if we con-
to being reconstructed accurately, and rare words tend to be sider moving each input xi to x̃i , f (xi ) indicates the direc-
reconstructed with less precision. This problem is also anal- tion along which the loss is sensitive to. If we think of x̃
ogous to the imbalanced classification setting. This is es- as the reconstruction of xi obtained from an autoencoder, a
pecially problematic when frequent words carry little infor- good x̃i should be in a way such that the deviation of x̃i from
mation about the task of interest, which is not uncommon. xi is small evaluated by f (xi ). In other words, we would
Examples include stop words (the, a, this, from) and topic like x̃i to still be correctly classified by the pretrained linear
related terms (movie, watch, actress) in a movie review sen- classifier. Therefore, f (xi ) should be a much better function
timent analysis task. to evaluate the deviation of two samples. if we can derive a
Bregman Divergence from f (xi ) and use it as the loss func-
Semisupervised Autoencoder with Bregman tion of the subsequent autoencoder training, the autoencoder
Divergence should be guided to give reconstruction errors that do not
To address the problems mentioned above, we propose to confuse the classifier. Note that f (xi ) is a quadratic func-
introduce supervision to the training of autoencoders. To tion of xi whenever f (xi ) > 0, so we only need to derive
achieve this, we first train a linear classifier on Bag of Words, the Hessian matrix in order to achieve the Bregman Diver-
and then use the weight of the learned classifier to define a gence. The Hessian follows as:
new loss function for the autoencoder. Now let us first de-
scribe our choice of loss function, and then elaborate the θθT , if 1 − yi θT xi > 0
motivation later: H(xi ) = (7)
0, otherwise.
D(x̃, x) = (θT (x̃ − x))2 . (4)
Recall that for a quadratic function with Hessian matrix H,
where θ ∈ Rd are the weights of the linear classifier, and we the Bregman Divergence is simply (x̃ − x)T H(x̃ − x); then
have omitted the bias for simplicity. Before we delve into we have:
more details, note that Equation (4) is a valid distance, as T
it is non-negative and reaches zeros if and only if x̃ = x. (θ (x̃i − xi ))2 , if 1 − yi θT xi > 0
D(x̃i , xi ) = (8)
Moreover, the reconstruction error is only measured after 0, otherwise
projecting on θ; this guides the reconstruction to be accurate
only along directions where the linear classifier is sensitive In words, Equation (8) says that we measure the recon-
to. Note also that Equation (4) on the one hand uses label struction loss for difficult examples (those that satisfy 1 −
information (θ has been trained with labeled data), on the yi θT xi > 0) with Equation (4); and there is no reconstruc-
other hand no explicit labels are directly referred to (only re- tion loss at all for easy examples. This discrimination is un-
quires xi ). Thus one is able to train an autoencoder on both desirable, because in this case the Autoencoder would com-
labeled and unlabeled data with the loss function in Equa- pletely ignore easy examples, and there is no way to guar-
tion (4). This subtlety distinguishes our method from pure antee that the x̃i can be correctly classified. Actually, this
supervised or unsupervised learning, and allows us to enjoy split is just an artifact of the hinge loss and the asymmetri-
the benefit from both worlds. cal property of Bregman Divergence. Hence, we perform a
As a design choice, we consider SVM with squared hinge simple correction by ignoring the condition in Equation (8),
loss (SVM2) and 2 regularization as the linear classifier, which basically pretends that all the examples induce a loss.
but other classifiers such as Logistic Regression can be used This directly yields the loss function as in Equation (4).

1396
The Bayesian Marginalization
Table 1: Statistics of the datasets.
In principle, one may directly apply Equation (4) as the loss
function in place of the squared Euclidean distance and train IMDB books DVD music electronics kitchenware
an autoencoder. However, doing so might introduce a bias # train 25,000 10,000 10,000 18,000 6,000 6,000
brought by one single classifier. As a remedy, we resort to # test 25,000 3,105 2,960 2,661 2,862 1,691
the Bayesian approach, which defines a probability distribu- # unlabeled 50,000 N/A N/A N/A N/A N/A
tion over θ. Although SVM2 is not a probabilistic classifier # features 8,876 9,849 10,537 13,099 5,091 3,907
like Logistic Regression, we can borrow the idea of Energy % positive 50 49.81 49.85 50.16 49.78 50.08
Based Model (Bengio 2009) and use L(θ) as the negative
log likelihood of the following distribution:
exp(−βL(θ)) to the squared Euclidean distance after performing element-
p(θ) = (9) wise normalizing the input using all difficult examples. The
exp(−βL(θ))dθ effect of this normalization is that the reconstruction errors
where β > 0 is the temperature parameter which controls of frequent words are down weighted; on the other hand, dis-
the shape of the distribution p. Note that the larger β is, the criminative words are given higher weights as they would
sharper p will be. In the extreme case, p(θ) is reduced to a occur less frequently in difficult examples. Note that it is
uniform distribution as β approaches 0, and collapses into a important to use a relatively large β in order to avoid the
single δ function as β goes to positive infinity. variance term dominating the mean term. In other words, we
Given p(θ), we rewrite Equation (4) as an expectation need to ensure p(θ) to be reasonable peaked around θ̂ to ef-
over θ: fective take advantage of label information.

Experiments
D(x̃, x) = Eθ∼p(θ) (θT (x̃ − x))2 = (θT (x̃ − x))2 p(θ)dθ.
Datasets
(10) We evaluate our model on six Sentiment Analysis bench-
Obviously there is now no closed form expression for marks. The first one is the IMDB dataset 1 (Maas et al.
D(x̃, x). To solve it one could use sampling methods such 2011), which consists of movie reviews collected from
as MCMC, which provides unbiased estimates of the ex- IMDB. The IMDB dataset is one of the largest sentiment
pectation but could be slow in practice. Instead, we use analysis dataset that is publicly available; it also comes with
the Laplace approximation, which approximates p(θ) by a an unlabeled set which allows us to evaluate semisupervised
Gaussian distribution p̃(θ) = N (θ̂, Σ). As estimating the learning methods. The rest five datasets are all collected
full covariance matrix is prohibitive, we further constrain Σ from Amazon 2 (Blitzer, Dredze, and Pereira 2007), which
to be diagonal. The benefit of doing so is that the expectation corresponds to the reviews of five different products: books,
can now be computed directly in closed form. To see this, by DVDs, music, electronics, kitchenware. All the six datasets
simply replacing p(θ) with p̃(θ) in Equation (11): are already tokenized as either uni-gram or bi-gram features.
For computational reasons, we only select the words that
D(x̃, x) =Eθ∼p̃(θ) (θT (x̃ − x))2 occur in at least 30 training examples. We summarize the
statistics of datasets in Table 1.
=(x̃ − x)T Eθ∼p̃(θ) (θθT )(x̃ − x)
=(x̃ − x)T (θ̂θ̂T + Σ)(x̃ − x) Methods
1 1 • Bag of Words (BoW). Instead of using the raw word
=(θ̂T (x̃ − x))2 + (Σ (x̃ − x))T (Σ (x̃ − x)).
2 2
counts directly, we take a simple step of data normaliza-
(11) tion:
log(1 + ci,j )
where D now involves two parts, corresponding to the mean xi,j = (13)
and variance term of the Gaussian distribution respectively. maxj log(1 + ci,j )
Now let us derive p̃(θ) for p(θ). In Laplace approximation, θ̂ where ci,j denotes the number of occurrences of the jth
is chosen as the mode of p(θ), which is exactly the solution word in the ith document, xi,j denotes the normalized
to the SVM2 optimization problem. For Σ, we have: count. We choose this normalization because it preserves
the sparsity of the Bag of Words features; also each fea-
∂ 2 L(θ) −1 ture element is normalized to the range [0, 1]. Note that
Σ =(diag( ))
∂θ2 the very same normalized Bag of Words features are fed
1 (12) into the autoencoders.
= (diag( I(1 − yi θT xi > 0)x2i ))−1
β i • Denoising Autoencoder (DAE) (Vincent et al. 2008).
This refers to the regular Denoising Autoencoder defined
Here we have overridden diag but letting it denote a diago- in Equation (1) with squared Euclidean distance loss:
nal matrix induced either by a square matrix or a vector; I
is the indicator function; (·)−1 denotes matrix inverse. Inter- 1
http://ai.stanford.edu/ amaas/data/sentiment/
2
estingly, the second term in Equation (11) is now equivalent http://www.cs.jhu.edu/ mdredze/datasets/sentiment/

1397
D(x̃, x) = x̃ − x22 . This is also used in (Glorot, Bor- LrDrop is the second best method that we have tested.
des, and Bengio 2011) on the Amazon datasets for do- Thanks to the usage of dropout regularization, it consis-
main adaptation. We use ReLu max(0, x) as the activa- tently outperforms BoW, and achieves the best results on
tion function, and Sigmoid as the decoding function. two (smaller) datasets. Compared with LrDrop, it appears
• Denoising Autoencoder with Finetuning (DAE+) (Vin- that our model works better on large datasets (≈ 10K words,
cent et al. 2008). This denotes the common approach to more than 10K training examples) than smaller ones. This
continue training an DAE on labeled data by replacing indicates that in high dimensional spaces with sufficient
the decoding part of DAE with a Softmax layer. samples, SBDAE benefits from learning a nonlinear feature
transformation that disentangles the underlying factors of
• Feedforward Neural Network (NN). This is the standard variation, while LrDrop is incapable of doing so due to its
fully connected neural network with one hidden layer and nature as a linear classifier.
random initialization. We use the same activation function As the training of the autoencoder part of SBDAE does
as that in Autoencoders, i.e., ReLU. not require the availability of labels, we also try incorporat-
• Logistic Regression with Dropout (LrDrop) (Wager, ing unlabeled data after learning the linear classifier in SB-
Wang, and Liang 2013). This is a model where logistic DAE. As shown in Table 2, doing so further improves the
regression is regularized with the marginalized dropout performance over using labeled data only. This justifies that
noise. LrDrop differs from our approach as it uses feature it is possible to bootstrap from a relatively small amount of
noising as an explicit regularization. Another difference labeled data and learn better representations with more un-
is that our model is able to learn nonlinear representa- labeled data with SBDAE.
tions, not merely a classifier, and thus is potentially able To gain more insights of the results, we further visual-
to model more complicated patterns in data. ize the filters learned by SBDAE and DAE on the IMDB
dataset in Table 3. In particular, we show the top 5 most
• Semisupervised Bregman Divergence Autoencoder (SB- activated and deactivated words of the first 8 filters (corre-
DAE). This corresponds to our model with Denoising Au- sponding to the first 8 rows of W ) of SBDAE and DAE, re-
toencoder as the feature learner. The training process is spectively. First of all, it seems very difficult to make sense
roughly equivalent to training on BoW followed by the of the filters of DAE as they are mostly common words with
training of DAE, except that the loss function of DAE is no clear co-occurrence pattern. By comparison, if we look at
replaced with the loss function defined in Equation (11). the filters from SBDAE, they are mostly sensitive to words
We cross validate β from the set {104 , 105 , 106 , 107 , 108 } that demonstrate clear polarity. In particular, all the 8 filters
(note that larger β corresponds to weaker Bayesian regu- seem to be most activated by certain negative words, and are
larization). most deactivated by certain positive words. In this way, the
• Semisupervised Bregman Divergence Autoencoder with activation of each filter of SBDAE is much more indicative
Finetuning (SBDAE+). of the polarity than that of DAE, which explains the better
performance of SBDAE over DAE. Note that this difference
Note that except for BoW and LrDrop, all the other meth- only comes from reweighting the reconstruction errors in a
ods require a predefined dimensionality of representation. certain way, with no explicit usage of labels.
We use fixed sizes on all the datasets. For SBDAE and NN, a
small hidden size is sufficient, so we use 200. For DAE, we
observe that it benefits from very large hidden sizes; how- Related Work and Discussion
ever, due to computational constraints, we take 2000. For Our work falls into the general category of learning repre-
BoW, DAE, SBDAE, we use SVM2 as the classifier. All the sentations for text data. In particular, there have been a lot
models are trained with mini-batch Stochastic Gradient De- of efforts that try to learn compact representations for either
scent with momentum of 0.9. words or documents (Turney and Pantel 2010; Blei, Ng, and
Jordan 2003; Deerwester et al. 1990; Mikolov et al. 2013;
Results Le and Mikolov 2014; Maas et al. 2011). LDA (Blei, Ng,
We first summarize the results as in classification error rate and Jordan 2003) explicitly learns a set of topics, each of
in Table 2. First of all, our model consistently beats BoW which is defined as a distribution on words; a document
with a margin, and it achieves the best results on four (larger) is thus represented as the posterior distribution on topics,
datasets out of six. On the other hand, DAE, DAE+ and NN which is a fixed-length, non-negative vector. Closely related
all fail to outperform BoW, although they share the same are matrix factorization models such as LSA (Deerwester
architecture as nonlinear classifiers. This suggests that SB- et al. 1990) and Non-negative Matrix Factorization (NMF)
DAE be able to learn a much better nonlinear feature trans- (Xu, Liu, and Gong 2003). While LSA factorizes the doc-
formation function by training with a more informed objec- term matrix via Singular Value Decomposition, NMF learns
tive (than that of DAE). Moreover, note also that finetun- non-negative basis and coefficient vectors. Similar to these
ing on labeled set (DAE+) significantly improves the perfor- efforts, our model also works directly on the doc-term ma-
mance of DAE, which is ultimately on a par with training a trix. However, thanks to the usage of autoencoder, the rep-
neural net with random initialization (NN). However, fine- resentation for documents are calculated instantly via direct
tuning offers little help to SBDAE, as it is already implicitly matrix product, which eliminates the need of expensive in-
guided by labels during the training. ference. Our work also distinguishes itself from other work

1398
Table 2: Left: our model achieves the best results on four (large ones) out of six datasets. Right: our model is able to take
advantage of unlabeled data and gain better performance.
books DVD music electronics kitchenware IMDB IMDB + unlabled
BoW 10.76 11.82 11.80 10.41 9.34 11.48 N/A
DAE 15.10 15.64 15.44 14.74 12.48 14.60 13.28
DAE+ 11.40 12.09 11.80 11.53 9.23 11.48 11.47
NN 11.05 11.89 11.42 11.15 9.16 11.60 N/A
LrDrop 9.53 10.95 10.90 9.81 8.69 10.88 10.73
SBDAE 9.16 10.90 10.59 10.02 8.87 10.52 10.42
SBDAE+ 9.12 10.90 10.58 10.01 8.83 10.50 10.41

Table 3: Visualization of learned feature maps. From top to bottom: most activated and deactivated words for SBDAE; most
activated and deactivated words for DAE.
nothing disappointing badly save even dull excuse ridiculously
cannon worst disappointing redeeming attempt fails had dean
outrageously unfortunately annoying awful unfunny stupid failed none
lends terrible worst sucks couldn’t worst rest ruined
teacher predictable poorly convince worst avoid he attempt
first tears loved amazing excellent perfect years with
classic wonderfully finest incredible surprisingly ? terrific best
man helps noir funniest beauty powerful peter recommended
hard awesome magnificent unforgettable unexpected excellent cool perfect
still terrific scared captures appreciated favorite allows heart
long wasn’t probably to making laugh tv someone
worst guy fan the give find might yet
kids music kind and performances where found goes
anyone work years this least before kids away
trying now place shows comes ever having poor
done least go kind recommend although ending worth
find book trying takes instead everyone once interesting
before day looks special wife anything wasn’t isn’t
work actors everyone now shows comes american rather
watching classic performances someone night away sense around

as a semisupervised representation learning model, where generalize to documents. MTC (Rifai et al. 2011a) is another
label information can be effectively leveraged. work that models the interaction of autoencoders and classi-
Recently, there has also been an active thread of research fiers. However, their training of autoencoders is purely un-
on learning word representations. Notably, (Mikolov et al. supervised, the interaction comes into play by requiring the
2013) shows that we can learn interesting word embeddings classifier to be invariant along the tangents of the learned
via very simple architecture on a large amount of unla- data manifold. It is not difficult to see that the assumption
beled dataset. Moreover, (Le and Mikolov 2014) proposed of MTC would not hold when the class labels did not align
to jointly learn representations for sentences and paragraphs well with the data manifold, which is a situation our model
together with words in a similar unsupervised fashion. While does not suffer from.
our work does not explicitly model the representations for
words, it is straightforward to incorporate this idea by adding Conclusion
an additional linear layer at the bottom of the autoencoder. In this paper, we have proposed a novel extension to autoen-
From the perspective of machine learning methodology, coders for learning task-specific representations for textual
our approach resembles the idea of layer-wise pretraining data. We have generalized the traditional autoencoders by
in deep Neural Networks (Bengio 2009). Our model dif- relaxing their loss function to the Bregman Divergence, and
fers from the traditional training procedure of autoencoders then derived a discriminative loss function from the label
in that we effectively utilize the label information to guide information. Experiments on text classification benchmarks
the representation learning. Related idea has been proposed have shown that our model significantly outperforms Bag of
in (Socher et al. 2011), where they train Recursive autoen- Words, traditional Denoising Autoencoder, and other com-
coders on sentences jointly with prediction of sentiment. peting methods. We have also qualitatively visualized that
Due to the delicate recursive architecture, their model only our model successfully learns discriminative features, which
works on sentences with given parsing trees, and could not unsupervised methods fail to do.

1399
Acknowledgments Socher, R.; Pennington, J.; Huang, E. H.; Ng, A. Y.; and
This work is supported in part by NSF (CCF-1017828). Manning, C. D. 2011. Semi-supervised recursive autoen-
coders for predicting sentiment distributions. In Proceedings
of the 2011 Conference on Empirical Methods in Natural
References Language Processing, EMNLP 2011, 27-31 July 2011, John
Banerjee, A.; Merugu, S.; Dhillon, I. S.; and Ghosh, J. 2004. McIntyre Conference Centre, Edinburgh, UK, A meeting of
Clustering with bregman divergences. In Proceedings of SIGDAT, a Special Interest Group of the ACL, 151–161.
the Fourth SIAM International Conference on Data Mining, Srivastava, N.; Hinton, G. E.; Krizhevsky, A.; Sutskever, I.;
Lake Buena Vista, Florida, USA, April 22-24, 2004, 234– and Salakhutdinov, R. 2014. Dropout: a simple way to pre-
245. vent neural networks from overfitting. Journal of Machine
Bengio, Y. 2009. Learning deep architectures for AI. Foun- Learning Research 15(1):1929–1958.
dations and Trends in Machine Learning 2(1):1–127. Turney, P. D., and Pantel, P. 2010. From frequency to mean-
Blei, D. M.; Ng, A. Y.; and Jordan, M. I. 2003. Latent ing: Vector space models of semantics. J. Artif. Intell. Res.
dirichlet allocation. Journal of Machine Learning Research (JAIR) 37:141–188.
3:993–1022. Vincent, P.; Larochelle, H.; Bengio, Y.; and Manzagol, P.
Blitzer, J.; Dredze, M.; and Pereira, F. 2007. Biographies, 2008. Extracting and composing robust features with de-
bollywood, boom-boxes and blenders: Domain adaptation noising autoencoders. In Machine Learning, Proceedings
for sentiment classification. In ACL 2007, Proceedings of the of the Twenty-Fifth International Conference (ICML 2008),
45th Annual Meeting of the Association for Computational Helsinki, Finland, June 5-9, 2008, 1096–1103.
Linguistics, June 23-30, 2007, Prague, Czech Republic. Wager, S.; Wang, S. I.; and Liang, P. 2013. Dropout train-
Deerwester, S. C.; Dumais, S. T.; Landauer, T. K.; Furnas, ing as adaptive regularization. In Advances in Neural Infor-
G. W.; and Harshman, R. A. 1990. Indexing by latent se- mation Processing Systems 26: 27th Annual Conference on
mantic analysis. JASIS 41(6):391–407. Neural Information Processing Systems 2013. Proceedings
of a meeting held December 5-8, 2013, Lake Tahoe, Nevada,
Glorot, X.; Bordes, A.; and Bengio, Y. 2011. Domain adap- United States., 351–359.
tation for large-scale sentiment classification: A deep learn-
Xu, W.; Liu, X.; and Gong, Y. 2003. Document clustering
ing approach. In Proceedings of the 28th International Con-
based on non-negative matrix factorization. In SIGIR 2003:
ference on Machine Learning, ICML 2011, Bellevue, Wash-
Proceedings of the 26th Annual International ACM SIGIR
ington, USA, June 28 - July 2, 2011, 513–520.
Conference on Research and Development in Information
Le, Q. V., and Mikolov, T. 2014. Distributed representations Retrieval, July 28 - August 1, 2003, Toronto, Canada, 267–
of sentences and documents. In Proceedings of the 31th In- 273.
ternational Conference on Machine Learning, ICML 2014,
Beijing, China, 21-26 June 2014, 1188–1196.
Maas, A. L.; Daly, R. E.; Pham, P. T.; Huang, D.; Ng, A. Y.;
and Potts, C. 2011. Learning word vectors for sentiment
analysis. In The 49th Annual Meeting of the Association for
Computational Linguistics: Human Language Technologies,
Proceedings of the Conference, 19-24 June, 2011, Portland,
Oregon, USA, 142–150.
Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G. S.; and
Dean, J. 2013. Distributed representations of words and
phrases and their compositionality. In Advances in Neural
Information Processing Systems 26: 27th Annual Confer-
ence on Neural Information Processing Systems 2013. Pro-
ceedings of a meeting held December 5-8, 2013, Lake Tahoe,
Nevada, United States., 3111–3119.
Rifai, S.; Dauphin, Y.; Vincent, P.; Bengio, Y.; and Muller,
X. 2011a. The manifold tangent classifier. In Advances
in Neural Information Processing Systems 24: 25th An-
nual Conference on Neural Information Processing Systems
2011. Proceedings of a meeting held 12-14 December 2011,
Granada, Spain., 2294–2302.
Rifai, S.; Vincent, P.; Muller, X.; Glorot, X.; and Bengio, Y.
2011b. Contractive auto-encoders: Explicit invariance dur-
ing feature extraction. In Proceedings of the 28th Interna-
tional Conference on Machine Learning, ICML 2011, Belle-
vue, Washington, USA, June 28 - July 2, 2011, 833–840.

1400

Homework3 Sol
No ratings yet
Homework3 Sol
5 pages
Sentiment Analysis With Contextual Embeddings and Self-Attention
No ratings yet
Sentiment Analysis With Contextual Embeddings and Self-Attention
10 pages
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
From Everand
Bag of Words Model: Unlocking Visual Intelligence with Bag of Words
Fouad Sabry
No ratings yet
Sentiment Analysis Using Machine Learning Classifiers
No ratings yet
Sentiment Analysis Using Machine Learning Classifiers
41 pages
Learning To Generate Reviews and Discovering Sentiment
No ratings yet
Learning To Generate Reviews and Discovering Sentiment
9 pages
He Laskar 2019
No ratings yet
He Laskar 2019
4 pages
15 SentimentAnalysis
No ratings yet
15 SentimentAnalysis
17 pages
Margin-Based Active Learning and Background Knowledge in Text Mining
No ratings yet
Margin-Based Active Learning and Background Knowledge in Text Mining
6 pages
14 SentimentClassification
No ratings yet
14 SentimentClassification
23 pages
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Learning Sentiment-Specific Word Embedding For Twitter Sentiment Classification
No ratings yet
Learning Sentiment-Specific Word Embedding For Twitter Sentiment Classification
11 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
A T M S - S T C: Dversarial Raining Ethods FOR EMI Upervised EXT Lassification
No ratings yet
A T M S - S T C: Dversarial Raining Ethods FOR EMI Upervised EXT Lassification
11 pages
Unit iv
No ratings yet
Unit iv
57 pages
Zharmagambetov 2015
No ratings yet
Zharmagambetov 2015
4 pages
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
No ratings yet
Winter Semester 2023-24 CSE3015 ETH AP2023246000714 Quiz-I-Question-Paper (1)
74 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Unit iv
No ratings yet
Unit iv
58 pages
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Addressing Sentiment Analysis Challenges
No ratings yet
Addressing Sentiment Analysis Challenges
8 pages
Lecture11- Unsupervised Learning (I)
No ratings yet
Lecture11- Unsupervised Learning (I)
29 pages
Top 10 NLP Question - Answer
No ratings yet
Top 10 NLP Question - Answer
16 pages
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
Zhou 2020
No ratings yet
Zhou 2020
5 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Neurocomputing: Shusen Zhou, Qingcai Chen, Xiaolong Wang
No ratings yet
Neurocomputing: Shusen Zhou, Qingcai Chen, Xiaolong Wang
11 pages
Generative and Discriminative Text Classification With Recurrent Neural Networks
No ratings yet
Generative and Discriminative Text Classification With Recurrent Neural Networks
9 pages
Word Embeddings
No ratings yet
Word Embeddings
13 pages
Lec.4 SDA (2023-2024) .FCDS
No ratings yet
Lec.4 SDA (2023-2024) .FCDS
18 pages
Artificial Intelligence Frame: Fundamentals and Applications
From Everand
Artificial Intelligence Frame: Fundamentals and Applications
Fouad Sabry
No ratings yet
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
No ratings yet
Represented Using Tensors, and As A Result, Neural Network Programming Utilizes
32 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
A Probabilistic Model For Semantic Word Vectors: Andrew L. Maas and Andrew Y. NG
No ratings yet
A Probabilistic Model For Semantic Word Vectors: Andrew L. Maas and Andrew Y. NG
8 pages
A Probabilistic Model For Semantic Word Vectors: Andrew L. Maas and Andrew Y. NG
No ratings yet
A Probabilistic Model For Semantic Word Vectors: Andrew L. Maas and Andrew Y. NG
8 pages
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
No ratings yet
Deep Unordered Composition Rivals Syntactic Methods For Text Classification
11 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
An Introduction to Functional Programming Through Lambda Calculus
From Everand
An Introduction to Functional Programming Through Lambda Calculus
Greg Michaelson
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
Dynamic Embedding Projection-Gated
No ratings yet
Dynamic Embedding Projection-Gated
10 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
No ratings yet
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
9 pages
Sentiment Analysis in Twitter: Rohit Kumar Jha (11615) Sakaar Khurana (10627)
No ratings yet
Sentiment Analysis in Twitter: Rohit Kumar Jha (11615) Sakaar Khurana (10627)
9 pages
MCS-024: Object Oriented Technologies and Java Programming
From Everand
MCS-024: Object Oriented Technologies and Java Programming
Dr. DK Sukhani
No ratings yet
Paragraph Vector PDF
No ratings yet
Paragraph Vector PDF
9 pages
Qta Lse Day4.PDF
No ratings yet
Qta Lse Day4.PDF
59 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
Complex Sentiment Analysis Using Recursive Autoencoders
No ratings yet
Complex Sentiment Analysis Using Recursive Autoencoders
5 pages
document analysis
No ratings yet
document analysis
6 pages
07 Dlintro
No ratings yet
07 Dlintro
39 pages
P16-2083
No ratings yet
P16-2083
6 pages
Lect05
No ratings yet
Lect05
17 pages
10 1016@j Neunet 2006 12 005 PDF
No ratings yet
10 1016@j Neunet 2006 12 005 PDF
9 pages
Levy Improving Distributional
No ratings yet
Levy Improving Distributional
16 pages
Maneesha Nidigonda Verzeo Major Project
No ratings yet
Maneesha Nidigonda Verzeo Major Project
11 pages
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
No ratings yet
NLP An Intuitive Understanding of Word Embeddings From Count Vectors To Word2Vec
18 pages
ML7 - Text Classification
No ratings yet
ML7 - Text Classification
13 pages
Word Embedding
No ratings yet
Word Embedding
9 pages
Improving The Accuracy of Pre-Trained Word Embeddings For Sentiment Analysis
No ratings yet
Improving The Accuracy of Pre-Trained Word Embeddings For Sentiment Analysis
15 pages
Energies: A Novel Deep Feature Learning Method Based On The Fused-Stacked Aes For Planetary Gear Fault Diagnosis
No ratings yet
Energies: A Novel Deep Feature Learning Method Based On The Fused-Stacked Aes For Planetary Gear Fault Diagnosis
18 pages
Clustering and Applications and Trends in Data Mining
No ratings yet
Clustering and Applications and Trends in Data Mining
42 pages
DSL Python Programming
No ratings yet
DSL Python Programming
97 pages
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
No ratings yet
A Survey of Deep Learning and Its Applications - A New Paradigm To Machine Learning - Dargan2019
22 pages
Data Science Lab: Introduction To Python
No ratings yet
Data Science Lab: Introduction To Python
21 pages
DSL Pandas
No ratings yet
DSL Pandas
87 pages
Data Science Lab: Numpy: Numerical Python
No ratings yet
Data Science Lab: Numpy: Numerical Python
71 pages
Data Science Lab: Matplotlib
No ratings yet
Data Science Lab: Matplotlib
23 pages
Data Mining and Warehousing Lab
No ratings yet
Data Mining and Warehousing Lab
4 pages
Generative Semi-Supervised Learning For Multivariate Time Series Imputation
No ratings yet
Generative Semi-Supervised Learning For Multivariate Time Series Imputation
9 pages
AAi
No ratings yet
AAi
37 pages
A Survey of Multi-View Representation Learning
No ratings yet
A Survey of Multi-View Representation Learning
21 pages
Geohazards 03 00011 v2
No ratings yet
Geohazards 03 00011 v2
28 pages
Paper 5
No ratings yet
Paper 5
22 pages
Deepeye document
No ratings yet
Deepeye document
53 pages
Instant download (Ebook) Deep Learning Projects Using TensorFlow 2: Neural Network Development with Python and Keras by Vinita Silaparasetty ISBN 9781484258019, 1484258010 pdf all chapter
100% (8)
Instant download (Ebook) Deep Learning Projects Using TensorFlow 2: Neural Network Development with Python and Keras by Vinita Silaparasetty ISBN 9781484258019, 1484258010 pdf all chapter
52 pages
A Survey of Deep Learning Based Network Anomaly Detection
No ratings yet
A Survey of Deep Learning Based Network Anomaly Detection
13 pages
CSE465 - AzK
No ratings yet
CSE465 - AzK
6 pages
Gen AI Brochure
No ratings yet
Gen AI Brochure
4 pages
Semantic ECG Interval Segmentation Using Autoencoders
No ratings yet
Semantic ECG Interval Segmentation Using Autoencoders
7 pages
1809 07454 PDF
No ratings yet
1809 07454 PDF
12 pages
Deep Learning Answers
No ratings yet
Deep Learning Answers
36 pages
Deep Learning Question Bank
No ratings yet
Deep Learning Question Bank
8 pages
Deep Learning Notes
100% (1)
Deep Learning Notes
44 pages
Unsupervised Features in Deep Learning A Case Study - PDF 20241018 102347 0000
No ratings yet
Unsupervised Features in Deep Learning A Case Study - PDF 20241018 102347 0000
8 pages
Methods of Deepfake Detection Based On Machine Learning
No ratings yet
Methods of Deepfake Detection Based On Machine Learning
4 pages
A Survey of Generative AI Applications
No ratings yet
A Survey of Generative AI Applications
36 pages
Hands On Machine Learning with R 1st Edition Brad Boehmke (Author) pdf download
No ratings yet
Hands On Machine Learning with R 1st Edition Brad Boehmke (Author) pdf download
53 pages
Machine Learning For Fluid Mechanics
No ratings yet
Machine Learning For Fluid Mechanics
32 pages
Introduction To Deep Learning: Suresh Jaganathan
No ratings yet
Introduction To Deep Learning: Suresh Jaganathan
73 pages
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF
No ratings yet
Cyberbullying Detection Based On Semantic-Enhanced Marginalized Denoising Auto-Encoder PDF
12 pages
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
No ratings yet
Jntuk r20 Unit v Deep Learning Techniqueswwwjntumaterials
32 pages
Deep Learning Syllabus
No ratings yet
Deep Learning Syllabus
4 pages
Neural Bellman-Ford Networks - A General Graph Neural Network Framework For Link Prediction
No ratings yet
Neural Bellman-Ford Networks - A General Graph Neural Network Framework For Link Prediction
24 pages
RAID Personalized Image Editing
No ratings yet
RAID Personalized Image Editing
4 pages
Machine Learning and The Physical Sciences: Giuseppe Carleo
No ratings yet
Machine Learning and The Physical Sciences: Giuseppe Carleo
47 pages
Accepted Version Full
No ratings yet
Accepted Version Full
48 pages
Using Variational Multi-View Learning For
No ratings yet
Using Variational Multi-View Learning For
19 pages
Adjusted Community-Aware Attributed Graph Anomaly Detection
No ratings yet
Adjusted Community-Aware Attributed Graph Anomaly Detection
8 pages

Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB

Uploaded by

Semisupervised Autoencoder For Sentiment Analysis12059-55631-1-PB

Uploaded by

Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16)

Semisupervised Autoencoder for Sentiment Analysis

Shuangfei Zhai, Zhongfei (Mark) Zhang

Abstract differs from most of the existing work as it naturally incor-

You might also like