0% found this document useful (0 votes)
38 views5 pages

Open Classification Final Report

This document describes an approach for open classification of text documents using deep learning. Open classification aims to identify documents that do not belong to any previously defined training classes, i.e. "unseen" classes. The proposed approach uses a convolutional neural network with a 1-vs-Rest output layer to classify documents. Clustering algorithms are also explored in the output layer to determine if a document belongs to an open/unseen class. The document outlines the open classification workflow and compares different clustering methods like Gaussian mixture models and Bayesian Gaussian mixture models.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views5 pages

Open Classification Final Report

This document describes an approach for open classification of text documents using deep learning. Open classification aims to identify documents that do not belong to any previously defined training classes, i.e. "unseen" classes. The proposed approach uses a convolutional neural network with a 1-vs-Rest output layer to classify documents. Clustering algorithms are also explored in the output layer to determine if a document belongs to an open/unseen class. The document outlines the open classification workflow and compares different clustering methods like Gaussian mixture models and Bayesian Gaussian mixture models.

Uploaded by

Anwar Shah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Open Classification of Text Document Topics

Qian Yu Varadarajan Srinivasan Leslie Teo


UC Berkeley MIDS UC Berkeley MIDS UC Berkeley MIDS
q [email protected] v [email protected] [email protected]

Abstract 2 Introduction
News websites have a need to identify new topic
Due to the dynamic nature of the on-
classes as they continuously receive streams of
line text, new online documents may not
new data. A natural language processing model
belong to any of the previously defined
can be used to quickly identify whether an incom-
training classes. Deep Open classification
ing news feed is related to an existing set of top-
(Shu, Xu, and Liu, 2017) is a new deep
ics or a new topic. A supervised text classifica-
learning based approach that presents a so-
tion model can be trained to learn and classify
lution to this challenge. The architecture
documents based on topics or genres with good
consist of a CNN architecture with a 1-vs-
labeled training data. However, in the web 2.0
Rest output layer.
world, new content is constantly being generated
We leverage the underlying method laid by social media, news articles, and blogs. Due to
out in Xu and Liu, 2017, but modify it to the dynamic nature of this content, a new incom-
explore clustering algorithms in the output ing document may not belong to any previously
layer to determine open class documents. ”known” classes but rather a new but unseen one.
We compare our experiment with the re- The key assumption of supervised learning of pre-
sults reported by the DOC reference pa- dicting based on what has been observed before at
per (Shu, Xu, and Liu, 2017). Our re- inference time is therefore violated.
sults show that, at least for the data and One approach to identifying new topic classes
tuning we are able to perform, a 1-vs- is called open world classification (Fei and Liu,
Rest approach still does better than clus- 2016) in which an 1 vs. Rest classifier is trained to
tering algorithms in identifying the ”un- detect an unseen class. Open classification is also
seen” class.1 part of a new machine learning paradigm called
Lifelong Machine Learning (LML) (Chen and Liu,
1 Credits 2014a). It is particularly valuable in learning the
abundant and multifarious information from the
This project is based on the seminal paper on open web. In the natural language learning setting, open
classification titled: DOC: Deep Open Classifica- world classification can be not only be used to fil-
tion of Text Documents (Shu, Xu, and Liu, 2017). ter unwanted documents but also discover new cat-
We also referred to other papers on lifelong ma- egories. Open world classification has several real
chine learning (Chen and Liu. 2016), convolu- world applications, namely, (1) identifying new
tional neural network for sentence classification topics, genres in social media e.g. new twitter top-
(Kim, 2014), paragraph vector (Le and Mikolov, ics, news or facebook trends (2) Filtering email or
2014) and task clustering (Thrun and O’Sullivan, other text documents where topics may grow or
1996) change over a period of time and (3) Online learn-
We are also really grateful to Ian Tenney for ing (Thrun and O’Sullivan, 1996).
reviewing our recommendations, shaping the pro-
posal and mentoring us along the way. 3 Background
1
https://github.com/qianyu88/W266_ Our implementation of open world classification
project_submission builds on the approached proposed in DOC pa-
per (Shu, Xu, and Liu, 2017). As suggested in of unseen classes. Figure 1 shows a high-level
that paper, we also used a Convolutional Neural view our open classification work flow.
Network (CNN) architecture due to CNN’s perfor-
mance and efficiency gains on sentence classifica- Figure 1: Open Classification Flow
tion (Kim, 2014) tasks. In this architecture, a 1-vs-
Rest output layer with m sigmoid functions is used
for open classification where m is the number of
”known” classes. The prediction of sigmoid func-
tion is reinterpreted at the testing time to deter-
mine the unseen open class. A document is classi-
fied as belonging to the the open (or unseen) class
if its sigmoid probabilities are less than thresholds
of all labeled classes. We built on the DOC paper
architecture by (a) modifying the method by which
the threshold for marking a document as part of 4 Methods
an ”unseen” class is determined. In our approach 4.1 Data Set
we use a validation dataset to estimate a percentile
threshold that maximizes the unseen class F1 score We used 20 Newsgroups (Rennie, 2008) data set
while also ensuring unseen class predicted volume for our experiment. The data set contains 20 non-
is in line with actual unseen class volume in vali- overlapping classes (Figure 2) in newsgroup top-
dation set, and (b) as an enhancement to Shu, Xu ics. Topics were divided across 6 broader themes
and Liu’s 1 vs.Rest approach, we applied unsuper- (politics, religion, recreation, computer, science
vised clustering methods in the output layer to pre- and for-sale). Each class has around 1000 sam-
dict an open class document. ples.

Using a clustering method for online learning is


Figure 2: Open Classification Flow
a well-known practice. Particularly, task cluster-
ing (Thrun and O’Sullivan, 1996) is an old concept
but is based on a similar idea to lifelong learning.
This idea can also be applied to open classification
problem. In task clustering, when a new task ar-
rives, algorithm first selects the most similar clus-
ter then uses the distance function of the cluster for
classification (Trun, 1996b). Concretely, we can
take the trained feature vectors of documents from 4.2 Paragraph Vector Model As Baseline
the language model and use them as input param-
As a baseline model, we used paragraph vec-
eters of a unsupervised clustering analysis. Using
tor model (Le and Mikolov, 2014) for our ex-
an outlier detection approach, we define threshold
periment.(Figure 3) The paragraph vector model
for labeled clusters’ probability distribution. If a
produces a fixed-length feature representations of
new document is detected as an outlier of all clus-
a paragraph from variable-length pieces of texts.
ters, it belongs to an ”unseen” class.
Compared to Bag of Words (BOW) model, it
We experimented 2 different clustering meth- provides a dense feature vector representation of
ods: Gaussian Mixture model (GMM) and documents, capturing ordering and semantics of
Bayesian Gaussian Mixture model (BGMM) in words which is similar to a CNN model. We
particular the Infinite Dirichlet Process (IDP). used the Gensim library’s Doc2vec implementa-
BGMM is a variant of the GMM with variational tion of paragraph vector and configured the API
inference where the algorithm maximizes a lower to use the distributed memory model proposed in
bound on model evidence with consideration of the [Le, Mikolov] paper. DM model is inspired
priors. IDP is a prior probability distribution on by the methods for learning the word vectors. We
clusterings with an infinite, unbounded, number of learn the paragraph vectors by running the train-
partitions. The model fits with the nature of open ing data through distributed memory model (DM)
classification where we can have infinite number in Doc2Vec implementation and at inference step,
we infer the vector representation for the learned ment is transformed into an 500×300 dense matrix
dataset. Ideally, we should have used a sepa- with embedding look up table. The CNN internal
rate sample for training the paragraph vectors and dimension is mirror to the DOC architecture (Shu,
a holdout set for inference.Unfortunately due to Xu, and Liu, 2017) where 3 regions of [3, 4, 5] and
small sample size of the newsgroup dataset, we 150 filters was used. The maxpooling layer output
ended up using the same training set for infer- which is used for open classification analysis has
ring the paragraph vectors. The paragraph vec- the feature size of 450.
tor model uses the max vocab size of 20000 and
outputs a feature vector size of 450 to align with 4.4 Open Classification Methods
feature size of the CNN model. Paragraph vec-
tors also address some of the key weaknesses of The mechanics of the open classification analysis
bag-of-words models. They inherit an important is the following: We train the language models
property of the word vectors: the semantics of the using paragraph vector or CNN. We then extract
words, for e.g., the word ’powerful’ is closer to feature vectors from the language model. Lastly
’strong’ than to ’Paris’. we run open classification experiments on the ex-
tracted features using 1-vs-Rest and Unsupervised
Clustering.
4.3 CNN Model Architecture
1-vs-Rest is the baseline method we used for
As our primary model we developed a Convo- open classification. We hold out one or more
lutional neural network architecture. (Figure 4) classes from the training data and then at testing
Based on the recent research on document and time we then mix them back into our sample. To
sentence classification using CNN (kim, 2014; determine if a new document is ”unseen”, we first
Zhang and Wallace, 2016), it has been reported train a model & calibrate the probability of 1-vs-
that CNN offers an excellent performance in sen- rest predicted probability for labeled classes. We
tence classification tasks compared to other state then predict the probability of all documents in test
of the art techniques such as RNN. A big argu- classes using that model. We then compare the
ment for CNNs is that they are fast. Convolutions probability of the test sample to the probability
are a central part of computer graphics and imple- distribution of the labeled classes using a thresh-
mented on a hardware level on GPUs. Compared old. If its probability is smaller than the thresh-
to something like n-grams, CNNs are also efficient old for all labeled classes, then the new sample
in terms of representation. With a large vocabu- is classified as ’unseen’. We use F1 score of ’un-
lary, computing anything more than 3-grams can seen’ class as the measure to evaluate effectiveness
quickly become expensive. Even Google doesnt of our predictions. 1-vs-Rest classifier requires a
provide anything beyond 5-grams. Convolutional baseline classification algorithm and we tried both
Filters learn good representations automatically, logistic regression and SVM 1-vs-rest classifiers.
without needing to represent the whole vocabu- The performance difference was less than 1%. Our
lary. reported results is based on multinomial logistic
We compared the approaches of training the regression 1-vs-rest classifier.
word embedding layer along with the CNN model For clustering methods, we first performed di-
and used Google’s pre-trained word2vec as em- mensionality reduction because the scikit-learn li-
bedding layer. Google pre-trained word2vec of- brary we used for clustering analysis could not
fered better performance than training word2vec handle higher dimensional embedding size (of
on newsgroups data, therefore we used Google 450). We chose Latent Semantic Analysis method
pre-trained word2vec as the CNN embedding as it enables discovery of latent patterns in the
layer for all our experiments. data. We used SVD and LSA normalizer to
Each document in the data set is padded or cut collapse the dimension of (CNN/Paragraph2Vec)
into a fix length in words. We use 500 words trained vectors (dimension (D × 450)) to a latent
length which is at the median length of all doc- dimensions of (D × 20), where D is number of
ument in the dataset. This number was chosen documents. We then used Gaussian Mixture Mod-
to balance training speed and information loss els to fit ”m” gaussians on the LSA transformed
because the distribution of the documents in the data for ”m” seen classes. We fixed the number of
dataset is following the power-law. Each docu- components equal to the number of known classes
and used Bayes Information Criteria to select ap- pooled them together as one large unseen class in-
propriate covariance matrix. stead of predicting each of the 3 unseen classes in-
As last step, to predict the unseen class, we dependently. (Note: Initially, we set an ambitious
developed an approach similar to outlier detec- target of training all 20 possible classes but we had
tion task. As mentioned previously for 1-vs-rest to scale back due to the time and computing cost
method, we hold out one or more classes from the of training CNN models.)
training set and mix them back during the vali- We used a weighted average of precision and
dation and test phase. Once the GMM model is recall to compute F1-score over the ”unseen” class
trained on ”known” classes, we then predict the for evaluation. In other words, for this project we
probabilities on a validation set which comprises are focusing on how well our model is predicting
of both known as well previously unseen classes. the ”unseen” samples only without considering the
Hyper parameter ”percentile threshold” is varied performance of all classes.
to maximize the F1 score of the unseen class in
validation set. As a high value of percentile thresh- 5.2 Results and Error Analysis
old will result in marking more documents as ”Un-
seen” and vice-versa, we had to be judicious about Model Open Method 5+1 5+2 5+3
the choice of threshold, so we added another cri- P2vec 1-vs-Rest 39.0% 51.0% 66.0%
teria of creating an ’unseen’ class size that was P2vec GM 12.5% 29.2% 35.2%
equivalent to true unseen class size in validation P2vec IDPs 13.8% 28.5% 37.6%
data (Figure 5). This combined criteria of F1 score CNN 1-vs-Rest 27.0% 34.0% 41.0%
and unseen class size was used to derive opti- CNN GM 13.5% 31.9% 38.8%
mal threshold which was then used in test dataset CNN IDP 10.8% 27.2% 38.0%
to predict the test set metrics. We followed this
same approach for both GMM and Infinite Dirich- Table 1: F1-Score for Unseen Class for 5 Seen +
let Process (IDP) clustering methods. 1, 2, 3 Unseen Experiments

While we were not able to run our test over all


Figure 3: Percent Threshold Tuning on Validation Set
20 groups, the results for our 5+1, 5+2, and 5+3
are quite instructive.
First, the paragraph to vector model did bet-
ter than the CNN model when using 1-vs-rest ap-
proach and is on-par with the CNN model when
using clustering approaches for our open classi-
fication experiments. We attribute this largely to
a hyper-parameter tuning problem. Before run-
ning the open classification tasks, we trained the
CNN model using a soft-max output layer and
compared its performance with the paragraph vec-
tor model for closed class classification. Due to
the limitation of computing resource and time, we
5 Results and Discussion could not tune the CNN model hyper-parameters
to outperform paragraph vector in closed classifi-
5.1 Test Metrics cation tasks. As a baseline when we used 5 labeled
We used 64% (Training), 16% (Validation) and classes from the 20 newsgroup data for closed
20%(Test) data split and random shuffle for run- classification, the paragraph vector model based
ning these experiments. We held out one or more classifier outperformed the CNN model by 14%
classes at training time and added them back dur- points (85% (Pvec) vs. 71% (CNN) in accuracy
ing validation and testing phase. We also tested score).
with holding back 1, 2, and 3 unseen classes, how- Secondly, the 1-vs-Rest open classification ap-
ever for unseen class detection task all unseen proach performed better than clustering methods.
classes were bundled together into one set, so for We believe that the following 2 reasons con-
example, when we held back 3 unseen classes, we tributed to this performance difference: (1) Reduc-
ing the dimensionality of pre-trained feature Vec- and CNN performed better as the number of un-
tors into latent dimensions (using LSA) potentially seen classes increased which aligns with our hy-
resulted in information loss. Though LSA trans- pothesis that larger samples in unseen classes can
formation captured 70% of the variance in those lead to richer probability distributions resulting in
dimensions, the classes were clustered too close to much more reliable thresholds for unseen class
each other to generate clean clusters. To illustrate predictions. This hypothesis, however, does have
that we embedded the LSA transformed data using to be confirmed with further experiments on a dif-
(t-SNE) distributed stochastic neighbors embed- ferent dataset of larger sample sizes.
ding to visualize these clusters. As figure 6 shows,
the data points across classes are co-mingled and 6 Conclusions and Future Work
clustered too close to each other. In summary, our results indicate that a 1-vs-Rest
approach generally does better than clustering ap-
Figure 4: 3D plot of Test set by top 3 Latent Dimensions proaches in identifying unseen classes. Without
significant tuning, a paragraph vector model does
better than CNN model using 1-vs-rest method
and is on-par with CNN model using clustering
methods. We do see better results in both methods
as the number of unseen classes increase.
Our potential next steps are to: A) fine tune
CNN model using larger data sets to optimize clas-
sification performance before perform open clas-
sification tasks; B) Study the trade-off between
”unseen” sample prediction accuracy and labeled
sample prediction accuracy in open classification
examining the errors, we observe some pat-
setting; and C) Explore application of open classi-
terns in the data which confirms this spatial
fication in an online learning setting.
closeness/co-mingling of classes hypothesis. The
model struggled to separate out/cluster certain am-
biguous documents accurately i.e. a document that References
belongs to one class such as Sci.Medicine, but
Lei Shu, Hu Xu, and Bing Liu. 2017. DOC: Deep
has information that can lead to mis-classification Open Classification of Text Documents.
in other classes such as talk.politics. For exam-
ple, test sample 31 is an article under science Zhiyuan Chen and Bing Liu. 2016. Lifelong Machine
medicine genre, and this document refers to fun- Learning.
neling of federal funds allocated to health care Yoon Kim. 2014. Convolutional neural networks for
to support defense expenditure, which has refer- sentence classification .
ences to ”politicians”. This news article was sup- Ye Zhang and Byron C. Wallace. 2016. A Sensitivity
posed to be part of sci.med which was an ’un- Analysis of (and Practitioners Guide to) Convolu-
seen’ test class, but the gmm model predicted this tional Neural Networks for Sentence Classification
under talk.politics. Losing the contextual infor-
Zhiyuan Chen and Bing Liu. 2014. Mining topics in
mation from lower dimensional embedding is po- documents: standing on the shoulders of big data.
tentially placing heavier weight on the remaining
word ’politicians’, which is referenced throughout Quoc Le and Tomas Mikolov. 2014. Distributed Rep-
resentations of Sentences and Documents
the article resulting in this mis-classification. (2)
the GMM clustering algorithm allows for a very Sebastian Thrun and Joseph OSullivan. 2014. Learn-
rich and complex clustering of the data. Our sam- ing More From Less Data: Experiments With Life-
ple size was quite small (with only a few thou- long Robot Learning
sand documents) which resulted in skewed prob-
ability distributions which made selection of out-
liers based on those threshold to be highly sensi-
tive.
Finally, We do find that both paragraph vector

You might also like