Open Classification Final Report
Open Classification Final Report
Abstract 2 Introduction
News websites have a need to identify new topic
Due to the dynamic nature of the on-
classes as they continuously receive streams of
line text, new online documents may not
new data. A natural language processing model
belong to any of the previously defined
can be used to quickly identify whether an incom-
training classes. Deep Open classification
ing news feed is related to an existing set of top-
(Shu, Xu, and Liu, 2017) is a new deep
ics or a new topic. A supervised text classifica-
learning based approach that presents a so-
tion model can be trained to learn and classify
lution to this challenge. The architecture
documents based on topics or genres with good
consist of a CNN architecture with a 1-vs-
labeled training data. However, in the web 2.0
Rest output layer.
world, new content is constantly being generated
We leverage the underlying method laid by social media, news articles, and blogs. Due to
out in Xu and Liu, 2017, but modify it to the dynamic nature of this content, a new incom-
explore clustering algorithms in the output ing document may not belong to any previously
layer to determine open class documents. ”known” classes but rather a new but unseen one.
We compare our experiment with the re- The key assumption of supervised learning of pre-
sults reported by the DOC reference pa- dicting based on what has been observed before at
per (Shu, Xu, and Liu, 2017). Our re- inference time is therefore violated.
sults show that, at least for the data and One approach to identifying new topic classes
tuning we are able to perform, a 1-vs- is called open world classification (Fei and Liu,
Rest approach still does better than clus- 2016) in which an 1 vs. Rest classifier is trained to
tering algorithms in identifying the ”un- detect an unseen class. Open classification is also
seen” class.1 part of a new machine learning paradigm called
Lifelong Machine Learning (LML) (Chen and Liu,
1 Credits 2014a). It is particularly valuable in learning the
abundant and multifarious information from the
This project is based on the seminal paper on open web. In the natural language learning setting, open
classification titled: DOC: Deep Open Classifica- world classification can be not only be used to fil-
tion of Text Documents (Shu, Xu, and Liu, 2017). ter unwanted documents but also discover new cat-
We also referred to other papers on lifelong ma- egories. Open world classification has several real
chine learning (Chen and Liu. 2016), convolu- world applications, namely, (1) identifying new
tional neural network for sentence classification topics, genres in social media e.g. new twitter top-
(Kim, 2014), paragraph vector (Le and Mikolov, ics, news or facebook trends (2) Filtering email or
2014) and task clustering (Thrun and O’Sullivan, other text documents where topics may grow or
1996) change over a period of time and (3) Online learn-
We are also really grateful to Ian Tenney for ing (Thrun and O’Sullivan, 1996).
reviewing our recommendations, shaping the pro-
posal and mentoring us along the way. 3 Background
1
https://github.com/qianyu88/W266_ Our implementation of open world classification
project_submission builds on the approached proposed in DOC pa-
per (Shu, Xu, and Liu, 2017). As suggested in of unseen classes. Figure 1 shows a high-level
that paper, we also used a Convolutional Neural view our open classification work flow.
Network (CNN) architecture due to CNN’s perfor-
mance and efficiency gains on sentence classifica- Figure 1: Open Classification Flow
tion (Kim, 2014) tasks. In this architecture, a 1-vs-
Rest output layer with m sigmoid functions is used
for open classification where m is the number of
”known” classes. The prediction of sigmoid func-
tion is reinterpreted at the testing time to deter-
mine the unseen open class. A document is classi-
fied as belonging to the the open (or unseen) class
if its sigmoid probabilities are less than thresholds
of all labeled classes. We built on the DOC paper
architecture by (a) modifying the method by which
the threshold for marking a document as part of 4 Methods
an ”unseen” class is determined. In our approach 4.1 Data Set
we use a validation dataset to estimate a percentile
threshold that maximizes the unseen class F1 score We used 20 Newsgroups (Rennie, 2008) data set
while also ensuring unseen class predicted volume for our experiment. The data set contains 20 non-
is in line with actual unseen class volume in vali- overlapping classes (Figure 2) in newsgroup top-
dation set, and (b) as an enhancement to Shu, Xu ics. Topics were divided across 6 broader themes
and Liu’s 1 vs.Rest approach, we applied unsuper- (politics, religion, recreation, computer, science
vised clustering methods in the output layer to pre- and for-sale). Each class has around 1000 sam-
dict an open class document. ples.