0% found this document useful (0 votes)
105 views5 pages

Kim 2016

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views5 pages

Kim 2016

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2016 IEEE International Conference on Systems, Man, and Cybernetics • SMC 2016 | October 9-12, 2016 • Budapest, Hungary

Semantic Text Classification with Tensor Space


Model-based Naïve Bayes
Han-joon Kim
School of Electrical and Computer Engineering
University of Seoul
Seoul, Korea
[email protected]

Jiyun Kim Jinseog Kim


School of Electrical and Computer Engineering Department of Applied Statistics
University of Seoul Dongguk University
Seoul, Korea Gyeongju, Korea
[email protected] [email protected]

Abstract—This paper presents a semantic naïve Bayes than other learning algorithms in terms of overcoming the
classification technique that is based upon our tensor space problem. Moreover, the NB algorithm is suitable for
model for text representation. In our work, each of Wikipedia operational text classification system since it is very easy to
articles is defined as a single concept, and a document is incrementally update the classification model due to its
represented as a 2nd–order tensor. Our method expands the simplicity; when new documents are given as training
conventional naïve Bayes by incorporating the semantic concept data, the current word feature statistics are updated and
features into term feature statistics under the tensor-space model. additional feature evaluation is immediately carried out
Through extensive experiments using three popular document without re-processing the past training data. This
collections, we prove that the proposed method significantly characteristic is essential in the case where the document
outperforms the conventional naïve Bayes. Surprisingly, the collection is highly evolutionary. More importantly, the
classification performance amounts to almost 100% in terms of NB algorithm does not require a complex generalization
F1-measures when using Reuters-21578 and 20Newsgroups process unlike support vector machine and decision trees;
it has only to calculate the feature statistics per class.
document collections.
Because of the above advantages, there have been
Keywords—text classification; naïve Bayes; vector space; many studies to improve the na ve Bayes text classifier
tensor space; Wikipedia; semantics; concepts in two aspects. One is to combine it with meta-learning
algorithms such as EM [1, 2], boosting [3], and active
I. INTRODUCTION learning [4, 5], and the other is to enrich the
representation of documents with external or internal
Text classification is to automatically assign an unknown semantic features [6, 7, 8, 9]. Recently, as good quality
textual document to its appropriate one or more classes. external knowledge such as Wikipedia
Nowadays, the most popular approach towards text (http://en.wikipedia.org/), WordNet
classification is to use machine learning techniques that (http://wordnet.princeton. edu/), and Open Directory
inductively build a classification model of pre-defined classes Project directory (https://www. dmoz.org/) have been
from a training set of labeled documents. This has been used built, oftentimes the second approach has been attempted
very significantly in spam email filtering, sentiment analysis, to enhance the NB algorithm. Of course, without the help of
readability assessment, and article triage. Popular learning external knowledge, internal (or latent) semantic features can
methods include naïve Bayes (NB), k-nearest neighbors (k-NN), be derived through singular value decomposition (SVD) [9].
decision trees, and support vector machine (SVM). In our work, The important thing is that the semantic features should
we focus on improving the naïve Bayes learning algorithm encompass the correct meanings of terms in a document to
because it is simple yet accurate technique in spite of its wrong improve the NB classification performance.
independence assumption. Besides, the NB algorithm has a In our work, we propose a semantic naïve Bayes text
number of superior advantages compared with other learning classifier that is based upon a tensor space text model proposed
algorithms. in our previous research. In [10], we proposed a text model
In general, an NB learning process is much faster than conforming to the definition of the ‘concept’ in the FCA
that of other machine learning methods since its (Formal Concept Analysis) framework. The model represents a
classification model can be developed with a single pass document as not a vector but a matrix (i.e., 2nd-order tensor)
over training documents. Basically, machine learning that reflects the relationship between term features and
algorithms should effectively deal with the curse-of- semantic features. To realize this semantically enriched text
dimensionality problem since text data have a huge number of model, we employ the Wikipedia encyclopedia as an external
term features. In this respect, the NB algorithm is less sensitive knowledge source in which each article is defined as a single
semantic concept.

978-1-5090-1897-0/16/$31.00 ©2016 IEEE SMC_2016 004206


2016 IEEE International Conference on Systems, Man, and Cybernetics • SMC 2016 | October 9-12, 2016 • Budapest, Hungary

II. PRILIMINARIES vector as well as a term vector. Thus we can say that the
model is a generalization of the conventional vector space
A. Tensor Space Model for Text Representation model. With the document representation, semantic features of
In the vector space model, documents are represented as concepts associated with literal features of terms help to
vectors in which each element has a weighting. In contrast, our significantly improve the performance of text classification.
semantic tensor space model represents documents as 2nd-order
tensors (i.e., matrices) ⨂ , where S is the number of
concepts (or semantics) and T is the number of terms indexed,
and and are the vector spaces for the concepts and terms.
We regard the ‘concept space’ as an independent space equated
with the ‘term’ and ‘document’ spaces used in the VSM.
A concept is defined by a pair of intent and extent
according to the formal concept analysis (FCA) principle [11].
The ‘extent’ means the set of instances that are included in the
concept, the ‘intent’ means all set of common attributes of
instances included in the extent. In our work, the extent that
represents a concept consists of a set of documents related with
the concept; the intent consists of a set of keyword extracted
Fig. 2. Representing a document as a term-by-concept matrix
from the set of documents. Figure 1 illustrates the term-
document matrix and the term-document-concept tensor
B. Naïve Bayes Learning Framework
representations for a given corpus. To represent a document
corpus, rather than a term-document matrix, we can build a 3rd- Naïve Bayes (NB) text classification systems produce their
order tensor with three distinct spaces: document, term, and classification model as a result of learning (estimation) process
concept. As a result, we can naturally represent terms or based on the Naïve Bayes learning algorithm. The estimated
concepts as matrices; given a 3rd-order tensor of a document classification model consists of two kinds of parameters: the
corpus, we can represent a component of each space using the term probability estimates θ | , and the class prior probabilities
other two vector spaces. That is, we can represent a document θ ; that is, the classification model θ θ | , θ . Each
as a concept-by-term matrix, a term as a concept-by-document parameter can be estimated according to maximum a posteriori
matrix, and a concept as a term-by-document matrix. (MAP) estimation.
For classifying a given document, Naïve Bayes learning
system estimates the posterior probability of each class via
|
Bayes’ rule; that is, Pr | , where Pr(c) is the

class prior probability that any random document from the
document collection belongs to the class c, Pr(d|c) is the
probability that a randomly chosen document from documents
in the class c is the document d, and Pr(d) is the probability that
a randomly chosen document from the whole collection is the
document d. The document d is then assigned to a class
argmax ∈ Pr , with the most posterior 1 . Here, the
document d is represented by a bag of words , ,…, | |
where multiple occurrences of words are preserved. Moreover,
the Naïve Bayes assumes that the terms in a document are
mutually independent and the probability of term occurrence is
Fig. 1. A 3rd-order tensor of a document corpus independent of position within the document. This assumption
allows simplifying the classification function.
For a particular document corpus, we need to define a
Φ argmax ∈ Pr |
concept space to build up the text tensor. To define each
dimension of the concept space, we specify a Wikipedia page | |
as a ‘concept’. After choosing the appropriate Wikipedia argmax ∈ Pr ∗∏ Pr | (1)
articles by regarding each of documents as a query, we can
To generate this classification function, Pr(c) can be simply
automatically generate the reasonable semantic space required
estimated by counting the frequency with which each class
for the tensor space [12].
value cj occurs in a set of the training documents Dt , where
Figure 2 illustrates an example of a matrix representation of Pr(cj|di) ∈{0,1}, given by the class label. That is, Pr c θ
a document. The concept-by-term matrix provides information
on the concepts that exist in the document. If necessary, a
document can be expressed as a 1st-order tensor (i.e., a vector)
by summing all the components of each row or column; in 1
The argmaxx F(x) is the value of x for which F(x) has the
other words, a document can be represented by a concept largest value.

SMC_2016 004207
2016 IEEE International Conference on Systems, Man, and Cybernetics • SMC 2016 | October 9-12, 2016 • Budapest, Hungary

, ,
∑ | ^r | ,
P (5)
. As for Pr(tij|c), its maximum likelihood estimate is ∑∈ , , | |
,
(∑ ), where TF(t, c) is the number of occurrence of and estimate semantic probability given as
∈ ,
term t in the class c and V denotes the set of significant words ,
extracted from the training documents. However, this ^ |
Pr (6)
∑ ∈ , | |
estimation can produce a biased underestimate of the
probability, or it can give a probability of zero for any word
that does not occur in some categories. To avoid this problem, respectively, where w(t, s, c) denotes the weighted value of
the estimate for Pr(tij|c) can be adjusted by Laplace’s law of each element in the tensor space, which account for the concept
succession as follows: c of term t in document d. Note that [7] proposed a similar
Naïve Bayes classification method that incorporates inherent
, semantic information, which is obtained by applying latent
Pr θ | ∑∈
(2) topic models from training documents without external
, | |
knowledge. In contrast, the superiority of our learning method
is that the meaning of a term occurring in a document can be
Therefore, the learning of the Naïve Bayes classifier does
more correctly captured with the external Wikipedia articles,
not require any other statistics than those already collected in
and it can be reflected in Equations (5) and (6) through the 2nd
TF, and other generalization process is not necessary.
–order document representation.
III. SEMANTIC NAÏVE BAYES TEXT CLASSIFICATION B. Estimating Semantic Naïve Bayes Learning Parameters
A. Semantically Extending the Naïve Bayes with the 2nd–order Text Representation
In this section, we intend to semantically extend the
conventional Naïve Bayes under the tensor-space model. Note
that it is necessary to incorporate the semantic information into
the first NB parameter θ | Pr . As mentioned
earlier, when additional semantic information for a document
exists in our tensor space model, a document can be
represented as a matrix of terms and semantics:
… … | |
⋮ ⋮ ⋮ ⋮ ⋮
… … | | . (3)
⋮ ⋮ ⋮ ⋮ ⋮
| | … | | … | || |

Here, is a random variable for -th term and -th


semantic information. V is a set of terms indexed, and S is a set
of semantic information (i.e., concepts) in the 2nd–order tensor
space (i.e., term-by-concept matrix) for each of the training
documents. [13] proposed a similar approach to our model to
improve the k-NN classification performance, in which another
Fig. 3. Estimating the learning parameters with the term-by-concept matrix
term space is produced by folding term vectors. However, it is
for document representation
not straightforward to properly determine the terms required
for the new term space, and limitations of using terms as Figure 3 depicts the 2nd–order tensor (i.e., term-by-concept
semantic units remain. In our work, with the 2nd–order tensor matrix) for the training document d in our tensor space model.
(matrix) representation of a document and the conditional In Equation (5), the value of w(t, s, c) corresponds to the
independence assumption in the conventional Naïve Bayes, the weighted value of each cell in the matrix. |V| denotes the size of
likelihood of a document can be approximated as follows: the term space (i.e., the number of terms indexed). The value of
Pr | Pr , 1, … , |V|, 1, … , | | ∑∈ , , in the denominator can be easily obtained by
| |
∏ ∏ Pr| |
| . (4) summing up the weight value of each cell along the term space,
which equals the w(s, c) in the numerator part of Equation 6. In
addition, ∑ ∈ , can be obtained by summing up the w(s,
Let ∑ and ∑ , then assume that a cell c) of a concept s over the semantic concept space. In short,
probability Pr | for given a class in the term-by-concept only if the 3rd-order tensor for a given set of training
matrix is approximated by θ | P | P , | . documents is developed, then our Naïve Bayes learning can be
conducted easily through the sum operations over the term-by-
Since Pr , | Pr | , ∙ Pr | , we can estimate
concept matrix.
term probabilities for a given class and semantic concept
as:

SMC_2016 004208
2016 IEEE International Conference on Systems, Man, and Cybernetics • SMC 2016 | October 9-12, 2016 • Budapest, Hungary

IV. EXPERIMENTS TABLE I. CLASSIFICATION RESULTS IN TERMS OF F1-MEASURE

In this section, we study the effectiveness of the proposed Tensor Space-based


Conventional Jing’s Naïve Bayes
semantic Naïve Bayes learning method. Furthermore, we Datasets
NB NB (from varying the size
demonstrate the performance of text classification as a function of concept window)
of the size of the concept window. Terms occurring in
5 99.5
documents are semantically weighted along a pre-defined
semantic concept space with a so-called ‘concept window’ that 20Newsgroups 69.8 85.4 15 99.7
defines the context; for instance, if the size of the concept 25 99.9
window is too small, then the meanings of terms might not be
well captured. Eventually, we show that our method 5 97.9
significantly outperforms the conventional Naïve Bayes Reuters-21578 80.9 85.8 15 98.6
learning method through extensive experiments with three
popular document collections. 25 98.2

A. Experimental Setup 5 89.3

In order to evaluate our proposed method, we have used 3 OHSUMED 41.7 - 15 90.0
controlled subsets of the 20Newsgroups, Reuters-21578, and
25 89.0
OHSUMED document collections, which have been accepted
as clean collections and are thus commonly used to evaluate
various machine learning algorithms for applications including V. CONCLUSIONS
text classification. The 20Newsgroups collection was collected This paper proposed a semantic Naïve Bayes classification
from the Usenet newsgroups collection. It has approximately method that utilizes our semantic tensor space model for text
20,000 documents, which are partitioned across 20 different representation. To overcome the problem of the lack-of-
newsgroups. In our work, we selected 500 top-ranked largest semantics in the bag-of-words model, our NB learning method
documents belonging to 9 distinct newsgroups including Autos, introduces additional semantic features that correspond to the
Christian, and Electronics. The Reuters-21578 collection meanings of each term in a document; the semantic features
originates from Reuter’s newswire in the year 1987. For a more are composed from external Wikipedia articles being aware of
reliable evaluation, we generated a subset of the Reuters a given training documents. As a result, in our classification
collection in which documents are not skewed over categories learning framework, the conventional term feature statistics
(or topics). We first selected the documents belonging to the
are split into the statistics of features and semantics (or
most frequent 7 categories including Acq, Earn, Crude, Interest,
concepts). Through extensive experiments, we proved that the
Ship, Money-fx, and Trade, and then we chose approximately
1,750 documents that had only a single topic to avoid the proposed method allows performing almost perfect
ambiguity of documents with multiple topics. Lastly, the classification when classifying the documents in the
OHSUMED collection comes from the on-line medical 20Newsgroups and Reuters-21578 collections. In the future,
information database named MEDLINE, which contains titles we plan to design MapReduce algorithms to efficiently
and abstracts from 270 medical journals. As for the analyze large textual datasets since our tensor space model is
classification metric, the classification results are discussed very sparse due to only a small fraction of terms and
with respect to the micro-averaging F1-meausre, which is a semantics of a document.
harmonic average of precision and recall for the classification
results, which varies from 0 to 1, and is proportionally related ACKNOWLEDGMENT
to classification effectiveness. This work was supported by Basic Science Research Program
B. Classification Evaluation through the National Research Foundation of Korea (NRF-
2015R1D1A1A09061299) funded by the Ministry of
As a baseline to measure the classification performance, we Education, and was also supported by Mid-career Researcher
chose the conventional Naïve Bayes and Jing’s Naïve Bayes Program through the National Research Foundation of Korea
[7]. In this experiment, as for the semantic concept space, we
(NRF) grant funded by the Korea government (MISP) (No.
have more than 50 Wikipedia articles that contain top-ranked
NRF-2013R1A2A2A01017030).
terms in each document collection. As expected, our semantic
Naïve Bayes learning has given superior classification results REFERENCES
for the test data as shown in Table 1. Even in case of
[1] K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell, “Text
classifying the documents in the 20Newsgroups and Reuters- classification from labeled and unlabeled documents using EM,”
21578 collections, our method yields almost perfect Machine Learning, vol. 39, no. 2, pp. 103–134, 2000
classification results. Moreover, we had expected that our [2] T. Tsuruoka, and J. Tsujii, “Training a naïve Bayes classifier via the EM
classification results are dependent upon the size of the concept algorithm with a class distribution constraint,”, Proceedings of the 7th
window, and however we found that the classification Conference on Natural Language Learning (HLT-NAACL 2003), pp.
performance is not sensitive to the size of concept window 127–134, 2003
only if the size of semantic space (i.e., the number of [3] H. J. Kim, J. U. Kim, and Y. G. Ra, “Boosting naïve Bayes text
classification using uncertainty-based selective sampling,”
Wikipedia articles selected) is greater than 20 as seen in the Neurocomputing, vol. 67, pp. 403–410, 2005
table.

SMC_2016 004209
2016 IEEE International Conference on Systems, Man, and Cybernetics • SMC 2016 | October 9-12, 2016 • Budapest, Hungary

[4] S. A. Engelson, and I. Dagan, “Committee-Based Sample Selection for [9] T. Liu, Z. Chen, B. Zhang, W. Y. Ma, and G. Wu, “Improving text
Probabilistic Classifiers,” Journal of Artificial Intelligence Research, classification using local latent semantic indexing,” Proceedings of
vol.11, pp. 335-360, 1999 IEEE International Conference on Data Mining, pp. 162–169, 2004
[5] S. B. Kim, K. S. Han, H. C. Rim, S. H. Myaeng, “Some Effective [10] H. J. Kim, K. J. Hong, and J. Y. Chang, “Semantically Enriching Text
Techniques for Naive Bayes Text Classification,” IEEE Transactions on Representation Model for Document Clustering,” Proceedings of
Knowledge and Data Engineering, vol. 18, no. 11, pp. 1457–1466, 2006 the 30th ACM Symposium On Applied Computing, pp. 922–925, 2015
[6] S. Hassan, M. Rafi, and M. S. Shaikh, “Comparing SVM and naïve [11] R. Wille, “Formal concept analysis as mathematical theory of concepts
Bayes classifiers for text categorization with Wikitology as knowledge and concept hierarchies,” Formal Concept Analysis, Springer Berlin
enrichment,” Proceedings of 14th International IEEE Conference on Heidelberg, pp. 1–33, 2009
Multitopic Conference, pp. 31-34, 2011 [12] K. J. Hong, and H. J. Kim, “A semantic search technique with
[7] H. Jing, Y. Tsao, K. U. Chen, and H. M. Wang, “Semantic Naïve Bayes Wikipedia-based text representation model,” Proceedings of IEEE
Classifier for Document Classification,” Proceedings of International International Conference on Big Data and Smart Computing, pp. 177–
Joint Conference on Natural Language Processing, pp. 1117–1123, 2013 182, 2016
[8] J. Kramer, and C. Gordon, “Improvement of a Naïve Bayes Sentiment [13] D. Cai, X. He, and J. Han, “Tensor space model for document analysis,”
Classifier Using MRS-Based Features,” Lexical and Computational Proceedings of the 29th Annual International ACM SIGIR Conference
Semantics, vol. 22, pp. 22–29, 2014 on Research and Development in Information Retrieval, pp. 625–626,
2006

SMC_2016 004210

You might also like