Bayessian Classification
Bayessian Classification
6, JUNE 2016
A Bayesian Classification Approach Using involving the learning criteria, the wrapper approach greedily
Class-Specific Features for Text selects better features with the learning criteria. The greedily search
in the wrapper approach, however, requires to train classifiers at
Categorization each step and leads a high computational burden. The embedded
approach can be considered as the combination of both filter and
Bo Tang, Student Member, IEEE, wrapper approaches, which not only measures the importance of
Haibo He, Senior Member, IEEE, each individual feature but also employs a search procedure guided
Paul M. Baggenstoss, Senior Member, IEEE, and by a learning algorithm. In practice, because of the simplicity and
Steven Kay, Fellow, IEEE the efficiency of the filter approach, it is predominantly used in TC.
To train a classifier for a multi-class TC problem, a same (or
AbstractIn this paper, we present a Bayesian classification approach for
automatic text categorization using class-specific features. Unlike conventional
global) feature subset is required for all classes. However, most fil-
text categorization approaches, our proposed method selects a specific feature ter approaches first calculate class-dependent feature scores, i.e.,
subset for each class. To apply these class-specific features for classification, we the feature importance for each class is measured. For example, the
follow Baggenstosss PDF Projection Theorem (PPT) to reconstruct the PDFs in Mutual Information (MI) approach measures the mutual depen-
raw data space from the class-specific PDFs in low-dimensional feature subspace, dency between the binary feature and each predefined class label
and build a Bayesian classification rule. One noticeable significance of our as the feature score. To measure the feature importance globally
approach is that most feature selection criteria, such as Information Gain (IG) and
(for all classes), a combination operation, such as summation, max-
Maximum Discrimination (MD), can be easily incorporated into our approach.
imization and weighted average, is needed. One major disadvan-
We evaluate our methods classification performance on several real-world
benchmarks, compared with the state-of-the-art feature selection approaches. tage is that using the combination operation may bias the feature
The superior results demonstrate the effectiveness of the proposed approach and importance for discrimination. Also, it lacks theoretical supports to
further indicate its wide potential applications in data mining. choose the best combination operation, and thus researchers and
engineers usually need to explore the best one through extensive
Index TermsFeature selection, text categorization, class-specific features, PDF empirical studies for a specific TC task [3].
projection and estimation, naive Bayes, dimension reduction In this paper, instead of using the combination operation to
select a global feature subset for all classes, we select a specific fea-
ture subset for each class, namely class-specific features. Previously
existing feature importance evaluation criteria can still be applied
in our proposed approach. Using Baggenstosss PDF Project Theo-
rem (PPT) [5], [6], we build a Bayesian classification rule for text
1 INTRODUCTION categorization with these selected class-specific features.
The rest of the paper is organized as follows: In Section 2,
THE wide availability of web documents in electronic forms requires
we introduce background and previous work on document
an automatic technique to label the documents with a predefined set
representation, naive Bayes, and feature selection techniques
of topics, what is known as automatic Text Categorization (TC).
for automatic text categorization. In Section 3, we present our
Over the past decades, it has been witnessed a large number of
proposed Bayesian method in details. Experimental results are
advanced machine learning algorithms to address this challenging
described and analyzed in Section 4, and a conclusion is given
task. By formulating the TC task as a classification problem, many
in Section 5.
existing learning approaches can be applied [1], [2], [3].
The key challenge in TC is the learning in a very high dimen-
sional data space. Documents are usually represented by the bag- 2 BACKGROUND AND RELATED WORK
of-words: namely, each word or phrase occurs in documents once 2.1 Background
or more times is considered as a feature. For a given data set, a col- Considering a TC problem with N predefined topics, let ci be the
lection of all words or phrases forms a dictionary with hundreds class label taking value i 2 f1; 2; . . . ; Ng. For a given data set, we
of thousands features. Learning from such high-dimensional fea- form a dictionary D with M terms. According to the concept of bag
tures may lead to a high computational burden and may even hurt of words, a document can be represented by a feature vector
the classification performance of classifiers due to irrelevant and x x1 ; x2 ; . . . ; xM T , where the mth element xm in x corresponds to
redundant features. To ameliorate the curse of dimensionality the mth term in D. In TC, both Binary and Real-valued feature mod-
issue and to speed up the learning process of classifiers, it is neces- els have been widely used. In Binary-valued feature model, the fea-
sary to perform feature reduction to reduce the size of features. ture value is either 1 or 0 indicating whether or not a particular term
Feature selection is a common feature reduction approach for occurs in the document. In Real-valued feature model, the feature
TC, in which only a subset of features are kept and the rest of them usually refers to the term frequency (TF) which is defined as the
are discarded. In general, feature selection methods fall into the fol- number times that a particular term appears in the document.
lowing three categories: the filter approach, the wrapper approach These two different feature models are both widely used in TC
and the embedded approach [4]. The filter approach evaluates the for classification as well as for feature selection. Under the probabi-
importance of each individual feature with a score based on the listic framework of naive Bayes, the Binary-valued feature model is
characteristics of data, and only those features with the highest used in Bernouli naive Bayes (BNB), and the Real-valued feature
scores are selected. In contrast to the filter approach without model is used in multinomial naive Bayes (MNB) or Poisson naive
Bayes (PNB). For classification, empirical studies have shown that
B. Tang, H. He and S. Kay are with the Department of Electrical, Computer and Bio- the Real-valued feature model offers better performance than the
medical Engineering at the University of Rhode Island, Kingston, RI 02881. Binary-valued feature model [7]. Interestingly, at the stage of fea-
E-mail: {btang, he, kay}@ele.uri.edu.
P.M. Baggenstoss is with the Frauhnhofer FKIE, Fraunhoferstr 20, 53343 Wachtberg, ture selection, the Binary-valued feature model is more commonly
Germany. E-mail: [email protected]. used than the Real-valued one. We will present the feature selec-
Manuscript received 6 Aug. 2015; revised 7 Jan. 2016; accepted 18 Jan. 2016. Date of tion approach in detail using these two feature models in the fol-
publication 27 Jan. 2016; date of current version 27 Apr. 2016. lowing section.
Recommended for acceptance by R. Gemulla. For classification, naive Bayes gains popularity due to its effi-
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. ciency and simplicity [8], [9]. Empirical studies have shown that
Digital Object Identifier no. 10.1109/TKDE.2016.2522427 naive Bayes could provide competitive performance compared
1041-4347 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016 1603
readers to the references [5], [6]. Note that in Eq. (5), one can use two benchmarks are widely used for text categorization in litera-
class-specific reference hypotheses c0;i for each class in theory, but ture for performance evaluation. We use the data of these two
we choose to use a common one c0 in this paper. By incorporating benchmarks provided by Cai et al. [22], [23].
the above reconstructed PDF into the Bayesian classification rule in The benchmark 20-NEWSGROUPS collects 20;000 documents that
Eq. (3), we have have been posted online with 20 different topics. In this bench-
pxjc0 mark, note that some topics are hierarchically categorized, and two
c / arg max log log pzi jci log pci different topics could be very closely related to each other, e.g., rec.
i2f1;2;...;Ng pzi jc0
(6) sport.baseball and rec.sport.hockey, comp.sys.mac.hardware and comp.
pzi jci sys.ibm.hardware, etc.
/ arg max log log pci :
i2f1;2;...;Ng pzi jc0 The original benchmark REUTERS consists of 21;578 documents
with 135 different topics. In our experiments, we use its ModApte
The key challenge of using Eq. (6) for classification is to find a refer- version of REUTERS-21578 and remove those documents which
ence class c0 in which both pxjc0 and pzi jc0 can be estimated or belong to multiple topics. The used benchmark contains 8;293
derived. A good choice of reference class c0 is the combination of documents with 65 different topics. However, the documents in
all classes [21]. Note that, in TC, both the original feature x and the this benchmark are highly unbalanced, and the topics are ranked
class-specific feature zi satisfy multinomial distributions, as shown with respect to the number of documents. Following the previous
in Eq. (1), and hence it is easy to obtain closed-forms of pxjc0 and works in literature [7], we use its subset, named REUTERS-10, which
pzi jc0 , i 1; 2; . . . ; N. consists of the documents from the first 10 topics.
Our method will be first to form a new reference class c0 that Before performing feature extraction and classification, we
consists of all documents of a given training data set. Each term in introduce a preprocessing stage in which we discard those terms
class c0 still satisfies a multinomial distribution. For each class, we that occur in less than two documents, and ignore those terms in a
then select the first K features zi according to a specific feature stoplist. For these two data sets, the officially split training and test-
selection criterion, such as MI, IG, etc., as shown in Eq. (4). Let ing data are provided. We use the training data set for feature
Ii ni1 ; ni2 ; . . . ; niK be the selected feature index vector, such that selection and parameter estimation, and apply the testing data set
the kth feature (term or phrase) in zi corresponds to the nik th for performance evaluation.
feature in x. For each class-specific feature subset zi , i 1; 2; . . . ; N,
we further estimate the multinomial distribution parameters under 4.2 Results
the reference class c0 , denoted by ^ pni 0 ; p^ni 0 ; . . . ; p^ni 0 , and
u ijc0 ^ In our experiments, we compare our approach using class-specific
1 2 K
features with other three conventional approaches using non-class-
the multinomial distribution parameters under the class ci , de-
specific features where the global combination functions of the
noted by ^u i ^
pni i ; p^ni i ; . . . ; p^ni i . Note that the parameters ^
u ijc0 sum, the weighted average and the maximum are applied, and
1 2 K
and ^u i can be estimated using Eq. (2). Hence, the PDFs in Eq. (6): with another one class-specific feature selection method using one-
pzi jc0 pzi jc0 ; ^u ijc0 and pzi jci pzi jci ; ^ui are easily calcu- vs-all scheme. The one-vs-all classification scheme trains a binary-
lated according to Eq. (1). The classification decision rule in Eq. (6) class classifier for each individual class, which allows the use of
can be further written as either class-specific or non-class-specific features to train binary-
class classifiers. The feature selection criteria of IG, Chi-square, RS,
X
K p^ni i and MD [17] are used to measure class-specific feature scores for
c / arg max zk log k
log pci ; (7)
i2f1;2;...;Ng k1 p^ni 0 both conventional approaches and ours.
k
Fig. 1 shows the overall testing accuracy on these two data sets
where a constant term is omitted. We summarize the procedure of with different number of features, when IG, Chi-square, RS, and
our Bayesian approach in Algorithm 1. MD are respectively used as the feature selection criterion to com-
pute feature scores for each class. It can be seen that the classification
Algorithm 1. The Proposed Bayesian Approach Using performance increases with more features are selected. The compar-
Class-Specific Features for Text Categorization ison in Fig. 1 indicates significant performance improvements of
our proposed approach for a small number of features, when IG,
INPUT: RS, and MD feature selection criteria are used. Note that, however,
Documents for a given training data set with N topics. the improvement for Chi-square feature selection approach is not
PROCEDURE consistent, especially for the REUTERS-10 data set. One reason would
1. Form a reference class c0 which consists of all documents; be that the feature score measured by the Chi-square criterion does
for each class i = 1 : N do not reflect the feature importance for discrimination. It can be seen
2. Calculate the score of each feature based on a specific cri- that the Chi-square criterion performs worst for the REUTERS-10 data
terion, and rank the feature with the score in a descend- set. We also notice that, when more features are selected, the classifi-
ing order; cation performances of all considered methods are improved, lead-
3. Choose the first K features zi , the index of which is ing to a small improvement of our proposed approach.
denoted by Ii ; Since the data set of REUTERS are highly imbalanced, we fur-
4. Estimate the parameters ^ u ij0 under the reference class c0 ther apply the performance metrics of F-Measure and G-Mean
and the parameters ^u i under the class ci ; to measure the classification performance, which are defined as
end follows:
OUTPUT: Given a document to be classified,
Output the class label c using Eq. (7). 2 Precision Recall
FMeasure ;
Precision Recall
p
GMean Precision Recall:
4 EXPERIMENTAL RESULTS AND ANALYSIS Since both F-Measure and G-Mean metrics are the combination
4.1 Data Sets of recall and precision, they are two comprehensive perfor-
In our experiments, we test our approach for text categorization on mance metrics and are commonly used in imbalanced learning
two real-world benchmarks: 20-NEWSGROUPS and REUTERS. These [24], [25]. In Fig. 2, we compare the performance in terms of F-
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016 1605
Fig. 1. The testing accuracy of our PPT-based classifier using class-specific features, compared with the conventional approaches, when IG, Chi-square, RS, and MD
criteria are used for feature selection on the data sets of (A). 20-NEWSGROUPS and (B). REUTERS-10.
Measure and G-Mean for the data sets of NEWSGROUPS and REU- 5 CONCLUSION
TERS-10, when the feature selection criteria of IG and MD are
used. It shows that our PPT-based classifier using class-specific In this paper, we presented a Bayesian classification approach for
features offers better F-Measure and G-Mean for the data set of automatic text categorization using class-specific features. In con-
NEWSGROUPS when IG is used as the feature selection criterion. trast to the conventional feature selection methods, it allows to
For the data set of REUTERS-10, our approach still has competi- choose the most important features for each class. To apply the
tive performance in F-Measure and G-Mean. class-specific features for classification, we derived a new naive
Fig. 2. The testing F-Measure and G-Mean of our PPT-based classifier using Class-specific features, compared with the conventional approaches, when IG and MD
criteria are used for feature selection on the data sets of 20-NEWSGROUPS and REUTERS-10.
1606 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016
Bayes rule following Baggenstosss PDF Projection Theorem. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
One important advantage of our method is that many existing fea-
ture selection criteria can be easily incorporated. The experiments
we conducted on several real-life data sets have shown promising
performance improvement compared with the state-of-the-art
methods for text categorization.
ACKNOWLEDGMENTS
This research was partially supported by US National Science
Foundation (NSF) under grant ECCS 1053717 and CCF 1439011,
and the Army Research Office under grant W911NF-12-1-0378.
REFERENCES
[1] W. Lam, M. Ruiz, and P. Srinivasan, Automatic text categorization and its
application to text retrieval, IEEE Trans. Knowl. Data Eng., vol. 11, no. 6,
pp. 865879, Nov./Dec. 1999.
[2] F. Sebastiani, Machine learning in automated text categorization, ACM
Comput. Surveys, vol. 34, no. 1, pp. 147, 2002.
[3] G. Forman, An extensive empirical study of feature selection metrics for
text classification, The J. Mach. Learn. Res., vol. 3, pp. 12891305, 2003.
[4] H. Liu and L. Yu, Toward integrating feature selection algorithms for clas-
sification and clustering, IEEE Trans. Knowl. Data Eng., vol. 17, no. 4,
pp. 491502, Apr. 2005.
[5] P. M. Baggenstoss, Class-specific feature sets in classification, IEEE Trans.
Signal Process., vol. 47, no. 12, pp. 34283432, Dec. 1999.
[6] P. M. Baggenstoss, The pdf projection theorem and the class-specific
method, IEEE Trans. Signal Process., vol. 51, no. 3, pp. 672685, Mar. 2003.
[7] A. McCallum and K. Nigam, A comparison of event models for naive
Bayes text classification, in Proc. Workshop Learn. for Text Categorization,
1998, vol. 752, pp. 4148.
[8] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural
Networks, and Fuzzy Logic Models. Cambridge, MA, USA: MIT Press, 2001.
[9] L. Wang and X. Fu, Data Mining with Computational Intelligence. New York,
NY, USA: Springer, 2006.
[10] D. D. Lewis, Naive (Bayes) at forty: The independence assumption in
information retrieval, in Proc. 10th Eur. Conf. Mach. Learn., 1998, pp. 415.
[11] D. Koller and M. Sahami, Hierarchically classifying documents using very
few words, in Proc. 14th Int. Conf. Mach. Learn., 1997, pp. 170178.
[12] Y. H. Li and A. K. Jain, Classification of text documents, The Comput. J.,
vol. 41, no. 8, pp. 537546, 1998.
[13] T. Joachims, Text categorization with support vector machines: Learning
with many relevant features, in Proc. 10th Eur. Conf. Mach. Learn., 1998,
pp. 137142.
[14] B. Tang and H. He, ENN: Extended nearest neighbor method for pattern
recognition [research frontier], IEEE Comput. Intell. Mag., vol. 10, no. 3,
pp. 5260, Aug. 2015.
[15] S. Eyheramendy, D. D. Lewis, and D. Madigan, On the naive Bayes model
for text categorization, in Proc. 9th Int. Workshop Artif. Intell. Statist., 2003,
pp. 332339.
[16] L. Galavotti, F. Sebastiani, and M. Simi, Experiments on the use of feature
selection and negative evidence in automated text categorization, in Proc.
4th Eur. Conf. Res. Adv. Technol. Digit. Libraries, 2000, pp. 5968.
[17] B. Tang, S. Kay, and H. He, Toward optimal feature selection in naive
Bayes for text categorization, Preprint, arXiv:1602.02850 [stat.ML], 2016.
[18] X. Fu and L. Wang, A GA-based RBF classifier with class-dependent
features, in Proc. Congress Evol. Comput., 2002, vol. 2, pp. 18901894.
[19] L. Wang, N. Zhou, and F. Chu, A general wrapper approach to selection of
class-dependent features, IEEE Trans. Neural Netw., vol. 19, no. 7,
pp. 12671278, Jul. 2008.
[20] S. Kay, Asymptotically optimal approximation of multidimensional pdfs
by lower dimensional pdfs, IEEE Trans. Signal Process., vol. 55, no. 2,
pp. 725729, Feb. 2007.
[21] B. Tang, H. He, Q. Ding, and S. Kay, A parametric classification rule based
on the exponentially embedded family, IEEE Trans. Neural Netw. Learn.
Syst., vol. 26, no. 2, pp. 367377, Feb. 2015.
[22] D. Cai, X. He, and J. Han, Document clustering using locality preserving
indexing, IEEE Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 16241637, Dec.
2005.
[23] D. Cai, Q. Mei, J. Han, and C. Zhai, Modeling hidden topics on document
manifold, in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 911920.
[24] H. He and E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl.
Data Eng., vol. 21, no. 9, pp. 12631284, Sep. 2009.
[25] B. Tang and H. He, KernelADASYN: Kernel based adaptive synthetic data
generation for imbalanced learning, in Proc. IEEE Congress Evol. Comput.,
2015, pp. 664671.