0% found this document useful (0 votes)
115 views5 pages

Bayessian Classification

This document presents a Bayesian classification approach for text categorization that selects class-specific feature subsets rather than a global feature subset. It follows Baggenstoss's PDF Projection Theorem to reconstruct probability density functions from low-dimensional, class-specific feature spaces. Experimental results demonstrate the effectiveness of selecting class-specific features compared to conventional feature selection approaches.

Uploaded by

Manoj Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
115 views5 pages

Bayessian Classification

This document presents a Bayesian classification approach for text categorization that selects class-specific feature subsets rather than a global feature subset. It follows Baggenstoss's PDF Projection Theorem to reconstruct probability density functions from low-dimensional, class-specific feature spaces. Experimental results demonstrate the effectiveness of selecting class-specific features compared to conventional feature selection approaches.

Uploaded by

Manoj Balaji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

1602 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO.

6, JUNE 2016

A Bayesian Classification Approach Using involving the learning criteria, the wrapper approach greedily
Class-Specific Features for Text selects better features with the learning criteria. The greedily search
in the wrapper approach, however, requires to train classifiers at
Categorization each step and leads a high computational burden. The embedded
approach can be considered as the combination of both filter and
Bo Tang, Student Member, IEEE, wrapper approaches, which not only measures the importance of
Haibo He, Senior Member, IEEE, each individual feature but also employs a search procedure guided
Paul M. Baggenstoss, Senior Member, IEEE, and by a learning algorithm. In practice, because of the simplicity and
Steven Kay, Fellow, IEEE the efficiency of the filter approach, it is predominantly used in TC.
To train a classifier for a multi-class TC problem, a same (or
AbstractIn this paper, we present a Bayesian classification approach for
automatic text categorization using class-specific features. Unlike conventional
global) feature subset is required for all classes. However, most fil-
text categorization approaches, our proposed method selects a specific feature ter approaches first calculate class-dependent feature scores, i.e.,
subset for each class. To apply these class-specific features for classification, we the feature importance for each class is measured. For example, the
follow Baggenstosss PDF Projection Theorem (PPT) to reconstruct the PDFs in Mutual Information (MI) approach measures the mutual depen-
raw data space from the class-specific PDFs in low-dimensional feature subspace, dency between the binary feature and each predefined class label
and build a Bayesian classification rule. One noticeable significance of our as the feature score. To measure the feature importance globally
approach is that most feature selection criteria, such as Information Gain (IG) and
(for all classes), a combination operation, such as summation, max-
Maximum Discrimination (MD), can be easily incorporated into our approach.
imization and weighted average, is needed. One major disadvan-
We evaluate our methods classification performance on several real-world
benchmarks, compared with the state-of-the-art feature selection approaches. tage is that using the combination operation may bias the feature
The superior results demonstrate the effectiveness of the proposed approach and importance for discrimination. Also, it lacks theoretical supports to
further indicate its wide potential applications in data mining. choose the best combination operation, and thus researchers and
engineers usually need to explore the best one through extensive
Index TermsFeature selection, text categorization, class-specific features, PDF empirical studies for a specific TC task [3].
projection and estimation, naive Bayes, dimension reduction In this paper, instead of using the combination operation to
select a global feature subset for all classes, we select a specific fea-
ture subset for each class, namely class-specific features. Previously
existing feature importance evaluation criteria can still be applied
in our proposed approach. Using Baggenstosss PDF Project Theo-
rem (PPT) [5], [6], we build a Bayesian classification rule for text
1 INTRODUCTION categorization with these selected class-specific features.
The rest of the paper is organized as follows: In Section 2,
THE wide availability of web documents in electronic forms requires
we introduce background and previous work on document
an automatic technique to label the documents with a predefined set
representation, naive Bayes, and feature selection techniques
of topics, what is known as automatic Text Categorization (TC).
for automatic text categorization. In Section 3, we present our
Over the past decades, it has been witnessed a large number of
proposed Bayesian method in details. Experimental results are
advanced machine learning algorithms to address this challenging
described and analyzed in Section 4, and a conclusion is given
task. By formulating the TC task as a classification problem, many
in Section 5.
existing learning approaches can be applied [1], [2], [3].
The key challenge in TC is the learning in a very high dimen-
sional data space. Documents are usually represented by the bag- 2 BACKGROUND AND RELATED WORK
of-words: namely, each word or phrase occurs in documents once 2.1 Background
or more times is considered as a feature. For a given data set, a col- Considering a TC problem with N predefined topics, let ci be the
lection of all words or phrases forms a dictionary with hundreds class label taking value i 2 f1; 2; . . . ; Ng. For a given data set, we
of thousands features. Learning from such high-dimensional fea- form a dictionary D with M terms. According to the concept of bag
tures may lead to a high computational burden and may even hurt of words, a document can be represented by a feature vector
the classification performance of classifiers due to irrelevant and x x1 ; x2 ; . . . ; xM T , where the mth element xm in x corresponds to
redundant features. To ameliorate the curse of dimensionality the mth term in D. In TC, both Binary and Real-valued feature mod-
issue and to speed up the learning process of classifiers, it is neces- els have been widely used. In Binary-valued feature model, the fea-
sary to perform feature reduction to reduce the size of features. ture value is either 1 or 0 indicating whether or not a particular term
Feature selection is a common feature reduction approach for occurs in the document. In Real-valued feature model, the feature
TC, in which only a subset of features are kept and the rest of them usually refers to the term frequency (TF) which is defined as the
are discarded. In general, feature selection methods fall into the fol- number times that a particular term appears in the document.
lowing three categories: the filter approach, the wrapper approach These two different feature models are both widely used in TC
and the embedded approach [4]. The filter approach evaluates the for classification as well as for feature selection. Under the probabi-
importance of each individual feature with a score based on the listic framework of naive Bayes, the Binary-valued feature model is
characteristics of data, and only those features with the highest used in Bernouli naive Bayes (BNB), and the Real-valued feature
scores are selected. In contrast to the filter approach without model is used in multinomial naive Bayes (MNB) or Poisson naive
Bayes (PNB). For classification, empirical studies have shown that
 B. Tang, H. He and S. Kay are with the Department of Electrical, Computer and Bio- the Real-valued feature model offers better performance than the
medical Engineering at the University of Rhode Island, Kingston, RI 02881. Binary-valued feature model [7]. Interestingly, at the stage of fea-
E-mail: {btang, he, kay}@ele.uri.edu.
 P.M. Baggenstoss is with the Frauhnhofer FKIE, Fraunhoferstr 20, 53343 Wachtberg, ture selection, the Binary-valued feature model is more commonly
Germany. E-mail: [email protected]. used than the Real-valued one. We will present the feature selec-
Manuscript received 6 Aug. 2015; revised 7 Jan. 2016; accepted 18 Jan. 2016. Date of tion approach in detail using these two feature models in the fol-
publication 27 Jan. 2016; date of current version 27 Apr. 2016. lowing section.
Recommended for acceptance by R. Gemulla. For classification, naive Bayes gains popularity due to its effi-
For information on obtaining reprints of this article, please send e-mail to: reprints@ieee.
org, and reference the Digital Object Identifier below. ciency and simplicity [8], [9]. Empirical studies have shown that
Digital Object Identifier no. 10.1109/TKDE.2016.2522427 naive Bayes could provide competitive performance compared
1041-4347 2016 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016 1603

with the state-of-the-art discriminative classifiers [10], [11], [12], pxk ; ci


MIxk ; ci log
such as support vector machine [13], nearest neighbors-based pxk pci
methods [14], etc. The naive Bayes is a model-based classification pxk ; ci p
xk ; c i
method with the naive assumption of independent features, and IGxk ; ci pxk ; ci log p
xk ; ci log
pxk pci p
xk pci
three distribution models are usually applied, including Bernoulli
model, multinomial model and Poisson model, which result in the pxk ; ci p xk ; ci 2
xk ; ci  pxk ; ci p (4)
Chixk ; ci
classifiers of Bernoulli naive Bayes, multinomial naive Bayes and pxk ; ci pxk ; ci p xk ; ci pxk ; ci
Poisson naive Bayes, respectively. Previous studies on real-life pxk jci
RSxk ; ci log
benchmarks have shown that the MNB usually performs better pxk jci
than the BNB at large vocabulary size [2], [7], [15], and the PNB is GSSxk ; ci pxk ; ci p xk ; ci  pxk ; ci p
xk ; ci ;
equivalent to the MNB when the document length and document
class are assumed to be independent [15]. For this reason, we com- where xk denotes the term does not occur in the document and ci
monly refer the naive Bayes to the MNB classifier. In this paper, we denotes the class other than ci , both of which correspond to the
concentrate on the formulation of our proposed classification measurement of negative evidences [16]. All feature importance
method using class-specific features for the MNB classifier, which measurements in Eq. (4), denoted by fxk ; ci MIxk ; ci , IGxk ;
can be easily extended to the the BNB and PNB classifiers. ci , Chixk ; ci , RSxk ; ci , or GSSxk ; ci , are specified to one class.
The MNB classifier considers that the term frequency follows To assess the importance of a feature in a global manner, three
the multinomial distribution. Given a new document xt to be classi- combination functions are typically applied: the sum fsum
PN N
i1 fxk ; ci , the maximum fmax maxi1 fxk ; ci , and the we-
fied, the MNB classifier first calculates the likelihood of observing a
document x in class ci according to the multinomial distribution: PN
ighted average favg i1 pci fxk ; ci .
Previously extensive experimental results given in [2], [3] have
PM
xm Y M shown that MI usually performs worst due to its sensitive to the
pxjci ; ui m1
pxm ; (1)
x1 !x2 !    xM ! m1 mi low frequency terms, while the others usually have competitive
performance among each other with one of three combination
functions. While these criteria are only based on the Binary-valued
where ui p1i ; p2i ; . . . ; pMi T is the parameter vector and each of
feature, our recent work [17] has shown that the Real-valued fea-
element is the probability parameter denoting the term falls into
ture can offer better performance based on the Jeffreys-Multi-
one of M categories. These parameters can be estimated as follows
Hypothesis divergence. A key issue of these existing approaches is
[7], [15]:
that, while these local functions in Eq. (4) can effectively assess
P the feature importance to a specific class, the combination fun-
x2Di xm b1 ctions may be not optimal globally. In this paper, instead of using
p^mi P PM ; i 1; 2; . . . ; M; (2)
x2Di m1 xm b2 the combination function to obtain the global feature importance
measurement, we directly use the class-specific measurements
where Di is the set with all class ci documents. b1 and b2 are two from Eq. (4) and build a Bayesian classification rule, following
smoothing parameters to avoid the zero-probability issue. Usually, Baggenstosss PPT [5], [6] for classification. Among other works
b1 1 and b2 M are used in literature [7], [15]. using class-specific features for classification, such as Fu and
Then, the MNB classifier assigns class label c to xt according to Wangs wrapper approach using Genetic algorithm [18] and the
scheme of one-vs-all [19], the theoretical supports are missing.
Also, the high computational cost of wrapper approaches make
c arg max pci jxt ; ui
i2f1;2;...;Ng them to be impractical.
/ arg max pxt jci ; ui pci (3)
i2f1;2;...;Ng 3 A BAYESIAN CLASSIFIER USING CLASS-SPECIFIC
/ arg max log pxt jci ; ui log pci ; FEATURES FOR TC
i2f1;2;...;Ng
Considering a N-class classification problem, suppose that for each
where the likelihood pxt jci ; ui has a form of Eq. (1) and pci is class ci , i 1; 2; . . . ; N, we select a class-specific feature subset
the class prior probability that can be estimated from a given zi fi x, where fi x could be a linear or non-linear function such
training data. that the dimension of zi is much smaller than x. Notice that we can-
not apply these class-specific features zi , i 1; 2; . . . ; N, to the
Bayesian classification rule directly, e.g., in Eq. (3), because it is
2.2 Related Work
invalid to compare the discriminative information on different fea-
Feature selection is widely adopted to reduce dimensionality of
ture subspaces. Here, we follow Baggenstosss PDF Projection The-
data. As we mentioned before, the filter and the wrapper are the orem [5], [6] to build a classification rule using these class-specific
two types of feature selection approach. The high computational
features. The idea is to reconstruct the PDF pxjci in the original
cost makes the wrapper approach impractical, and we concentrates raw data space from the PDF pzi jci in the class-specific feature
on the filter approach in this work. Many filter approaches have
subspace, if we know both PDFs pxjc0 and pzi jc0 under a refer-
been proposed in TC, including document frequency (DF), mutual ence hypothesis (class) c0 . The reconstructed PDF can be written as
information, information gain (IG), Chi-square statistic, relevance
score (RS), GSS coefficient, among others. Due to the space limita- pxjc0
tion, we just give a brief description of these criteria in this section pxjci pzi jci : (5)
pzi jc0
and refer the interested reader to the survey paper [3] for details.
The common idea of these feature selection criteria is to mea- We name Eq. (5) as the PDF Projection Theorem since it projects the
sure dependency or relevance between the binary feature and the PDF from a low-dimensional feature subspace into a high-dimen-
class as a score of feature importance. Specifically, given a binary sional raw data space. In [20], Kay has proved that the constructed
feature xk 2 f0; 1g and a category ci , i 2 f1; 2; . . . ; Ng, the feature PDF has the minimum KL-divergence to the true one asymptoti-
importance measurements in MI, IG, Chi-square, RS, and GSS are cally within a class of PDFs defined by a given reference class c0 .
defined by: For the detailed explanation of Eq. (5), we refer the interested
1604 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016

readers to the references [5], [6]. Note that in Eq. (5), one can use two benchmarks are widely used for text categorization in litera-
class-specific reference hypotheses c0;i for each class in theory, but ture for performance evaluation. We use the data of these two
we choose to use a common one c0 in this paper. By incorporating benchmarks provided by Cai et al. [22], [23].
the above reconstructed PDF into the Bayesian classification rule in The benchmark 20-NEWSGROUPS collects 20;000 documents that
Eq. (3), we have have been posted online with 20 different topics. In this bench-
pxjc0 mark, note that some topics are hierarchically categorized, and two
c / arg max log log pzi jci log pci different topics could be very closely related to each other, e.g., rec.
i2f1;2;...;Ng pzi jc0
(6) sport.baseball and rec.sport.hockey, comp.sys.mac.hardware and comp.
pzi jci sys.ibm.hardware, etc.
/ arg max log log pci :
i2f1;2;...;Ng pzi jc0 The original benchmark REUTERS consists of 21;578 documents
with 135 different topics. In our experiments, we use its ModApte
The key challenge of using Eq. (6) for classification is to find a refer- version of REUTERS-21578 and remove those documents which
ence class c0 in which both pxjc0 and pzi jc0 can be estimated or belong to multiple topics. The used benchmark contains 8;293
derived. A good choice of reference class c0 is the combination of documents with 65 different topics. However, the documents in
all classes [21]. Note that, in TC, both the original feature x and the this benchmark are highly unbalanced, and the topics are ranked
class-specific feature zi satisfy multinomial distributions, as shown with respect to the number of documents. Following the previous
in Eq. (1), and hence it is easy to obtain closed-forms of pxjc0 and works in literature [7], we use its subset, named REUTERS-10, which
pzi jc0 , i 1; 2; . . . ; N. consists of the documents from the first 10 topics.
Our method will be first to form a new reference class c0 that Before performing feature extraction and classification, we
consists of all documents of a given training data set. Each term in introduce a preprocessing stage in which we discard those terms
class c0 still satisfies a multinomial distribution. For each class, we that occur in less than two documents, and ignore those terms in a
then select the first K features zi according to a specific feature stoplist. For these two data sets, the officially split training and test-
selection criterion, such as MI, IG, etc., as shown in Eq. (4). Let ing data are provided. We use the training data set for feature
Ii ni1 ; ni2 ; . . . ; niK  be the selected feature index vector, such that selection and parameter estimation, and apply the testing data set
the kth feature (term or phrase) in zi corresponds to the nik th for performance evaluation.
feature in x. For each class-specific feature subset zi , i 1; 2; . . . ; N,
we further estimate the multinomial distribution parameters under 4.2 Results
the reference class c0 , denoted by ^ pni 0 ; p^ni 0 ; . . . ; p^ni 0 , and
u ijc0 ^ In our experiments, we compare our approach using class-specific
1 2 K
features with other three conventional approaches using non-class-
the multinomial distribution parameters under the class ci , de-
specific features where the global combination functions of the
noted by ^u i ^
pni i ; p^ni i ; . . . ; p^ni i . Note that the parameters ^
u ijc0 sum, the weighted average and the maximum are applied, and
1 2 K
and ^u i can be estimated using Eq. (2). Hence, the PDFs in Eq. (6): with another one class-specific feature selection method using one-
pzi jc0 pzi jc0 ; ^u ijc0 and pzi jci pzi jci ; ^ui are easily calcu- vs-all scheme. The one-vs-all classification scheme trains a binary-
lated according to Eq. (1). The classification decision rule in Eq. (6) class classifier for each individual class, which allows the use of
can be further written as either class-specific or non-class-specific features to train binary-
class classifiers. The feature selection criteria of IG, Chi-square, RS,
X
K p^ni i and MD [17] are used to measure class-specific feature scores for
c / arg max zk log k
log pci ; (7)
i2f1;2;...;Ng k1 p^ni 0 both conventional approaches and ours.
k
Fig. 1 shows the overall testing accuracy on these two data sets
where a constant term is omitted. We summarize the procedure of with different number of features, when IG, Chi-square, RS, and
our Bayesian approach in Algorithm 1. MD are respectively used as the feature selection criterion to com-
pute feature scores for each class. It can be seen that the classification
Algorithm 1. The Proposed Bayesian Approach Using performance increases with more features are selected. The compar-
Class-Specific Features for Text Categorization ison in Fig. 1 indicates significant performance improvements of
our proposed approach for a small number of features, when IG,
INPUT: RS, and MD feature selection criteria are used. Note that, however,
 Documents for a given training data set with N topics. the improvement for Chi-square feature selection approach is not
PROCEDURE consistent, especially for the REUTERS-10 data set. One reason would
1. Form a reference class c0 which consists of all documents; be that the feature score measured by the Chi-square criterion does
for each class i = 1 : N do not reflect the feature importance for discrimination. It can be seen
2. Calculate the score of each feature based on a specific cri- that the Chi-square criterion performs worst for the REUTERS-10 data
terion, and rank the feature with the score in a descend- set. We also notice that, when more features are selected, the classifi-
ing order; cation performances of all considered methods are improved, lead-
3. Choose the first K features zi , the index of which is ing to a small improvement of our proposed approach.
denoted by Ii ; Since the data set of REUTERS are highly imbalanced, we fur-
4. Estimate the parameters ^ u ij0 under the reference class c0 ther apply the performance metrics of F-Measure and G-Mean
and the parameters ^u i under the class ci ; to measure the classification performance, which are defined as
end follows:
OUTPUT: Given a document to be classified,
 Output the class label c using Eq. (7). 2  Precision  Recall
FMeasure ;
Precision Recall
p
GMean Precision  Recall:
4 EXPERIMENTAL RESULTS AND ANALYSIS Since both F-Measure and G-Mean metrics are the combination
4.1 Data Sets of recall and precision, they are two comprehensive perfor-
In our experiments, we test our approach for text categorization on mance metrics and are commonly used in imbalanced learning
two real-world benchmarks: 20-NEWSGROUPS and REUTERS. These [24], [25]. In Fig. 2, we compare the performance in terms of F-
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016 1605

Fig. 1. The testing accuracy of our PPT-based classifier using class-specific features, compared with the conventional approaches, when IG, Chi-square, RS, and MD
criteria are used for feature selection on the data sets of (A). 20-NEWSGROUPS and (B). REUTERS-10.

Measure and G-Mean for the data sets of NEWSGROUPS and REU- 5 CONCLUSION
TERS-10, when the feature selection criteria of IG and MD are
used. It shows that our PPT-based classifier using class-specific In this paper, we presented a Bayesian classification approach for
features offers better F-Measure and G-Mean for the data set of automatic text categorization using class-specific features. In con-
NEWSGROUPS when IG is used as the feature selection criterion. trast to the conventional feature selection methods, it allows to
For the data set of REUTERS-10, our approach still has competi- choose the most important features for each class. To apply the
tive performance in F-Measure and G-Mean. class-specific features for classification, we derived a new naive

Fig. 2. The testing F-Measure and G-Mean of our PPT-based classifier using Class-specific features, compared with the conventional approaches, when IG and MD
criteria are used for feature selection on the data sets of 20-NEWSGROUPS and REUTERS-10.
1606 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 28, NO. 6, JUNE 2016

Bayes rule following Baggenstosss PDF Projection Theorem. " For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
One important advantage of our method is that many existing fea-
ture selection criteria can be easily incorporated. The experiments
we conducted on several real-life data sets have shown promising
performance improvement compared with the state-of-the-art
methods for text categorization.

ACKNOWLEDGMENTS
This research was partially supported by US National Science
Foundation (NSF) under grant ECCS 1053717 and CCF 1439011,
and the Army Research Office under grant W911NF-12-1-0378.

REFERENCES
[1] W. Lam, M. Ruiz, and P. Srinivasan, Automatic text categorization and its
application to text retrieval, IEEE Trans. Knowl. Data Eng., vol. 11, no. 6,
pp. 865879, Nov./Dec. 1999.
[2] F. Sebastiani, Machine learning in automated text categorization, ACM
Comput. Surveys, vol. 34, no. 1, pp. 147, 2002.
[3] G. Forman, An extensive empirical study of feature selection metrics for
text classification, The J. Mach. Learn. Res., vol. 3, pp. 12891305, 2003.
[4] H. Liu and L. Yu, Toward integrating feature selection algorithms for clas-
sification and clustering, IEEE Trans. Knowl. Data Eng., vol. 17, no. 4,
pp. 491502, Apr. 2005.
[5] P. M. Baggenstoss, Class-specific feature sets in classification, IEEE Trans.
Signal Process., vol. 47, no. 12, pp. 34283432, Dec. 1999.
[6] P. M. Baggenstoss, The pdf projection theorem and the class-specific
method, IEEE Trans. Signal Process., vol. 51, no. 3, pp. 672685, Mar. 2003.
[7] A. McCallum and K. Nigam, A comparison of event models for naive
Bayes text classification, in Proc. Workshop Learn. for Text Categorization,
1998, vol. 752, pp. 4148.
[8] V. Kecman, Learning and Soft Computing: Support Vector Machines, Neural
Networks, and Fuzzy Logic Models. Cambridge, MA, USA: MIT Press, 2001.
[9] L. Wang and X. Fu, Data Mining with Computational Intelligence. New York,
NY, USA: Springer, 2006.
[10] D. D. Lewis, Naive (Bayes) at forty: The independence assumption in
information retrieval, in Proc. 10th Eur. Conf. Mach. Learn., 1998, pp. 415.
[11] D. Koller and M. Sahami, Hierarchically classifying documents using very
few words, in Proc. 14th Int. Conf. Mach. Learn., 1997, pp. 170178.
[12] Y. H. Li and A. K. Jain, Classification of text documents, The Comput. J.,
vol. 41, no. 8, pp. 537546, 1998.
[13] T. Joachims, Text categorization with support vector machines: Learning
with many relevant features, in Proc. 10th Eur. Conf. Mach. Learn., 1998,
pp. 137142.
[14] B. Tang and H. He, ENN: Extended nearest neighbor method for pattern
recognition [research frontier], IEEE Comput. Intell. Mag., vol. 10, no. 3,
pp. 5260, Aug. 2015.
[15] S. Eyheramendy, D. D. Lewis, and D. Madigan, On the naive Bayes model
for text categorization, in Proc. 9th Int. Workshop Artif. Intell. Statist., 2003,
pp. 332339.
[16] L. Galavotti, F. Sebastiani, and M. Simi, Experiments on the use of feature
selection and negative evidence in automated text categorization, in Proc.
4th Eur. Conf. Res. Adv. Technol. Digit. Libraries, 2000, pp. 5968.
[17] B. Tang, S. Kay, and H. He, Toward optimal feature selection in naive
Bayes for text categorization, Preprint, arXiv:1602.02850 [stat.ML], 2016.
[18] X. Fu and L. Wang, A GA-based RBF classifier with class-dependent
features, in Proc. Congress Evol. Comput., 2002, vol. 2, pp. 18901894.
[19] L. Wang, N. Zhou, and F. Chu, A general wrapper approach to selection of
class-dependent features, IEEE Trans. Neural Netw., vol. 19, no. 7,
pp. 12671278, Jul. 2008.
[20] S. Kay, Asymptotically optimal approximation of multidimensional pdfs
by lower dimensional pdfs, IEEE Trans. Signal Process., vol. 55, no. 2,
pp. 725729, Feb. 2007.
[21] B. Tang, H. He, Q. Ding, and S. Kay, A parametric classification rule based
on the exponentially embedded family, IEEE Trans. Neural Netw. Learn.
Syst., vol. 26, no. 2, pp. 367377, Feb. 2015.
[22] D. Cai, X. He, and J. Han, Document clustering using locality preserving
indexing, IEEE Trans. Knowl. Data Eng., vol. 17, no. 12, pp. 16241637, Dec.
2005.
[23] D. Cai, Q. Mei, J. Han, and C. Zhai, Modeling hidden topics on document
manifold, in Proc. 17th ACM Conf. Inf. Knowl. Manage., 2008, pp. 911920.
[24] H. He and E. Garcia, Learning from imbalanced data, IEEE Trans. Knowl.
Data Eng., vol. 21, no. 9, pp. 12631284, Sep. 2009.
[25] B. Tang and H. He, KernelADASYN: Kernel based adaptive synthetic data
generation for imbalanced learning, in Proc. IEEE Congress Evol. Comput.,
2015, pp. 664671.

You might also like