Weizhang 2013
Weizhang 2013
Wei Zhang, Student Member ,IEEE Feng Gao, Senior Member, IEEE
MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an,
Shaanxi Province, China Shaanxi Province, China
[email protected] [email protected]
Abstract—Naive Bayes classifier is widely used in machine X 0 ,X1 ,! ,X d can be different. The most two common
learning for its simplicity and efficiency. However, most of
the existing work on naïve Bayes focused on improving the Bayes models are the multivariable Bernoulli model and
Bayes model itself or whether the “naïve assumption” is the multinomial model.
satisfied. In this paper, the performance of naïve bayes in
A. The Multivariable Bernoulli Model
text classification is analyzed and the corresponding results
from different points of view is proposed, then an improving The multivariable Bernoulli model is the simplest and
way for text classification with highly asymmetric the most classic Bayes model [3],[4],[5].This model
misclassification costs is provided. Finally the related assumes random variables X1 ,! ,X d and the class
experiments proved the above proposed method were
efficient. variable X 0 are independent, which is called naïve
Keywords-Text Classification,Naïve Bayes,Machine Learning, assumption. Although the assumption is hardly to
Feature Selection satisfied in practical, Bayes classifier still has good
performance.
I. INTRODUCTION In this model, the values of X1 ,! ,X d are 1 when
Text classification has been broadly studied, and the feature denoted by X i presence in a document,
Naive Bayes classifier is widely used in machine learning otherwise 0. The probability of a new document
for its simplicity and efficiency [1], [2]. belonging to class i is
Most of the existing work on naïve Bayes focused on
constructing and improving the Bayes model itself or P X 1 X 2 ... X d | X 0
discussing whether the “naïve assumption” is satisfied
P X1 x1 ,..., X d xd | X 0 i
and the influence and the corresponding improvement of
classification performance[3],[4],[5],[6],[7],[8]. d
p
j 1
ij ˄3˅ Text classification with highly asymmetric
misclassification costs is common. For example, spam
d filtering in classification is such the case, i.e. the cost of
j a 1
1 pij ˄4˅ classifying a ham to spam class is higher than the cost of
classifying a spam to the ham class.
Formula (1) indicates that all features are considered In general some features was selected under an
during the classification, regardless of the features optimize functions during features selection. In this paper
presence or not in the document. Formula (3) indicates a new thoughtful way that the features tending to the class
that only the presence features that are the most approach with higher misclassification costs are preferentially
to the document itself are considered during the selected during feature selection is presented. For spam
classification. Formula (4) indicates that only the absence filtering problem, we choose the features tending to
features that are complement to the document are belonging to hams class, which implies
considered during the classification. Obviously for the
p X 1| X 0 ham ! p X 1| X 0 spam
The advantages of the provided way are listed as Table1 Recall of every class of formula (3)
follows: Without tendency tendency
1) It will lead to higher probability of finding hams Ham recall 98.5% 100%
using formula (3), and reduce the error of classifying the Spam recall 99.7% 98.7%
ham to spam.
2) From the section 3, classifier will give the opposite Table2 Recall of every class of formula (4)
conclusion using formula (4), which inclines to classify Without tendency tendency
the ham to spam. So if the chosen features prefer the class
Ham recall 96.9% 59.2%
with higher misclassification costs, formula (4) can
modify formula (3). Spam recall 82.2% 100%
3) Due to the sparsity of text classification, the
production terms of formula (3) is less than formula (4), Table3 Recall of every class of formula (1)
which vastly reduces the calculation cost. Without tendency tendency
Ham recall 98.3% 100%
Spam recall 99% 99.9%
V. ILLUSTRATIVE EXAMPLES AND ANALYSIS
Here we choose the mail set of Natural Language From the experiments results, choosing features with
Processing Laboratory at Northeastern University as the tendency can actually improve the recall for a special
trial data including 1633 mails with 428 hams and 1205 class. During classifying formula (4) has the biggest error
spams. All the mails are divided into 10 parts results and cannot be used alone. It should just
uniformly[14]. complement and modify formula (3).
In statistic, the hams contain 9882 features and 38902 Note that in these experiments, the feature selection
features for the spams, and this two class features have without tendency provides tendency features in fact
6243 features in common. Here we choose 100 features tending to spam class. This is because the features in this
denoting the mail using the simplest document frequency data set are not uniformly distributed and the features in
method, among which 20 from the individually particular spam class are more than ham class.
features of the two classes and 60 from the common part
with the probability appearing in ham higher than the one
in spam. VI. CONCLUSION
For comparison, choose 100 features in the whole In this paper an idea to improve the recall of class
features set using the same method in addition. with higher misclassification costs in text classification is
After choosing features, we train the naive Bayes proposed. The difference between our method and the
classifier with multivariate Bernoulli model using data set, existing work is that existing research mostly focused on
and we make 10-fold cross-validation experiments ten the classification threshold adjustment [15], while our
times in individual case. method chooses the tendency features during feature
In the experiments without tendency features, there is selection. Experiment results show that our method can
an average of 6.1 hams classified to spam class and an improve the recall of class with higher misclassification
average of 2.7 spams classified to ham class using cost if choose the features tending to the class with higher
formula (3). When using formula (4), there is an average misclassification cost.
of 13.2 hams classified to spam class and an average of The contribution of this paper is based on qualitative
214.1 spams classified to ham class. While using formula analysis. The next further work should focus on how to
(1), there is an average of 7.4 hams classified to spam improve quantitatively the precise of class with higher
class and an average of 11.5 spams classified to ham class. misclassification costs and how to choose the tendency
In the experiments with tendency features, when using features exactly. In addition, how to fusion the results of
formula (3), there is an average of 5.6 spams classified to formula (4) into formula (3) is also a meaningful problem.
ham class and 0 mistakes for hams. Formula (4) gives an
average of 174.5 hams classified to ham class and 0
mistakes for spam class. Formula (1) provides 0 mistakes ACKNOWLEDGMENT
for ham class and an average of 1.5 spams classified to The research is supported in part by the National
ham class. Natural Science Foundation (60633020, 60802056,
Recall of every class for all the cases is shown in the 60921003,60905018), National Science Fund for
following three tables: Distinguished Young Scholars (60825202), Key Projects
in the National Science &Technology Pillar Program
(2011BAK08B02), 863 High Tech Development Plan
(2007AA01Z480, 2008AA01Z415).
REFERENCES