0% found this document useful (0 votes)
47 views4 pages

Weizhang 2013

In this paper, the performance of naïve bayes in text classification is analyzed and the corresponding results from different points of view is proposed, then an improving way for text classification with highly asymmetric misclassification costs is provided. Finally the related experiments proved the above proposed method were efficient.

Uploaded by

smritii bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views4 pages

Weizhang 2013

In this paper, the performance of naïve bayes in text classification is analyzed and the corresponding results from different points of view is proposed, then an improving way for text classification with highly asymmetric misclassification costs is provided. Finally the related experiments proved the above proposed method were efficient.

Uploaded by

smritii bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Performance Analysis and Improvement of Naïve Bayes in Text Classification Application

Wei Zhang, Student Member ,IEEE Feng Gao, Senior Member, IEEE
MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an, MOE KLINNS Lab, Xi’an Jiaotong University, Xi’an,
Shaanxi Province, China Shaanxi Province, China
[email protected] [email protected]

Abstract—Naive Bayes classifier is widely used in machine X 0 ,X1 ,! ,X d can be different. The most two common
learning for its simplicity and efficiency. However, most of
the existing work on naïve Bayes focused on improving the Bayes models are the multivariable Bernoulli model and
Bayes model itself or whether the “naïve assumption” is the multinomial model.
satisfied. In this paper, the performance of naïve bayes in
A. The Multivariable Bernoulli Model
text classification is analyzed and the corresponding results
from different points of view is proposed, then an improving The multivariable Bernoulli model is the simplest and
way for text classification with highly asymmetric the most classic Bayes model [3],[4],[5].This model
misclassification costs is provided. Finally the related assumes random variables X1 ,! ,X d and the class
experiments proved the above proposed method were
efficient. variable X 0 are independent, which is called naïve
Keywords-Text Classification,Naïve Bayes,Machine Learning, assumption. Although the assumption is hardly to
Feature Selection satisfied in practical, Bayes classifier still has good
performance.
I. INTRODUCTION In this model, the values of X1 ,! ,X d are 1 when
Text classification has been broadly studied, and the feature denoted by X i presence in a document,
Naive Bayes classifier is widely used in machine learning otherwise 0. The probability of a new document
for its simplicity and efficiency [1], [2]. belonging to class i is
Most of the existing work on naïve Bayes focused on
constructing and improving the Bayes model itself or P X 1 X 2 ... X d | X 0
discussing whether the “naïve assumption” is satisfied
P X1 x1 ,..., X d xd | X 0 i
and the influence and the corresponding improvement of
classification performance[3],[4],[5],[6],[7],[8]. d

In this paper two main Bayes are introduced. –P Xj xj | X0 i ˄1˅


Combining the practical character of text classification, j 1
i.e., feature sparsity, the naïve Bayes classifier is analyzed d
1 x j
–p
xj
and discussed in three cases in according to the presence ij 1  pij
or absence of every feature in the document. Then a new j 1
solution for the classification task with highly asymmetric Where pij is the abbreviation of the conditional
misclassification costs is proposed and validated by the
experiments using related data sets. probability P ( X j 1| X 0 i ) . The Bayes model with
The rest of this paper is organized as follows. In this model chooses the class with maximal posteriori for
section 2, two primary naïve Bayes models are introduced.
Section 3 discusses in detail the performance of naïve the document, i.e., the value of X 0 is chosen i
Bayes model in text classification. A new solution for corresponding to the maximum of formula (1).
classification when the misclassification costs are
B. The multinomial Model
different is proposed in section 4. Illustrative examples
are given in section 5 using pubic data sets. The The multinomial model assumes the length of text and
concluding and future work are presented in section 6. the class variable are also independent in addition to the
naïve assumption [3], [6], [7]. The values of X 1 ,..., X d
II. THE NAÏVE BAYES MODEL
are appearing times of their corresponding features in the
In text classification, denote text a group of random document. The probability of a document belonging to a
variables {X 0 ,X1 , ! ,X d }ˈˈ X 0  {1, 2 !, C} , where class i is:
C is the number of text classes and X 0 ,X1 ,! ,X d can
denote different letters, words, or other text attributes. In
different Bayes models, designating the values of
P X 1 X 2 ... X d | X 0 same document, the product of formula (3) and formula (4)
equals formula (1).
§ · Intuitively, formula (3) considers all the presence
P ¨ X1 x1 ,..., X d xd | X 0 i, ¦ x j ¸ ˄2˅ features, i.e. the text itself. For each feature this
© j ¹ classification method can be thought as a cumulate way,
x i.e. the probability of a document in which a features
§ · d pij j
¨ ¦ x j ¸ !– presence at the same time belonging to one of the classes
© j ¹ j 1 xj ! is cumulated by the probability belonging to the
corresponding class when the a features appear one by
The naïve Bayes classifier with this model also chooses
one in the document.
the class i with maximal posteriori probability for a
On contrary, formula (4) only considers the features
document.
which do not appear in the document to be classified. The
The multinomial model contains more information
document which is composed of the above features is
than the multivariable Bernoulli model because it has the
complementary the document to be classified. For each
whole words frequency info. The multinomial model
feature this classification method can be understood as a
should outperform the multivariable Bernoulli model in
excluding way step by step, i.e. to exclude the effect of
intuition[9], [10], [11] . However, lots of experiments in
features which do not appear in the document one by one.
the related research work provided it was not true[12][13],
Formula (1) considers the all features which include
so we choose the multivariable Bernoulli model in this
the presence features and the absence features in the
paper.
document to be classified. In other words, it not only
III. PERFORMANCE ANALYSIS OF NAÏVE BAYES considers the probability of the document belonging to
CLASSIFIER IN TEXT CLASSIFICATION one of the classes, but also gives consideration to the
probability of the complementary document belonging to
There are two kinds of effect brought by the features
the same class.
corresponding to the random variable X j : Thus, in the above three classification methods,
1) When the value of the random variable X j is 1, i.e. formula (3) is the most native and approximating way.
Formula (4) is contrary to formula (3), since if a feature
the corresponding feature presences in a document, the satisfies p1 j ! p2 j , then inevitably exists
contribution which the classifier assigns X 0 i is Pij ;
1  p1 j  1  p2 j . Formula (1) gives consideration
2) When the value of the random variable X j is 0, i.e.
to both the effect of formula (3) and formula (4).
the corresponding feature absences in a document, the The significant property of text classification is
contribution which the classifier assigns X 0 i is sparsity, which means the values of mostly random
1  pij . variables denoting the document are zero. So the terms in
formula (4) is more than formula (3), which means
In a document to be classified, assuming there are a b !! a . Due to this, if the corresponding features of
random variables with values 1, b random variables with random variables are not uniformly distributed, the result
values 0, in other words, a  b d . Without loss of of formula (3) may be covered by one of formula (4).
generality, let the value of the first a random variables
are 1, then formula (1) can be divided into two parts
IV. AN IMPROVEMENT TO TEXT CLASSIFICATION WITH
denoted as follows:
a
HIGHLY ASYMMETRIC MISCLASSIFICATION COSTS

–p
j 1
ij ˄3˅ Text classification with highly asymmetric
misclassification costs is common. For example, spam
d filtering in classification is such the case, i.e. the cost of
–
j a 1
1  pij ˄4˅ classifying a ham to spam class is higher than the cost of
classifying a spam to the ham class.
Formula (1) indicates that all features are considered In general some features was selected under an
during the classification, regardless of the features optimize functions during features selection. In this paper
presence or not in the document. Formula (3) indicates a new thoughtful way that the features tending to the class
that only the presence features that are the most approach with higher misclassification costs are preferentially
to the document itself are considered during the selected during feature selection is presented. For spam
classification. Formula (4) indicates that only the absence filtering problem, we choose the features tending to
features that are complement to the document are belonging to hams class, which implies
considered during the classification. Obviously for the
p X 1| X 0 ham ! p X 1| X 0 spam
The advantages of the provided way are listed as Table1 Recall of every class of formula (3)
follows: Without tendency tendency
1) It will lead to higher probability of finding hams Ham recall 98.5% 100%
using formula (3), and reduce the error of classifying the Spam recall 99.7% 98.7%
ham to spam.
2) From the section 3, classifier will give the opposite Table2 Recall of every class of formula (4)
conclusion using formula (4), which inclines to classify Without tendency tendency
the ham to spam. So if the chosen features prefer the class
Ham recall 96.9% 59.2%
with higher misclassification costs, formula (4) can
modify formula (3). Spam recall 82.2% 100%
3) Due to the sparsity of text classification, the
production terms of formula (3) is less than formula (4), Table3 Recall of every class of formula (1)
which vastly reduces the calculation cost. Without tendency tendency
Ham recall 98.3% 100%
Spam recall 99% 99.9%
V. ILLUSTRATIVE EXAMPLES AND ANALYSIS
Here we choose the mail set of Natural Language From the experiments results, choosing features with
Processing Laboratory at Northeastern University as the tendency can actually improve the recall for a special
trial data including 1633 mails with 428 hams and 1205 class. During classifying formula (4) has the biggest error
spams. All the mails are divided into 10 parts results and cannot be used alone. It should just
uniformly[14]. complement and modify formula (3).
In statistic, the hams contain 9882 features and 38902 Note that in these experiments, the feature selection
features for the spams, and this two class features have without tendency provides tendency features in fact
6243 features in common. Here we choose 100 features tending to spam class. This is because the features in this
denoting the mail using the simplest document frequency data set are not uniformly distributed and the features in
method, among which 20 from the individually particular spam class are more than ham class.
features of the two classes and 60 from the common part
with the probability appearing in ham higher than the one
in spam. VI. CONCLUSION
For comparison, choose 100 features in the whole In this paper an idea to improve the recall of class
features set using the same method in addition. with higher misclassification costs in text classification is
After choosing features, we train the naive Bayes proposed. The difference between our method and the
classifier with multivariate Bernoulli model using data set, existing work is that existing research mostly focused on
and we make 10-fold cross-validation experiments ten the classification threshold adjustment [15], while our
times in individual case. method chooses the tendency features during feature
In the experiments without tendency features, there is selection. Experiment results show that our method can
an average of 6.1 hams classified to spam class and an improve the recall of class with higher misclassification
average of 2.7 spams classified to ham class using cost if choose the features tending to the class with higher
formula (3). When using formula (4), there is an average misclassification cost.
of 13.2 hams classified to spam class and an average of The contribution of this paper is based on qualitative
214.1 spams classified to ham class. While using formula analysis. The next further work should focus on how to
(1), there is an average of 7.4 hams classified to spam improve quantitatively the precise of class with higher
class and an average of 11.5 spams classified to ham class. misclassification costs and how to choose the tendency
In the experiments with tendency features, when using features exactly. In addition, how to fusion the results of
formula (3), there is an average of 5.6 spams classified to formula (4) into formula (3) is also a meaningful problem.
ham class and 0 mistakes for hams. Formula (4) gives an
average of 174.5 hams classified to ham class and 0
mistakes for spam class. Formula (1) provides 0 mistakes ACKNOWLEDGMENT
for ham class and an average of 1.5 spams classified to The research is supported in part by the National
ham class. Natural Science Foundation (60633020, 60802056,
Recall of every class for all the cases is shown in the 60921003,60905018), National Science Fund for
following three tables: Distinguished Young Scholars (60825202), Key Projects
in the National Science &Technology Pillar Program
(2011BAK08B02), 863 High Tech Development Plan
(2007AA01Z480, 2008AA01Z415).
REFERENCES

[1] D.D. Lewis, Representation and Learning in Information Retrieval,


PhD dissertation, Dept. of Computer Science, Univ.of
Massachusetts, Amherst, 1992.
[2] Lewis, D.D. Naive (Bayes) at forty: The independence assumption
in information retrieval. MachineLearning: ECML-98, Tenth
European Conference on Machine Learning 1998. 4-15.
[3] Spiegelhalter, D.J. and Knill-Jones, R.P. Statistical and knowledge
based approaches to clinical decision support systems, with an
application in gastroenterology (with discussion). Journal of the
Royal Statistical Society (Series A), 1984.147, 35-77.
[4] Lewis, D.D. Naive (Bayes) at forty: The independence assumption
in information retrieval. MachineLearning: ECML-98, Tenth
European Conference on Machine Learning 1998. 4-15.
[5] Hand D.J. and Yu K. Idiot's Bayes - not so stupid after all?
International Statistical Review,to appear.2002.
[6] McCallum, A., Nigam, K.: A comparison of event models for
Naive Bayes text classification. In: Learning for Text ategorization:
Papers from the AAAI Workshop,AAAI Press 1998. 41–48
Technical Report WS-98-05.
[7] Eyheramendy, S., Lewis, D.D., Madigan, D.: On the Naive Bayes
model for text categorization. In Bishop, C.M., Frey, B.J., eds.: AI
& Statistics 2003: Proceedings of the Ninth InternationalWorkshop
on Artificial Intelligence and Statistics. 2003.332–339.
[8] Rennie J., Shih, L., Teevan, J., & Karger, D. (2003) Tackling the
Poor Assumptions of Naïve Bayes Text Classi¿HUV 3URF RI
ICLM-2003.
[9] S. E. Robertson, C. J. van Rijsbergen, and M. F. Porter.
Probabilistic models of indexing and searching. In R. N. Oddy,
S. E. Robertson, C. J. van Rijsbergen, and P. W. Williams,
editors, Information Research and Retrieval, chapter 4, pp. 35-
56. Butterworths, 1981.
[10] Robert M. Losee. Parameter estimation for probabilistic
document-retrieval models. Journal of the American Society for
Information Science, 39(1):8-16, 1988.
[11] K.M. Schneider. On word frequency information and negative
evidence in Naive Bayes text classi ¿FDWLRQ ,Q 4th International
Conference on Advances in Natural Language Processing, pp.
474–485, Alicante, Spain,2004.
[12] Katz, S.M.: Distribution of content words and phrases in text and
language modelling. Natural Language Engineering. 2. 1996. 15–
59.
[13] Rennie, J.D.M., Shih, L., Teevan, J., Karger, D.: Tackling the poor
assumptions of Naive Bayes text classi¿HUV,Q)DZFHWW70LVKUD
N., eds.: Proceedings of the Twentieth International Conference on
Machine Learning (ICML-2003), Washington, D.C., AAAI Press.
2003. 616–623.
[14] http://www.nlplab.cn.
[15] A. Kolcz. Local sparsity control for naive Bayes with extreme
misclassi¿FDWLRQ FRVWV ,Q 3URFHHGLQJV RI WKH (OHYHQWK $&0
SIGKDD International Conference on Knowledge Discovery and
DataMining,2005.pp.128–137.

You might also like