2 Utilizing Improved Bayesian Algorithm To Identify Blog Comment Spam
2 Utilizing Improved Bayesian Algorithm To Identify Blog Comment Spam
Abstract—In this paper, according to the blog website team algorithm, such as TAN (tree augmented Bayes network)
dealing with comment spam demand more and more, analyzed algorithm [4-5].
the traditional Bayesian algorithm based on statistical method Compared with all other classification algorithm in
of defects, pointed out the deficiency in practical application, theory, has the lowest Bayesian model of error rate. However,
improved the rough Bayesian algorithm, utilized a string of in practice is not always the case. This is because of its
comments appear based on the garbage probability value, application assumption (such as kind of condition
calculated the number of geometrical average algorithm. The independence) inaccurate, and the probability of lack of
experimental results show that the modified Bayesian available data creates. However a variety of the experiment,
classification algorithm can effectively improve the
and decision tree and neural network classification algorithm
classification of spam effect, garbage afr, legal review
comments afr and average afr has dropped substantially.
and, in some areas, this classification algorithm can be
compared with it, in addition, Bayes classification is also can
Keywords-blog comment spam; Bayesian algorithm; be used for not directly use the Bayesian theorem
geometric mean algorithm classification algorithm of other provide theoretical
determination [6].
Paul Graham proposed a new method to filter spam email
I. INTRODUCTION based on statistical Bayesian algorithm in 2002, making
With the rapid development of the Internet, blog is spam identification accuracy greatly improved. With the rise
becoming one of the most rapid and most economic means of blog services, spam blog comments in the blog become a
of communication. But the blog in became a kind of problem, to identify and shield spam comments have become
information communication tools, at the same time it is increasingly demanding, the content of spam comment is
becoming a large commercial advertising, and useless similar to spam email, Bayesian algorithm can also be used
information carrier, which requires users spend a lot of time to identify spam blog comment [7-8].
and energy to deal with these so-called "junk" comment. The test data in this paper use blog comments sample
How to pull the blog comments as filtering is concerned library of Institute for Information and Language Processing
about a big problem of users, so "spam" method research is System of University of Amsterdam
an important subject in the processing of blog comment [1-2]. (http://ilps.science.uva.nl/resources/commentspam),
At present some blog system has taken a certain implementing it in Microsoft's C# programming language.
technological means to deal with the rubbish, these
technologies have some shortage or technical not perfect. II. PROBLEM DESCRIPTION
Therefore, studying a kind of effective spam system has the Each sample data set with a n d feature vector to describe
very vital significance [2-3]. the n attribute value, namely: X = {x1 , x2 ,..., xn } , assume m
Bayes classification algorithm is statistical classification
method, it is a kind of using probability statistics knowledge class, respectively for C1 , C 2 ,..., C m said. Given a unknown
classification algorithm. On many occasions, and simple data sample X (i.e. no class label), if the simple Bayesian
Bayesian (Naiumlve Bayes, NB) classification algorithm can classification will unknown sample X assigned to kind of Ci ,
and decision tree and neural network comparable it must be P (Ci | X ) > P (C j | X ) 1 ≤ j ≤ m, j ≠ i .
classification algorithm, the proposed algorithm can use to
large database, and the method is simple, classification Because P( X ) for all kind of constant, maximum a
accuracy is high, the speed. Due to the Bayesian theorem posteriori probability P(Ci | X ) can be converted into the
suppose that a property values to a given the influence of the maximization prior probability P( X | Ci ) P(Ci ) . If the
class independently of other property values, and this
assumption in fact often was not set up, so its classification training data set many properties and tuples, the calculation
accuracy may decline. Therefore, there is a lot of lower the of P( X | Ci ) costs may be very big, therefore, usually
assumption of the independence Bayesian classification hypothesis of each attribute values independent each other,
∏
addition, the algorithm is no classification rules output. N
The following is the basic steps of Bayesian algorithm to P= P (8)
i =1 spam| wi
identify blog comment spam.
Q=∏
N
First, create two sample libraries, which are composed of (1 − Pspam|wi ) (9)
spam samples and non-spam samples respectively. Suppose i =1
the number of comments in two library is Bcomment Pspam = P /( P + Q ) (10)
and Gcomment , the number of occurrences of a string w in two Use the above formulas, the calculation of rg and rb is
library is good(w) and bad(w).
not affected by comment length.
If a comment includes string w, the spam probability of
The corresponding C# code of above algorithm can be
this comment pspam| w can be shown as follows: implemented as following:
rg = min(1,2( good ( w) / Gcomment )) (1) This algorithm can make the comment spam recognition
rate improved significantly; Comparison of correct
rb = min(1, bad ( w) / Bcomment ) (2) recognition rate of two algorithms is shown in Table 1.
p spam|w = max(0.01, min(0.99, rb /(rg + rb ))) (3) TABLE I. COMPARISON OF CORRECT RECOGNITION RATE OF TWO
ALGORITHMS
To determine whether a comment is spam, we can
compute the p spam| w of every string w, then find the absolute
spam recognition rate of
value of the difference between p spam| w and 0.5. Sort the comment spam recognition rate of
improved Bayesian
number Bayesian algorithm
result value from small to large, and take out the first N algorithm
values. Suppose their pspam| w values are w1 , w2 ,..., wN
400 87.5% 96.75%
respectively, the spam probability of this comment can be
computed as below: 500 84.4% 95.6%
∏
N
P 600 81.17% 94%
i =1 spam| wi
Pspam = (4)
∏ ∏
N N
P + (1 − Pspam|wi ) IV. USING GEOMETRIC MEAN ALGORITHM
i =1 spam| wi i =1
While computing P and Q, use (∏ N (1 − Pspam|wi )) N and
1
If a string is not included in the comment, its pspam| w
i =1
value can be set to 0.4. If the value of Pspam is greater than
∏
N 1
( P )N instead it, which is the geometric mean value
i =1 spam|wi
0.99, the comment is determined as spam, or it is determined of 1 − Pspam| wi and Pspam| wi respectively, can get a new
as non-spam.
After analysis, there are two flaws in above algorithm: algorithm.
(1) While computing rg and rb , the algorithm uses the Following three steps can be used to compute pspam| w ,
number of comments in two sample library. If the number of the spam probability of the comment w:
strings in every comment varies greatly, rg and rb can not rg = min(1,2( good ( w) / Gstring )) (11)
reflect the actual situation of the sample libraries, resulting in rb = min(1, bad ( w) / Bstring ) (12)
a low recognition rate.
(2) The spam probability derived through the algorithm p spam|w = max(0.01, min(0.99, rb /(rg + rb ))) (13)
processing will normally be close to 0 or 1, the median value
The corresponding C# code can be implemented as
does not appear, making it difficult to determine the extent of
a suspected spam. following:
private void CalculateTokenProbability(string token)
{
int g=_good.Tokens.ContainsKey(token)?
424
2012 IEEE Symposium on Robotics and Applications(ISRA)
∏
N 1
P = 1− ( (1 − Pspam|wi )) N (14)
using geometric mean
using Bayesian algorithm
i =1 algorithm
Q = 1 − (∏
N 1
Pspam|wi ) N (15) probability scope
number of
probability scope
number of
i =1 comments comments
Pspam = P /( P + Q) (16)
Pspam >0.5 12 0.82< Pspam <0.98 13
The corresponding C# code can be immplemented as
following: 0.4< Pspam <0.5 14 0.3< Pspam <0.53 3
double p,q,s;
double mult=1; Pspam <0.4 5 0< Pspam <0.2 20
double comb=1;
int index=0; V. CONCLUSION
foreach (string key in probs.Keys)
The traditional Bayesian algorithm use the number of
{
comments in the spam sample library and non-spam library
double prob=(double)probs[key];
as the calculation basis, the comment length have
mult=mult*prob;
considerable impact on the identification results, resulting in
comb=comb*(1-prob);
low spam recognition rate. This paper use respective total
if (++index>Knobs.InterestingWordCount)
number of strings in two samples library to improve the
break;
efficiency of Bayesian algorithm, resulting in a substantial
}
increase in the spam recognition rate. On this basis, use the
p=1-Math.Pow(comb,(double)1/(double)index);
geometric mean algorithm instead of Bayesian algorithm to
q=1-Math.Pow(mult,(double)1/(double)index);
further improve the spam recognition rate, and makes the
s=p/(p+q);
spam probability distribution of comments that have not
return s;
correctly identified more balanced which can be used to
In above code, the variable mult used for holding
determine the extent of a suspect comment being spam
∏ ∏
N N
P , and comb for holding (1 − Pspam| wi ) . comment.
i =1 spam| wi i =1
variable p and q denotes the geometric mean value of
1− Pspam| wi and Pspam | wi respectively. REFERENCES
[1] Abu-Nimeh, S.; Chen, T. “Proliferation and Detection of Blog
Spam”. Security & Privacy. Vol.8,No.5, pp.42-47,2010.
425
2012 IEEE Symposium on Robotics and Applications(ISRA)
[2] Kamaliha E.; Riahi, F.; Qazvinian V.; Adibi, J. “Characterizing [6] Di Michele, S.; Tassa, A.; Mugnai, A.; Marzano, F.S.; Bauer,
Network Motifs to Identify Spam Comments”. 2008 IEEE P.; Baptista, J.P.V.P. “Bayesian algorithm for microwave-based
International Conference on Data Mining Workshops. pp.919-928, precipitation retrieval: description and application to TMI
2008. measurements over ocean”. Geoscience and Remote Sensing.
[3] Fei-Fei Li, Rob Fergus, Pietro Perona. “Learning generative visual Vol.43,No.4, pp.778-791,2005.
models from few training examples: An incremental Bayesian [7] Bhattarai, A.; Rus, V.; Dasgupta, D. “Characterizing comment
approach tested on 101 object categories ”. Computer Vision and spam in the blogosphere through content analysis”. Computational
Image Understanding. Vol.106. No.1, pp.59-70, 2007. Intelligence in Cyber Security. pp.37-44, 2009.
[4] Liwei Wang, Xiao Wang, Jufu Feng. “Subspace distance analysis [8] Beatrice Cynthia Dhinakaran, Dhinaharan Nagamalai and Jae-Kwang
with application to adaptive Bayesian algorithm for face Lee. “Bayesian Approach Based Comment Spam Defending Tool ”.
recognition ”. Pattern Recognition. Vol.39, No.3, pp.456-464, 2006. Lecture Notes in Computer Science.Vol.5576, pp.578-587, 2009.
[5] Byoung-Tak Zhang and Ha-Young Jang. “A Bayesian Algorithm for
In Vitro Molecular Evolution of Pattern Classifiers ”. Lecture Notes
in Computer Science. Vol.3384, pp.720-722, 2005.
426