0% found this document useful (0 votes)
14 views4 pages

2 Utilizing Improved Bayesian Algorithm To Identify Blog Comment Spam

The document describes an improved Bayesian algorithm for identifying spam in blog comments. It analyzes the deficiencies of traditional Bayesian algorithms and modifies the algorithm to calculate spam probability based on the frequency of strings in spam and non-spam samples. The experimental results show the modified algorithm effectively improves the classification accuracy of spam comments.

Uploaded by

Om Chandwadkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views4 pages

2 Utilizing Improved Bayesian Algorithm To Identify Blog Comment Spam

The document describes an improved Bayesian algorithm for identifying spam in blog comments. It analyzes the deficiencies of traditional Bayesian algorithms and modifies the algorithm to calculate spam probability based on the frequency of strings in spam and non-spam samples. The experimental results show the modified algorithm effectively improves the classification accuracy of spam comments.

Uploaded by

Om Chandwadkar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

2012 IEEE Symposium on Robotics and Applications(ISRA)

Utilizing Improved Bayesian Algorithm to Identify Blog Comment Spam

LI Aiwu LIU Hongying


Dept. of Computer Science Dept. of Computer Science and Engineering
Guangdong Vocational College of Posts and Telecom Guangzhou Vocational & Technical Institute of
Guangzhou, China Industry & Commerce
e-mail: [email protected] Guangzhou, China
e-mail: [email protected]

Abstract—In this paper, according to the blog website team algorithm, such as TAN (tree augmented Bayes network)
dealing with comment spam demand more and more, analyzed algorithm [4-5].
the traditional Bayesian algorithm based on statistical method Compared with all other classification algorithm in
of defects, pointed out the deficiency in practical application, theory, has the lowest Bayesian model of error rate. However,
improved the rough Bayesian algorithm, utilized a string of in practice is not always the case. This is because of its
comments appear based on the garbage probability value, application assumption (such as kind of condition
calculated the number of geometrical average algorithm. The independence) inaccurate, and the probability of lack of
experimental results show that the modified Bayesian available data creates. However a variety of the experiment,
classification algorithm can effectively improve the
and decision tree and neural network classification algorithm
classification of spam effect, garbage afr, legal review
comments afr and average afr has dropped substantially.
and, in some areas, this classification algorithm can be
compared with it, in addition, Bayes classification is also can
Keywords-blog comment spam; Bayesian algorithm; be used for not directly use the Bayesian theorem
geometric mean algorithm classification algorithm of other provide theoretical
determination [6].
Paul Graham proposed a new method to filter spam email
I. INTRODUCTION based on statistical Bayesian algorithm in 2002, making
With the rapid development of the Internet, blog is spam identification accuracy greatly improved. With the rise
becoming one of the most rapid and most economic means of blog services, spam blog comments in the blog become a
of communication. But the blog in became a kind of problem, to identify and shield spam comments have become
information communication tools, at the same time it is increasingly demanding, the content of spam comment is
becoming a large commercial advertising, and useless similar to spam email, Bayesian algorithm can also be used
information carrier, which requires users spend a lot of time to identify spam blog comment [7-8].
and energy to deal with these so-called "junk" comment. The test data in this paper use blog comments sample
How to pull the blog comments as filtering is concerned library of Institute for Information and Language Processing
about a big problem of users, so "spam" method research is System of University of Amsterdam
an important subject in the processing of blog comment [1-2]. (http://ilps.science.uva.nl/resources/commentspam),
At present some blog system has taken a certain implementing it in Microsoft's C# programming language.
technological means to deal with the rubbish, these
technologies have some shortage or technical not perfect. II. PROBLEM DESCRIPTION
Therefore, studying a kind of effective spam system has the Each sample data set with a n d feature vector to describe
very vital significance [2-3]. the n attribute value, namely: X = {x1 , x2 ,..., xn } , assume m
Bayes classification algorithm is statistical classification
method, it is a kind of using probability statistics knowledge class, respectively for C1 , C 2 ,..., C m said. Given a unknown
classification algorithm. On many occasions, and simple data sample X (i.e. no class label), if the simple Bayesian
Bayesian (Naiumlve Bayes, NB) classification algorithm can classification will unknown sample X assigned to kind of Ci ,
and decision tree and neural network comparable it must be P (Ci | X ) > P (C j | X ) 1 ≤ j ≤ m, j ≠ i .
classification algorithm, the proposed algorithm can use to
large database, and the method is simple, classification Because P( X ) for all kind of constant, maximum a
accuracy is high, the speed. Due to the Bayesian theorem posteriori probability P(Ci | X ) can be converted into the
suppose that a property values to a given the influence of the maximization prior probability P( X | Ci ) P(Ci ) . If the
class independently of other property values, and this
assumption in fact often was not set up, so its classification training data set many properties and tuples, the calculation
accuracy may decline. Therefore, there is a lot of lower the of P( X | Ci ) costs may be very big, therefore, usually
assumption of the independence Bayesian classification hypothesis of each attribute values independent each other,

978-1-4673-2207-2/12/$31.00 嘋2012 IEEE 423


2012 IEEE Symposium on Robotics and Applications(ISRA)

such prior probability P( x1 | Ci ) , P( x2 | Ci ) ,…, P( xn | Ci ) III. IMPROVED BAYESIAN ALGORITHM


can be obtained from the training data set. Modify the meaning of Bcomment and Gcomment in original
According to this method, to an unknown category of algorithm as the total number of strings in two samples
sample X, can be separately calculated the X belongs to
library respectively, and using Bstring and Gstring to denote
every category of Ci probability P( X | Ci ) P(Ci ) , and then
choose one of the largest categories as its probability them. The new algorithm can be shown as follows:
categories. rg = min(1,2( good ( w) / Gstring )) (5)
Simple Bayesian algorithm is the premise of each
attribute was established between independent each other. rb = min(1, bad ( w) / Bstring ) (6)
When data set to meet this independence hypothesis, p spam|w = max(0.01, min(0.99, rb /( rg + rb ))) (7)
classification accuracy is higher, or you could lower. In


addition, the algorithm is no classification rules output. N
The following is the basic steps of Bayesian algorithm to P= P (8)
i =1 spam| wi
identify blog comment spam.
Q=∏
N
First, create two sample libraries, which are composed of (1 − Pspam|wi ) (9)
spam samples and non-spam samples respectively. Suppose i =1
the number of comments in two library is Bcomment Pspam = P /( P + Q ) (10)
and Gcomment , the number of occurrences of a string w in two Use the above formulas, the calculation of rg and rb is
library is good(w) and bad(w).
not affected by comment length.
If a comment includes string w, the spam probability of
The corresponding C# code of above algorithm can be
this comment pspam| w can be shown as follows: implemented as following:
rg = min(1,2( good ( w) / Gcomment )) (1) This algorithm can make the comment spam recognition
rate improved significantly; Comparison of correct
rb = min(1, bad ( w) / Bcomment ) (2) recognition rate of two algorithms is shown in Table 1.
p spam|w = max(0.01, min(0.99, rb /(rg + rb ))) (3) TABLE I. COMPARISON OF CORRECT RECOGNITION RATE OF TWO
ALGORITHMS
To determine whether a comment is spam, we can
compute the p spam| w of every string w, then find the absolute
spam recognition rate of
value of the difference between p spam| w and 0.5. Sort the comment spam recognition rate of
improved Bayesian
number Bayesian algorithm
result value from small to large, and take out the first N algorithm
values. Suppose their pspam| w values are w1 , w2 ,..., wN
400 87.5% 96.75%
respectively, the spam probability of this comment can be
computed as below: 500 84.4% 95.6%


N
P 600 81.17% 94%
i =1 spam| wi
Pspam = (4)
∏ ∏
N N
P + (1 − Pspam|wi ) IV. USING GEOMETRIC MEAN ALGORITHM
i =1 spam| wi i =1
While computing P and Q, use (∏ N (1 − Pspam|wi )) N and
1
If a string is not included in the comment, its pspam| w
i =1
value can be set to 0.4. If the value of Pspam is greater than

N 1
( P )N instead it, which is the geometric mean value
i =1 spam|wi
0.99, the comment is determined as spam, or it is determined of 1 − Pspam| wi and Pspam| wi respectively, can get a new
as non-spam.
After analysis, there are two flaws in above algorithm: algorithm.
(1) While computing rg and rb , the algorithm uses the Following three steps can be used to compute pspam| w ,
number of comments in two sample library. If the number of the spam probability of the comment w:
strings in every comment varies greatly, rg and rb can not rg = min(1,2( good ( w) / Gstring )) (11)
reflect the actual situation of the sample libraries, resulting in rb = min(1, bad ( w) / Bstring ) (12)
a low recognition rate.
(2) The spam probability derived through the algorithm p spam|w = max(0.01, min(0.99, rb /(rg + rb ))) (13)
processing will normally be close to 0 or 1, the median value
The corresponding C# code can be implemented as
does not appear, making it difficult to determine the extent of
a suspected spam. following:
private void CalculateTokenProbability(string token)
{
int g=_good.Tokens.ContainsKey(token)?

424
2012 IEEE Symposium on Robotics and Applications(ISRA)

_good.Tokens[token]*Knobs.GoodTokenWeight:0; According to the results of actual test, if Pspam >0.52, the


int comment can be determined as spam.
b=_bad.Tokens.ContainsKey(token)?_bad.Tokens[token]:0; Use above geometric mean algorithm, the recognition
if (g+b>=Knobs.MinCountForInclusion) rate of comment spam can be further improved. Comparison
{ of recognition rate of two algorithms is shown in Table.2.
double
goodfactor=Min(1,(double)g/(double)_ngood); TABLE II. COMPARISON OF RECOGNITION RATE OF IMPROVED
double badfactor=Min(1,(double)b/(double)_nbad); BAYESIAN ALGORITHM AND GEOMETRIC MEAN ALGORITHM
double prob=Max(0.0001,Min(0.9999,
badfactor/(goodfactor+badfactor))); number of
recognition rate of recognition rate of
double prob=badfactor/(goodfactor+badfactor); improved Bayesian geometric mean
comment spam
algorithm algorithm
if (g==0)
{ 400 96.75% 97%
prob=(b>Knobs.CertainSpamCount)?
500 95.6% 96.2%
Knobs.CertainSpamScore:Knobs.LikelySpamScore;
} 600 94% 94.83%
_prob[token]=prob; In addition, we can get a more balanced distribution of
} spam probability, which can be used to determine the extent
} of a suspect comment which has not correctly recognized
In above code, the variale goodfactor denotes rg , being spam comment. Using two algorithm on a sample
library including 600 spam comments respectively, the
badfactor denotes rb , and prob denotes pspam| w . number of spam comments which has not correctly
Using geometric mean value of 1 − Pspam| wi and Pspam| wi , recognized is compared in Table.3
the spam probability of this comment can be obtained TABLE III. SPAM PROBABILITY DISTRIBUTION OF TWO ALGORITHM
through following steps:


N 1
P = 1− ( (1 − Pspam|wi )) N (14)
using geometric mean
using Bayesian algorithm
i =1 algorithm

Q = 1 − (∏
N 1
Pspam|wi ) N (15) probability scope
number of
probability scope
number of
i =1 comments comments
Pspam = P /( P + Q) (16)
Pspam >0.5 12 0.82< Pspam <0.98 13
The corresponding C# code can be immplemented as
following: 0.4< Pspam <0.5 14 0.3< Pspam <0.53 3
double p,q,s;
double mult=1; Pspam <0.4 5 0< Pspam <0.2 20
double comb=1;
int index=0; V. CONCLUSION
foreach (string key in probs.Keys)
The traditional Bayesian algorithm use the number of
{
comments in the spam sample library and non-spam library
double prob=(double)probs[key];
as the calculation basis, the comment length have
mult=mult*prob;
considerable impact on the identification results, resulting in
comb=comb*(1-prob);
low spam recognition rate. This paper use respective total
if (++index>Knobs.InterestingWordCount)
number of strings in two samples library to improve the
break;
efficiency of Bayesian algorithm, resulting in a substantial
}
increase in the spam recognition rate. On this basis, use the
p=1-Math.Pow(comb,(double)1/(double)index);
geometric mean algorithm instead of Bayesian algorithm to
q=1-Math.Pow(mult,(double)1/(double)index);
further improve the spam recognition rate, and makes the
s=p/(p+q);
spam probability distribution of comments that have not
return s;
correctly identified more balanced which can be used to
In above code, the variable mult used for holding
determine the extent of a suspect comment being spam
∏ ∏
N N
P , and comb for holding (1 − Pspam| wi ) . comment.
i =1 spam| wi i =1
variable p and q denotes the geometric mean value of
1− Pspam| wi and Pspam | wi respectively. REFERENCES
[1] Abu-Nimeh, S.; Chen, T. “Proliferation and Detection of Blog
Spam”. Security & Privacy. Vol.8,No.5, pp.42-47,2010.

425
2012 IEEE Symposium on Robotics and Applications(ISRA)

[2] Kamaliha E.; Riahi, F.; Qazvinian V.; Adibi, J. “Characterizing [6] Di Michele, S.; Tassa, A.; Mugnai, A.; Marzano, F.S.; Bauer,
Network Motifs to Identify Spam Comments”. 2008 IEEE P.; Baptista, J.P.V.P. “Bayesian algorithm for microwave-based
International Conference on Data Mining Workshops. pp.919-928, precipitation retrieval: description and application to TMI
2008. measurements over ocean”. Geoscience and Remote Sensing.
[3] Fei-Fei Li, Rob Fergus, Pietro Perona. “Learning generative visual Vol.43,No.4, pp.778-791,2005.
models from few training examples: An incremental Bayesian [7] Bhattarai, A.; Rus, V.; Dasgupta, D. “Characterizing comment
approach tested on 101 object categories ”. Computer Vision and spam in the blogosphere through content analysis”. Computational
Image Understanding. Vol.106. No.1, pp.59-70, 2007. Intelligence in Cyber Security. pp.37-44, 2009.
[4] Liwei Wang, Xiao Wang, Jufu Feng. “Subspace distance analysis [8] Beatrice Cynthia Dhinakaran, Dhinaharan Nagamalai and Jae-Kwang
with application to adaptive Bayesian algorithm for face Lee. “Bayesian Approach Based Comment Spam Defending Tool ”.
recognition ”. Pattern Recognition. Vol.39, No.3, pp.456-464, 2006. Lecture Notes in Computer Science.Vol.5576, pp.578-587, 2009.
[5] Byoung-Tak Zhang and Ha-Young Jang. “A Bayesian Algorithm for
In Vitro Molecular Evolution of Pattern Classifiers ”. Lecture Notes
in Computer Science. Vol.3384, pp.720-722, 2005.

426

You might also like