Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
Interplay Between Probabilistic Classifiers and Boosting Algorithms For Detecting Complex Unsolicited Emails
(1)
Algorithm for classification:
Input: Training set
1 2, 3
, ...
n
T t t t t
with ( , y )
i i i
t x . Number
of sample version of training set B .
Output: An appropriate classifier for the training set
x
G .
For 1... n B
1) Draw with replacement K N sample from the
Training set T , obtaining the
th
n sample
n
T .
2) Train Classifier
n
G , for each sample
n
T .
3) Build the Final Classifier as a vote of
n
G with
1... n B
1
( )
M
x
x m m
m
G sign G
(2)
B. Boosting with Re-sampling
Boosting with Re-sample technique is similar as
Bootstrap and Bagging. The difference is, Bootstrapping and
Bagging technique performs sampling with replacement but
Boosting with Re-sample perform sampling without
replacement. It was proposed in 1989 by Schapire [2].
Algorithm for classification:
Input: Training set
1 2, 3
, ...
n
T t t t t
with
( , y )
i i i
t x
. Number of
sample version of training set B .
Output: An appropriate classifier for the training set
x
G .
1) Draw without replacement
1
K N sample from the
Training set T , obtaining the sample
1
T .
2) Train weak Classifier
1
G
, for each sample
1
T .
3) Select
2
K N sample from the training set T
including half of the misclassified sample by
1
G
.
Train weak classifier
2
G
on it.
4) Select all remaining sample misclassified by
1
G
and
2
G
. Train weak Classifier
3
G
on it.
5) Build the final classifier based on the voting of
weak classifier
3
1
( )
x
x n
n
G sign G
(3)
C. Adaptive Boosting (AdaBoost)
The idea behind Adaptive Boosting is to reweight the data
instead of Random sampling. AdaBoost technique is a
concept of building ensembles for the classifiers with
improved performance. AdaBoost algorithm [24], [25],
learns from the combination of output M of the weak
classifier
x
m
G
, for the final decision of classification carried
out by
1
( )
M
x
x m m
m
G sign G
.
Algorithm for classification:
Input: Training set
1 2, 3
, ...
n
T t t t t =
with ( , y )
i i i
t x = . Number
of sample version of training set B
Output: An appropriate classifier for the training set
x
G .
Initialise the weights
1
t
i
w
N
=
,
{1... } i N e
For
{1...M} m=
.
1) Learn the classifier
x
m
G
by the training data
using the weights
t
i
w .
2) Calculate Error term
1
1
( )
i
N
x t
i i m
rr i
m N
t
i
i
w I y G
E
w
=
=
=
=
.
3) Calculate weight contribution
1
0.5log( )
rr
m
m rr
m
E
E
u
=
.
4) Put
( I(y ))
i
x t t
i i m i m
w w Exp G u =
then
renormalize
1
t
i
i
w =
.
The final classifier
1
( )
M
x
x m m
m
G sign G u
=
=
(4)
IV. PROBABILISTIC CLASSIFIERS
This idea was proposed by Lewis [26], who developed the
term
( )
i
j
c
P
d
which is known as probability of a document
represented by a vector
1
{ ... }
j j j
n
d w w =
of terms falls within a
particular category
i
c . This probability is calculated by the
Bayes theorem that can be seen as
( )* ( )
( )
( )
j
i
i
i
j j
d
P c P
c
c
P
d P d
=
(5)
where
( )
j
P d
is probability of a randomly selected
document represented as a vector
j
d , and
( )
i
P c
is
probability of the randomly selected document
j
d falls on
the class
i
c . This is a basic method of the Bayesian Filter.
The equation of the Bayesian Classifier has been seen as
problematic because of high number of the possible
vectors
j
d
. This problem has been tackled by making the
assumption that any two randomly selected coordinates of
the document vector are statistically independent to each
other. This assumption of independence can be determined
by the equation 6.
1
( ) ( )
j j
n
l
i i l
w d
P P
c c
=
=
[
(6)
This assumption has been adopted by a classifier named
Naive Bayes and it has been widely used in research related
to the text mining [27]-[30].
V. GENETIC FEATURE SEARCH
In this paper, we have used Genetic search algorithm for
the most informative feature selection. This algorithm is a
type of inductive learning strategy that was initially
introduced by Holland [31]. The algorithm works in a way
similar to the Genetic models of the natural system, and that
is why it is known as Genetic algorithm.
A Genetic algorithm initially maintains a constant
population of individuals as sample of the space to be
searched. Each individual is evaluated by its fitness. New
individuals are formulated by choosing the best performing
individuals who produce offspring [32] who retain the
features of their parents. This creates a population with
improved fitness.
Two main Genetic operators are involved in generating
these new individuals. These operators are Crossover and
Mutation. Crossover operator works by randomly selecting a
point in two parents gene structures and exchanging the
remaining segments of parents to create new individuals.
Therefore, crossover creates two new individuals by
combining features of two old individuals. Mutation works
by randomly changing some component of the particular
individuals. It works as a population perturbation operator,
which means adding new information in the population.
Mutation operator also prevents any stagnation that might be
occurring during the search process.
VI. EXPERIMENTS AND EVALUATION
A. Data Set
We have taken our data from Enron email dataset. We
selected Enron 4, Enron 5 and Enron 6 datasets and created
6000 HAM and 6000 SPAM files by random sampling. The
reason for taking these versions of Enron emails is the
complexities imbibed in Email SPAM. Complexity can be
defined in terms of the attacks such as Tokenisation,
Obfuscation etc. done by spammers. Therefore, this
dataset would demonstrate the efficacy of the classifiers
against these attacks.
B. Pre-Processing of Data
The content of the Email files are represented by the
vectors of the features i.e. weight of word i in document k.
[33]. Then, these vectors are combined for a collection of
documents to create a Term-Document Matrix. This process
is called Indexing. Due to the large number of Email files,
the resultant matrix would be very large and sparse. For this
problem, some dimensionality reduction technique has to be
used before classification. This can be done by Feature
Selection or Feature Extraction methods. Dimensionality
can be further reduced by Stop word removal (words that
carry no information, such as Pronouns, Prepositions and
conjunctions) [33], Lemmatisation (grouping the words
which have same meaning, such as Boost, Boosted,
Boosting).
C. Feature Selection
In this step, we select most informative features. Several
techniques have been developed for feature selection. We
have used the Genetic Search method. From the Term-
Document matrix that carry 1359 attributes initially, the
Journal of Advances in Computer Network, Vol. 1, No. 2, March 2013
134
Journal of Advances in Computer Network, Vol. 1, No. 2, March 2013
135
Genetic Search Algorithm has selected 375 best features.
These features were used for further analysis.
D. Classifiers
A Java and Matlab environment on window 7 operating
system platform was used for the testing of the classifiers.
Initially, we have evaluated some Probabilistic classifiers:
Bayesian Classifier and Naive Bayes Classifier.
Thereafter, Boosting algorithms were used for the purpose
of boosting the performance of probabilistic classifiers.
Three Boosting algorithms were tested: Bagging, Boosting
with Re-sampling, and Adaptive Boosting (AdaBoost) and
then compared with all combinations of these algorithms.
E. Evaluation
We have taken total 375 features for classifying 12000
Email files. The data was split into 66% training and 34%
Test data. Concern classifiers will learn with the training
data and remaining data will be taken for testing. We have
taken accuracy (
cc
A
) and (
, H S
F
) value of the classifiers for our
analysis.
The
cc
A
value is defined as
c c
cc
E
S H
A
T
(7)
where
c
S
is Total correctly classified Spam Text massage,
c
H
is Total correctly classified Ham Text massage and
E
T
is
Total Text Email. The term accuracy can be defined as The
percentage ratio of total correctly classified Email to total
Email. It shows the strength of a classifier i.e. what
percentage of Emails is correctly classified.
The
, H S
F value is defined as
, ,
,
, ,
2
H S H S
H S
H S H S
P R
F
P R
(8)
where
, H S
P
is the Precision and
, H S
R
is the Recall. We have
calculated both values separately for HAM and SPAM
massages and then calculated the
, H S
F
value. The final result
will be calculated by the weighted average sum of the
, H S
F
value.
By the help of these values, we will calculate the accuracy
and strength of classifiers.
VII. COMPARATIVE ANALYSIS
With the 375 best features selected by the Genetic search
algorithm, Table II gives the significant performance
comparison of the concern classifiers with and without
boosting algorithms. It also shows the strength of genetic
feature search method that has given up to 88.1% to 92.9%
accurate classification of 12000 Email files (6000 HAM +
6000 SPAM).
If we compare the performance of probabilistic classifiers
then in each case, Bayesian Classifier performance is
better than Naive Bayes. It can be observed from Table II,
Figure 1 and Figure 2, that Bayesian Classifier gives
accuracy up to 88.7% to 92.9% whereas, Naive Bayes
gives, 88.1% to 91.7%.
TABLE II: RESULTS OF CLASSIFIERS WITH (OUT) BOOSTING
Boosting Algorithms
Probabilistic Classifiers
BayesNet (BN)
Acc, F value
In %
NavieBayes(NB)
Acc, F value
In %
Without Boosting
Bagging
Boosting with Re-
Sample
AdaBoost
88.8, 88.7
89.2, 89.1
92.9, 92.9
92.4, 92.3
88.0, 88.1
88.4, 88.4
91.7, 91.7
91.2, 91.2
It also shows that without the use of boosting algorithms,
both probabilistic classifiers give poor accuracy, which is
88.1% to 88.7%. With help of the boosting algorithms, the
performance of these classifiers has increased.
BN NB ABoost+BN ABoost+NB Boost+BN Boost+NB Bag+BN Bad+NB
70
75
80
85
90
95
100
Classifiers
A
c
c
u
r
a
c
y
(
%
)
Fig. 1. Accuracy of Probabilistic Classifiers with(out) Boosting
For boosting of the performance of our probabilistic
classifier, we have taken three Boosting algorithms. In the
Table II and Fig. 1 and Fig. 2, it is clearly indicates that
Boosting with Resample method gives maximum
performance improvement. Table II, shows that the
performance improvement with AdaBoost and Bagging is
less than Boosting with Resample. But the AdaBoost
results are closer to best one. Finally, with Boosting done by
Boosting with Resample, Bayesian Classifier is giving
best result. In this case, accuracy is up to 92.9%.
BN NB ABoost+BN ABoost+NB Boost+BN Boost+NB Bag+BN Bag+NB
70
75
80
85
90
95
100
Classifiers
F
v
a
l
u
e
(
%
)
Fig. 2. F-Value of probabilistic classifiers with(out) boosting.
VIII. CONCLUSION
In this research, it has been shown that the Bayesian
Journal of Advances in Computer Network, Vol. 1, No. 2, March 2013
136
classifier is a better predictor of the Spam than Naive Bayes.
In this research, both the Accuracy and the F values have
been used for gauging the strength of concerned classifiers.
We obtained nearly similar values of these two
measurements. We have also shown that Boosting
algorithms play a crucial role in boosting the performance of
classifiers. We have conducted three studies as a part of this
research .to compare performance. First is the performance
comparison of Probabilistic classifiers without Boosting
where the Bayesian classifier has shown the best result.
Second is the performance comparison of Boosting
algorithms where Boosting with resample has shown
significant strength, though the AdaBoost results were very
close to the best one. And last is the performance
comparison of Probabilistic Classifiers with Boosting where
Bayesian Classifiers performed best when used in
conjunction with boosting with resample.
As discussed in the previously boosting algorithms have
significant contribution in boosting the performance of
classifiers. As part of our future work, we will investigate
the effect of boosting on the performance of other classifiers.
REFERENCES
[1] J. Zdziarski, Ending spam: Bayesian content filtering and the art of
statistical language classification, San Francisco: No Starch Press,
2005.
[2] R. Schapire, Using output codes to boost multiclass learning
problems, in Proc. 14th International Conf. on Machine Learning
(ICML), Nashville, TN, 1997, pp. 313321.
[3] J. Goodman, G. V. Cormack, and D. Heckerman, Spam and the
ongoing battle for the inbox, Communications of the ACM, vol. 50,
issue 2, pp. 24-33, February 2007.
[4] T. Mitchell, Machine Learning, New York: McGraw-Hill, 1997.
[5] D. D. Lewis and W. A. Gale, A sequential algorithm for training text
classifiers, in Proc. 17th Annu. International ACM-SIGIR Conf. on
Research and Development in Information Retrieval, 1994, pp. 3-12.
[6] D. Koller and M. Sahami, Hierarchically classifying documents
using very few words, in Proc. 14th International Conf. on Machine
Learning (ICML), Nashville, TN, 1997, pp. 170178.
[7] P. Domingos and M. J. Pazzani, On the optimality of the simple
Bayesian classifier under zero-one loss, Mach. Learn., vol. 29, no. 2-
3, pp. 103130, 1997.
[8] E. Bauer and R. Kohavi, An empirical comparison of voting
classification algorithms: Bagging, boosting, and variants, Machine
Learning, vol. 36, issue 1-2, pp. 105-139, 1999.
[9] J. H. Holland, Adaptation in Natural and Artificial Systems, Ann
Arbor, MI: University of Michigan Press, 1975.
[10] Androutsopoulos, J. Koutsias, K. V. Chandrinos, G. Paliouras, and C.
D. Spyropoulos, An evaluation of naive bayesian anti-spam filtering
in Proc. Workshop on Machine Learning in the New Information Age,
11th European Conf. on Machine Learning (ECML 2000), G.
Potamias, V. Moustakis, and M. van Someren, Eds., Barcelona, Spain,
2000, pp. 9-17.
[11] X. Carreras and L. Mrquez, Boosting trees for clause splitting, in
Proc. CoNLL-2001 Shared Task, Toulouse, France, 2001.
[12] C. O'Brien and C. Vogel, Spam filters: bayes vs. chi-squared; letters
vs. words, in Proc. International Symposium on Information and
Communication Technologies, M. Aleksy, et al., Eds., 2003, pp. 298-
303.
[13] L. Zhang, J. B. Zhu, and T. S. Yao, An evaluation of statistical spam
filtering techniques, ACM Transactions on Asian Language
Information Processing (TALIP), vol. 3, issue 4, pp. 243-269,
December 2004.
[14] X. Luo and A. N. Zincir-Heywood, Evaluation of two systems on
multi-class multi-label document classification, in Proc. Foundations
of Intelligent Systems: 15th International Symposium, ISMIS (ISMIS
2005), Saratoga Springs, NY, USA, 2005.
[15] Metsis, I. Androutsopoulos, and G. Paliouras, Spam filtering with
naive bayesWhich naive bayes? in Proc. 3rd Conf. on Email and
Anti-Spam (CEAS), 2006, pp.125134.
[16] C. C. Lai, An empirical study of three machine learning methods for
spam filtering, Knowledge-Based System, Elsevier, vol. 20, issue 3,
pp. 249-254, April 2007
[17] J. Chen and Z. Chen, Extended Bayesian information criterion for
model selection with large model spaces, Biometrika, vol. 95, no. 3,
pp. 759-771, 2008.
[18] K. Manjusha and R. Kumar, Spam mail classification using
combined approach of bayesian and neural network, in Proc. 2nd
International Conf. on Computational Intelligence, Communication
Systems and Networks (CICN10), 2010, pp. 145-149.
[19] S. Vohra et al. Novel approach: Nave bayes with vector space
model for spam classification, in Proc. Nirma University
International Conference on Engineering (NUiCONE), IEEE
Conference, 2011.
[20] N. K. Korada, N. S. P. Kumar, and Y. V. N. H. Deekshitulu,
Implementation of Naive Bayesian Classifier and Ada-Boost
Algorithm Using Maize Expert System, International Journal of
Information Sciences and Techniques (IJIST), vol. 2, no. 3, pp. 63-75,
2012.
[21] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed., John
Wiley & Sons, 2001.
[22] B. Efron, The jack-knife, the bootstrap and other re-sampling plans,
CBMS-NSF Regional Conference Series in Applied Mathematics, no.
38, 1982.
[23] L. Breiman, Bagging predictors, Machine Learning, vol. 24, issue 2,
pp. 123140, August 1996.
[24] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical
Learning, 2nd ed., Springer, 2001.
[25] Y. Freund and R. Schapire, Experiments with a new boosting
algorithm, in Proc. 13th International Conf. on Machine Learning
(ICML), Bari, Italy, 1996, pp. 148156.
[26] D. D. Lewis, Naive (Bayes) at forty: The independence assumption
in information retrieval, in Proc. of 10th European Conference on
Machine Learning (ECML-98), 1998, pp. 4 -15.
[27] T. Joachims, Text categorization with support vector machines:
Learning with many relevant features, in Proc. 10th European
Conference on Machine Learning (ECML-98), 1998, pp. 137-142.
[28] D. Koller, and M. Sahami, Hierarchically classifying documents
using very few words, in Proc. 14th International Conf. on Machine
Learning (ICML), D.H. Fisher, Ed., pp. 170-178, Morgan Kaufmann,
San Francisco, 1997.
[29] L. S. Larkey and W. B. Croft, Combining classifiers in text
categorization, in Proc. 19th Annual Conf. Research and
Development in Information Retrieval (SIGIR-96), 1996, pp. 289-297.
[30] A. Sharma and S. Dey, A comparative study of feature selection and
machine learning techniques for sentiment analysis, in Proc. 2012
ACM Research in Applied Computation Symposium, San Antonio,
Texas, 2012.
[31] J. H. Holland, Adaptation in Natural and Artificial Systems, Ann
Arbor, MI: University of Michigan Press, 1975.
[32] H. Vafaie and I. F. Imam, Feature selection methods: Genetic
algorithms vs. greedy-like search, in Proc. 3rd International Fuzzy
Systems and Intelligent Control Conference, 1994.
[33] K. Aas and L. Eikvil, Text categorisation: A survey, technical
report, Norwegian Computing Centre, June 1999.
Shrawan Kumar Trivedi is a research fellow of
information systems at Indian Institute of Management
Indore, India. He completed his Master of Technology
in Information Technology from Indian Institute of
Information Technology (IIIT-Allahabad) and Master
of Science in Electronics and Communication from
UIET, CSJM University Kanpur, India. Currently he is
working on Data Mining.
Shubhamoy Dey
is an associate professor of
information systems at Indian Institute of
Management Indore, India. He completed his Ph. D
from the School of Computing, University of Leeds,
UK, and Master of Technology from Indian Institute
of Technology (IIT-
Kharagpur). He specializes in
Data Mining and has 25 years of research,
consulting and teaching experience in UK, USA and
India.