Spam Filtering Algorithm Analysis
Spam Filtering Algorithm Analysis
com
Abstract In this e-world, most of the transactions and business is taking place through e-mails. Nowadays, email becomes
a powerful tool for communication as it saves a lot of time and cost. But, due to social networks and advertisers, most of the
emails contain unwanted information called spam. Even though lot of algorithms has been developed for email spam
classification, still none of the algorithms produces 100% accuracy in classifying spam emails. Current server-side antispam filters are made up of several modules aimed at detecting different features of spam e-mails. In particular, text
categorization techniques have been investigated by researchers for the design of modules for the analysis of the semantic
content of e-mails, due to their potentially higher generalization capability with respect to manually derived classification
rules used in current server-side filters. Our research paper consists of comprehensive study of spam detection algorithms
under the category of content based filtering. The implemented results have been benchmarked to analyze how accurately
they have been classified into their original categories of spam and ham. Further, a new dynamic aspect has been added
which includes run-time implementation of Naive Bayes and J48 Tree algorithm on the data which we fed from the mail
server dynamically for more efficient results.
Keywords Spam, Naive-Bayes, KNN, Black list, White list, J48 Decision Tree, Nave Bayes Multinomial.
I. INTRODUCTION
Due to the intensive use of internet, email has become one of the fastest and most economical mode of communication. This
enables internet user to easily transfer information from anywhere in the world in a fraction of second. However, the increase of
email users have resulted in the dramatic increase of spam emails during the past few years. E-mail spam, also known as junk email or unsolicited bulk e-mail (UBE), is a subset of spam that delivers nearly identical messages to numerous recipients by email. Definitions of spam usually include the aspects that e-mail is unsolicited and sent in bulk. E-mail spam has steadily grown
since the early 1990s. Botnets, networks of virus-infected computers, are used to send about 80% of spam. Spammers collect email addresses from chat rooms, websites, customer lists, newsgroups, and viruses which harvest users' address books, and are
sold to other spammers. Since the cost of the spam is borne mostly by the recipient, many individual and business people send
bulk messages in the form of spam. The voluminous of spam emails a strain the Information Technology based organizations
and creates billions of dollars lose in terms of productivity. In recent years, spam emails lands up into a serious security threat,
and act as a prime medium for phishing of sensitive information. Addition to this, it also spread malicious software to various
users. Therefore, email classification becomes an important area to automatically classify original emails from spam emails.
Automatic email spam classification contains more challenges because of unstructured information, more number of features
and large number of documents. As the usage increases all of these features may adversely affect performance in terms of
quality and speed. Many recent algorithms use only relevant features for classification. Even though more number of
classification techniques has been developed for spam classification, still 100% accuracy of predicting the spam email is
questionable. So Identification of best spam algorithm itself became a tedious task because of features and drawbacks of every
algorithm against each other.
Daily Spam emails sent: 12.4billion
Daily Spam received per person: 6
Annual Spam received per person: 2,200
Spam cost to all non-corporate: $255 million Internet users
Spam cost to all U.S Corporation in 2002: $8.9 billion
Email address changes due to spam: 16%
Annual Spam in 1,000 employees company: 2.1 million
Users who reply to Spam email: 28%
21
IJRASET 2015: All Rights are Reserved
www.ijraset.com
22
IJRASET 2015: All Rights are Reserved
www.ijraset.com
23
IJRASET 2015: All Rights are Reserved
www.ijraset.com
It should also be noted that all three distance measures are only valid for continuous variables. In the instance of categorical
variables the Hamming Distance must be used. It also brings up the issue of standardization of the numerical variables between
0 and 1, when there is a mixture of numerical and categorical variables in the dataset.
Choosing the optimal value for K is best done by first inspecting the data. In general, a large value of K is more precise as it
reduces the overall noise but there is no guarantee. Cross-validation is another way to retrospectively determine a good value of
K. Historically, the optimal K for most of the datasets is between 3-10. That produces much better results than 1NN.
III. IMPLEMENTED WORK
A. Nave Bayes
The Bayesian approach is fundamentally an important DM technique. The Bayes classifier can provably achieve the optimal
result when the probability distribution is given. Bayesian method is based on the probability theory. A Bayesian filter learns a
spam classifier from a set of manually classified examples of spam and legitimate (or ham) messages i.e. Training collection.
This training collection is taken as the input for the learning process; this consists of the following steps:
1) Pre-processing: The pre-processing is the deletion of irrelevant elements (e.g. HTML), and selection of the segments
suitable for processing (e.g. headers, body).
2) Tokenization: This is the process of dividing the message into semantically coherent segments (e.g. words, other character
strings).
3) Representation: The representation is the conversion of a message into an attribute-value pairs vector [10], where the
attributes are the previously defined tokens, and their values can be binary, (relative) frequencies, etc.
4) Selection: The selection process includes the Statistical deletion of less predictive attributes (using e.g. quality metrics like
Information Gain).
5) Learning: The learning phase automatically building a classification model (the classifier) from the collection of messages.
The shape of the classifier depends on the learning algorithm used, ranging from decision trees (C4.5), or classification
rules (Ripper), to statistical linear models (Support Vector Machines, Winnow), neural networks, genetic algorithms, etc.
B. Nave Bayesian Classifiers
Naive Bayes can often outperform more with sophisticated classification methods. The following example shows the Nave
24
IJRASET 2015: All Rights are Reserved
www.ijraset.com
25
IJRASET 2015: All Rights are Reserved
www.ijraset.com
The class-conditional probability of encountering the text x can be calculated as the product from the likelihoods of the
individual words (under the naive assumption of conditional independence).
P(xj)=P(x1j)P(x2j)P(xnj)=i=1mP(xij)
1) Term Frequency-Inverse Document Frequency (Tf-idf): The term frequency - inverse document frequency (Tf-idf) is
another alternative for characterizing text documents. It can be understood as a weighted term frequency, which is
especially useful if stop words have not been removed from the text corpus. The Tf-idf approach assumes that the
importance of a word is inversely proportional to how often it occurs across all documents. Although Tf-idf is most
commonly used to rank documents by relevance in different text mining tasks, such as page ranking by search engines, it
can also be applied to text classification via naive Bayes.
Tf-idf=tfn(t,d)idf(t) (39)
Let tfn(d,f) be the normalized term frequency, and idf, the inverse document frequency, which can be calculated as follows:
26
IJRASET 2015: All Rights are Reserved
www.ijraset.com
Technical Specifications
java.lang.Object
weka.classifiers.AbstractClassifier
weka.classifiers.bayes.NaiveBayesMultinomial
D. J48
C4.5 algorithm generates decision trees which is an extension of Quinlans ID3 algorithm. Such decision trees are used for
classification and hence statistical classifier is one more name for this C4.5 algorithm.
1) Algorithm: C4.5 builds decision trees from a set of training data in the same way as ID3, using the concept of information
entropy. The training data is a set of already classified samples. Each sample consists of a p-dimensional vector, where it
represents attributes or features of the sample, as well as the class in which it falls.
At each node of the tree, C4.5 chooses the attribute of the data that most effectively splits its set of samples into subsets
enriched in one class or the other. The splitting criterion is the normalized information gain (difference in entropy). The attribute
with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurs on the smaller
subsists.
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying
to choose that class. None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the
tree using the expected value of the class. Instance of previously-unseen class encountered. Again, C4.5 creates a decision node
higher up the tree using the expected value.
2) Pseudo Code: In pseudo code, the general algorithm for building decision trees is:
a)
b)
c)
d)
e)
27
IJRASET 2015: All Rights are Reserved
www.ijraset.com
Algorithm
Correctly
Classified
(No. of
Instances)
Incorrectly
Classified
(No. of
Instances)
Success
Percentage
(No. of
Instances
in %)
Dataset
Size
(No.of
Instances)
Nave Bayes
32
80%
40
J48
Decision
Tree
40
100%
40
Nave Bayes
Multinomial
40
100%
40
28
IJRASET 2015: All Rights are Reserved
www.ijraset.com
COMPARISON OF PERFORMANCE
Parameters
J48 Decision
Tree
Mean absolute
error
0%
Nave Bayes
Multinomial
0.0027%
Root mean
squared error
0%
Relative
absolute error
0%
0.7853%
Root relative
squared error
0%
4.6943%
100%
100%
Coverage of
cases (0.95 level)
0.0192%
After comparing the results we reached the conclusion that Nave Bayes Multinomial & J48 Decision Tree approach were
equally efficient but out of them J48 Decision Tree approach was the best due to zero percent error calculated above.
V. CONCLUSION
Various techniques of spam filtering are studied and analyzed. The implemented results are mentioned in the tables shown
above. The most efficient technique according to our research is J48 Decision Tree if the technique is to be applied on static
data, but if our run time type data is dynamic then Nave Bayes is the best.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
29
IJRASET 2015: All Rights are Reserved