A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
E Mail is becoming fastest and economical mode of features, page rank, Host information are taken into
communication . The growing use of email has lead to consideration to classify URLs. Phishtank corpora has been
increased rate of spam emails. As it is information age users used and Bayesian classification is done to improve the
rely on emails to communicate with the globe. Business performance of system [1].
organization, individuals and all corporate industries are Georgios Paliouras et al., have presented learning method to
filter spam email. The two machine learning algorithm are
communicating with emails so that it is important part
considered for anti-spam filtering such as Naïve Bayesian and
concerning with education, business and personal usage.
Memory based learning approach and they are compared
concerning performance. So, that in both methods spam
Spam: filtering accuracy has improved and keyword based filter are
Spam are nothing but the unsolicited bulk emails (UBE) and used widely for email [2].
it’s another part is unsolicited commercial email .These spam Zhan Chuan, LU Xian-liang has given an application for
emails not only consume the user’s time but also the energy to email filtering using a new improved Bayesian filter. They
recognize the undesired messages, It is wasting the network have represented word frequency by vector weights and word
bandwidth. entropy is used for attribute selection then formula is derived
Content Based Spam Filter: which improves the performance apparently [3].
Content Based filter works on content of emails i.e., text, Vikas P. Deshpande et al., has presented an efficient method
URLs, main headers like subject for classification purpose. It of naïve Bayesian which blocks all spam emails without
is the method used to filter spam. blocking legitimate emails. To derive solution on this
The emails include two parts such as Body of the message problem, they considered statistical classifier such as naïve
and Header, Header stores the information about message like Bayesian anti-spam filter and content based spam filter which
from whom it is received, date and time of emails received, are adaptive in nature [4].
sender etc. Now emails ambiguous data is removed by Sheng et al., have shown that phishing websites are hacked as
preprocessing then text is extracted. soon as they are identified as phishing campaigns have two
hours of average life. So to block and identify such phishing
URLs they have extracted features like suspicious characters,
number of dots, ip address, hexadecimal character [5].
49
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)
(3.8)
III. ALGORITHM STUDY
IV. EXPERIMENT
A. Bayesian Classifier:
A. Implementation using Bayesian Classifier :
Naïve bayes classifier is statistical classifier famous for
Email filtering, Spam emails are identified by classification 1) Gmail Dataset and SpamAssassin Dataset:
This is the combination of the real time dataset downloaded
method. Naïve bayes uses tokens (words) with spam and ham
from Gmail and some emails from SpamAssassin in bulk
mails for Calculating probability to determine whether a mail
consisting of legitimate and spam emails. These emails are
is spam or not. considered for input to preprocess in HTML format.
Mathematical Formulation:
Bayesian classifier is based on Naïve Bayes theorem, Naïve 2) Text Preprocessing:
Bayes theorem can perform more sophisticated classification
methods. a) HTML Tag Removal:
To demonstrate the concept consider following equations [11];
The input Emails are in HTML format so this contains
Thus, we can write:
the tag, so to purify the text we need to remove the
Prior probability of Legitimate mail = Number of legitimate
tags.
mail / Total number of mail (3.1)
Prior probability of Spam mail = Number of spam mail / Total
number of mail (3.2) b) Stopword Removal:
Likelihood of X-mail given Legitimate = Number of This is the stopword list which consist of terms
legitimate mail in the vicinity of X-mails / Total number of including articles, prepositions, conjunctions and
legitimate mail. (3.3) certain high frequency words (such as some verbs,
Likelihood of X-mail given Spam = Number of spam mail in adverbs)
the vicinity of X-mails / Total number of spam mail. (3.4)
Posterior probability of X-mail being legitimate = Prior c) Tokenization :
probability of legitimate mail × Likelihood of X-mail given Lexical analysis also named as Tokenization, It
50
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)
involves dividing the content of text into strings of DMOZ: It is used to get genuine and legitimate URLs of web
character called as Tokens. Filtering techniques uses links for Dataset of legitimate or non-phishing URLs.
white space (blank) removal and removal of
punctuation symbols in tokenizing. 2) URL Preprocessing :
a) IP Address:
d) Word Frequency: IP addresses and hexadecimal characters are used to hide the
This counts the frequency of words depending on its actual URLs. For example consider the URL
occurrence, This helps in deriving the word probability http://www.bankingcompany.com/online/transaction/website/
for spam and legitimate mails. Or phishing.html” which is shortened using the IP address
Term Frequency http://132.115.201.115 which looks like legitimate and not
Terms Frequency of term can be defined as the overall suspicious.
frequency of a term in the entire corpus i.e. in the entire email b) Hexadecimal Character
instances. To calculate the TF score, frequencies of terms in The URL can also be represented using hexadecimal base
individual emails were first calculated and then all the values with a ‘%’ symbol. It may represent any special
frequencies of a term in the entire set of emails were added to characters Spoofguard identified the ‘@’ and ‘-’ symbol most
find the TF Score for a particular term tk. Mathematically it prominent in phishing URLs. In URL a @ symbol is
can be expressed as considered as centre and its left side is dispensed and its right
side is thrown into phishing site. Consider the URL
http://[email protected]” will enter into
(4.1) “phishing site.com” and discards “www.citibank.com”. Such
types of methods uses mask for phishing site and pretense as
Terms having less TF Score will be eliminated and those legitimate sites.
having high score will be selected.
3) Bayesian Classifier:
It is method used for classification of text, It gives efficient
learning algorithm for data mining. This uses Bayes classifier
theorem which is based on conditional independence
assumption:
4) Performance Measurement
Performance can be evaluated in terms of Accuracy,
Error, Time, Precision and Recall for Base method using
Bayesian Classifier .
51
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)
c) Suspicious Character
Presence of suspicious characters such as @ symbol and
other special binary characters such as (‘.’, ‘=’, ‘$’, ‘^’ and
etc.) either in the host or path name, can be suspicious
characters.
d) Number of Dots
In this number of dots are observed in given URLs of email to
predict whether a given URL is malicious or legitimate.
5) Performance Measurement:
As combination classification model builds of Bayesian and
Decision Tree C4.5, It is essential to derive performance on
the basis of parameters such as Accuracy (Correctly classified
instances), Error (Incorrectly classified instances ), precision
and Recall are evaluated .
Where,
TN: True Negative, Legitimate predicted as Legitimate
TP: True Positive, Spam predicted as Spam
FP: Legitimate predicted as Spam
FN: Spam predicted as Legitimate.
A. Computation of system’s efficiency under different volume Fig. 3. Error of the Implementation for different volume of the Datasets
of Dataset for combination approach using Bayesian
Classifier and Decision Tree (C4.5) Classifier:
52
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)
Fig. 4. Time taken for Implementation for different volume of the Datasets Fig. 6. Recall of the Implementation for different volume of the Datasets
B. Tabular Results:
TABLE I
Implementation Results using Bayesian Classifier
TABLE II
Implementation Results using Combination of Bayesian
Classifier and Decision Tree C4.5 Classifier
53
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)
ACKNOWLEDGMENT
TABLE III We are sincerely grateful to all the persons who help us
Comparative Performance Evaluation of A) Implementation through this work to make it successful.
using Bayesian Classifier and B) Implementation using
Bayesian Classifier and Decision Tree (C4.5) Classifier
Where, A - Bayesian Classifier and B - Bayesian and C4.5 REFERENCES
Classifier
[1] Dhanalakshmi Ranganayakulu and Chellappan C., “Detecting malicious
URLs in E-Mail - An implementation”, in AASRI Conference on
Intelligent Systems and Control, Vol. 4 , pg. 125–131, 2013.
[2] G. Paliouras et al.,“An Evaluation of Naive Bayesian Anti-Spam
Filtering”, in Proceedings of the Workshop on Machine Learning in the
New Information Age, 11th European Conference on Machine Learning,
Barcelona, Spain, pages 9–17, 2000.
[3] Zhan Chuan et al., “An Improved Bayesian with Application to Anti-
Spam Email”, in Journal of Electronic Science and Technology of
China, Vol.3 No.1, Mar. 2005.
[4] Vikas P. Deshpande and Robert F. Erbacher, “An Evaluation of Naïve
Bayesian Anti-Spam Filtering Techniques”, in Proceedings of the 2007
IEEE Workshop on Information Assurance United States Military
Academy, West Point, NY 20-22 June 2007.
[5] Sheng, S. et al.,“An empirical analysis of phishing blacklists”, in
Proceedings of the CEAS’09, 2009.
VI. CONCLUSIONS [6] Pawan Prakash et al.,“PhishNet:Predictive Blacklisting to Detect
Phishing Attacks”, in Proceedings of the IEEE Infocom, pp.1-5, 2010.
[7] Maher Aburrous et al., “Experimental Case Studies for Investigating E-
We have integrated the content based spam detection using Banking Phishing Techniques and Attack Strategies”, Cognitive
Bayesian Classifier and phishing URLs detection using Computing, DOI 10.1007/s12559-010-9042-7, Vol. 2, pp. 242-253,
Decision Tree C4.5. Thus we found that performance 2010.
evaluated for combination approach of Bayesian classifier and [8] Congfu Xu et al.,“An approach to image spam filtering based on base64
encoding and N-Gram feature extraction”, in IEEE International
Decision Tree C4.5 are improved as compared to Conference on Tools with Artificial Intelligence, DOI
implementation using content based spam detection by 10.1109/ICTAI.2010.31, 2010.
Bayesian Classifier. [9] R. Malathi, “Email Spam Filter using Supervised Learning with
Bayesian Neural Network”, Computer Science, H.H. The Rajah’s
We have evaluated the results across different volume of College, Pudukkottai-622 001,Tamil Nadu, India, Int J Engg Techsci
dataset, Implementation using Bayesian classifier gives 94.86 Vol 2(1),89-100, 2011.
% accuracy whereas The Combination approach of Bayesian [10] Sadeghian, A and Ariaeinejad, R., “Spam detection system: A new
Classifier and Decision Tree C4.5 gives 95.54 % accuracy So, approach based on interval type-2 fuzzy sets”, in IEEE CCECE -000379,
2011.
We can say that combination approach has improved the
[11] Xiang, G. et al., “CANTINA+: A feature-rich machine learning
results in terms of Accuracy and It became the efficient framework for detecting phishing Web sites”. in ACM Trans. Inf. Syst.
method for classification of content based spam detection and Secur. Vol.14, No.2, pp.1-21, 2011.
malicious URL detection in integrated form. [12] Naïve Bayes Classifier.(2014, Dec) [online] Available :
http://www.statsoft.com/textbook/naive-bayes-classifier .
54