0% found this document useful (0 votes)
76 views6 pages

A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail

Evaluación comparativa del rendimiento de la detección de de spam y URL maliciosas en el correo electrónico

Uploaded by

Corporacion H21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
76 views6 pages

A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail

Evaluación comparativa del rendimiento de la detección de de spam y URL maliciosas en el correo electrónico

Uploaded by

Corporacion H21
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

A Comparative Performance Evaluation of Content


Based Spam and Malicious URL Detection in E-mail
Sunil B. Rathod Tareek M. Pattewar
PG Student, Department of Computer Engineering, Assistant Professor, Department of Computer Engineering,
North Maharashtra University, North Maharashtra University,
SES’s R. C. Patel Institute of Technology, Shirpur, India SES’s R. C. Patel Institute of Technology, Shirpur, India
[email protected] [email protected]

Abstract—E-mail communication is growing rapidly. Email


contains Text and URLs as content. Text can be suspicious, from II. RELATED WORK
undesired sender which contains un-required content and URLs Today’s internet is suffering from major problem known as
may be malicious which redirects users to phishing (malicious)
Email spam .It annoys users and make financial damage to
websites. Thus to stop such activity a spam and malicious URLs
detection system is required which benefits users by removing
companies. So far developed techniques to stop spam are
spam content and malicious URLs in Email. We have used data filtering methods .Spam emails are UBE also known as junk
mining approach like supervised classification which improves emails ,that are send to many recipients who have not
the systems accuracy and detects more amount of spam and requested or subscribe to this. Spam filter removes spam or
malicious URLs. un-required messages from email inbox . It also has Phishing
URLs which redirects users to phishing websites and seeking
Keywords— Bayesian Classifier, Decision Tree, Malicious URL personal credentials like username and password for financial
Detection, Spam Detection . purpose.
The existing work by Dhanalakshmi R and Chellapan C, did
I. INTRODUCTION implementation on malicious URL detection in Email. Lexical

E Mail is becoming fastest and economical mode of features, page rank, Host information are taken into
communication . The growing use of email has lead to consideration to classify URLs. Phishtank corpora has been
increased rate of spam emails. As it is information age users used and Bayesian classification is done to improve the
rely on emails to communicate with the globe. Business performance of system [1].
organization, individuals and all corporate industries are Georgios Paliouras et al., have presented learning method to
filter spam email. The two machine learning algorithm are
communicating with emails so that it is important part
considered for anti-spam filtering such as Naïve Bayesian and
concerning with education, business and personal usage.
Memory based learning approach and they are compared
concerning performance. So, that in both methods spam
Spam: filtering accuracy has improved and keyword based filter are
Spam are nothing but the unsolicited bulk emails (UBE) and used widely for email [2].
it’s another part is unsolicited commercial email .These spam Zhan Chuan, LU Xian-liang has given an application for
emails not only consume the user’s time but also the energy to email filtering using a new improved Bayesian filter. They
recognize the undesired messages, It is wasting the network have represented word frequency by vector weights and word
bandwidth. entropy is used for attribute selection then formula is derived
Content Based Spam Filter: which improves the performance apparently [3].
Content Based filter works on content of emails i.e., text, Vikas P. Deshpande et al., has presented an efficient method
URLs, main headers like subject for classification purpose. It of naïve Bayesian which blocks all spam emails without
is the method used to filter spam. blocking legitimate emails. To derive solution on this
The emails include two parts such as Body of the message problem, they considered statistical classifier such as naïve
and Header, Header stores the information about message like Bayesian anti-spam filter and content based spam filter which
from whom it is received, date and time of emails received, are adaptive in nature [4].
sender etc. Now emails ambiguous data is removed by Sheng et al., have shown that phishing websites are hacked as
preprocessing then text is extracted. soon as they are identified as phishing campaigns have two
hours of average life. So to block and identify such phishing
URLs they have extracted features like suspicious characters,
number of dots, ip address, hexadecimal character [5].

978-1-4673-7437-8/15/$31.00 ©2015 IEEE

49
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

Pawan et al., discovered malicious URLs by enhancing legitimate. (3.5)


blacklisting. One conflict with this method is that their Posterior probability of X-mail being spam = Prior probability
updation process is fast so they failed to identify phishing of spam mail × Likelihood of X-mail given spam. (3.6)
URLs in early hours of a phishing attack[6]. Finally we classify X-mail as spam as its class membership
Maher Abburrous et al., endeavor for a survey to recognize the has a largest posterior probability.
essential features which can develop accuracy and precision
for malicious URLs detection [7].
Congfu Xu et al, did a feature extraction on Base64 encoding B. Decision Tree C4.5:
of image with n-gram technique. A SVM needs to be trained C4.5 is developed by Ross Quinlan. It is Extension of ID3
for efficiently detecting spam images from legitimate images. and also known as statistical classifier. C4.5 creats decision
Its seen from experiment that It has improved the performance tree alike ID3 as it is successor of ID3 using the concept of
in terms of Accuracy, Precision and Recall [8]. “Information Entropy”: It is measure of homogeneity of a
R. Malathi et al., has given a new spam detection method by learning set. At each node of tree, C 4.5 selects attributes for
employing Text Categorization, using Supervised Learning dividing its sets into subsets. Normalized information gain is
with Bayesian Neural Network which uses Rule based the important criterion for splitting the data. Another term is
heuristic approach and statistical analysis tests to identify “Information Gain” which is the difference in information
“Spam” [9]. entropy associated with attribute. The attribute with highest
Sadeghian A. et al, had presented spam detection based on normalized information gain is choosen to make decision.
interval type-2 fuzzy sets. This system gives user more control performance of the system can be derived by Accuracy and
on categories of spam and permits the personalization of the Error Rate as follows;
spam filter [10].
CANTINA+ classifies phishing URLs and the feature set is
more exhaustive and obtained classification accuracy of (3.7)
92.3%. There exist various related researches and case studies
conducted on analyzing the feature set required to reduce the
exhaustiveness and time consumption [11].

(3.8)
III. ALGORITHM STUDY
IV. EXPERIMENT
A. Bayesian Classifier:
A. Implementation using Bayesian Classifier :
Naïve bayes classifier is statistical classifier famous for
Email filtering, Spam emails are identified by classification 1) Gmail Dataset and SpamAssassin Dataset:
This is the combination of the real time dataset downloaded
method. Naïve bayes uses tokens (words) with spam and ham
from Gmail and some emails from SpamAssassin in bulk
mails for Calculating probability to determine whether a mail
consisting of legitimate and spam emails. These emails are
is spam or not. considered for input to preprocess in HTML format.
Mathematical Formulation:
Bayesian classifier is based on Naïve Bayes theorem, Naïve 2) Text Preprocessing:
Bayes theorem can perform more sophisticated classification
methods. a) HTML Tag Removal:
To demonstrate the concept consider following equations [11];
The input Emails are in HTML format so this contains
Thus, we can write:
the tag, so to purify the text we need to remove the
Prior probability of Legitimate mail = Number of legitimate
tags.
mail / Total number of mail (3.1)
Prior probability of Spam mail = Number of spam mail / Total
number of mail (3.2) b) Stopword Removal:
Likelihood of X-mail given Legitimate = Number of This is the stopword list which consist of terms
legitimate mail in the vicinity of X-mails / Total number of including articles, prepositions, conjunctions and
legitimate mail. (3.3) certain high frequency words (such as some verbs,
Likelihood of X-mail given Spam = Number of spam mail in adverbs)
the vicinity of X-mails / Total number of spam mail. (3.4)
Posterior probability of X-mail being legitimate = Prior c) Tokenization :
probability of legitimate mail × Likelihood of X-mail given Lexical analysis also named as Tokenization, It

50
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

involves dividing the content of text into strings of DMOZ: It is used to get genuine and legitimate URLs of web
character called as Tokens. Filtering techniques uses links for Dataset of legitimate or non-phishing URLs.
white space (blank) removal and removal of
punctuation symbols in tokenizing. 2) URL Preprocessing :
a) IP Address:
d) Word Frequency: IP addresses and hexadecimal characters are used to hide the
This counts the frequency of words depending on its actual URLs. For example consider the URL
occurrence, This helps in deriving the word probability http://www.bankingcompany.com/online/transaction/website/
for spam and legitimate mails. Or phishing.html” which is shortened using the IP address
Term Frequency http://132.115.201.115 which looks like legitimate and not
Terms Frequency of term can be defined as the overall suspicious.
frequency of a term in the entire corpus i.e. in the entire email b) Hexadecimal Character
instances. To calculate the TF score, frequencies of terms in The URL can also be represented using hexadecimal base
individual emails were first calculated and then all the values with a ‘%’ symbol. It may represent any special
frequencies of a term in the entire set of emails were added to characters Spoofguard identified the ‘@’ and ‘-’ symbol most
find the TF Score for a particular term tk. Mathematically it prominent in phishing URLs. In URL a @ symbol is
can be expressed as considered as centre and its left side is dispensed and its right
side is thrown into phishing site. Consider the URL
http://[email protected]” will enter into
(4.1) “phishing site.com” and discards “www.citibank.com”. Such
types of methods uses mask for phishing site and pretense as
Terms having less TF Score will be eliminated and those legitimate sites.
having high score will be selected.

3) Bayesian Classifier:
It is method used for classification of text, It gives efficient
learning algorithm for data mining. This uses Bayes classifier
theorem which is based on conditional independence
assumption:

P (spam/word) = [P (word/spam) P (spam)] / p (word)

Considering spam probability for words, It evaluates Spam


and Legitimate mails for classification then gives performance
measurement.

4) Performance Measurement
Performance can be evaluated in terms of Accuracy,
Error, Time, Precision and Recall for Base method using
Bayesian Classifier .

B. Implementation using combination of both Bayesian


Classifier and Decision Tree C4.5 :
As Email body consist mainly of ‘TEXT’ and ‘URLs’, for
TEXT we do classification based on Bayesian Classifier and
The process undergo classification as in A) Base Method
using Bayesian Classifier and for URLs we use following
method of classification.
1) Phishtank Dataset and DMOZ Dataset
Phishtank is source of blacklisted phishing URLs which
admits user input and they are verified by users. It is set of
URLs which are suspected and reported as phishing URLs to Fig. 1. Combination Approach of Content Based Spam Detection using
phishtank. Bayesian Classifier and malicious URLs Detection in Email using Decision
Tree C4.5.

51
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

c) Suspicious Character
Presence of suspicious characters such as @ symbol and
other special binary characters such as (‘.’, ‘=’, ‘$’, ‘^’ and
etc.) either in the host or path name, can be suspicious
characters.

d) Number of Dots
In this number of dots are observed in given URLs of email to
predict whether a given URL is malicious or legitimate.

3) Decision Tree C4.5:


The Dataset from Phishtank is preprocessed and passed as
input to Decision Tree C4.5 for classification then
performance is measured in terms of Accuracy, Time and
Error.

4) Testing G-mail Dataset :


This is derived from g-mail consisting of spam and legitimate
mails .It also needs to be preprocessed in two terms : A)
Preprocessing for Text and B) Preprocessing on URLs to give
pure Text and URLs then classification is done by
combination of both Bayesian classifier and Decision Tree ( C
4.5). Further correctly classified instances (mails) and
Incorrectly classified instances (mails) are evaluated. Fig. 2. Accuracy of the Implementation for different volume of the Datasets

5) Performance Measurement:
As combination classification model builds of Bayesian and
Decision Tree C4.5, It is essential to derive performance on
the basis of parameters such as Accuracy (Correctly classified
instances), Error (Incorrectly classified instances ), precision
and Recall are evaluated .

Accuracy = (TN + TP) / (TN + TP + FN + FP)

Error = 100- (Accuracy)

Precision = (TP) / (TP+FP)

Recall = (TP) / (TP + FN)

Where,
TN: True Negative, Legitimate predicted as Legitimate
TP: True Positive, Spam predicted as Spam
FP: Legitimate predicted as Spam
FN: Spam predicted as Legitimate.

V. EXPERIMENTAL RESULTS AND PERFORMANCE EVALUATION

A. Computation of system’s efficiency under different volume Fig. 3. Error of the Implementation for different volume of the Datasets
of Dataset for combination approach using Bayesian
Classifier and Decision Tree (C4.5) Classifier:

52
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

Fig. 4. Time taken for Implementation for different volume of the Datasets Fig. 6. Recall of the Implementation for different volume of the Datasets

B. Tabular Results:
TABLE I
Implementation Results using Bayesian Classifier

TABLE II
Implementation Results using Combination of Bayesian
Classifier and Decision Tree C4.5 Classifier

Fig. 5. Precision of Implementation for different volume of the Datasets

53
2015 IEEE International Conference on Computer Graphics, Vision and Information Security (CGVIS)

ACKNOWLEDGMENT
TABLE III We are sincerely grateful to all the persons who help us
Comparative Performance Evaluation of A) Implementation through this work to make it successful.
using Bayesian Classifier and B) Implementation using
Bayesian Classifier and Decision Tree (C4.5) Classifier
Where, A - Bayesian Classifier and B - Bayesian and C4.5 REFERENCES
Classifier
[1] Dhanalakshmi Ranganayakulu and Chellappan C., “Detecting malicious
URLs in E-Mail - An implementation”, in AASRI Conference on
Intelligent Systems and Control, Vol. 4 , pg. 125–131, 2013.
[2] G. Paliouras et al.,“An Evaluation of Naive Bayesian Anti-Spam
Filtering”, in Proceedings of the Workshop on Machine Learning in the
New Information Age, 11th European Conference on Machine Learning,
Barcelona, Spain, pages 9–17, 2000.
[3] Zhan Chuan et al., “An Improved Bayesian with Application to Anti-
Spam Email”, in Journal of Electronic Science and Technology of
China, Vol.3 No.1, Mar. 2005.
[4] Vikas P. Deshpande and Robert F. Erbacher, “An Evaluation of Naïve
Bayesian Anti-Spam Filtering Techniques”, in Proceedings of the 2007
IEEE Workshop on Information Assurance United States Military
Academy, West Point, NY 20-22 June 2007.
[5] Sheng, S. et al.,“An empirical analysis of phishing blacklists”, in
Proceedings of the CEAS’09, 2009.
VI. CONCLUSIONS [6] Pawan Prakash et al.,“PhishNet:Predictive Blacklisting to Detect
Phishing Attacks”, in Proceedings of the IEEE Infocom, pp.1-5, 2010.
[7] Maher Aburrous et al., “Experimental Case Studies for Investigating E-
We have integrated the content based spam detection using Banking Phishing Techniques and Attack Strategies”, Cognitive
Bayesian Classifier and phishing URLs detection using Computing, DOI 10.1007/s12559-010-9042-7, Vol. 2, pp. 242-253,
Decision Tree C4.5. Thus we found that performance 2010.
evaluated for combination approach of Bayesian classifier and [8] Congfu Xu et al.,“An approach to image spam filtering based on base64
encoding and N-Gram feature extraction”, in IEEE International
Decision Tree C4.5 are improved as compared to Conference on Tools with Artificial Intelligence, DOI
implementation using content based spam detection by 10.1109/ICTAI.2010.31, 2010.
Bayesian Classifier. [9] R. Malathi, “Email Spam Filter using Supervised Learning with
Bayesian Neural Network”, Computer Science, H.H. The Rajah’s
We have evaluated the results across different volume of College, Pudukkottai-622 001,Tamil Nadu, India, Int J Engg Techsci
dataset, Implementation using Bayesian classifier gives 94.86 Vol 2(1),89-100, 2011.
% accuracy whereas The Combination approach of Bayesian [10] Sadeghian, A and Ariaeinejad, R., “Spam detection system: A new
Classifier and Decision Tree C4.5 gives 95.54 % accuracy So, approach based on interval type-2 fuzzy sets”, in IEEE CCECE -000379,
2011.
We can say that combination approach has improved the
[11] Xiang, G. et al., “CANTINA+: A feature-rich machine learning
results in terms of Accuracy and It became the efficient framework for detecting phishing Web sites”. in ACM Trans. Inf. Syst.
method for classification of content based spam detection and Secur. Vol.14, No.2, pp.1-21, 2011.
malicious URL detection in integrated form. [12] Naïve Bayes Classifier.(2014, Dec) [online] Available :
http://www.statsoft.com/textbook/naive-bayes-classifier .

54

You might also like