E-Mail Spam Detection Using Machine Learning and Deep Learning
E-Mail Spam Detection Using Machine Learning and Deep Learning
http://doi.org/10.22214/ijraset.2020.6159
International Journal for Research in Applied Science & Engineering Technology (IJRASET)
ISSN: 2321-9653; IC Value: 45.98; SJ Impact Factor: 7.429
Volume 8 Issue VI June 2020- Available at www.ijraset.com
I. INTRODUCTION
In present times the commercial or bulk e-mails have become a really major problem. Spam nowadays is a waste of storage space,
time and bandwidth for communication. From many years the problem caused by spam or fraud mails is increasing. In recent
studies, 77% of all mail is spam that comes around a value of 15 billion emails per day and costs Internet users about $ 300 million
per year.
Today for email filtering, knowledge Engineering and Machine Learning are two most successful approaches. In knowledge
engineering approach the hard and fast rule is specifying a set of principles according to which email is classified as spam or ham.
Application of this method, doesn’t shows any promising results because the rules should be necessary. Constantly updating the
rules and methods just causes waste of time and requires more maintenance. As compared to knowledge Engineering, Machine
learning is more appropriate approach.
It does not have to specify any rules. A set of pre-classified e-mail messages is used here in place of set of rules. Machine learning
approaches have a wide range of Importance and a lot of algorithms can be used for e-mail filtering and classification. These include
Support Vector Machine, Naïve Bayes.
1) Support Vector Machine: Support Vector Machines with associated learning algorithms analyses data for classification. The
SVM algorithm for training constructs a model which allocates the new examples into one of the categories. In an SVM model
examples are represented as points in n- dimensional space which are mapped so that the points of the different categories are
separated by a gap that should be as broad as possible. Then the untrained examples are mapped in that space and a decision is
made, that is, to which category does it belongs to depending upon the side of the plane they fall.
2) Naïve Bayes: The roots of the Naïve Bayesian classifier lie in the Bayes Theorem.
Bayes Theorem basically describes how much we should modify the probability so that our hypothesis (H) transpires, given some
novel evidence (e). This paper determines the probability that an email is spam, given the evidence of the email's feature values F1,
F2….,Fn. These features are just a Boolean value (0 or 1) dependent on whether the feature is present in the email or not. Then P
(Spam| features) to P (Ham| features) are determined and then decided which is more likely.
B. Deep Learning
In this paper, we exploit a deep neural network for E-mail Spam Detection using TensorFlow. We build the neural network model
which contains recurrent neural networks and LSTM (Long Short-Term Memory) which automatically extracts the features
avoiding the overhead of exclusively extracting the features. We are training and testing the model on our self-designed dataset.
Results on our dataset show that the neural model achieves significantly better accuracies compared to the previous studies done on
E-mail Spam Detection using linguistic approach, demonstrating the advantage of the automatically extracted neural features.
2) Sequences and Tokenizers: We use the tokenizer class from the pre-processing package to convert our text to vectors. The
tokenizer method is initialized using our training data (text part only). This will convert the text into a dictionary of words.
Then we will convert this list of indices to a binary NumPy matrix. Matrix columns represent words in text data, rows represent
text lines. We will create a second NumPy matrix for the test data. In this case we will only use terminology from training data,
as our model will be trained on it. So, the second matrix has the same column created from training data and binary flags
created from the test set.
The above Fig. displays the factors like confusion matrix, classification report and f1 measure for SVM Classifier.
B. Deep Learning
V. FUTURE SCOPE
However, the experiment has made efforts towards solving the problem of spam e-mail. Proposed solutions using legislative,
behavioural and technical measures are not a complete solution. The problem of spam e-mail and anti-spam solutions is game like
cat and mouse, every day spammers will come up with new techniques Send spam e-mail. This work has given possible directions
for classification. Spam e-mail Future efforts will be extended to:
A. Obtaining accurate classification, zero percent (0%) with abortion of ham E-mail as spam and spam as e-mail ham.
B. Many Efforts will be implemented to block phishing e-mail, which carries phishing Attacks and now days which is a matter of
concern.
C. Also, work can be extended to keep it away from the Denial of service attack (DoS). Now which has emerged in distributed
fashion, is called distributed Denial of Service Attack (DoS).
VI. CONCLUSION
In this study, we reviewed the general application in the field of machine learning approach and spam filtering. A review of the
state-of-the-art algorithm has been implemented to classify the message as either spam or ham. Efforts made by various researchers
to solve the problem of spam through the use of machine learning classifiers were discussed. The development of spam messages
was investigated over the years to avoid filters. The basic structure of the email spam filter and the processes involved in filtering
spam emails were noted. The paper surveyed some of the publicly available datasets and performance metrics that can be used to
measure the effectiveness of any spam filter. The challenges of machine learning algorithms in efficiently handling the threat of
spam were pointed out and a comparative study of machine learning techniques available in the literature. We also revealed some
open research problems related to spam filters. In general, the amount and amount of literature we reviewed suggests that significant
progress has been made and will still be made in this area. After discussing open problems in spam filtering, further research needs
to be done to increase the effectiveness of spam filters. It will develop spam filters to continue an active research area for academics
and industry practitioners researching machine learning techniques for effective spamming. Our hope is that research students will
use this paper as a spring board to conduct qualitative research in spam filtering using machine learning, deep learning, and deep
adversarial learning algorithms.
REFERENCES
[1] Abduelbaset M. However, Tarik Rashed, Ali S. Elbekaie, and Husien A. Alhammi, “An Anti-Spam System Using Artificial Neural Networks And Genetic
Algorithms” (A Neural Model In Anti Spam).
[2] Er. Seema Rani, Er. Sugandha Sharma, “Survey on E-mail Spam Detection Using NLP”, International Journal of Advanced Research in Computer Science and
Software Engineering, India, Volume 4, Issue 5, May 2014.
[3] Masurah Mohamad, Khairulliza Ahmad Salleh, “Independent Feature Selection as Spam-Filtering Technique: An Evaluation of Neural Network”, Malaysia.
[4] El-Sayed M. El-Alfy, “Learning Methods For Spam Filtering”, College of Computer Sciences and Engineering King Fahd University of Petroleum and
Minerals, Saudi Arabia.
[5] Upasna Attri & Harpreet Kaur, “Comparative Study of Gaussian and Nearest Mean Classifiers for Filtering Spam E-mails”, Global Journal of Computer
Science and Technology Network, Web & Security, USA, Volume 12 Issue 11 Version June 2012.
[6] Alia Taha Sabri, Adel Hamdan Mohammads, Bassam Al-Shargabi, Maher Abu Hamdeh, “Developing New Continuous Learning Approach for Spam Detection
using Artificial Neural Network (CLA_ANN)”, European Journal of Scientific Research, ISSN 1450-216X Vol.42 No.3 (2010), pp.511-521.
[7] Enrique Puertas Sanz, José María Gómez Hidalgo,José Carlos Cortizo Pérez, “Email Spam Filtering”, Universidad Europea de Madrid Villaviciosa de Odón,
28670 Madrid, SPAIN.
[8] Ravinder Kamboj, “A rule based approach for spam detection” ,Computer Science and Engineering Department, Thapar University, India, July 2010.
[9] Vandana Jaswal, Nidhi Sood, “Spam Detection System Using Hidden Markov Model”, International Journal of Advanced Research in Computer Science and
Software Engineering, India, Volume 3, Issue 7, July 2013.
[10] Sahil Puri, Dishant Gosain, Mehak Ahuja, Ishita Kathuria, Nishtha Jatana, “Comparison and Analysis of Spam Detection.