0% found this document useful (0 votes)
64 views5 pages

E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem

Spam, sometimes called spam, is unsolicited email that is typically sent to large lists of recipients. Although real individuals can send spam, botnets (computer networks infected by an attacker known as a "bully") are often responsible for sending spam. While most people view spam as a problem, they believe it is a result of email communication. In addition to being annoying, spam can also be dangerous because it can clog email inboxes if not filtered properly and deleted frequently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
64 views5 pages

E-Mail Spam Detection Using Machine Learning Naive Bayes Theorem

Spam, sometimes called spam, is unsolicited email that is typically sent to large lists of recipients. Although real individuals can send spam, botnets (computer networks infected by an attacker known as a "bully") are often responsible for sending spam. While most people view spam as a problem, they believe it is a result of email communication. In addition to being annoying, spam can also be dangerous because it can clog email inboxes if not filtered properly and deleted frequently.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 9, Issue 2, February – 2024 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

E-Mail Spam Detection Using Machine Learning


Naive Bayes Theorem
Karra.NAGA SIVA SURYA DILEEP 1
Krovvidi.KARTHIK SAI SRI RAMA RAJU 2
Karella.JOHNY 3
Kombathula.VENKAT 4
P.SRINU VASA RAO 5
Swarnandhra College of Engineering & Technology.

Abstract:- Spam, sometimes called spam, is unsolicited I. INTRODUCTION


email that is typically sent to large lists of recipients.
Although real individuals can send spam, botnets Spam has become a major problem in today's digital
(computer networks infected by an attacker known as a age and poses a challenge to individuals, businesses and
"bully") are often responsible for sending spam. While organizations. Spam is unsolicited messages that fill
most people view spam as a problem, they believe it is a inboxes, waste time and resources, and potentially
result of email communication. In addition to being expose users to malicious or fraudulent content. To solve
annoying, spam can also be dangerous because it can this problem, machine learning has become a powerful
clog email inboxes if not filtered properly and deleted tool in spam detection. The purpose of spam detection is
frequently. to identify legitimate emails (spam) or spam. Due to the
evolution of spam, the effectiveness of legal procedures
Spammers or spammers often change their methods is limited. Machine learning provides powerful and
and content to trick victims into downloading malware, flexible processing using patterns and features extracted
sharing personal information, or feeding money. Most from large email databases. Machine learning algorithms
spam is commercial in nature and financially motivated. can learn from email text files to create patterns that can
Spammers attempt to deceive recipients by making false identify spam patterns. This model can be used to
claims, selling questionable products, and promoting identify new, unseen emails. By analyzing various email
false information. elements such as sending information, sentences,
content, and embedded URLs, machine learning
Unwanted emails, such as phishing and spam, cost algorithms can identify spam characteristics and act
businesses and individuals billions of dollars each year. accordingly.
Many models and techniques for automatic spam
detection have been introduced and developed, but In today's digital age, where email is one of the
100% accuracy has not yet been found. Among all most common means of communication, spam has
designs, machine and deep learning algorithms are more become a huge problem for people and organizations.
successful. Natural language processing (NLP) improves Spam filters are essential for managing and monitoring
model accuracy. This study presents the effectiveness of important emails in your inbox. Machine learning (ML)
word embedding in spam classification. and natural language processing (NLP) techniques can
be used for spam classifiers to accurately identify and
Preliminary study Transformer model BERT filter spam. In this project, we aim to create a spam
(Bidirectional Encoder Represented by Transformers) is classification system based on machine learning and
well tuned to accomplish the task of identifying spam natural language processing to accurately classify emails
from non-spam (HAM). BERT uses a color layer to place as spam or not spam. The performance of the system will
the content of the text into its perspective. The results be evaluated according to various parameters such as
were compared with the basic DNN (Deep Neural accuracy, precision, recall, and F1 score.
Network) model consisting of BiLSTM (Bidirectional
Long Term Memory) layer and two thick layers. Spam is the use of email to send unsolicited email
or broadcast email to multiple recipients. Spam emails
 Here are some of the most popular spam topics: indicate that the recipient has not given permission to
Pharmaceuticals, financial services, working from receive such emails. Spam has become a big problem on
home, porn, online courses and cryptocurrency. the internet. Spam wastes storage and media. Automated
email filters are still the best way to catch spam, but
Keywords:- Machine Learning, Natural Language nowadays spammers can easily block any spam filtering
Processing, Spam, Ham, Email, Naive Bayes, Logistic application. A few years ago, most spam from specific
Regression. email addresses was blocked manually. Machine learning
techniques will be used for spam detection, so Naive

IJISRT24FEB1369 www.ijisrt.com 1265


Volume 9, Issue 2, February – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
Bayes is one of the methods used in this process. Naive II. LITERATURE SURVEY
Bayes algorithm is a supervised learning algorithm used
to solve classification problems and helps create fast  L.F. Cranor ve B.A. Lamacchia said that , Spam is not a
learning models for fast prediction. marketing email that wastes our time; this is spam. It
also uses network traffic and mail servers. It has also
Spam and Spamming: Spam refers to the content of become an integral part of many attacks, including spam,
e-mail and the use of electronic communications to send phishing, cross-site scripting, cross-site request errors,
unsolicited messages, especially advertisements; Bad and malware attacks. Statistics show an increase in spam
links are called spam. Therefore, if you do not know the containing malicious content, compared to spam
sender, the message may be spam. Many users do not promoting legitimate products and services. This paper
realize that they are signing up for some emails only studies the spam detection problem and develops an
after downloading free services, software or updates. effective spam detection method based on email content
"Raw" is not spam email. analysis. We see that many features have many
disadvantages. Our goal is to examine the effectiveness
Machine learning is more complex and focuses on of these features in aiding classical spam detection
developing computer programs and algorithms to access techniques. To further complicate the problem, we
data. So this template using training data is primarily a developed a spam classification model based on
set of emails. Machine Learning Methods There are randomness; Here spam is a small group, accounting for
many algorithms that can be used for email filtering. In only 16.5% of all emails. Use different measurements to
this article, Naive Bayes algorithm was used to detect evaluate the design. The results show that the spam
spam and it gave the highest accuracy. classification model improves well when learning about
devices with poor characteristics.
The machine learning method that uses training  J. Goodman, G. V. Cormack, and D. Heckerman said
data, which is the pre-processing of the email, is more that, Antispam researchers and developers are
efficient. “Naive Bayes, Support Vector Machines, working to improve spam filtering software to solve
Neural Networks, K-Neighbour Communities, Random problems caused by the complex techniques
Forests, etc.” that can be used for email filtering. There spammers use to bypass filters and enter users'
are many machine learning methods, including: Why mailboxes. In 1998, the volume of unsolicited emails
choose machine learning: Machine learning allows users (or spam) increased from approximately 10% to 80%
to feed large amounts of data into computer algorithms of all messages sent, causing problems for email
and the computer analyzes it based on input data and service providers (ESP). Major email services send
makes data-driven recommendations and decisions. What more than a million spam messages every day. Use
is a dataset: A dataset is collected from data or learning algorithms to find characteristics of spam
information regarding individual content. The spam and good emails. Spammers can also quickly learn
email profile includes spam and non-spam emails. What the most obvious words to avoid and the safest words
is training data and test data: The main difference to add to the filter. Algorithms based on logistic
between training data and test data is that training data is regression and support vector machine can reduce the
a part of the original data used to train the learning number of missed spam messages by half.
model while test data is used to check the accuracy of  TO. Blanzieri and A. Bryl said that , Spam is a big
the model. . The size of training data is usually larger problem online today, causing financial losses for
than testing data. Training and testing datasets are two companies and affecting consumers. Filtering is an
important concepts in machine learning based on training important and popular way to prevent spam. In this
data. article, we provide an overview of the state of the art
in machine learning for spam filtering, as well as
ROC Curve: The ROC Curve (Receiver Operating methods for evaluating and comparing filtering
Characteristic Curve) is a graph that shows the methods. We will also briefly describe other branches
performance of each distribution model. of anti-spam and discuss the use of various
commercial and non-commercial anti-spam software
PR Curve: PR curve is a graph that has the correct solutions.
value on the y-axis and remembers the value on the x-  L. Zhang, J. Zhu, and T. Yao said that , This article
axis. In other words, the PR curve has TP/(TP+FP) on evaluates five evaluation methods in the context of spam
the y-axis and TP/(TP+FN) on the x-axis. It is worth filter statistics. We use the estimated values to examine
noting that accuracy is also known as positive predictive the effects of different pruning methods and sizes on
value (PPV). each student's performance. As can be seen, the
importance of feature selection varies from classifier to
Confusion Matrix: Confusion matrix is a table used classifier. In particular, we found that the support vector
to evaluate the effectiveness of classification models in machine AdaBoost and the maximum entropy model
machine learning and statistics. It shows the results of performed best in this test, with similar properties:
classification by calculating positives, negatives, insensitivity to feature selection strategies, easy size for
negatives and false positives. advanced features, eight-scaling, and performance on
many datasets. yield. appeared. In contrast, Naive Bayes

IJISRT24FEB1369 www.ijisrt.com 1266


Volume 9, Issue 2, February – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
(a classifier commonly used in spam filtering) has been detection model that will help you detect whether an email is
shown to be sensitive to the selection process of small spam or not using Naive Bayes and Natural Language
elements and works poorly without any boost, resulting Processing (NLP).
in a large penalty. Experiments also show that when
creating filters for legitimate email applications that are
more expensive than spam (e.g., λ = 999), aggressive
feature pruning should be avoided to better preserve
performance. An interesting finding is the impact of
email headers on spam filtering, which has often been
overlooked in previous research. Experiments show that
a classifier using name features can perform as well as or
better than a filter using text only. This means that email
headers can be a reliable and unique source for spam
filtering.
 J.-J. Sheu said that , The contact header of an email is
usually the email name, sender's name, email address,
delivery date, etc. It includes the following key
features. These features help classify emails. This
article uses decision tree data mining techniques to
identify important features of email headers to
identify spam organization strategies and propose
spam filtering methods to accurately identify spam
and emails. Based on spam filter testing using many
emails from China, we got the following positive
numbers: 96.5% accuracy, 96.67% accuracy and
96.3% bounce rate. Therefore, the methods Fig 1 : Email Spam Dataset
mentioned in this article only need to analyze the
session header to effectively identify spam emails Naive Bayes classifier is a popular algorithm for email
and thus reduce the computational cost. filtering. They often use the message bag feature to identify
spam, a method used to classify letters.
III. METHODOLOGY
The Naive Bayes classifier works by correlating usage
Naive Bayes is an effective method of data analysis tokens (usually words, sometimes other) with spam and non-
based on Bayes theorem used for email spam filtering. If spam emails, and then using Bayes' theorem to calculate
you have an email account, you will notice that emails are whether the email is associated with spam.
divided into different groups and classified as important,
spam, advertisements, etc. We are sure you have seen it Naive Bayesian spam filtering is a basic technology for
classified as. Wouldn't it be nice to see a smart machine processing spam that can be adjusted according to the user's
working for you? email needs and provides the negative result of spam
detection generally available to users. It is one of the oldest
In general, the tags added by the system are correct. So spam filtering systems, launched in the 1990s. Naive Bayes
does this mean that our email software reads every is one of the simplest yet powerful classifier algorithms.
communication and now understands what you are doing as Given hypothesis A and evidence B, Bayes' theorem
a user? This is true! In the age of data analysis and machine calculates the relationship between the probability of the
learning, automatic filtering of emails is done by algorithms previous hypothesis obtaining with evidence P(A) and the
such as Naive Bayes classifiers, which use the simple Bayes probability of obtaining the final hypothesis if we accept
theorem for the data. evidence P(A/B) is :

Current spam filtering software is constantly trying to


classify emails correctly. Avoiding spam and promotional
communications is the hardest part of all. As the battle
between spam filtering software and anonymous spam and
email marketing continues, spam communication algorithms Naïve Bayes Equation.
must also continue to evolve. In data analysis, the Naive
Bayes algorithm forms the basis for filtering messages on Where:
Gmail, Yahoo Mail, Hotmail and all other platforms. As • A, B = events
online consumption of goods and services increases, • P(AB) = probability of A if B is true
consumers are facing a major problem with too much spam • P(BA) = probability of B if A is true
in their inboxes. Whether it's advertising or fraud. But this is • P (A ), P(B) = independence of A and B
why very important messages/emails are compressed in
spam emails . In this article, we will create an email spam

IJISRT24FEB1369 www.ijisrt.com 1267


Volume 9, Issue 2, February – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
IV. RESULT

Naïve Bayes Method:


data['Spam']= data[ 'category'].apply(lambda x:1 if
x=='spam' else 0)
X_train,X_test,y_train,y_test=train_test_split(data.Message
, .Spam , test_size=0,20)
model=Pipeline([ ('vectorizer',CountVectorizer()),
('nb',MultinomialNB()) ])
model.fit(X_train,y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

V. OUTPUT

Array(1,0), dtype : int64


Output 2 : Recall and Precision Graph
The probability of receiving an e-mail is calculated
correctly according to the data set.
So accuracy = 0.9856502242152466 (OR) 98%.

Classification Report :
Precision recall f1 support score\n\n 0 0.99 0.99 0.99 965\n
1 0.97 0.92 0.94 150\n\n Accuracy 0.98 1115\n Macro
average 0.98 0.96 10199810.n 0 91099.1099 1091099.100
99 9110999.10998 19910999.10998 19910999.10998
199.109989 10999.10998 199'

Output 3 : Confusion Matrix

VI. CONCLUSION

We conclude that the Naive Bayes algorithm is the best


for classification in spam detection and it is worth
examining the Polynomial Naive Bayes algorithm because it
has many applications in many industries and the algorithm
Predictions are very fast. Media classification is one of the
Output 1 : True False Positive Rate
most popular users of the Naive Bayes algorithm. News
political, regional, international etc. It is widely used to
divide into different sections such as political , regional and
so on.

IJISRT24FEB1369 www.ijisrt.com 1268


Volume 9, Issue 2, February – 2024 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
REFERENCES

[1]. L.F. Cranor ve B.A. Lamacchia: "Spam!"


Communications of the ACM, vol. 41. No. 8, p. 74-83,
August 1998.
[2]. J. Goodman, G. V. Cormack, and D. Heckerman,
“Spam and the ongoing war for the inbox,”
Communications ACM, vol. 50, no. 2.00 pm 24-33,
February 2007.
[3]. TO. Blanzieri and A. Bryl, “A study on learning-based
email spam filtering strategies,” Artificial Intelligence,
vol. 29. No. 1 p.m. 63-92, Three. Year 2008.
[4]. L. Zhang, J. Zhu, and T. Yao, “Evaluation of spam
filtering techniques,” ACM Transactions on Asian
Studies, vol. 3, p. 243-269, 2004
[5]. J.-J. Sheu, “An effective two-stage spam filtering
method based on email classification,” International
Journal of Cyber Security, vol. 9. No. 1, s. 34-43,
2009.

IJISRT24FEB1369 www.ijisrt.com 1269

You might also like