0% found this document useful (0 votes)
9 views

Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning

The document discusses detecting spam messages using the naive Bayes algorithm of machine learning. It defines spam and discusses how spammers operate, including personalizing messages and mimicking legal letters. The document also presents a table for determining spam/non-spam using examples and the naive Bayes algorithm.

Uploaded by

Arpan Soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning

The document discusses detecting spam messages using the naive Bayes algorithm of machine learning. It defines spam and discusses how spammers operate, including personalizing messages and mimicking legal letters. The document also presents a table for determining spam/non-spam using examples and the naive Bayes algorithm.

Uploaded by

Arpan Soni
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 3

Detecting spam messages using the naive Bayes

algorithm of basic machine learning


Khamdamov Rustam Khamdamovich Haydarov Elshod
2021 International Conference on Information Science and Communications Technologies (ICISCT) | 978-1-6654-3258-0/21/$31.00 ©2021 IEEE | DOI: 10.1109/ICISCT52966.2021.9670243

Tashkent university of information technologies named after Tashkent university of information technologies named after
Muhammad al-Khwarazmi Muhammad al-Khwarazmi
Tashkent, Uzbekistan Tashkent, Uzbekistan
[email protected] [email protected]

Abstract—In this article, we consider the definition of spam users). They use both problems in user software and malicious
messages using the naive Bayes algorithm based on machine "Trojan" programs that the user receives along with viruses, or
learning and study the algorithm of inclusion. A table has been through file-sharing networks. In all likelihood, vulnerabilities
compiled for determining spam/non-spam by example and in wireless networks can also be exploited.
calculating using the naive Bayes algorithm based on machine Typically, a separate user computer is used to send a small
learning. fraction of messages, with hundreds or thousands of user
machines participating in the mailing. Apparently, the largest
Keywords—spam, filtering, Bayes, machine learning, message,
spammers managed to establish end-to-end monitoring of
algorithm
message delivery, as a result, a letter rejected when trying to
I. INTRODUCTION deliver from one IP address will be sent again from another IP.
This makes bouncing (rejecting) mail over RBL ineffective -
When considering and studying any phenomenon, it is attempts to deliver the message will be repeated from other IP
necessary to clearly define it. This is especially important when addresses.
considering the problems associated with "spam", as there are Unauthorized use of someone else's resources is clearly
many different definitions, many of which are too vague for illegal and criminalized in most countries, but the technical
practical use. In the text of this article, the following definitions complexity of securing evidence makes such prosecution
are used: ineffective.
Spam is anonymous unsolicited bulk email. All epithets in Personalization. A large proportion of spam messages are
the definition are important, other types of mass mailings are not unique. In other words, random sequences of characters (often
considered spam in this text. Most spam mailings are of an invisible to the reader), personal addresses, anecdotes, large
advertising or other commercial nature, this is important when pieces of coherent text, and so on are introduced into the letter.
considering the spam economy, but not so important from a "Mimicry" for legal letters. Spammers make technical
technical point of view. ICQ, SMS and other similar mailings information in sent emails as close as possible to legal
are not considered in the article. correspondence. As a result, the bulk of spam passes through
Legal mailing is a mass mailing of e-mail requested by the formal filters easily.
user. It is assumed that the user of the legal mailing list has In addition to its negative (in terms of ease of detection)
expressed a desire to receive it. As a rule, legal mailings are not properties, spam has a number of features that make it easier to
anonymous. detect:
Regular or legal email - email between users, or between A. A spam message contains a message (advertisement)
automated systems and users. Regular mail is not, first of all, from the customer of the mailing list. Thus, there can be no free
mass mail; there are usually only a few recipients of an text in the message; the advertised product or service will be
individual message. described there.
II. METHODS B. The spam message should be easy to read for the
recipient. In other words, it cannot be encrypted; the bulk must
Further evolution of methods of combating spam led to the be received as part of the message. The number of random
emergence of content filtering of e-mail - the analysis of sequences ("garbage") visible to the user should be small.
message texts using deterministic or statistical methods. Content Violation of these rules reduces readability, and, consequently,
filtering of spam is a fairly new method today, the effectiveness the response to advertising.
of which is still quite high, but a serious struggle against content B. The uniqueness of messages is provided automatically,
filters on the part of spam distributors is visible. Whose victory that is, random sequences of characters, greetings, and so on are
will end this struggle today is unclear. added by the program. Otherwise, the cost and time of making
The development of technologies for sending spam has led individual messages will be too high.
to the fact that today spam mail (at least spam aimed at the
Russian market) has a number of technological features that are Bayesian classifier belongs to the category of machine
important for the topic discussed below: learning. The bottom line is this: the system, which is faced with
Distribution. A significant proportion of spam messages are the task of determining whether the next letter is spam, is trained
sent through equipment installed by end users (usually in advance with a certain number of letters that are known
private

978-1-6654-3258-0/21/$31.00 ©2021 IEEE


Authorized licensed use limited to: Somaiya University. Downloaded on November 03,2023 at 13:58:36 UTC from IEEE Xplore. Restrictions apply.
exactly where "spam" and where "not spam". It has already Not spam:
become clear that this is training with a teacher, where we play
the role of a teacher. The Bayesian classifier represents a "There will be a meeting tomorrow"
document (in our case, a letter) in the form of a set of words that "Buy a kilogram of apples and a chocolate bar"
supposedly do not depend on each other (this is where that
naivety comes from). Assignment: determine to which category the following
letter should be classified:
It is necessary to calculate the score for each class (spam /
not spam) and select the one that turned out to be the maximum. “There is a mountain of apples in the store. Buy seven
For this we use the following formula*: kilograms and a chocolate bar "
𝑛 Solution:
arg 𝑚𝑎𝑥 [𝑃(𝑄𝑘 ) ∏ 𝑃(𝑥𝑖 |𝑄𝑘 )] We make a table. We remove all "stop words", calculate the
𝑖=1 probabilities, and take the parameter for smoothing as one.
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 𝑄𝑘 Rating for the Spam category:
𝑃(𝑄𝑘 ) =
𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑑𝑜𝑐𝑢𝑚𝑒𝑛𝑡𝑠 2 2 2 1 1 1 1 1
𝛼+𝑁𝑖𝑘
∙ ∙ ∙ ∙ ∙ ∙ ∙ ≈ 0.000000000587(𝑜𝑟 5.87𝐸 − 10)
4 23 23 23 23 23 23 23
𝑃(𝑥𝑖 |𝑄𝑘 ) = - word occurrence to class document
𝛼𝑀+𝑁𝑘
(with anti-aliasing) Rating for the "Not spam" category:
2 2 2 1 1 1 1 1
𝑁𝑘 - the number of words included in the class document ∙ ∙ ∙ ∙ ∙ ∙ ∙ ≈ 0.00000000444(𝑜𝑟 4.44𝐸 − 9)
4 21 21 21 21 21 21 21
M is the number of words from the training sample Answer: the "Not Spam" rating is higher than the "Spam"
𝑁𝑖𝑘 - the number of occurrences of the word to class rating. So the verification letter is not spam!
document
TABLE I. COUNTING
𝛼 - parameter for smoothing.
Word The The 𝑷(𝒙𝒊 |𝑺𝒑𝒂𝒎) 𝑷(𝒙𝒊 |𝑵𝒐𝒕 𝑺𝒑𝒂𝒎)
When the amount of text is very large, you have to work with num- numbe
very small numbers. In order to avoid this, you can transform ber of r of not
spam spam
the formula by the property of the logarithm **: entries entries
log 𝑎𝑏 = 𝑙𝑜𝑔𝑎 + 𝑙𝑜𝑔𝑏 Vouchers 1 0
low 1 0
Substitute and get: price 1 0
Stock 1 0
𝑛
Buy 1 1 1+1 1+1
arg 𝑚𝑎𝑥 [𝑙𝑜𝑔𝑃(𝑄𝑘 ) + ∑ 𝑙𝑜𝑔𝑃(𝑥𝑖 |𝑄𝑘 )] 14 + 9 14 + 7
1+1 1+1
A word from the training sample

𝑖=1
chocolate 1 1
bar 14 + 9 14 + 7
* While performing calculations, you may come across a get 1 0
word that was not present at the stage of training the system. phone 1 0
This can result in a score of zero and the document cannot be gift 1 0
classified in any of the categories (spam / non-spam). As much tomorrow 0 1
will be 0 1
as you want, you will not train your system in every possible
meeting 0 1
word. To do this, it is necessary to apply anti-aliasing, or rather, kilogram 0 1 1+0 1+1
make small corrections to all the probabilities of occurrence of 14 + 9 14 + 7
words in the document. The parameter 0 <α≤1 is selected (if α = apples 0 1 1+0 1+1
1, then this is Laplace smoothing) 14 + 9 14 + 7
store 0 0 1+0 1+0
** The logarithm is a monotonically increasing function. As 14 + 9 14 + 7
you can see from the first formula, we are looking for the mountain 0 0 1+0 1+0
maximum. The logarithm of the function will peak at the same 14 + 9 14 + 7
seven 0 0 1+0 1+0
point (on the abscissa) as the function itself. This simplifies the
calculation, since only the numerical value changes. 14 + 9 14 + 7

III. FROM THEORY TO PRACTICE We calculate the same using the function transformed by the
Let our system be trained on the following letters, known in property of the logarithm:
advance where "spam" and where "not spam" (training sample):
Rating for the Spam category:
Spam: 2 2 2 1 1 1
𝑙𝑜𝑔 + log + log + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 + 𝑙𝑜𝑔
"Vouchers at a low price" 4 23 23 23 23 23
1 1
"Stock! Buy a chocolate bar and get a phone as a gift " + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ≈ −21.25
23 23

Authorized licensed use limited to: Somaiya University. Downloaded on November 03,2023 at 13:58:36 UTC from IEEE Xplore. Restrictions apply.
Rating for the "Not spam" category: [3] Bolshakova E.I., Klyshinsky E.S., Lande D.V., Noskov A.A., Peskova
O.V., Yagunova E.V. Automatic processing of texts in natural language
2 2 2 1 1 1 and computational linguistics: textbook. allowance - M .: MIEM, 2011 .-
𝑙𝑜𝑔 + log + log + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 - 272 p.
4 21 21 21 21 21
1 1 [4] Barber D. Bayesian reasoning and machine learning. - Cambridge
+ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ≈ −19.23 University Press, 2012.
21 21
[5] Chhabra S., Yerazunis WS, Siefkes C. Spam filtering using a markov
Answer: similar to the previous answer. Verification email random field model with variable weighting schemas // Data Mining,
is not spam! 2004. ICDM'04. Fourth IEEE International Conference on. - IEEE, 2004.
- Pp. 347-350.
CONCLUTION [6] CRM114 Notes for the TREC 2005 Spam Track [Electronic resource].
Access mode
The mathematical apparatus of the considered class is http://crm114.sourceforge.net/docs/NIST_TREC_2005_paper.html (date
described and a detailed description of the algorithm is given. accessed: 20.02.17)
As a result of the review of the spam identification method, a [7] Hovold J. Naive Bayes Spam Filtering Using Word-Position-Based
table was compiled reflecting the identification process. Attributes // CEAS. - 2005. - Pp. 41-48.
[8] Christina V., Karpagavalli S., Suganya G. Email spam filtering using
As a result of this review, the following promising research supervised machine learning techniques // International Journal on
areas can be identified: Computer Science and Engineering (IJCSE). - 2010. - Vol. 2. - Pp. 3126-
3129.
• research of not fully understood algorithms [9] Lowd D., Meek C. Good Word Attacks on Statistical Spam Filters // In
• development of combined methods with high accuracy; Proceedings of the Second Conference on Email and Anti-Spam (CEAS).
- 2005.
REFERENCES [10] Su B., Xu C. Not So Naıve Online Bayesian Spam Filter // Proceedings
[1] Cormack GV Email spam filtering: A systematic review // Foundations of the Twenty-First Innovative Applications of Artificial Intelligence
and Trends in Information Retrieval. - 2008. - Vol. 1. - No. 4. - P. 335- Conference. - 2009.
455. [11] Better Bayesian Filtering. [Electronic resource]. Access mode
[2] Spam and phishing in the second quarter of 2016. [Electronic resource]. http://www.paulgraham.com/better.html (Date accessed: 20.02.17)
Access mode: https://securelist.ru/analysis/spam-quarterly/29116/spam-
and-phishing-in-q2-2016/ (date of access: 20.02.17)

Authorized licensed use limited to: Somaiya University. Downloaded on November 03,2023 at 13:58:36 UTC from IEEE Xplore. Restrictions apply.

You might also like