Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
Detecting Spam Messages Using The Naive Bayes Algorithm of Basic Machine Learning
Tashkent university of information technologies named after Tashkent university of information technologies named after
Muhammad al-Khwarazmi Muhammad al-Khwarazmi
Tashkent, Uzbekistan Tashkent, Uzbekistan
[email protected] [email protected]
Abstract—In this article, we consider the definition of spam users). They use both problems in user software and malicious
messages using the naive Bayes algorithm based on machine "Trojan" programs that the user receives along with viruses, or
learning and study the algorithm of inclusion. A table has been through file-sharing networks. In all likelihood, vulnerabilities
compiled for determining spam/non-spam by example and in wireless networks can also be exploited.
calculating using the naive Bayes algorithm based on machine Typically, a separate user computer is used to send a small
learning. fraction of messages, with hundreds or thousands of user
machines participating in the mailing. Apparently, the largest
Keywords—spam, filtering, Bayes, machine learning, message,
spammers managed to establish end-to-end monitoring of
algorithm
message delivery, as a result, a letter rejected when trying to
I. INTRODUCTION deliver from one IP address will be sent again from another IP.
This makes bouncing (rejecting) mail over RBL ineffective -
When considering and studying any phenomenon, it is attempts to deliver the message will be repeated from other IP
necessary to clearly define it. This is especially important when addresses.
considering the problems associated with "spam", as there are Unauthorized use of someone else's resources is clearly
many different definitions, many of which are too vague for illegal and criminalized in most countries, but the technical
practical use. In the text of this article, the following definitions complexity of securing evidence makes such prosecution
are used: ineffective.
Spam is anonymous unsolicited bulk email. All epithets in Personalization. A large proportion of spam messages are
the definition are important, other types of mass mailings are not unique. In other words, random sequences of characters (often
considered spam in this text. Most spam mailings are of an invisible to the reader), personal addresses, anecdotes, large
advertising or other commercial nature, this is important when pieces of coherent text, and so on are introduced into the letter.
considering the spam economy, but not so important from a "Mimicry" for legal letters. Spammers make technical
technical point of view. ICQ, SMS and other similar mailings information in sent emails as close as possible to legal
are not considered in the article. correspondence. As a result, the bulk of spam passes through
Legal mailing is a mass mailing of e-mail requested by the formal filters easily.
user. It is assumed that the user of the legal mailing list has In addition to its negative (in terms of ease of detection)
expressed a desire to receive it. As a rule, legal mailings are not properties, spam has a number of features that make it easier to
anonymous. detect:
Regular or legal email - email between users, or between A. A spam message contains a message (advertisement)
automated systems and users. Regular mail is not, first of all, from the customer of the mailing list. Thus, there can be no free
mass mail; there are usually only a few recipients of an text in the message; the advertised product or service will be
individual message. described there.
II. METHODS B. The spam message should be easy to read for the
recipient. In other words, it cannot be encrypted; the bulk must
Further evolution of methods of combating spam led to the be received as part of the message. The number of random
emergence of content filtering of e-mail - the analysis of sequences ("garbage") visible to the user should be small.
message texts using deterministic or statistical methods. Content Violation of these rules reduces readability, and, consequently,
filtering of spam is a fairly new method today, the effectiveness the response to advertising.
of which is still quite high, but a serious struggle against content B. The uniqueness of messages is provided automatically,
filters on the part of spam distributors is visible. Whose victory that is, random sequences of characters, greetings, and so on are
will end this struggle today is unclear. added by the program. Otherwise, the cost and time of making
The development of technologies for sending spam has led individual messages will be too high.
to the fact that today spam mail (at least spam aimed at the
Russian market) has a number of technological features that are Bayesian classifier belongs to the category of machine
important for the topic discussed below: learning. The bottom line is this: the system, which is faced with
Distribution. A significant proportion of spam messages are the task of determining whether the next letter is spam, is trained
sent through equipment installed by end users (usually in advance with a certain number of letters that are known
private
𝑖=1
chocolate 1 1
bar 14 + 9 14 + 7
* While performing calculations, you may come across a get 1 0
word that was not present at the stage of training the system. phone 1 0
This can result in a score of zero and the document cannot be gift 1 0
classified in any of the categories (spam / non-spam). As much tomorrow 0 1
will be 0 1
as you want, you will not train your system in every possible
meeting 0 1
word. To do this, it is necessary to apply anti-aliasing, or rather, kilogram 0 1 1+0 1+1
make small corrections to all the probabilities of occurrence of 14 + 9 14 + 7
words in the document. The parameter 0 <α≤1 is selected (if α = apples 0 1 1+0 1+1
1, then this is Laplace smoothing) 14 + 9 14 + 7
store 0 0 1+0 1+0
** The logarithm is a monotonically increasing function. As 14 + 9 14 + 7
you can see from the first formula, we are looking for the mountain 0 0 1+0 1+0
maximum. The logarithm of the function will peak at the same 14 + 9 14 + 7
seven 0 0 1+0 1+0
point (on the abscissa) as the function itself. This simplifies the
calculation, since only the numerical value changes. 14 + 9 14 + 7
III. FROM THEORY TO PRACTICE We calculate the same using the function transformed by the
Let our system be trained on the following letters, known in property of the logarithm:
advance where "spam" and where "not spam" (training sample):
Rating for the Spam category:
Spam: 2 2 2 1 1 1
𝑙𝑜𝑔 + log + log + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 + 𝑙𝑜𝑔
"Vouchers at a low price" 4 23 23 23 23 23
1 1
"Stock! Buy a chocolate bar and get a phone as a gift " + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ≈ −21.25
23 23
Authorized licensed use limited to: Somaiya University. Downloaded on November 03,2023 at 13:58:36 UTC from IEEE Xplore. Restrictions apply.
Rating for the "Not spam" category: [3] Bolshakova E.I., Klyshinsky E.S., Lande D.V., Noskov A.A., Peskova
O.V., Yagunova E.V. Automatic processing of texts in natural language
2 2 2 1 1 1 and computational linguistics: textbook. allowance - M .: MIEM, 2011 .-
𝑙𝑜𝑔 + log + log + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 + 𝑙𝑜𝑔 - 272 p.
4 21 21 21 21 21
1 1 [4] Barber D. Bayesian reasoning and machine learning. - Cambridge
+ 𝑙𝑜𝑔 + 𝑙𝑜𝑔 ≈ −19.23 University Press, 2012.
21 21
[5] Chhabra S., Yerazunis WS, Siefkes C. Spam filtering using a markov
Answer: similar to the previous answer. Verification email random field model with variable weighting schemas // Data Mining,
is not spam! 2004. ICDM'04. Fourth IEEE International Conference on. - IEEE, 2004.
- Pp. 347-350.
CONCLUTION [6] CRM114 Notes for the TREC 2005 Spam Track [Electronic resource].
Access mode
The mathematical apparatus of the considered class is http://crm114.sourceforge.net/docs/NIST_TREC_2005_paper.html (date
described and a detailed description of the algorithm is given. accessed: 20.02.17)
As a result of the review of the spam identification method, a [7] Hovold J. Naive Bayes Spam Filtering Using Word-Position-Based
table was compiled reflecting the identification process. Attributes // CEAS. - 2005. - Pp. 41-48.
[8] Christina V., Karpagavalli S., Suganya G. Email spam filtering using
As a result of this review, the following promising research supervised machine learning techniques // International Journal on
areas can be identified: Computer Science and Engineering (IJCSE). - 2010. - Vol. 2. - Pp. 3126-
3129.
• research of not fully understood algorithms [9] Lowd D., Meek C. Good Word Attacks on Statistical Spam Filters // In
• development of combined methods with high accuracy; Proceedings of the Second Conference on Email and Anti-Spam (CEAS).
- 2005.
REFERENCES [10] Su B., Xu C. Not So Naıve Online Bayesian Spam Filter // Proceedings
[1] Cormack GV Email spam filtering: A systematic review // Foundations of the Twenty-First Innovative Applications of Artificial Intelligence
and Trends in Information Retrieval. - 2008. - Vol. 1. - No. 4. - P. 335- Conference. - 2009.
455. [11] Better Bayesian Filtering. [Electronic resource]. Access mode
[2] Spam and phishing in the second quarter of 2016. [Electronic resource]. http://www.paulgraham.com/better.html (Date accessed: 20.02.17)
Access mode: https://securelist.ru/analysis/spam-quarterly/29116/spam-
and-phishing-in-q2-2016/ (date of access: 20.02.17)
Authorized licensed use limited to: Somaiya University. Downloaded on November 03,2023 at 13:58:36 UTC from IEEE Xplore. Restrictions apply.