0% found this document useful (0 votes)
66 views4 pages

Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability

The document proposes an algorithm to reduce the rate of false positive emails identified by spam filters. The algorithm performs a series of checks on emails before sending to alert the user about spammy elements and suggest changes. The goal is to help users ensure their emails are delivered by avoiding triggers that cause false positive identification.

Uploaded by

Madhu Raj Sekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views4 pages

Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability

The document proposes an algorithm to reduce the rate of false positive emails identified by spam filters. The algorithm performs a series of checks on emails before sending to alert the user about spammy elements and suggest changes. The goal is to help users ensure their emails are delivered by avoiding triggers that cause false positive identification.

Uploaded by

Madhu Raj Sekhar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Reverse of E-mail Spam Filtering Algorithms to

Maintain E-mail Deliverability

Hussah AlRashid, Rasheed AlZahrani, Eyas ElQawasmeh


Information Systems Department
College of Computer and Information Sciences
King Saud University
Riyadh, Saudi Arabia

Abstract—Over the past years, research has highlighted the legitimate e-mails that are misclassified as spam [1]. This
importance of enhancing the performance of e-mail spam filters paper will focus on the problem of false positive e-mails trying
to eliminate the risk of false negative e-mails. On the other hand, to identify its reasons and propose an algorithm that could help
the problem of false positive e-mails got less attention despite the in reducing its ratio.
fact that it is critical and may cause a failure in the delivery of
important e-mails. The organization of this paper is as follows. The first
section lists some of the pervious work done in this area. The
The aim of this research is to provide a solution to reduce the second section describes the details of the algorithm that we
rate of false positive e-mails. It addresses the problem by developed in order to reduce the false positive rate of e-mails.
exploring the behavior of the existing e-mail spam filters and The third section presents the experiment that we carried out in
highlighting the different reasons behind the failure of e-mail order to evaluate the performance of the proposed algorithm.
delivery. Based on this investigation, we developed an algorithm Finally, the last section concludes the work that has been done
that helps e-mail users in ensuring the deliverability of their e- in this research and lists the future work that lays the
mails. The proposed algorithm is based on reversing the foundation for further enhancements.
mechanism of spam filters on the client-side.

Keywords— Electronic mail, e-mail, ham, spam, false positive,


false negative, spam filters, spam filtering. II. RELATED WORK
A lot of work has been concerned with improving the
performance of e-mail spam filters to avoid the
I. INTRODUCTION misclassification of e-mail messages. Many researches were
Nowadays, the exponential growth of the Internet led to the aimed at reducing the spam filters’ false negative rate.
creation of enormous means of electronic communications However, to the author’s knowledge, there are only few
around the world. Electronic mail (or e-mail) is considered one researches that were focused on improving the performance of
of the most popular methods amongst people for spam filters to decrease its false positive rate.
communications. Since the proposed algorithm takes advantage of the
But as we all know, technology is a double-edged sword. existing spam filters, we are going to do a short overview of
Although e-mail has several advantages, however, some people them in the next section.
tend to misuse technology to support their own malicious Authors of [2] explored different spam filtering models to
intentions. In the case of e-mail, some people may use this create a powerful algorithm that classifies e-mail messages
technology to send a deceptive message that contains a based on the e-mail user behavior in addition to the message
malicious code (such as a virus or worm) to harm the recipient. content. The algorithm is expected to understand the user’s
To overcome this issue, computer specialists developed what is behavior by learning which kind of messages the user wants to
known today as spam filters. Their work relies on powerful receive. This way, the spam filter understands the behavior of
algorithms to classify e-mails into two categories: legitimate each user individually and can classify messages more
(ham) and illegitimate (spam) e-mails. precisely.
Spam filters were originally developed to prevent spam e- Authors of [3] demonstrated that spam filters could be
mail from reaching the recipient’s inbox. However, a few classified based on the technique they use such as: machine
problems have occurred as a result of using spam filters. False learning, Bayesian theory, etc. Machine learning based
positive/negative e-mails is one of the critical limitations of classifiers rely on training the spam filter using a set of input
spam filters. False negative e-mails are those e-mails that are data to enhance the overall performance of the classification
truly spam but instead they are misclassified as legitimate e- process. The authors proposed a method for e-mail spam
mails [1]. On the other hand, false positive e-mails are classification using machine-learning algorithm called

ISBN: 978-1-4799-3724-0/14/$31.00 ©2014 IEEE 297


“Adaptive Boosting Algorithm”. The algorithm is expected to Step.2: Get the subject of the message; if it is empty
adapt its behavior based on the input it receives over time to then ask the user to add a subject.
decrease the false negative/positive rate.
Step.3: If the subject contains spam trigger words;
Another type of spam classifiers is Bayesian based then list them and ask the user to change
classifiers . They work by analyzing the content of the entire them.
message to calculate the overall spam probability of the
message [4]. Auhtors of [4] proposed an enhanced algorithm Step.4: Get the body of the message; if it is empty
then ask the user to add a small portion of
that aims at reducing the false positive rate. The algorithm is
called “Minimum Risk Bayes Algorithm” and it uses the text.
decision-making technique in the classification process. Step.5: If the body of the message contains spam
Authors of [5] addressed the issue of false positive trigger words; then list them and ask the user
to change them.
classification and how sensitive it is given that important
messages may end up as spam. To solve this problem, the Step.6: If the body of the message contains
authors proposed a training algorithm for spam classification. capitalized words; then convert them into
A series of tests were conducted to evaluate the performance of lower case words.
the proposed algorithm.
Step.7: If the body of the message contains continues
punctuation marks and/or symbols; then
minimize them.
III. PROPOSED ALGORITHM
Step.8: If the words within the e-mail body contain
There are two aspects that could help in solving this
symbols; then list them and ask the user to
problem. The first aspect is the server-side and the second
change them.
aspect is the client-side. The solution on the server-side is that
spam filters algorithms could be enhanced to eliminate the Step.9: If the format styles are used excessively in the
possibility of false positive filtering. However, this solution is body of the message; then ask the user to
not applicable because there is a trade off between false reduce them.
positive and false negative classification. In order to reduce the
rate of one of them, the rate of the other will increase Step.10: If the body of the message contains blocked,
accordingly. shortened or IP form URL; then refer to it and
ask the user to change/delete it.
In this paper, we are going to focus on the client-side
(especially the sender’s side and not the recipient’s side). The Step.11: If the body of the message contains invalid
proposed algorithm will help e-mail users in decreasing the HTML code; then refer to it and ask the user
spam probability of their e-mails. The algorithm will perform to fix it.
a series of checks on the message before sending it. Moreover, Step.12: If an alternative text is not included with
it will alert the user regarding the things that make his/her e- HTML e-mails; then ask the user to include it.
mail look like spam. It will also suggest some changes on the
e-mail content if necessary. Step.13: If the message body contains embedded
forms; then ask the user to remove them.
The expected outcome of the proposed algorithm is to help
users make their e-mails’ content look legitimate to pass the Step.14: If the body of the message is composed of
spam filters’ test successfully and land in the recipient’s inbox. images only; then ask the user to add a text.
If the user followed the algorithm’s suggestions, there is a great Step.15: If the subject and body of the e-mail are
chance that the deliverability rate of his/her e-mail will identical; then ask the user to make some
increase. changes.
The algorithm works as follows. When the user writes an e- Step.16: Get the attachment type, if it is video, java
mail message, the algorithm checks its contents. The checking script, executable files or any other suspicious
process includes looking at the subject, words, URLs, From, type; then inform the user about it.
To, CC, and BCC addresses, attachments, etc. These elements
will be compared with the ones that are usually associated with Step.17: Get the “From” address, if it is not in the right
spam e-mails. The user will then be provided with a list of format; then inform the user.
things that might trigger spam filters along with the suggested Step.18: If the “From” address is invalid; then inform
recommendations. the user.
The checking process of the proposed algorithm includes Step.19: If the mail server of the “From” address is
breaking down the problem into the following points: blocked; then inform the user.
BEGIN Step.20: Get the “To” addresses, if there are more than
Step.1: Obtain the body and header of the e-mail 20 addresses; then ask the user to reduce the
message. number and send the message to 20 users at a
time.

ISBN: 978-1-4799-3724-0/14/$31.00 ©2014 IEEE 298


Step.21: If the “To” addresses contains an Table I presents the formulas for these measures. Note that
invalid/trap/bounce addresses; then ask the nH and nS represent the number of ham and spam messages
user to remove them. respectively, nH → H and nS → S represent the number of
ham/spam messages that are classified correctly, nH → S
Step.22: Display a message with the changes and/or
recommendations to the user. represents the number of ham classified as spam, and nS → H
represents the number of spam classified as ham.
END
TABLE I. EVALUATION MEASURES

IV. PERFORMANCE RESULTS Measure Formula


Since the algorithm is designed to work on the client-side, nH → S
we developed a webmail client for this purpose. The program FPR
nH → H + nH → S
is implemented with minimal functionality to achieve the goal
of this research. Fig. 1 shows the interface of the algorithm. FNR
nS → H
nS → S + nS → H

nH → H
Ham Recall
nH → H + nH → S

nH → H
Ham Precision
nH → H + nS → H

nS → S
Spam Recall
nS → S + nS → H

nS → S
Spam Precision
nS → S + nH → S

nH → H + nS → S
Accuracy
nH → H + nS → S + nS → H + nH → S

C. Experiment Results
The experiment involves resending ten e-mails: five ham e-
mails and five spam e-mails using the application we
Fig. 1. The Interface of the Webmail Application developed. The algorithm is applied on the messages before
sending them and some changes are made based on the
recommendations of the algorithm.
An experiment was conducted to test the overall
performance of the proposed algorithm. The following The results showed that two out of five of the false positive
subsections describe the experiment settings. e-mails landed in the inbox, and the rest have been classified as
spam. On the other hand, all the five spam e-mails have been
A. Data Set classified as spam. So this gives us the following values:
The data set used in the experiment is a collection of emails nH → H = 2 , nH → S = 3 , nS → S = 5 , nS → H = 0 . We used
from an e-mail user. The e-mails were classified as spam by these values to calculate the formulas in Table I. The results of
the user’s spam filter. However, the user considers those e- the evaluation are listed in Table II.
mails as legitimate. The false positive e-mails are used to
evaluate the purpose of this research, which is ensuring the TABLE II. EVALUATION RESULTS
deliverability of legitimate e-mails to the recipient.
Measure Result
The data set also includes a set of actual spam e-mails.
Spam e-mails are used to ensure that the algorithm will not FPR 60%
pass actual spam. We want the performance of the proposed
FNR 0%
algorithm to be balanced so that it passes legitimate e-mails
without affecting the security of e-mail users by increasing the Ham Recall 40%
chance of receiving actual spam.
Ham Precision 100%
B. Evaluation Measures Spam Recall 100%
There are various measures to evaluate the performance of
Spam Precision 62.5%
an algorithm. Here, we are going to focus on five measures:
false positive rate (FPR), false negative rate (FNR), recall, Accuracy 70%
precision and accuracy.

ISBN: 978-1-4799-3724-0/14/$31.00 ©2014 IEEE 299


All the ham messages in the data set are false positive Although the proposed algorithm has proven its potentiality
cases, which means that the original percentage of the false in serving the aim of this research, however, there are several
positive rate before applying the algorithm is 100%. The false avenues for potential improvements. Further testing on the
positive rate for these messages after applying the algorithm is implemented part of the algorithm is required. In addition, the
60% (3/[2+3]=0.6), and that is a 40% decrease in the test sample could be expanded to include a wide range of e-
misclassification of ham e-mail. mail messages from e-mail users of different backgrounds.
Moreover, the e-mails used in the evaluation can be collected
Ham recall percentage shows that 40% of the actual ham
on a longer time span to cover all the possibilities.
messages in the data set have been classified correctly. 100%
of ham precision indicates that all the messages that are
classified as ham are actually ham and there are no false
negative cases amongst them. REFERENCES
Regarding the rest of the measures, the 0% of false negative [1] J. Duntemann. (2004). Degunking Your Email, Spam, and Viruses.
[Online]. Avialable: http://www.ebrary.com.
rate and 100% of spam recall show that the algorithm did not
[2] S. Hershkop and S. Stolfo. “Combining Email Models for False Positive
cause a trade off between the security of e-mail users and the Reduction.” In the Proceedings of the 11th ACM SIGKDD International
deliverability of messages. Conference on Knowledge Discovery in Data Mining, 2005, pp. 98-107.
Overall, the proposed algorithm has been proven to be [3] A. Ali and Y. Xiang. “Spam Classification Using Adaptive Boosting
Algorithm.” In the Proceedings of the International Conference on
affective in increasing the deliverability of legitimate e-mails. Computer and Information Science, 2007, pp. 972-976.
[4] H. Yin and Z. Chaoyang. “An improved Bayesian Algorithm for
Filtering Spam E-mail.” In the Proceedings of the International
V. CONCLUSION AND FUTURE WORK Symposium on Intelligence Information Processing and Trusted
Computing, 2011, pp. 87-90.
The proposed algorithm provides support for e-mail users [5] L. Zhen and Z. Ming-Tian. “Spam Filtering Issue: FPD Research
to help them increase their e-mail’s deliverability. It can be between False Positive and False Negative.” In the Proceedings of the
integrated as an extra layer with existing e-mail clients or it can Fourth International Conference on Fuzzy Systems and Knowledge
be implemented separately as a support tool. Discovery, 2007, pp. 526-534.

ISBN: 978-1-4799-3724-0/14/$31.00 ©2014 IEEE 300

You might also like