Email classification via intention based Segmentation
Email classification via intention based Segmentation
Abstract—Email is the most popular way of personal and spamming and the situation becomes worst even after several
official communication among people and organizations. Due to continuous safety efforts. The spam filters developed during
untrusted virtual environment, email systems may face frequent the last two decades can identify the spam emails which
attacks like malware, spamming, social engineering, etc. Spam-
ming is the most common malicious activity, where unsolicited contains extra spaces, HTML tags, links of malicious web
emails are sent in bulk, and these spam emails can be the source pages etc., but they do not consider the noise or presence
of malware, waste resources, hence degrade the productivity. In of concept drift in the emails [3]. Naturally, the interests of
spam filter development, the most important challenge is to find a person change frequently, so the class of the email also
the correlation between the nature of spam and the interest of changes for a user with time. For example, during placement
the users because the interests of users are dynamic. This paper
proposes a novel dynamic spam filter model that considers the season, interests of the students are towards the subjects which
changes in the interests of users with time while handling the are useful in their placements so during that time emails related
spam activities. It uses intention-based segmentation to compare to those subjects will be hams for the students but after the
different segments of text documents instead of comparing them placements, interests of the students change and they are more
as a whole. The proposed spam filter is a multi-tier approach interested in emails related to entertainment, games etc. So
where initially, the email content is divided into segments with the
help of part of speech (POS) tagging based on voices and tenses. the emails which were spam at the time of placements will
Further, the segments are clustered using hierarchical clustering no longer be spams after placements. This changing effect of
and compared using the vector space model. In the third stage, interest has not been considered in most of the spam filtering
concept drift is detected in the clusters to identify the change in tools. There are a wide variety of spam techniques present on
the interest of the user. Later, the classification of ham emails the web as shown in Fig. 1 but at counterpart, the spammers
into various categories is done in the last stage. For experiments
Enron dataset is used and the obtained results are promising. smartly develop new alternate techniques against the spam
Index Terms—Concept drift; Intention-based Segmentation; filters as they come into use [4].
Part of speech(POS) tagging; Vector space model; Hierarchical
clustering; Spam.
I. I NTRODUCTION
The first ever spam email was delivered on 3 May 1978
through ARPAnet where about 2600 ARPAnet users were Fig. 1. Techniques Used by Spammers
sent a message by a marketer Gary Thuerk [1]. This incident
demonstrated the power of electronic mail as an advertising To classify an email as spam or ham, this paper proposes
platform to the world but introduced a new malicious activity a novel method based on intention based segmentation. Spam
known as spamming. The tremendous growth in the usage of emails are usually commercial emails which are of less interest
email for personal and commercial use, spam filtering becomes to the user or contain some malware content, whereas the
the most important pre-requisite of any email system. Con- ham emails are of more interest (current interest) to the user.
ventionally, spam is considered as an unsolicited commercial Conventionally, the whole email is processed to predict its
email. The importance of these spam email depends on the nature, whereas the proposed intention based segmentation
overall interest of the users too. Hence, a strategic spam filter approach works on the segments of the content of the email.
is essential that can filter the emails with the consideration of Ideally, processing of the segments individually reflects more
users interest. Meanwhile, judicial laws have also been made information about the content, which results in higher ac-
to fight against the spam activities such as CAM SPAM Act curacy. There are two possible ways of segmentation: ran-
in USA [2]. dom and rule-based. Random selection of segments does not
Many spam filtering techniques have been published dur- add descriptive information so not suitable for segmentation.
ing the last two decades [4], but because of technological Hence, the segments should be generated based on the context
advancements, spammers found various alternate techniques of of the sentence. In this approach, whenever the context of the
38
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
sentence is changing, it represents a boundary point between TABLE I
two segments and content should be divided at this point for F EATURES AND COMMUNICATION M EANS
further processing. After dividing the content into segments Tense present past future
Subject i/we you it/they/(s)he
change in the content of emails with time is detected. This Style interrogative negative affirmative
Status active passive -
change is referred as concept drift. Topics present in spam Part of speech verb noun adjective
39
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the other is a substantial difference between the two segments. because they create noise for the data. After that, the frequency
The hypothesis is chosen based on the type of change to be for every term is calculated using the Linguistic data consor-
detected. For detecting abrupt concept drift, a sub-sequence tium dataset. Further, TF-IDF value of the term is calculated
of the data is obtained. Then this sub-sequence is divided into using the following formula:
two parts: training window and testing window. Then using
tf idf (t) = tf (t) ∗ log(N/N (t)) (1)
some hypothesis change is detected in the testing window. If
no change appears in the testing window, then this is added to Here, tf(t) is mean term frequency of t, N is total no. of
the training window and a new testing window is introduced. documents and N(t) is no. of documents in which term t
The change is detected by analyzing the loss values of testing appears. A popular term is the term which satisfy the following
window and loss values of the same sized window obtained constraints -
by the reshuffling of the sub-sequence. At each iteration, risk
involved in test-train split is also calculated with a value called tf idf (t) > Tth , LDC(t) < Pmax (2)
permutation loss. Here, Tth represents the lower bound of TF-IDF values for
Later, to solve the problem of spam emails, Lu et al. [8] popular terms, and Pmax represents the upper bound of count
proposed concept drift detection technique, represented as a of popular terms. A rare term is the term which satisfy the
model (x,y), where x is the feature vector and y is data following constraints:
stream label. For drift detection, a two-sliding-windows model
tf idf (t) > Rth , LDC(t) < Rmax (3)
is used, where each window consists of the data points taken
from the two infinite random distributions. This method not Here, Rth represents the lower bound of TF-IDF values for
only ensures the absence of concept drift, but also highlights rare terms, and Rmax represents the upper bound of count of
the local regions where drift occurs. One strategy to detect rare terms.
the changes is through data distribution, where two data In the next step, a set of popular and rare terms is given
samples are compared and checked that they are from the as input to the feature extraction algorithm which will give a
same distribution or not. There are different tests to determine two-dimensional vector as output containing a feature vector
the relation among the samples of different distributions. of the term and weighting function for each of the terms. These
Authors proposed another concept drift detection strategy is vectors are given input to any standard classifier, which gives
to find the concept drift with the help of learners output. This the topic of the given document. In their research work, Naive
model traces and controls the error rate of the online learning Bayes and support vector machine classifiers are used.
algorithm, where the errors are defined as samples of data Many techniques related to the topic under consideration
found from the Bernoulli trials and generalized using binomial have been proposed earlier. The detailed literature survey
distribution. A significant change in the error indicates the shows that each technique has its limitations and drawbacks.
change in class distribution and the concept drift is detected. Also, none of the techniques considers all the important factors
Later, Russell et al. [9] suggested a methodology based on of concept drift. The literature shows that the presence of
feature extraction to classify the documents into categories. concept drift in emails plays a very important part in finding
This method uses two metrics to classify a document: popular- the class of the emails due to noise present in the content of
ity and rarity. If a topic is given, the feature set of the topic will the emails whereas the presence of HTML tags, price tags,
contain a set of popular words and a set of rare words which extra spaces etc. in the content put a significant effect on
come under the topic. Authors tested their algorithm under finding the class of emails. Meanwhile, the intention based
a wide range of development centric topics. Instead of using segmentation analysis indicates that instead of comparing the
the text similarity to compare the documents, authors used two documents as a whole, segment analysis can give better
feature extraction algorithm because of the presence of topic results. Whereas, it is evident that feature extraction algorithm
ambiguity in the documents, which gave very less accuracy. gives higher accuracy compare to the other normal classifica-
This approach takes the topics which tend to be very focused, tion algorithms. Classification algorithms should be applied to
and also take care of the words and symbols present in local the features after feature extraction instead of directly applying
or regional languages so that documents do not lose their to the documents. It is also found that instead of comparing
authenticity. It was observed that to extract the features of a the documents with text similarity TF-IDF scheme should be
topic, the combined use of popularity and rarity metrics gave used due to the presence of noise in the documents.
higher accuracy compared to the use of either of them. In the
III. P ROPOSED M ETHODOLOGY
first step, the focused topics are defined. Focused topics are
those which have following properties: This paper proposes a novel spam filter model to detect
spam emails via the concept drift occurred in the content.
• The topic is not present on the web with a high frequency. Categorization of the ham emails has also been done, which
• Document overlap with other topic should be negligible makes it easier for the user to access them. In the proposed
if not null. model, the classification of emails is performed in six steps,
In the next step, documents containing very less information where at every step, content of the email is processed and
and the popular terms present in the documents are removed, compared with the rest of the previous emails. By analyzing
40
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the comparison between the emails, concept drift is detected a group of characters which have a particular meaning
and the class of the emails is decided. Most importantly, in some particular document, whereas a type is defined
emails containing malware content are also taken care of and, as the class of a group of characters having the same
an email some pre-defined specific keywords is identified as character sequence [11].
spam emails. Fig. 2 shows the proposed methodology and b. Tagging: Tagging is the process where all words present
description of the steps is as follows: in the document are marked with a particular part-of-
speech. This process is known as POS (part-of-speech)
tagging or grammatical tagging or word-category disam-
biguation. A Word is marked based on it’s definition and
context i.e. how it is related to its adjacent or similar
words in the document. POS tagging is a similar process
of identification of nouns, verbs, adjectives, adverbs, etc.
[12].
c. Classification: Classification is performed based on tenses
(present, past or future) or voices (active or passive). If
a sentence contains modal verbs then it is categorized as
future tense, whereas if it contains verb and past or past
participle tense then it is categorized as past tense. If the
sentence contains verb and present or present participle
tense, then it is claimed to be in present tense [9].
Fig. 2. Proposed methodology Voice of the sentence can be determined by removal of
all the available adverbs. If the sentence has preposition
A. Checking for malware content or subordinating conjunction and after first word if the
This is the first step towards finding the real class of the sentence has wh-determiner, wh-pronoun, possessive wh-
emails. In this step content of the current email is checked. If it pronoun, wh-adverb as part of speech tags then the
contains some specific keywords stored in the database which sentence is in passive voice. If the sentence has ‘be’, ‘am’,
represents the malware content or if it contains the hyperlink ‘is’, ‘are’, ‘was’, ‘were’, ‘been’, ‘has’, ‘have’, ‘had’,
which are related to the malware content present on the web ‘do’, ‘did’, ‘does’, ‘can’, ‘could’, ‘shall’, ‘should’, ‘will’,
then the email is directly put into the spam emails and no ‘would’, ‘may’, ‘might’, ‘must’ and verb of past participle
further process is done on that email. as part-of-speech then the sentence is also in passive
voice. If in place of verb of past participate after one
B. Parsing emails word the sentence has verb of past tense, verb of present
The second step is parsing of the emails. In this, email is tense with third person singular, verb of present tense
parsed i.e. all the information related to the email is stored in a with non-third person singular or after two words it has
different variable, for example: date, sender, subject, content. verb of present participle tense, verb of past tense, verb of
After storing the content in a different variable, it is further past participle tense, verb of present tense with non-third
processed for the further processing. Extra spaces, HTML tags, person singular, verb of present tense with third person
price tags, exclamation symbols etc. are found in the content, singular, base form of verb as part of speech tags, then it
and these things are removed from the emails. Hyperlinks is in active voice. If the sentence is too short (less than
present in the content are also found and removed. The main two words) it is assumed in active voice [13].
idea behind removing the extra spaces, HTML tags, price tags
and exclamation marks is that these are used by spammers to
make a difference in the content of emails so that spam filter
recognise them as different and do not mark them as spams
[1]. Removal of these words makes it easier for proposed spam
filter to focus on the content mainly. The idea behind removing
the hyperlink is that once the hyperlink is processed in the
previous step no further processing is required [10].
Fig. 3. Segmentation of content
C. Content segmentation
In this step, email content is divided into segments that D. Grouping of Segments
involves following three stages: Tokenization, Tagging and In this phase, groups of obtained segments are formed
Classification (as shown in Fig. 3). using hierarchical clustering (Fig. 4). Initially, each segment is
a. Tokenization: It is the process to divide the email content taken as an individual cluster and later the difference between
into tokens and sometimes the difference between token clusters is calculated. The clusters having difference less than a
and type has to be recognized. A token is defined as pre-defined threshold are merged [14]. The difference between
41
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the clusters is calculated via cosine similarity and euclidean F. Classification of ham messages
distance using vector space model. In this step, ham messages are divided into different cat-
egories, which makes it easier for the user to access them.
Different classification algorithms are used to classify the
messages. The first step of each algorithm is to detect the
features from the documents (messages). After feature extrac-
tion, classifiers are trained and samples are tested. The features
used are ngram range and TF-IDF values [16].
• ngram range: ngram range refers to the size of the n-
gram which should be considered by the classifier if it
Fig. 4. Hierarchical Clustering is considering bigrams, trigrams etc. The more the size
of the n-gram, the more complex words the classifier can
Two clusters are merged repeatedly when the value of handle.
| cos θ | is less than 0.3 and euclidean distance between them • TF-IDF value: TF-IDF value refers to the term frequency
is less than a pre-defined threshold. Fig. 5 shows the cosine in the document. It can be calculated by dividing the
similarity between two documents [15]. The newly created count of the term with total no. of terms in the document.
cluster is represented by a new vector. Similarity formula Dataset of “20newsgroups” is used to train the classifier.
between two documents is as below: This dataset contains 20000 news articles, and these are in the
~ (d1 ).V~ (d2 ) folders of respective categories. Twenty-three categories have
V
sim(d1 , d2 ) = (4) been defined in this dataset. For every news article present in
~ ~ (d2 )|
|V (d1 )||V the database, the content of the news article and its category is
Euclidean distance [16] between two vectors is calculated extracted. Then the TF-IDF value for every term present in the
using following formula: news article is found, and classifier is trained based on the TF-
IDF values of the terms. TF-IDF values are used instead of the
count of the terms because it creates a problem when there is a
qX
2
D= (Vi − Vj ) (5)
big difference in the size of documents. Suppose a document
has 10000 words and contains 2000 related to a topic and
representing that topic. And there is one more document in the
dataset which contains 1000 words and contains 200 related
to the same topic. So, if word count is taken as parameter, the
classifier does not predict its category correctly [15]. Grid-
Based Search method for parameter tuning is used. Unigrams
and bigrams have been considered for our classifier, and TF-
IDF value threshold is from the range 0.5 to 1. In parameter
tuning for every combination of parameters, the classifier is
Fig. 5. Cosine similarity
tested, and accuracy is calculated. The set of parameters which
gives best accuracy is considered in the classifier. Alongside
changing the values of parameters, different classifiers and
values can also be tested, which will give the best possible
E. Finding concept drift in emails combination. This process takes more time as every set of
The final number of generated clusters represents the quan- parameters has to be tested but gives better results compare to
tity of topics. If all the segments of an email are present in the the other classifiers [17]. Fig. 6 tells about all the categories
clusters in which number of segments are less than a threshold, in which ham messages are divided.
then that email is marked as spam email, and rest of the emails IV. R ESULT AND DISCUSSION
are put into the ham emails category. Segments made in the
A. Dataset Description
third step are mapped with the emails and stored. There are
two advantages this approach: firstly, it avoids unnecessary The proposed algorithm has been tested on two datasets:
processing of duplicate emails and secondly, this approach Enron email dataset, and personal emails obtain from a gmail
ease the process of finding segments related to an email, hence account as a stream by using python libraries. Description of
reduces the computation cost. In experiments, step 3 to 5 are both datasets is as below:
performed by comparing the email with the stored emails of a. Enron email Dataset: The Enron email dataset was pub-
last 30 days (the window size is 30 days). As the stored lished by the “Federal energy regulatory commission”
segments and clusters are used for processing, it makes the during its Enrons collapse investigation. This dataset
process fast because only the current email has to be processed contains approximately 50,000 emails of the Enron Cor-
and put into the cluster. poration and has two fields:
42
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Ham messages categories
TABLE II
R ELATION BETWEEN PARAMETER VALUES AND ACCURACY Fig. 7. Segmentation of email contents
43
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
V. C ONCLUSION AND FUTURE W ORK [15] F. C. Heilbron, V. Escorcia, B. Ghanem and J. C. Niebles, ”Modern
Information Retrieval” Addison-Wesley Longman Publishing Co., pp.
In this paper, a method based on intention-based segmen- 961-970, 2007.
[16] Cha, Sung-Hyuk; Yoon, Sungsoon; and Tappert, Charles C.,”Enhancing
tation is proposed to classify an email. Two approaches have Binary Feature Vector Similarity Measures” (2005). CSIS Technical
been used to determine the intention of sentences: tense (past, Reports. 18.
present, future) and voices (active and passive voice) based. [17] K. Vasa, ”Text Classification through Statistical and Machine Learning
Models: A Survey”, International Journal Of Engineering Design Re-
These two methods are very closely related to the syntax and search, 2016, vol. 4.
grammar of English language. To compare two documents [18] V. Bobicev, ”Text classification: The case of multiple labels,” 2016 In-
vector space model is used. Sliding window technique is used ternational Conference on Communications (COMM), Bucharest, 2016,
pp. 39-42.
to detect the concept drift in email datasets that helps to give [19] Kaggle ”https://www.kaggle.com/wcukierski/enron-email-dataset” [Ac-
real class of emails with better accuracy in less time. The cessed March 20, 2020]
emails are compared by their segments of the same intention, [20] Jeffrey ”http://jeffreyfossett.com/” [Accessed march 20, 2020]
[21] Scott, David. (2009). ”Sturges’ rule”. Wiley Interdisciplinary Reviews:
instead of comparing them as a whole. This approach takes Computational Statistics. 1. 303 - 306. 10.1002/wics.35.
every small detail of the emails and ensures good performance.
The concept drift in emails describes the interests of the user.
There are methods other than tense-based and voice-based
segmentation which can be used for content segmentation
that can also be used for text summarization and other text-
based processes. The objective of the paper is to classify an
email with minimum computation resource, hence Naive bayes
and SVM are used for classification. Other advance machine
learning approaches can be used in future to achieve higher
accuracy with proper parameter tuning but need
R EFERENCES
[1] Ruano-Ords, D., Fdez-Riverola, F. and Mndez, ”Concept drift in e-mail
datasets: An empirical study with practical implications” Information
Sciences, 2018, 428, pp.120-135.
[2] Bhowmick, Alexy and Hazarika, Shyamanta. (2018). E-Mail Spam
Filtering: A Review of Techniques and Trends. 10.1007/978-981-10-
4765-7 61.
[3] W.A. Awad and S.M. ELseuofi,”Machine Learning Method for Spam
Email Classification”, Internation Journal of Computer Science and
Information Technology, 2011, vlo. 3, No. 1.
[4] Saad, Omar M. and Darwish, Ashraf and Faraj, Ramadan. (2019). A
survey of machine learning techniques for Spam filtering.
[5] Y. Sun, Z. Wang, Y. Bai, H. Dai and S. Nahavandi, A Classifier Graph
Based Recurring Concept Detection and Prediction Approach 2018
Computational Intelligence and Neuroscience, China, pp. 1-13.
[6] Papadimitriou D., Koutrika, G., Velegrakis Y. and Mylopoulos, “ Finding
Related Forum Posts through Content Similarity over Intention-Based
Segmentation” IEEE Transactions on Knowledge and Data Engineering,
2017, 29(9), pp.1860-1873.
[7] M. Harel, K. Crammer, R. El-Yaniv, S. Mannor, Concept Drift Detection
Through Resampling International Conference on Machine Learning,
Beijing, China, 2014, pp. 1324-1334.
[8] Lu, N., Zhang, G. and Lu J., Concept drift detection via competence
models International Conference on Artificial Intelligence, 2014, China,
pp.11-28.
[9] R. Power, J. Chen, T. Karthik, L. Subramanian, Document Classification
for Focused Topics International Conference on Machine Learning,
2013, NY, USA, pp. 123-136.
[10] S. Roy, A. Patra, S. Sau, K. Mandal, S. Kunar, An Efficient Spam
Filtering Technique for Email Account American Journal of Engineering
Research, 2013, pp. 63-73.
[11] R. J. Passonneau and D. J. Litman, ”Intention- based segmentation: Hu-
man reliability and correlation with linguistic cues in Proc. Annu.Meet.
Association Comput. Linguistics, 1993, pp. 148155.
[12] Jeffrey C. Reynar, “Topic Segmentation: Algorithms and Applications”,
University of Pennsylvania, AUGUST 1998.
[13] D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent dirichlet allocation” J. Mach.
Learn. Res. vol. 3 pp. 993-1022 2003.
[14] L. Weng et al. “Query by document via a decomposition-based two-level
retrieval approach” Proc. 34th Int. ACM SIGIR Conf. Res. Development
Inf. Retrieval pp. 505-514 Jul. 2011.
44
Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.