0% found this document useful (0 votes)
3 views

Email classification via intention based Segmentation

Uploaded by

varunsh245
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Email classification via intention based Segmentation

Uploaded by

varunsh245
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proc.

EECSI 2020 - 1-2 October 2020

Email classification via intention-based


segmentation
Sanjay Kumar Sonbhadra Sonali Agarwal Mohammad Syafrullah Krisna Adiyarta
Department of IT Department of IT Program of master in CS Program of master in CS
IIIT, Allahabad IIIT, Allahabad Universitas Budi Luhur Universitas Budi Luhur
Prayagraj, India Prayagraj, India Jakarta, Indonesia Jakarta, Indonesia
[email protected] [email protected] [email protected] [email protected]

Abstract—Email is the most popular way of personal and spamming and the situation becomes worst even after several
official communication among people and organizations. Due to continuous safety efforts. The spam filters developed during
untrusted virtual environment, email systems may face frequent the last two decades can identify the spam emails which
attacks like malware, spamming, social engineering, etc. Spam-
ming is the most common malicious activity, where unsolicited contains extra spaces, HTML tags, links of malicious web
emails are sent in bulk, and these spam emails can be the source pages etc., but they do not consider the noise or presence
of malware, waste resources, hence degrade the productivity. In of concept drift in the emails [3]. Naturally, the interests of
spam filter development, the most important challenge is to find a person change frequently, so the class of the email also
the correlation between the nature of spam and the interest of changes for a user with time. For example, during placement
the users because the interests of users are dynamic. This paper
proposes a novel dynamic spam filter model that considers the season, interests of the students are towards the subjects which
changes in the interests of users with time while handling the are useful in their placements so during that time emails related
spam activities. It uses intention-based segmentation to compare to those subjects will be hams for the students but after the
different segments of text documents instead of comparing them placements, interests of the students change and they are more
as a whole. The proposed spam filter is a multi-tier approach interested in emails related to entertainment, games etc. So
where initially, the email content is divided into segments with the
help of part of speech (POS) tagging based on voices and tenses. the emails which were spam at the time of placements will
Further, the segments are clustered using hierarchical clustering no longer be spams after placements. This changing effect of
and compared using the vector space model. In the third stage, interest has not been considered in most of the spam filtering
concept drift is detected in the clusters to identify the change in tools. There are a wide variety of spam techniques present on
the interest of the user. Later, the classification of ham emails the web as shown in Fig. 1 but at counterpart, the spammers
into various categories is done in the last stage. For experiments
Enron dataset is used and the obtained results are promising. smartly develop new alternate techniques against the spam
Index Terms—Concept drift; Intention-based Segmentation; filters as they come into use [4].
Part of speech(POS) tagging; Vector space model; Hierarchical
clustering; Spam.

I. I NTRODUCTION
The first ever spam email was delivered on 3 May 1978
through ARPAnet where about 2600 ARPAnet users were Fig. 1. Techniques Used by Spammers
sent a message by a marketer Gary Thuerk [1]. This incident
demonstrated the power of electronic mail as an advertising To classify an email as spam or ham, this paper proposes
platform to the world but introduced a new malicious activity a novel method based on intention based segmentation. Spam
known as spamming. The tremendous growth in the usage of emails are usually commercial emails which are of less interest
email for personal and commercial use, spam filtering becomes to the user or contain some malware content, whereas the
the most important pre-requisite of any email system. Con- ham emails are of more interest (current interest) to the user.
ventionally, spam is considered as an unsolicited commercial Conventionally, the whole email is processed to predict its
email. The importance of these spam email depends on the nature, whereas the proposed intention based segmentation
overall interest of the users too. Hence, a strategic spam filter approach works on the segments of the content of the email.
is essential that can filter the emails with the consideration of Ideally, processing of the segments individually reflects more
users interest. Meanwhile, judicial laws have also been made information about the content, which results in higher ac-
to fight against the spam activities such as CAM SPAM Act curacy. There are two possible ways of segmentation: ran-
in USA [2]. dom and rule-based. Random selection of segments does not
Many spam filtering techniques have been published dur- add descriptive information so not suitable for segmentation.
ing the last two decades [4], but because of technological Hence, the segments should be generated based on the context
advancements, spammers found various alternate techniques of of the sentence. In this approach, whenever the context of the

38

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
sentence is changing, it represents a boundary point between TABLE I
two segments and content should be divided at this point for F EATURES AND COMMUNICATION M EANS
further processing. After dividing the content into segments Tense present past future
Subject i/we you it/they/(s)he
change in the content of emails with time is detected. This Style interrogative negative affirmative
Status active passive -
change is referred as concept drift. Topics present in spam Part of speech verb noun adjective

or ham emails change with time for a particular user. These


changes can be gradual or abrupt or recurring. For example,
the user will have different interests in different weathers, so The JSON format represents the file in a tree form, from which
these type of changes are called recurring change. Students content extraction is very easy with the help of regex libraries.
going from school to college will have a gradual change in For finding the drift in the concept, a database of 11700 topics
his/ her interests is an example of gradual concept drift. If a is used. A topic is defined for every email body from this
student got failed in any class, these type of abrupt changes database. Then every email is labelled with the topics included
also affects the interests of the user known as sudden concept in the body of the mail. These set of topics are used as the
drift. input set for state machine transition.
Along with finding the class of emails, the current interests Papadimitriou et al. [6] proposed a method for finding the
of the user can be identified by detecting concept drift in the posts which are of interest to the user. Initially, a forum is filled
emails, which helps to access the ham emails easily. Generally, by the users in which reference posts are presented. Based on
the email contains gradual or recurring concept drifts, whereas the likes or dislikes on the reference posts, other posts are
abrupt concept drifts are found in very less volume. There are compared with the reference posts and given rank. Comparison
different methods to detect different types of concept drifts. of the posts is not done as whole. Every segment of each post
For detecting sudden drifts, statistical hypothesis tests are is considered, e.g. who has posted that content, what is the
used. Gradual concept drift can be detected using early drift topic of the content of the post, what are the hashtags, is there
detection method. There are methods based on probability and any picture in the post, who are tagged in that post, and content
graph to detect recurring and incremental concept drift [5]. in the post is also divided into segments. To divide the email
The rest of the paper is organized as follows: in section content into segments, greedy approach is applied where each
II, existing works in the field of spam detection and concept- word is taken as a segment and then in each iteration, border
drift detection are explained, whereas section III describes the with the worst score is removed. Table I tells about the features
proposed methodology for concept drift detection in Email and communications means used for segmentation.
dataset. The experimental results and performance analysis are Border is defined as a point where two different segments
covered in section IV. The last section contains concluding meet, whereas the score of the border is defined as a difference
remarks and future scope of the proposed work. of the concept between the left and right segments of the
border. Borders having less score compared to a pre-defined
II. R ELATED W ORK threshold are removed from the content. After removal of all
In past decades, several machine learning based spam the borders, the segments of the body of the post are merged
filtering approaches. A wide variety of techniques have also with the other segments of the post. The response of the user
been used to find the concept drift in emails to build adaptive to the posts which have appeared before on users timeline is
spam filters. Ruano et al. [1] proposed a concept drift based also recorded. Based on this response, a score is calculated
methodology to detect the spam messages. In this approach, and used as the rank of the post, which is used to improve the
email messages are divided into two types: spam and ham. prediction system.
Spam messages are defined as messages of less interest Afterwards, Harel et al.[7] proposed a concept drift de-
whereas ham messages are those which are of more interest tection method based on the change in concept for some
to the user. Set of grammar rules are defined and a finite state defined hypothesis. Authors defined a methodology which
machine (FSM) is created to calculate every type of concept only detects sudden and gradual concept drifts but does not
drift. In this approach, every FSM has five states in which detect any false changes in the concept. The computation
starting with ready state, two states are intermediate states and complexity of the method depends on the choice of the
two states are concept drift found and not found (final states). hypothesis. For example: in a book store, books are given
For every set of input state, transitions are recorded and the ratings based on the reviews given to them. If the dictionary
stack is updated. After the input string ends, the current state of of ratings contains bad, good, informative the change from
the machine is recorded. If it is a non-final state, then possible ‘fantasy’ to ‘educational’ books will be easily detected because
types of concept drift are checked. If it is in the ‘not found’ both are correlated to word ‘informative’. If the dictionary of
state then no concept drift found and if it is in the ‘found’ ratings contains bad, good, kindle change to kindle will not
state, then the specific type of concept drift is found for which be easily detected because it is not correlated with the other
the finite state machine is generated. To handle various HTML sentiments. During the detection process, a sub-sequence is
tags and useless spaces, tokenization is performed using Perl’s taken and changes are evaluated based on some hypothesis.
regex libraries. The content of the email is represented in Java In this research work, they considered two hypothesis: one is
script object notation (JSON) format, which is easy to read. equality of the average on sequential test segments, whereas

39

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the other is a substantial difference between the two segments. because they create noise for the data. After that, the frequency
The hypothesis is chosen based on the type of change to be for every term is calculated using the Linguistic data consor-
detected. For detecting abrupt concept drift, a sub-sequence tium dataset. Further, TF-IDF value of the term is calculated
of the data is obtained. Then this sub-sequence is divided into using the following formula:
two parts: training window and testing window. Then using
tf idf (t) = tf (t) ∗ log(N/N (t)) (1)
some hypothesis change is detected in the testing window. If
no change appears in the testing window, then this is added to Here, tf(t) is mean term frequency of t, N is total no. of
the training window and a new testing window is introduced. documents and N(t) is no. of documents in which term t
The change is detected by analyzing the loss values of testing appears. A popular term is the term which satisfy the following
window and loss values of the same sized window obtained constraints -
by the reshuffling of the sub-sequence. At each iteration, risk
involved in test-train split is also calculated with a value called tf idf (t) > Tth , LDC(t) < Pmax (2)
permutation loss. Here, Tth represents the lower bound of TF-IDF values for
Later, to solve the problem of spam emails, Lu et al. [8] popular terms, and Pmax represents the upper bound of count
proposed concept drift detection technique, represented as a of popular terms. A rare term is the term which satisfy the
model (x,y), where x is the feature vector and y is data following constraints:
stream label. For drift detection, a two-sliding-windows model
tf idf (t) > Rth , LDC(t) < Rmax (3)
is used, where each window consists of the data points taken
from the two infinite random distributions. This method not Here, Rth represents the lower bound of TF-IDF values for
only ensures the absence of concept drift, but also highlights rare terms, and Rmax represents the upper bound of count of
the local regions where drift occurs. One strategy to detect rare terms.
the changes is through data distribution, where two data In the next step, a set of popular and rare terms is given
samples are compared and checked that they are from the as input to the feature extraction algorithm which will give a
same distribution or not. There are different tests to determine two-dimensional vector as output containing a feature vector
the relation among the samples of different distributions. of the term and weighting function for each of the terms. These
Authors proposed another concept drift detection strategy is vectors are given input to any standard classifier, which gives
to find the concept drift with the help of learners output. This the topic of the given document. In their research work, Naive
model traces and controls the error rate of the online learning Bayes and support vector machine classifiers are used.
algorithm, where the errors are defined as samples of data Many techniques related to the topic under consideration
found from the Bernoulli trials and generalized using binomial have been proposed earlier. The detailed literature survey
distribution. A significant change in the error indicates the shows that each technique has its limitations and drawbacks.
change in class distribution and the concept drift is detected. Also, none of the techniques considers all the important factors
Later, Russell et al. [9] suggested a methodology based on of concept drift. The literature shows that the presence of
feature extraction to classify the documents into categories. concept drift in emails plays a very important part in finding
This method uses two metrics to classify a document: popular- the class of the emails due to noise present in the content of
ity and rarity. If a topic is given, the feature set of the topic will the emails whereas the presence of HTML tags, price tags,
contain a set of popular words and a set of rare words which extra spaces etc. in the content put a significant effect on
come under the topic. Authors tested their algorithm under finding the class of emails. Meanwhile, the intention based
a wide range of development centric topics. Instead of using segmentation analysis indicates that instead of comparing the
the text similarity to compare the documents, authors used two documents as a whole, segment analysis can give better
feature extraction algorithm because of the presence of topic results. Whereas, it is evident that feature extraction algorithm
ambiguity in the documents, which gave very less accuracy. gives higher accuracy compare to the other normal classifica-
This approach takes the topics which tend to be very focused, tion algorithms. Classification algorithms should be applied to
and also take care of the words and symbols present in local the features after feature extraction instead of directly applying
or regional languages so that documents do not lose their to the documents. It is also found that instead of comparing
authenticity. It was observed that to extract the features of a the documents with text similarity TF-IDF scheme should be
topic, the combined use of popularity and rarity metrics gave used due to the presence of noise in the documents.
higher accuracy compared to the use of either of them. In the
III. P ROPOSED M ETHODOLOGY
first step, the focused topics are defined. Focused topics are
those which have following properties: This paper proposes a novel spam filter model to detect
spam emails via the concept drift occurred in the content.
• The topic is not present on the web with a high frequency. Categorization of the ham emails has also been done, which
• Document overlap with other topic should be negligible makes it easier for the user to access them. In the proposed
if not null. model, the classification of emails is performed in six steps,
In the next step, documents containing very less information where at every step, content of the email is processed and
and the popular terms present in the documents are removed, compared with the rest of the previous emails. By analyzing

40

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the comparison between the emails, concept drift is detected a group of characters which have a particular meaning
and the class of the emails is decided. Most importantly, in some particular document, whereas a type is defined
emails containing malware content are also taken care of and, as the class of a group of characters having the same
an email some pre-defined specific keywords is identified as character sequence [11].
spam emails. Fig. 2 shows the proposed methodology and b. Tagging: Tagging is the process where all words present
description of the steps is as follows: in the document are marked with a particular part-of-
speech. This process is known as POS (part-of-speech)
tagging or grammatical tagging or word-category disam-
biguation. A Word is marked based on it’s definition and
context i.e. how it is related to its adjacent or similar
words in the document. POS tagging is a similar process
of identification of nouns, verbs, adjectives, adverbs, etc.
[12].
c. Classification: Classification is performed based on tenses
(present, past or future) or voices (active or passive). If
a sentence contains modal verbs then it is categorized as
future tense, whereas if it contains verb and past or past
participle tense then it is categorized as past tense. If the
sentence contains verb and present or present participle
tense, then it is claimed to be in present tense [9].
Fig. 2. Proposed methodology Voice of the sentence can be determined by removal of
all the available adverbs. If the sentence has preposition
A. Checking for malware content or subordinating conjunction and after first word if the
This is the first step towards finding the real class of the sentence has wh-determiner, wh-pronoun, possessive wh-
emails. In this step content of the current email is checked. If it pronoun, wh-adverb as part of speech tags then the
contains some specific keywords stored in the database which sentence is in passive voice. If the sentence has ‘be’, ‘am’,
represents the malware content or if it contains the hyperlink ‘is’, ‘are’, ‘was’, ‘were’, ‘been’, ‘has’, ‘have’, ‘had’,
which are related to the malware content present on the web ‘do’, ‘did’, ‘does’, ‘can’, ‘could’, ‘shall’, ‘should’, ‘will’,
then the email is directly put into the spam emails and no ‘would’, ‘may’, ‘might’, ‘must’ and verb of past participle
further process is done on that email. as part-of-speech then the sentence is also in passive
voice. If in place of verb of past participate after one
B. Parsing emails word the sentence has verb of past tense, verb of present
The second step is parsing of the emails. In this, email is tense with third person singular, verb of present tense
parsed i.e. all the information related to the email is stored in a with non-third person singular or after two words it has
different variable, for example: date, sender, subject, content. verb of present participle tense, verb of past tense, verb of
After storing the content in a different variable, it is further past participle tense, verb of present tense with non-third
processed for the further processing. Extra spaces, HTML tags, person singular, verb of present tense with third person
price tags, exclamation symbols etc. are found in the content, singular, base form of verb as part of speech tags, then it
and these things are removed from the emails. Hyperlinks is in active voice. If the sentence is too short (less than
present in the content are also found and removed. The main two words) it is assumed in active voice [13].
idea behind removing the extra spaces, HTML tags, price tags
and exclamation marks is that these are used by spammers to
make a difference in the content of emails so that spam filter
recognise them as different and do not mark them as spams
[1]. Removal of these words makes it easier for proposed spam
filter to focus on the content mainly. The idea behind removing
the hyperlink is that once the hyperlink is processed in the
previous step no further processing is required [10].
Fig. 3. Segmentation of content
C. Content segmentation
In this step, email content is divided into segments that D. Grouping of Segments
involves following three stages: Tokenization, Tagging and In this phase, groups of obtained segments are formed
Classification (as shown in Fig. 3). using hierarchical clustering (Fig. 4). Initially, each segment is
a. Tokenization: It is the process to divide the email content taken as an individual cluster and later the difference between
into tokens and sometimes the difference between token clusters is calculated. The clusters having difference less than a
and type has to be recognized. A token is defined as pre-defined threshold are merged [14]. The difference between

41

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the clusters is calculated via cosine similarity and euclidean F. Classification of ham messages
distance using vector space model. In this step, ham messages are divided into different cat-
egories, which makes it easier for the user to access them.
Different classification algorithms are used to classify the
messages. The first step of each algorithm is to detect the
features from the documents (messages). After feature extrac-
tion, classifiers are trained and samples are tested. The features
used are ngram range and TF-IDF values [16].
• ngram range: ngram range refers to the size of the n-
gram which should be considered by the classifier if it
Fig. 4. Hierarchical Clustering is considering bigrams, trigrams etc. The more the size
of the n-gram, the more complex words the classifier can
Two clusters are merged repeatedly when the value of handle.
| cos θ | is less than 0.3 and euclidean distance between them • TF-IDF value: TF-IDF value refers to the term frequency
is less than a pre-defined threshold. Fig. 5 shows the cosine in the document. It can be calculated by dividing the
similarity between two documents [15]. The newly created count of the term with total no. of terms in the document.
cluster is represented by a new vector. Similarity formula Dataset of “20newsgroups” is used to train the classifier.
between two documents is as below: This dataset contains 20000 news articles, and these are in the
~ (d1 ).V~ (d2 ) folders of respective categories. Twenty-three categories have
V
sim(d1 , d2 ) = (4) been defined in this dataset. For every news article present in
~ ~ (d2 )|
|V (d1 )||V the database, the content of the news article and its category is
Euclidean distance [16] between two vectors is calculated extracted. Then the TF-IDF value for every term present in the
using following formula: news article is found, and classifier is trained based on the TF-
IDF values of the terms. TF-IDF values are used instead of the
count of the terms because it creates a problem when there is a
qX
2
D= (Vi − Vj ) (5)
big difference in the size of documents. Suppose a document
has 10000 words and contains 2000 related to a topic and
representing that topic. And there is one more document in the
dataset which contains 1000 words and contains 200 related
to the same topic. So, if word count is taken as parameter, the
classifier does not predict its category correctly [15]. Grid-
Based Search method for parameter tuning is used. Unigrams
and bigrams have been considered for our classifier, and TF-
IDF value threshold is from the range 0.5 to 1. In parameter
tuning for every combination of parameters, the classifier is
Fig. 5. Cosine similarity
tested, and accuracy is calculated. The set of parameters which
gives best accuracy is considered in the classifier. Alongside
changing the values of parameters, different classifiers and
values can also be tested, which will give the best possible
E. Finding concept drift in emails combination. This process takes more time as every set of
The final number of generated clusters represents the quan- parameters has to be tested but gives better results compare to
tity of topics. If all the segments of an email are present in the the other classifiers [17]. Fig. 6 tells about all the categories
clusters in which number of segments are less than a threshold, in which ham messages are divided.
then that email is marked as spam email, and rest of the emails IV. R ESULT AND DISCUSSION
are put into the ham emails category. Segments made in the
A. Dataset Description
third step are mapped with the emails and stored. There are
two advantages this approach: firstly, it avoids unnecessary The proposed algorithm has been tested on two datasets:
processing of duplicate emails and secondly, this approach Enron email dataset, and personal emails obtain from a gmail
ease the process of finding segments related to an email, hence account as a stream by using python libraries. Description of
reduces the computation cost. In experiments, step 3 to 5 are both datasets is as below:
performed by comparing the email with the stored emails of a. Enron email Dataset: The Enron email dataset was pub-
last 30 days (the window size is 30 days). As the stored lished by the “Federal energy regulatory commission”
segments and clusters are used for processing, it makes the during its Enrons collapse investigation. This dataset
process fast because only the current email has to be processed contains approximately 50,000 emails of the Enron Cor-
and put into the cluster. poration and has two fields:

42

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Ham messages categories

TABLE II
R ELATION BETWEEN PARAMETER VALUES AND ACCURACY Fig. 7. Segmentation of email contents

n-gram size TF-IDF Value NB accuracy SVM accuracy


1 0.1 72% 75%
1 0.2 75% 77% The relation between parameter values and accuracy is shown
1 0.3 78% 79%
2 0.1 79% 80% in Table II.
2 0.2 82% 83%
2 0.3 89% 90%
3 0.1 77% 79%
3 0.2 79% 80% C. Validation of Concept Drift
3 0.3 81% 83%

After dividing the emails into different categories, a his-


togram of number of emails is generated with time for each
– file category. These histograms show how the presence of the
– message emails in different categories is changing, which tells about
File contains the name of the file and directory in which the change in content of the emails. For number of bins and
email is present. The name of the user to which an email bin-width of histogram, Sturges formula [21] is used which is
belongs is the root level directory of the corresponding shown below:
emails. Message contains the content of the email. Ev- bins = ceil(log2 n) + 1 (6)
erything related to the text is present in this column [18].
b. Emails obtained from gmail account: These emails are max(values) − min(values)
obtained from a gmail account using python libraries. binwidth = (7)
ceil(log2 n) + 1
These emails come in the form of stream, where all
information is in form of string. All the parameters of Here, n is number of emails present in the category for which
an email (i.e. sender, receiver, date, message) can be histogram is generated. Fig. 8 shows the histogram generated
obtained by using regex libraries [19]. for a category (graphics) after obtaining the results.
B. Experimental Evaluation
Initially, emails are tokenized and meta-data of emails
are collected. Meta-data contains the information of sender,
receiver, etc. After collection of meta-data, malware contents
are checked to declare an email as spam. In the next step, stop
words and extra spaces are removed from the emails to reduce
the noise.
For dividing the content based on tenses and voices, helping
verbs present in the sentence and their positions with respect
to the subject of the sentence are analyzed and current email
segments are stored for future reference. Segments of the
content look as shown in Fig. 7.
After extensive experiments, Naive bayes gives the accuracy
of 89%, whereas SVMs are a set of supervised learning meth-
ods used for classification, regression and outliers detection, Fig. 8. Change in content of emails with time
achieved the accuracy of 90% with proper parameter tuning.

43

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
V. C ONCLUSION AND FUTURE W ORK [15] F. C. Heilbron, V. Escorcia, B. Ghanem and J. C. Niebles, ”Modern
Information Retrieval” Addison-Wesley Longman Publishing Co., pp.
In this paper, a method based on intention-based segmen- 961-970, 2007.
[16] Cha, Sung-Hyuk; Yoon, Sungsoon; and Tappert, Charles C.,”Enhancing
tation is proposed to classify an email. Two approaches have Binary Feature Vector Similarity Measures” (2005). CSIS Technical
been used to determine the intention of sentences: tense (past, Reports. 18.
present, future) and voices (active and passive voice) based. [17] K. Vasa, ”Text Classification through Statistical and Machine Learning
Models: A Survey”, International Journal Of Engineering Design Re-
These two methods are very closely related to the syntax and search, 2016, vol. 4.
grammar of English language. To compare two documents [18] V. Bobicev, ”Text classification: The case of multiple labels,” 2016 In-
vector space model is used. Sliding window technique is used ternational Conference on Communications (COMM), Bucharest, 2016,
pp. 39-42.
to detect the concept drift in email datasets that helps to give [19] Kaggle ”https://www.kaggle.com/wcukierski/enron-email-dataset” [Ac-
real class of emails with better accuracy in less time. The cessed March 20, 2020]
emails are compared by their segments of the same intention, [20] Jeffrey ”http://jeffreyfossett.com/” [Accessed march 20, 2020]
[21] Scott, David. (2009). ”Sturges’ rule”. Wiley Interdisciplinary Reviews:
instead of comparing them as a whole. This approach takes Computational Statistics. 1. 303 - 306. 10.1002/wics.35.
every small detail of the emails and ensures good performance.
The concept drift in emails describes the interests of the user.
There are methods other than tense-based and voice-based
segmentation which can be used for content segmentation
that can also be used for text summarization and other text-
based processes. The objective of the paper is to classify an
email with minimum computation resource, hence Naive bayes
and SVM are used for classification. Other advance machine
learning approaches can be used in future to achieve higher
accuracy with proper parameter tuning but need

R EFERENCES
[1] Ruano-Ords, D., Fdez-Riverola, F. and Mndez, ”Concept drift in e-mail
datasets: An empirical study with practical implications” Information
Sciences, 2018, 428, pp.120-135.
[2] Bhowmick, Alexy and Hazarika, Shyamanta. (2018). E-Mail Spam
Filtering: A Review of Techniques and Trends. 10.1007/978-981-10-
4765-7 61.
[3] W.A. Awad and S.M. ELseuofi,”Machine Learning Method for Spam
Email Classification”, Internation Journal of Computer Science and
Information Technology, 2011, vlo. 3, No. 1.
[4] Saad, Omar M. and Darwish, Ashraf and Faraj, Ramadan. (2019). A
survey of machine learning techniques for Spam filtering.
[5] Y. Sun, Z. Wang, Y. Bai, H. Dai and S. Nahavandi, A Classifier Graph
Based Recurring Concept Detection and Prediction Approach 2018
Computational Intelligence and Neuroscience, China, pp. 1-13.
[6] Papadimitriou D., Koutrika, G., Velegrakis Y. and Mylopoulos, “ Finding
Related Forum Posts through Content Similarity over Intention-Based
Segmentation” IEEE Transactions on Knowledge and Data Engineering,
2017, 29(9), pp.1860-1873.
[7] M. Harel, K. Crammer, R. El-Yaniv, S. Mannor, Concept Drift Detection
Through Resampling International Conference on Machine Learning,
Beijing, China, 2014, pp. 1324-1334.
[8] Lu, N., Zhang, G. and Lu J., Concept drift detection via competence
models International Conference on Artificial Intelligence, 2014, China,
pp.11-28.
[9] R. Power, J. Chen, T. Karthik, L. Subramanian, Document Classification
for Focused Topics International Conference on Machine Learning,
2013, NY, USA, pp. 123-136.
[10] S. Roy, A. Patra, S. Sau, K. Mandal, S. Kunar, An Efficient Spam
Filtering Technique for Email Account American Journal of Engineering
Research, 2013, pp. 63-73.
[11] R. J. Passonneau and D. J. Litman, ”Intention- based segmentation: Hu-
man reliability and correlation with linguistic cues in Proc. Annu.Meet.
Association Comput. Linguistics, 1993, pp. 148155.
[12] Jeffrey C. Reynar, “Topic Segmentation: Algorithms and Applications”,
University of Pennsylvania, AUGUST 1998.
[13] D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent dirichlet allocation” J. Mach.
Learn. Res. vol. 3 pp. 993-1022 2003.
[14] L. Weng et al. “Query by document via a decomposition-based two-level
retrieval approach” Proc. 34th Int. ACM SIGIR Conf. Res. Development
Inf. Retrieval pp. 505-514 Jul. 2011.

44

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.

You might also like