Email classification via intention based Segmentation

Uploaded by

varunsh245

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Email classification via intention based Segmentation

Uploaded by

varunsh245

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proc.

EECSI 2020 - 1-2 October 2020

Email classification via intention-based

segmentation
Sanjay Kumar Sonbhadra Sonali Agarwal Mohammad Syafrullah Krisna Adiyarta
Department of IT Department of IT Program of master in CS Program of master in CS
IIIT, Allahabad IIIT, Allahabad Universitas Budi Luhur Universitas Budi Luhur
Prayagraj, India Prayagraj, India Jakarta, Indonesia Jakarta, Indonesia
[email protected] [email protected] [email protected] [email protected]

Abstract—Email is the most popular way of personal and spamming and the situation becomes worst even after several
official communication among people and organizations. Due to continuous safety efforts. The spam filters developed during
untrusted virtual environment, email systems may face frequent the last two decades can identify the spam emails which
attacks like malware, spamming, social engineering, etc. Spam-
ming is the most common malicious activity, where unsolicited contains extra spaces, HTML tags, links of malicious web
emails are sent in bulk, and these spam emails can be the source pages etc., but they do not consider the noise or presence
of malware, waste resources, hence degrade the productivity. In of concept drift in the emails [3]. Naturally, the interests of
spam filter development, the most important challenge is to find a person change frequently, so the class of the email also
the correlation between the nature of spam and the interest of changes for a user with time. For example, during placement
the users because the interests of users are dynamic. This paper
proposes a novel dynamic spam filter model that considers the season, interests of the students are towards the subjects which
changes in the interests of users with time while handling the are useful in their placements so during that time emails related
spam activities. It uses intention-based segmentation to compare to those subjects will be hams for the students but after the
different segments of text documents instead of comparing them placements, interests of the students change and they are more
as a whole. The proposed spam filter is a multi-tier approach interested in emails related to entertainment, games etc. So
where initially, the email content is divided into segments with the
help of part of speech (POS) tagging based on voices and tenses. the emails which were spam at the time of placements will
Further, the segments are clustered using hierarchical clustering no longer be spams after placements. This changing effect of
and compared using the vector space model. In the third stage, interest has not been considered in most of the spam filtering
concept drift is detected in the clusters to identify the change in tools. There are a wide variety of spam techniques present on
the interest of the user. Later, the classification of ham emails the web as shown in Fig. 1 but at counterpart, the spammers
into various categories is done in the last stage. For experiments
Enron dataset is used and the obtained results are promising. smartly develop new alternate techniques against the spam
Index Terms—Concept drift; Intention-based Segmentation; filters as they come into use [4].
Part of speech(POS) tagging; Vector space model; Hierarchical
clustering; Spam.

I. I NTRODUCTION
The first ever spam email was delivered on 3 May 1978
through ARPAnet where about 2600 ARPAnet users were Fig. 1. Techniques Used by Spammers
sent a message by a marketer Gary Thuerk [1]. This incident
demonstrated the power of electronic mail as an advertising To classify an email as spam or ham, this paper proposes
platform to the world but introduced a new malicious activity a novel method based on intention based segmentation. Spam
known as spamming. The tremendous growth in the usage of emails are usually commercial emails which are of less interest
email for personal and commercial use, spam filtering becomes to the user or contain some malware content, whereas the
the most important pre-requisite of any email system. Con- ham emails are of more interest (current interest) to the user.
ventionally, spam is considered as an unsolicited commercial Conventionally, the whole email is processed to predict its
email. The importance of these spam email depends on the nature, whereas the proposed intention based segmentation
overall interest of the users too. Hence, a strategic spam filter approach works on the segments of the content of the email.
is essential that can filter the emails with the consideration of Ideally, processing of the segments individually reflects more
users interest. Meanwhile, judicial laws have also been made information about the content, which results in higher ac-
to fight against the spam activities such as CAM SPAM Act curacy. There are two possible ways of segmentation: ran-
in USA [2]. dom and rule-based. Random selection of segments does not
Many spam filtering techniques have been published dur- add descriptive information so not suitable for segmentation.
ing the last two decades [4], but because of technological Hence, the segments should be generated based on the context
advancements, spammers found various alternate techniques of of the sentence. In this approach, whenever the context of the

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
sentence is changing, it represents a boundary point between TABLE I
two segments and content should be divided at this point for F EATURES AND COMMUNICATION M EANS
further processing. After dividing the content into segments Tense present past future
Subject i/we you it/they/(s)he
change in the content of emails with time is detected. This Style interrogative negative affirmative
Status active passive -
change is referred as concept drift. Topics present in spam Part of speech verb noun adjective

or ham emails change with time for a particular user. These

changes can be gradual or abrupt or recurring. For example,
the user will have different interests in different weathers, so The JSON format represents the file in a tree form, from which
these type of changes are called recurring change. Students content extraction is very easy with the help of regex libraries.
going from school to college will have a gradual change in For finding the drift in the concept, a database of 11700 topics
his/ her interests is an example of gradual concept drift. If a is used. A topic is defined for every email body from this
student got failed in any class, these type of abrupt changes database. Then every email is labelled with the topics included
also affects the interests of the user known as sudden concept in the body of the mail. These set of topics are used as the
drift. input set for state machine transition.
Along with finding the class of emails, the current interests Papadimitriou et al. [6] proposed a method for finding the
of the user can be identified by detecting concept drift in the posts which are of interest to the user. Initially, a forum is filled
emails, which helps to access the ham emails easily. Generally, by the users in which reference posts are presented. Based on
the email contains gradual or recurring concept drifts, whereas the likes or dislikes on the reference posts, other posts are
abrupt concept drifts are found in very less volume. There are compared with the reference posts and given rank. Comparison
different methods to detect different types of concept drifts. of the posts is not done as whole. Every segment of each post
For detecting sudden drifts, statistical hypothesis tests are is considered, e.g. who has posted that content, what is the
used. Gradual concept drift can be detected using early drift topic of the content of the post, what are the hashtags, is there
detection method. There are methods based on probability and any picture in the post, who are tagged in that post, and content
graph to detect recurring and incremental concept drift [5]. in the post is also divided into segments. To divide the email
The rest of the paper is organized as follows: in section content into segments, greedy approach is applied where each
II, existing works in the field of spam detection and concept- word is taken as a segment and then in each iteration, border
drift detection are explained, whereas section III describes the with the worst score is removed. Table I tells about the features
proposed methodology for concept drift detection in Email and communications means used for segmentation.
dataset. The experimental results and performance analysis are Border is defined as a point where two different segments
covered in section IV. The last section contains concluding meet, whereas the score of the border is defined as a difference
remarks and future scope of the proposed work. of the concept between the left and right segments of the
border. Borders having less score compared to a pre-defined
II. R ELATED W ORK threshold are removed from the content. After removal of all
In past decades, several machine learning based spam the borders, the segments of the body of the post are merged
filtering approaches. A wide variety of techniques have also with the other segments of the post. The response of the user
been used to find the concept drift in emails to build adaptive to the posts which have appeared before on users timeline is
spam filters. Ruano et al. [1] proposed a concept drift based also recorded. Based on this response, a score is calculated
methodology to detect the spam messages. In this approach, and used as the rank of the post, which is used to improve the
email messages are divided into two types: spam and ham. prediction system.
Spam messages are defined as messages of less interest Afterwards, Harel et al.[7] proposed a concept drift de-
whereas ham messages are those which are of more interest tection method based on the change in concept for some
to the user. Set of grammar rules are defined and a finite state defined hypothesis. Authors defined a methodology which
machine (FSM) is created to calculate every type of concept only detects sudden and gradual concept drifts but does not
drift. In this approach, every FSM has five states in which detect any false changes in the concept. The computation
starting with ready state, two states are intermediate states and complexity of the method depends on the choice of the
two states are concept drift found and not found (final states). hypothesis. For example: in a book store, books are given
For every set of input state, transitions are recorded and the ratings based on the reviews given to them. If the dictionary
stack is updated. After the input string ends, the current state of of ratings contains bad, good, informative the change from
the machine is recorded. If it is a non-final state, then possible ‘fantasy’ to ‘educational’ books will be easily detected because
types of concept drift are checked. If it is in the ‘not found’ both are correlated to word ‘informative’. If the dictionary of
state then no concept drift found and if it is in the ‘found’ ratings contains bad, good, kindle change to kindle will not
state, then the specific type of concept drift is found for which be easily detected because it is not correlated with the other
the finite state machine is generated. To handle various HTML sentiments. During the detection process, a sub-sequence is
tags and useless spaces, tokenization is performed using Perl’s taken and changes are evaluated based on some hypothesis.
regex libraries. The content of the email is represented in Java In this research work, they considered two hypothesis: one is
script object notation (JSON) format, which is easy to read. equality of the average on sequential test segments, whereas

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the other is a substantial difference between the two segments. because they create noise for the data. After that, the frequency
The hypothesis is chosen based on the type of change to be for every term is calculated using the Linguistic data consor-
detected. For detecting abrupt concept drift, a sub-sequence tium dataset. Further, TF-IDF value of the term is calculated
of the data is obtained. Then this sub-sequence is divided into using the following formula:
two parts: training window and testing window. Then using
tf idf (t) = tf (t) ∗ log(N/N (t)) (1)
some hypothesis change is detected in the testing window. If
no change appears in the testing window, then this is added to Here, tf(t) is mean term frequency of t, N is total no. of
the training window and a new testing window is introduced. documents and N(t) is no. of documents in which term t
The change is detected by analyzing the loss values of testing appears. A popular term is the term which satisfy the following
window and loss values of the same sized window obtained constraints -
by the reshuffling of the sub-sequence. At each iteration, risk
involved in test-train split is also calculated with a value called tf idf (t) > Tth , LDC(t) < Pmax (2)
permutation loss. Here, Tth represents the lower bound of TF-IDF values for
Later, to solve the problem of spam emails, Lu et al. [8] popular terms, and Pmax represents the upper bound of count
proposed concept drift detection technique, represented as a of popular terms. A rare term is the term which satisfy the
model (x,y), where x is the feature vector and y is data following constraints:
stream label. For drift detection, a two-sliding-windows model
tf idf (t) > Rth , LDC(t) < Rmax (3)
is used, where each window consists of the data points taken
from the two infinite random distributions. This method not Here, Rth represents the lower bound of TF-IDF values for
only ensures the absence of concept drift, but also highlights rare terms, and Rmax represents the upper bound of count of
the local regions where drift occurs. One strategy to detect rare terms.
the changes is through data distribution, where two data In the next step, a set of popular and rare terms is given
samples are compared and checked that they are from the as input to the feature extraction algorithm which will give a
same distribution or not. There are different tests to determine two-dimensional vector as output containing a feature vector
the relation among the samples of different distributions. of the term and weighting function for each of the terms. These
Authors proposed another concept drift detection strategy is vectors are given input to any standard classifier, which gives
to find the concept drift with the help of learners output. This the topic of the given document. In their research work, Naive
model traces and controls the error rate of the online learning Bayes and support vector machine classifiers are used.
algorithm, where the errors are defined as samples of data Many techniques related to the topic under consideration
found from the Bernoulli trials and generalized using binomial have been proposed earlier. The detailed literature survey
distribution. A significant change in the error indicates the shows that each technique has its limitations and drawbacks.
change in class distribution and the concept drift is detected. Also, none of the techniques considers all the important factors
Later, Russell et al. [9] suggested a methodology based on of concept drift. The literature shows that the presence of
feature extraction to classify the documents into categories. concept drift in emails plays a very important part in finding
This method uses two metrics to classify a document: popular- the class of the emails due to noise present in the content of
ity and rarity. If a topic is given, the feature set of the topic will the emails whereas the presence of HTML tags, price tags,
contain a set of popular words and a set of rare words which extra spaces etc. in the content put a significant effect on
come under the topic. Authors tested their algorithm under finding the class of emails. Meanwhile, the intention based
a wide range of development centric topics. Instead of using segmentation analysis indicates that instead of comparing the
the text similarity to compare the documents, authors used two documents as a whole, segment analysis can give better
feature extraction algorithm because of the presence of topic results. Whereas, it is evident that feature extraction algorithm
ambiguity in the documents, which gave very less accuracy. gives higher accuracy compare to the other normal classifica-
This approach takes the topics which tend to be very focused, tion algorithms. Classification algorithms should be applied to
and also take care of the words and symbols present in local the features after feature extraction instead of directly applying
or regional languages so that documents do not lose their to the documents. It is also found that instead of comparing
authenticity. It was observed that to extract the features of a the documents with text similarity TF-IDF scheme should be
topic, the combined use of popularity and rarity metrics gave used due to the presence of noise in the documents.
higher accuracy compared to the use of either of them. In the
III. P ROPOSED M ETHODOLOGY
first step, the focused topics are defined. Focused topics are
those which have following properties: This paper proposes a novel spam filter model to detect
spam emails via the concept drift occurred in the content.
• The topic is not present on the web with a high frequency. Categorization of the ham emails has also been done, which
• Document overlap with other topic should be negligible makes it easier for the user to access them. In the proposed
if not null. model, the classification of emails is performed in six steps,
In the next step, documents containing very less information where at every step, content of the email is processed and
and the popular terms present in the documents are removed, compared with the rest of the previous emails. By analyzing

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the comparison between the emails, concept drift is detected a group of characters which have a particular meaning
and the class of the emails is decided. Most importantly, in some particular document, whereas a type is defined
emails containing malware content are also taken care of and, as the class of a group of characters having the same
an email some pre-defined specific keywords is identified as character sequence [11].
spam emails. Fig. 2 shows the proposed methodology and b. Tagging: Tagging is the process where all words present
description of the steps is as follows: in the document are marked with a particular part-of-
speech. This process is known as POS (part-of-speech)
tagging or grammatical tagging or word-category disam-
biguation. A Word is marked based on it’s definition and
context i.e. how it is related to its adjacent or similar
words in the document. POS tagging is a similar process
of identification of nouns, verbs, adjectives, adverbs, etc.
[12].
c. Classification: Classification is performed based on tenses
(present, past or future) or voices (active or passive). If
a sentence contains modal verbs then it is categorized as
future tense, whereas if it contains verb and past or past
participle tense then it is categorized as past tense. If the
sentence contains verb and present or present participle
tense, then it is claimed to be in present tense [9].
Fig. 2. Proposed methodology Voice of the sentence can be determined by removal of
all the available adverbs. If the sentence has preposition
A. Checking for malware content or subordinating conjunction and after first word if the
This is the first step towards finding the real class of the sentence has wh-determiner, wh-pronoun, possessive wh-
emails. In this step content of the current email is checked. If it pronoun, wh-adverb as part of speech tags then the
contains some specific keywords stored in the database which sentence is in passive voice. If the sentence has ‘be’, ‘am’,
represents the malware content or if it contains the hyperlink ‘is’, ‘are’, ‘was’, ‘were’, ‘been’, ‘has’, ‘have’, ‘had’,
which are related to the malware content present on the web ‘do’, ‘did’, ‘does’, ‘can’, ‘could’, ‘shall’, ‘should’, ‘will’,
then the email is directly put into the spam emails and no ‘would’, ‘may’, ‘might’, ‘must’ and verb of past participle
further process is done on that email. as part-of-speech then the sentence is also in passive
voice. If in place of verb of past participate after one
B. Parsing emails word the sentence has verb of past tense, verb of present
The second step is parsing of the emails. In this, email is tense with third person singular, verb of present tense
parsed i.e. all the information related to the email is stored in a with non-third person singular or after two words it has
different variable, for example: date, sender, subject, content. verb of present participle tense, verb of past tense, verb of
After storing the content in a different variable, it is further past participle tense, verb of present tense with non-third
processed for the further processing. Extra spaces, HTML tags, person singular, verb of present tense with third person
price tags, exclamation symbols etc. are found in the content, singular, base form of verb as part of speech tags, then it
and these things are removed from the emails. Hyperlinks is in active voice. If the sentence is too short (less than
present in the content are also found and removed. The main two words) it is assumed in active voice [13].
idea behind removing the extra spaces, HTML tags, price tags
and exclamation marks is that these are used by spammers to
make a difference in the content of emails so that spam filter
recognise them as different and do not mark them as spams
[1]. Removal of these words makes it easier for proposed spam
filter to focus on the content mainly. The idea behind removing
the hyperlink is that once the hyperlink is processed in the
previous step no further processing is required [10].
Fig. 3. Segmentation of content
C. Content segmentation
In this step, email content is divided into segments that D. Grouping of Segments
involves following three stages: Tokenization, Tagging and In this phase, groups of obtained segments are formed
Classification (as shown in Fig. 3). using hierarchical clustering (Fig. 4). Initially, each segment is
a. Tokenization: It is the process to divide the email content taken as an individual cluster and later the difference between
into tokens and sometimes the difference between token clusters is calculated. The clusters having difference less than a
and type has to be recognized. A token is defined as pre-defined threshold are merged [14]. The difference between

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
the clusters is calculated via cosine similarity and euclidean F. Classification of ham messages
distance using vector space model. In this step, ham messages are divided into different cat-
egories, which makes it easier for the user to access them.
Different classification algorithms are used to classify the
messages. The first step of each algorithm is to detect the
features from the documents (messages). After feature extrac-
tion, classifiers are trained and samples are tested. The features
used are ngram range and TF-IDF values [16].
• ngram range: ngram range refers to the size of the n-
gram which should be considered by the classifier if it
Fig. 4. Hierarchical Clustering is considering bigrams, trigrams etc. The more the size
of the n-gram, the more complex words the classifier can
Two clusters are merged repeatedly when the value of handle.
| cos θ | is less than 0.3 and euclidean distance between them • TF-IDF value: TF-IDF value refers to the term frequency
is less than a pre-defined threshold. Fig. 5 shows the cosine in the document. It can be calculated by dividing the
similarity between two documents [15]. The newly created count of the term with total no. of terms in the document.
cluster is represented by a new vector. Similarity formula Dataset of “20newsgroups” is used to train the classifier.
between two documents is as below: This dataset contains 20000 news articles, and these are in the
~ (d1 ).V~ (d2 ) folders of respective categories. Twenty-three categories have
V
sim(d1 , d2 ) = (4) been defined in this dataset. For every news article present in
~ ~ (d2 )|
|V (d1 )||V the database, the content of the news article and its category is
Euclidean distance [16] between two vectors is calculated extracted. Then the TF-IDF value for every term present in the
using following formula: news article is found, and classifier is trained based on the TF-
IDF values of the terms. TF-IDF values are used instead of the
count of the terms because it creates a problem when there is a
qX
2
D= (Vi − Vj ) (5)
big difference in the size of documents. Suppose a document
has 10000 words and contains 2000 related to a topic and
representing that topic. And there is one more document in the
dataset which contains 1000 words and contains 200 related
to the same topic. So, if word count is taken as parameter, the
classifier does not predict its category correctly [15]. Grid-
Based Search method for parameter tuning is used. Unigrams
and bigrams have been considered for our classifier, and TF-
IDF value threshold is from the range 0.5 to 1. In parameter
tuning for every combination of parameters, the classifier is
Fig. 5. Cosine similarity
tested, and accuracy is calculated. The set of parameters which
gives best accuracy is considered in the classifier. Alongside
changing the values of parameters, different classifiers and
values can also be tested, which will give the best possible
E. Finding concept drift in emails combination. This process takes more time as every set of
The final number of generated clusters represents the quan- parameters has to be tested but gives better results compare to
tity of topics. If all the segments of an email are present in the the other classifiers [17]. Fig. 6 tells about all the categories
clusters in which number of segments are less than a threshold, in which ham messages are divided.
then that email is marked as spam email, and rest of the emails IV. R ESULT AND DISCUSSION
are put into the ham emails category. Segments made in the
A. Dataset Description
third step are mapped with the emails and stored. There are
two advantages this approach: firstly, it avoids unnecessary The proposed algorithm has been tested on two datasets:
processing of duplicate emails and secondly, this approach Enron email dataset, and personal emails obtain from a gmail
ease the process of finding segments related to an email, hence account as a stream by using python libraries. Description of
reduces the computation cost. In experiments, step 3 to 5 are both datasets is as below:
performed by comparing the email with the stored emails of a. Enron email Dataset: The Enron email dataset was pub-
last 30 days (the window size is 30 days). As the stored lished by the “Federal energy regulatory commission”
segments and clusters are used for processing, it makes the during its Enrons collapse investigation. This dataset
process fast because only the current email has to be processed contains approximately 50,000 emails of the Enron Cor-
and put into the cluster. poration and has two fields:

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
Fig. 6. Ham messages categories

TABLE II
R ELATION BETWEEN PARAMETER VALUES AND ACCURACY Fig. 7. Segmentation of email contents

n-gram size TF-IDF Value NB accuracy SVM accuracy

1 0.1 72% 75%
1 0.2 75% 77% The relation between parameter values and accuracy is shown
1 0.3 78% 79%
2 0.1 79% 80% in Table II.
2 0.2 82% 83%
2 0.3 89% 90%
3 0.1 77% 79%
3 0.2 79% 80% C. Validation of Concept Drift
3 0.3 81% 83%

After dividing the emails into different categories, a his-

togram of number of emails is generated with time for each
– file category. These histograms show how the presence of the
– message emails in different categories is changing, which tells about
File contains the name of the file and directory in which the change in content of the emails. For number of bins and
email is present. The name of the user to which an email bin-width of histogram, Sturges formula [21] is used which is
belongs is the root level directory of the corresponding shown below:
emails. Message contains the content of the email. Ev- bins = ceil(log2 n) + 1 (6)
erything related to the text is present in this column [18].
b. Emails obtained from gmail account: These emails are max(values) − min(values)
obtained from a gmail account using python libraries. binwidth = (7)
ceil(log2 n) + 1
These emails come in the form of stream, where all
information is in form of string. All the parameters of Here, n is number of emails present in the category for which
an email (i.e. sender, receiver, date, message) can be histogram is generated. Fig. 8 shows the histogram generated
obtained by using regex libraries [19]. for a category (graphics) after obtaining the results.
B. Experimental Evaluation
Initially, emails are tokenized and meta-data of emails
are collected. Meta-data contains the information of sender,
receiver, etc. After collection of meta-data, malware contents
are checked to declare an email as spam. In the next step, stop
words and extra spaces are removed from the emails to reduce
the noise.
For dividing the content based on tenses and voices, helping
verbs present in the sentence and their positions with respect
to the subject of the sentence are analyzed and current email
segments are stored for future reference. Segments of the
content look as shown in Fig. 7.
After extensive experiments, Naive bayes gives the accuracy
of 89%, whereas SVMs are a set of supervised learning meth-
ods used for classification, regression and outliers detection, Fig. 8. Change in content of emails with time
achieved the accuracy of 90% with proper parameter tuning.

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.
V. C ONCLUSION AND FUTURE W ORK [15] F. C. Heilbron, V. Escorcia, B. Ghanem and J. C. Niebles, ”Modern
Information Retrieval” Addison-Wesley Longman Publishing Co., pp.
In this paper, a method based on intention-based segmen- 961-970, 2007.
[16] Cha, Sung-Hyuk; Yoon, Sungsoon; and Tappert, Charles C.,”Enhancing
tation is proposed to classify an email. Two approaches have Binary Feature Vector Similarity Measures” (2005). CSIS Technical
been used to determine the intention of sentences: tense (past, Reports. 18.
present, future) and voices (active and passive voice) based. [17] K. Vasa, ”Text Classification through Statistical and Machine Learning
Models: A Survey”, International Journal Of Engineering Design Re-
These two methods are very closely related to the syntax and search, 2016, vol. 4.
grammar of English language. To compare two documents [18] V. Bobicev, ”Text classification: The case of multiple labels,” 2016 In-
vector space model is used. Sliding window technique is used ternational Conference on Communications (COMM), Bucharest, 2016,
pp. 39-42.
to detect the concept drift in email datasets that helps to give [19] Kaggle ”https://www.kaggle.com/wcukierski/enron-email-dataset” [Ac-
real class of emails with better accuracy in less time. The cessed March 20, 2020]
emails are compared by their segments of the same intention, [20] Jeffrey ”http://jeffreyfossett.com/” [Accessed march 20, 2020]
[21] Scott, David. (2009). ”Sturges’ rule”. Wiley Interdisciplinary Reviews:
instead of comparing them as a whole. This approach takes Computational Statistics. 1. 303 - 306. 10.1002/wics.35.
every small detail of the emails and ensures good performance.
The concept drift in emails describes the interests of the user.
There are methods other than tense-based and voice-based
segmentation which can be used for content segmentation
that can also be used for text summarization and other text-
based processes. The objective of the paper is to classify an
email with minimum computation resource, hence Naive bayes
and SVM are used for classification. Other advance machine
learning approaches can be used in future to achieve higher
accuracy with proper parameter tuning but need

R EFERENCES
[1] Ruano-Ords, D., Fdez-Riverola, F. and Mndez, ”Concept drift in e-mail
datasets: An empirical study with practical implications” Information
Sciences, 2018, 428, pp.120-135.
[2] Bhowmick, Alexy and Hazarika, Shyamanta. (2018). E-Mail Spam
Filtering: A Review of Techniques and Trends. 10.1007/978-981-10-
4765-7 61.
[3] W.A. Awad and S.M. ELseuofi,”Machine Learning Method for Spam
Email Classification”, Internation Journal of Computer Science and
Information Technology, 2011, vlo. 3, No. 1.
[4] Saad, Omar M. and Darwish, Ashraf and Faraj, Ramadan. (2019). A
survey of machine learning techniques for Spam filtering.
[5] Y. Sun, Z. Wang, Y. Bai, H. Dai and S. Nahavandi, A Classifier Graph
Based Recurring Concept Detection and Prediction Approach 2018
Computational Intelligence and Neuroscience, China, pp. 1-13.
[6] Papadimitriou D., Koutrika, G., Velegrakis Y. and Mylopoulos, “ Finding
Related Forum Posts through Content Similarity over Intention-Based
Segmentation” IEEE Transactions on Knowledge and Data Engineering,
2017, 29(9), pp.1860-1873.
[7] M. Harel, K. Crammer, R. El-Yaniv, S. Mannor, Concept Drift Detection
Through Resampling International Conference on Machine Learning,
Beijing, China, 2014, pp. 1324-1334.
[8] Lu, N., Zhang, G. and Lu J., Concept drift detection via competence
models International Conference on Artificial Intelligence, 2014, China,
pp.11-28.
[9] R. Power, J. Chen, T. Karthik, L. Subramanian, Document Classification
for Focused Topics International Conference on Machine Learning,
2013, NY, USA, pp. 123-136.
[10] S. Roy, A. Patra, S. Sau, K. Mandal, S. Kunar, An Efficient Spam
Filtering Technique for Email Account American Journal of Engineering
Research, 2013, pp. 63-73.
[11] R. J. Passonneau and D. J. Litman, ”Intention- based segmentation: Hu-
man reliability and correlation with linguistic cues in Proc. Annu.Meet.
Association Comput. Linguistics, 1993, pp. 148155.
[12] Jeffrey C. Reynar, “Topic Segmentation: Algorithms and Applications”,
University of Pennsylvania, AUGUST 1998.
[13] D. M. Blei, A. Y. Ng, M. I. Jordan, “Latent dirichlet allocation” J. Mach.
Learn. Res. vol. 3 pp. 993-1022 2003.
[14] L. Weng et al. “Query by document via a decomposition-based two-level
retrieval approach” Proc. 34th Int. ACM SIGIR Conf. Res. Development
Inf. Retrieval pp. 505-514 Jul. 2011.

Authorized licensed use limited to: University of Dhaka. Downloaded on September 01,2021 at 08:58:47 UTC from IEEE Xplore. Restrictions apply.

Hourglass Workout Program by Luisagiuliet 2
76% (21)
Hourglass Workout Program by Luisagiuliet 2
51 pages
12 Week Program: Summer Body Starts Now
89% (45)
12 Week Program: Summer Body Starts Now
70 pages
Read People Like A Book by Patrick King-Edited
60% (73)
Read People Like A Book by Patrick King-Edited
12 pages
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
77% (13)
Livingood, Blake - Livingood Daily Your 21-Day Guide To Experience Real Health
260 pages
Cheat Code To The Universe
94% (77)
Cheat Code To The Universe
34 pages
Facial Gains Guide (001 081)
91% (45)
Facial Gains Guide (001 081)
81 pages
Curse of Strahd
95% (467)
Curse of Strahd
258 pages
The Psychiatric Interview - Daniel Carlat
91% (34)
The Psychiatric Interview - Daniel Carlat
473 pages
The Borax Conspiracy
91% (57)
The Borax Conspiracy
14 pages
Trauma-Focused ACT - Russ Harris
95% (38)
Trauma-Focused ACT - Russ Harris
568 pages
The Secret Language of Attraction
86% (107)
The Secret Language of Attraction
278 pages
How To Develop and Write A Grant Proposal
83% (541)
How To Develop and Write A Grant Proposal
17 pages
Workbook For The Body Keeps The Score
88% (52)
Workbook For The Body Keeps The Score
111 pages
Penis Enlargement Secret
61% (123)
Penis Enlargement Secret
12 pages
KamaSutra Positions
78% (69)
KamaSutra Positions
55 pages
7 Hermetic Principles
93% (29)
7 Hermetic Principles
3 pages
27 Feedback Mechanisms Pogil Key
77% (13)
27 Feedback Mechanisms Pogil Key
6 pages
Phone Codes
78% (27)
Phone Codes
5 pages
36 Questions That Lead To Love
91% (35)
36 Questions That Lead To Love
3 pages
2025 MandateForLeadership FULL
70% (10)
2025 MandateForLeadership FULL
920 pages
How To Kiss A Woman's Breast
60% (114)
How To Kiss A Woman's Breast
14 pages
The 36 Questions That Lead To Love - The New York Times
94% (34)
The 36 Questions That Lead To Love - The New York Times
3 pages
100 Questions To Ask Your Partner
80% (35)
100 Questions To Ask Your Partner
2 pages
Satanic Calendar
25% (55)
Satanic Calendar
4 pages
The 36 Questions That Lead To Love - The New York Times
95% (21)
The 36 Questions That Lead To Love - The New York Times
3 pages
Jeffrey Epstein39s Little Black Book Unredacted PDF
75% (12)
Jeffrey Epstein39s Little Black Book Unredacted PDF
95 pages
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
100% (7)
14 Easiest & Hardest Muscles To Build (Ranked With Solutions)
27 pages
1001 Songs
70% (71)
1001 Songs
1,798 pages
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
23% (954)
The 4 Hour Workweek, Expanded and Updated by Timothy Ferriss - Excerpt
38 pages
Zodiac Sign & Their Most Common Addictions
63% (30)
Zodiac Sign & Their Most Common Addictions
9 pages
Constructing A User Preference Ontology For Anti-Spam Mail Systems
No ratings yet
Constructing A User Preference Ontology For Anti-Spam Mail Systems
12 pages
Article 28
No ratings yet
Article 28
5 pages
Categorization of Email Using Machine Learning On Cloud: Abstract
No ratings yet
Categorization of Email Using Machine Learning On Cloud: Abstract
5 pages
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
No ratings yet
(IJCST-V11I2P16) :shikha, Jatinder Singh Saini
9 pages
122 14211291439 13 PDF
No ratings yet
122 14211291439 13 PDF
5 pages
Subject Based Efficient Spam Detection Technique
No ratings yet
Subject Based Efficient Spam Detection Technique
5 pages
Email Filter For Spam Mail: A Review
No ratings yet
Email Filter For Spam Mail: A Review
5 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
No ratings yet
A Comparative Approach To Email Classification Using Naive Bayes Classifier and Hidden Markov Model
6 pages
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
No ratings yet
Survey On Spam Filtering in Text Analysis: Saksham Sharma, Rabi Raj Yadav
7 pages
MASFE MutliagentSystemforFilteringE MailsUsingJADE
No ratings yet
MASFE MutliagentSystemforFilteringE MailsUsingJADE
21 pages
Kongunadu College of Engineering and Technology: Automated Spam Filtering: A Fuzzy Similarity Approach
No ratings yet
Kongunadu College of Engineering and Technology: Automated Spam Filtering: A Fuzzy Similarity Approach
6 pages
Spam Classification Based On Supervised Learning U
No ratings yet
Spam Classification Based On Supervised Learning U
6 pages
Email Spam PDF
No ratings yet
Email Spam PDF
5 pages
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
No ratings yet
InboxIQ_ an Automated Email Reply System Revolutionizing Inbox Management With Machine Learning
8 pages
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
No ratings yet
Considering Behavior of Sender in Spam Mail Detection: S. Naksomboon, C. Charnsripinyo and N. Wattanapongsakorn
5 pages
Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability
No ratings yet
Reverse of E-Mail Spam Filtering Algorithms To Maintain E-Mail Deliverability
4 pages
A New Email Retrivel Ranking Approach
No ratings yet
A New Email Retrivel Ranking Approach
20 pages
A Comparative Study For SMS Spam Detection
No ratings yet
A Comparative Study For SMS Spam Detection
4 pages
Spam Message Detection Using Logistic Regression
No ratings yet
Spam Message Detection Using Logistic Regression
4 pages
Email Based Spam Detection
No ratings yet
Email Based Spam Detection
5 pages
Iccs 2020 Published Paper
No ratings yet
Iccs 2020 Published Paper
9 pages
Multi-Purpose Chat Bot: Team Formation Team Members
No ratings yet
Multi-Purpose Chat Bot: Team Formation Team Members
15 pages
Machine Learning Based Spam E-Mail Detection
No ratings yet
Machine Learning Based Spam E-Mail Detection
10 pages
Spam Filtering Based On Latent Semantic Indexing: January 2008
No ratings yet
Spam Filtering Based On Latent Semantic Indexing: January 2008
10 pages
Voting Classification Method for Email Spam Prediction
No ratings yet
Voting Classification Method for Email Spam Prediction
10 pages
Paper Publication PDF
No ratings yet
Paper Publication PDF
10 pages
ICavor Paper IT2021
No ratings yet
ICavor Paper IT2021
4 pages
Spam Detection in Email Using Machine Le
No ratings yet
Spam Detection in Email Using Machine Le
8 pages
Scrutinizing Unsolicited E-Mail and Revealing Zombies: V.Annie, Mr. G. Sathishkumar. B.E, M.E, Ph.D.
No ratings yet
Scrutinizing Unsolicited E-Mail and Revealing Zombies: V.Annie, Mr. G. Sathishkumar. B.E, M.E, Ph.D.
5 pages
Email (Research) 3
No ratings yet
Email (Research) 3
7 pages
Mining Social Networks For Personalized Email Prioritization
No ratings yet
Mining Social Networks For Personalized Email Prioritization
9 pages
Thesis On Spam Detection
100% (3)
Thesis On Spam Detection
4 pages
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
No ratings yet
A Novel Method of Spam Mail Detection Using Text Based Clustering Approach
11 pages
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
100% (2)
Hybrid Machine Learning Based E-Mail Spam Filtering Technique
58 pages
Review On Detection of Spam Comments Using NLP Algorithm
No ratings yet
Review On Detection of Spam Comments Using NLP Algorithm
4 pages
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
No ratings yet
Enhancing Email Security with Naïve Bayes Spam Detection.docx Fully edited
64 pages
Email Mining
No ratings yet
Email Mining
37 pages
A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models
No ratings yet
A_Novel_Approach_for_Spam_Detection_Using_Natural_Language_Processing_With_AMALS_Models
16 pages
E-Mail Security Using Spam Mail Detection and Filtering Network System
No ratings yet
E-Mail Security Using Spam Mail Detection and Filtering Network System
4 pages
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
No ratings yet
A Comparative Performance Evaluation of Content Based Spam and Malicious URL Detection in E-Mail
6 pages
Presentation 3
No ratings yet
Presentation 3
13 pages
A Systematic Literature Review On SMS Spam Detection Techniques
No ratings yet
A Systematic Literature Review On SMS Spam Detection Techniques
10 pages
Solution: March 2018
No ratings yet
Solution: March 2018
8 pages
Sem5 Paper DT
No ratings yet
Sem5 Paper DT
3 pages
Spam Detection Using BERT
No ratings yet
Spam Detection Using BERT
6 pages
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
No ratings yet
(IJCST-V11I3P21) :ms. Deepali Bhimrao Chavan, Prof. Suraj Shivaji Redekar
4 pages
Spam Filtering Using Spam Mail Communities: A Paper On
No ratings yet
Spam Filtering Using Spam Mail Communities: A Paper On
13 pages
Cosdes: A Collaborative Spam Detection System With A Novel E-Mail Abstraction Scheme
No ratings yet
Cosdes: A Collaborative Spam Detection System With A Novel E-Mail Abstraction Scheme
14 pages
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
No ratings yet
Email Filtering: Machine Learning Techniques and An Implementation For The UNIX Pine Mail System
42 pages
Elsarticle Template New
No ratings yet
Elsarticle Template New
3 pages
Spam Filtering Algorithm Analysis
No ratings yet
Spam Filtering Algorithm Analysis
9 pages
Report
No ratings yet
Report
6 pages
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
No ratings yet
Analysis of Spam Email Filtering Through Naive Bayes Algorithm Across Different Datasets
4 pages
Jebin 2
No ratings yet
Jebin 2
22 pages
Monitoring of Suspicious Discussions On Online Forums Using Data Mining
No ratings yet
Monitoring of Suspicious Discussions On Online Forums Using Data Mining
7 pages
$RB0DCAN
No ratings yet
$RB0DCAN
10 pages
Writing Good Emails
From Everand
Writing Good Emails
IntroBooks Team
No ratings yet
The Science of Managing Our Digital Stuff
From Everand
The Science of Managing Our Digital Stuff
Ofer Bergman
3.5/5 (3)
Run Your Own Mail Server: IT Mastery, #20
From Everand
Run Your Own Mail Server: IT Mastery, #20
Michael W. Lucas
No ratings yet
AI-Natural Language Processing
No ratings yet
AI-Natural Language Processing
49 pages
Analysis of Classifiers For Fake News Detection
No ratings yet
Analysis of Classifiers For Fake News Detection
14 pages
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
No ratings yet
Cs6007 - Information Retrieval: Objectives: The Student Should Be Made To
24 pages
1 s2.0 S1877050922015058 Main
No ratings yet
1 s2.0 S1877050922015058 Main
11 pages
AMT302 QUESTION BANK - Format
No ratings yet
AMT302 QUESTION BANK - Format
3 pages
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
No ratings yet
Kenny-230718-Top 70 Microsoft Data Science Interview Questions
17 pages
Map Reduce Applications
No ratings yet
Map Reduce Applications
94 pages
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
No ratings yet
Q - ClassX - AI - NATURAL LANGUAGE PROCESSING
10 pages
NLP PPT
No ratings yet
NLP PPT
58 pages
Natural Language: Anguage Odels
No ratings yet
Natural Language: Anguage Odels
28 pages
Tamrakar 2015
No ratings yet
Tamrakar 2015
6 pages
Unit 4 NLP
No ratings yet
Unit 4 NLP
29 pages
Rating Prediction
No ratings yet
Rating Prediction
20 pages
Alternative Probabilistic Models: Probability Theory
100% (1)
Alternative Probabilistic Models: Probability Theory
37 pages
TARP
No ratings yet
TARP
21 pages
Vectorizer - Fit - Transform Function
No ratings yet
Vectorizer - Fit - Transform Function
16 pages
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
No ratings yet
Modern Information Retrieval Chapter 7: Text Operations: Ricardo Baeza-Yates Berthier Ribeiro-Neto
40 pages
Artificial Intelligence Capstone Project idea
No ratings yet
Artificial Intelligence Capstone Project idea
15 pages
Sree017 NLP
No ratings yet
Sree017 NLP
3 pages
Topic Extraction From News Archive Using TF PDF Algorithm
No ratings yet
Topic Extraction From News Archive Using TF PDF Algorithm
10 pages
Keyword Analysis of Artificial Intelligence Educat
No ratings yet
Keyword Analysis of Artificial Intelligence Educat
10 pages
Text Summarization Using Word Frequency
No ratings yet
Text Summarization Using Word Frequency
3 pages
718answer Key 30161 ECCDLO8012 NLP EXTC Sem VIII May 2023
No ratings yet
718answer Key 30161 ECCDLO8012 NLP EXTC Sem VIII May 2023
8 pages
NLP Midsem Paper Jan 2024 Regular exam
No ratings yet
NLP Midsem Paper Jan 2024 Regular exam
4 pages
Cse-Ai 2-2 Sem Cs&Syllabus Ug r20
No ratings yet
Cse-Ai 2-2 Sem Cs&Syllabus Ug r20
19 pages
Decoding Ai and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis
No ratings yet
Decoding Ai and Human Authorship: Nuances Revealed Through NLP and Statistical Analysis
19 pages
Measuring Discourse Bias On Top Speeches by Women For Women Text Mining Analysis
100% (1)
Measuring Discourse Bias On Top Speeches by Women For Women Text Mining Analysis
8 pages
Mindsight Codex
No ratings yet
Mindsight Codex
87 pages
CP5074 - SNA Unit V Notes
No ratings yet
CP5074 - SNA Unit V Notes
21 pages
Performance Evaluation of Supervised Machine Learning Techniques For Efficient Detection of Emotions From Online Content
No ratings yet
Performance Evaluation of Supervised Machine Learning Techniques For Efficient Detection of Emotions From Online Content
26 pages