Cosdes: A Collaborative Spam Detection System With A Novel E-Mail Abstraction Scheme
Cosdes: A Collaborative Spam Detection System With A Novel E-Mail Abstraction Scheme
1 INTRODUCTION
E
-MAIL communication is prevalent and indispensable
nowadays. However, the threat of unsolicited junk e-
mails, also known as spams, becomes more and more
serious. According to a survey by the website TopTenRE-
VIEWS [11], 40 percent of e-mails were considered as spams
in 2006. The statistics collected by MessageLabs
1
show that
recently the spam rate is over 70 percent and persistently
remains high. The primary challenge of spam detection
problem lies in the fact that spammers will always find new
ways to attack spamfilters owing to the economic benefits of
sending spams. Note that existing filters generally perform
well when dealing with clumsy spams, which have
duplicate content with suspicious keywords or are sent
from an identical notorious server. Therefore, the next stage
of spam detection research should focus on coping with
cunning spams which evolve naturally and continuously.
Although the techniques used by spammers vary
constantly, there is still one enduring feature: spams with
identical or similar content are sent in large quantities and
successively. Since only a small amount of e-mail users will
order products or visit websites advertised in spams,
spammers have no choice but to send a great quantity of
spams to make profits. It means that even with developing
and employing unexpected new tricks, spammers still have
to send out large quantities of identical or similar spams
simultaneously and in succession. This specific feature of
spams can be designated as the near-duplicate phenomenon,
which is a significant key in the spam detection problem.
In view of above facts, the notion of collaborative spam
filtering with near-duplicate similarity matching scheme
has recently received much attention. The primary idea of
the near-duplicate matching scheme for spam detection is to
maintain a known spam database, formed by user feedback,
to block subsequent spams with similar content. Collabora-
tive filtering indicates that user knowledge of what spam
may subsequently appear is collected to detect following
spams. Overall, there are three key points of this type of
spam detection approach we have to be concerned about.
First, an effective representation of e-mail (i.e., e-mail
abstraction) is essential. Since a large set of reported spams
has to be stored in the known spam database, the storage
size of e-mail abstraction should be small. Moreover, the e-
mail abstraction should capture the near-duplicate phe-
nomenon of spams, and should avoid accidental deletion of
nonspam e-mails (also known as hams). Second, every
incoming e-mail has to be matched with the large database,
meaning that the near-duplicate matching process should
be substantially efficient. Finally, the latest spams have to be
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011 669
. C.-Y. Tseng and M.-S. Chen are with the Department of Electrical
Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Road,
Taipei 10617, Taiwan, and the Research Center for Information Technology
Innovation (CITI), Academia Sinica, No. 128, Sec. 2, Academia Road,
Nankang, Taipei 11529, Taiwan.
E-mail: [email protected], [email protected].
. P.-C. Sung is with the Department of Electrical Engineering, National
Taiwan University, No. 1, Sec. 4, Roosevelt Road, Taipei 10617, Taiwan.
E-mail: [email protected].
Manuscript received 7 Mar. 2009; revised 9 Aug. 2009; accepted 7 Dec. 2009;
published online 24 Aug. 2010.
Recommended for acceptance by D. Cook.
For information on obtaining reprints of this article, please send e-mail to:
[email protected], and reference IEEECS Log Number TKDE-2009-03-0118.
Digital Object Identifier no. 10.1109/TKDE.2010.147.
1. http://www.messagelabs.com/globalthreats.
1041-4347/11/$26.00 2011 IEEE Published by the IEEE Computer Society
included instantly and successively into the database so as
to effectively block subsequent near-duplicate spams.
Although previous researchers have developed various
methods on near-duplicate spam detection [7], [8], [12], [16],
[17], [22], [23], [24], [25], [30], [31], these works are still
subject to some drawbacks. To achieve the objectives of
small storage size and efficient matching, prior works
mainly represent each e-mail by a succinct abstraction
derived from e-mail content text. Moreover, hash-based text
representation is applied extensively. One major problem of
these abstractions is that they may be too brief and thus
may not be robust enough to withstand intentional attacks.
A common attack to this type of representation is to insert a
random normal paragraph without any suspicious key-
words into unobvious position of an e-mail. In such a
context, if the whole e-mail content is utilized for hash-
based representation, the near-duplicate part of spams
cannot be captured. In addition, the false positive rate (i.e.,
the rate of classifying hams as spams) may increase because
the random part of e-mail content is also involved in e-mail
abstraction. On the other hand, hash-based text representa-
tion also suffers from the problem of not being suitable for
all languages. Finally, images and hyperlinks are important
clues to spam detection, but both of them are unable to be
included in hash-based text representation.
In this paper, we explore to devise a more sophisticated e-
mail abstraction, whichcanmore effectivelycapture the near-
duplicate phenomenonof spams. Motivatedbythe fact that e-
mail users are capable of easily recognizing similar spams by
observingthe layouts of e-mails, we attempt torepresent each
e-mail based on the e-mail layout structure. Fortunately,
almost all e-mails nowadays are in Multipurpose Internet
Mail Extensions (MIME) format with the text/html content-
type. That is, HTML content is available in an e-mail and
provides sufficient informationabout e-mail layout structure.
Inviewof this observation, we propose the specific procedure
Structure Abstraction Generation (SAG), which generates an
HTML tag sequence to represent each e-mail. Different from
previous works, SAG focuses on the e-mail layout structure
insteadof detailedcontent text. Inthis regard, eachparagraph
of text without any HTMLtag embeddedwill be transformed
to a newly defined tag <iytcrt,.
Definition 1 (<mytext,). <iytcrt, is a newly defined
tag that represents a paragraph of text without any HTML tag
embedded.
Since we ignore the semantics of the text, the proposed
abstraction scheme is inherently applicable to e-mails in all
languages. This significant feature is superior to most
existing methods. Once e-mails are represented by our
newly devised e-mail abstractions, two e-mails are viewed
as near-duplicate if their HTML tag sequences are exactly
identical to each other. Note that even when spammers insert
random tags into e-mails, the proposed e-mail abstraction
scheme will still retain efficacy since arbitrary tag insertion is
prone to syntax errors or tag mismatching, meaning that the
appearance of the e-mail content will be greatly altered.
Moreover, the proposed procedure SAG also adopts some
heuristics to better guarantee the robustness of our approach.
While a more sophisticated e-mail abstraction is intro-
duced, one challenging issue arises: how to efficiently
match each incoming e-mail with an existing huge spam
database. To resolve this issue, we devise an innovative tree
structure, SpTrees, to store large amounts of the e-mail
abstractions of reported spams, and SpTrees contribute to
substantially promoting the efficiency of matching. In the
design of the near-duplicate matching scheme based on
SpTrees, we aim at reducing the number of spams and tags
which are required to be compared.
By integrating above techniques, in this paper, we design
a complete spam detection system COllaborative Spam
DEtection System (Cosdes). Cosdes possesses an efficient
near-duplicate matching scheme and a progressive update
scheme. The progressive update scheme not only adds in
new reported spams, but also removes obsolete ones in the
database. With Cosdes maintaining an up-to-date spam
database, the detection result of each incoming e-mail can
be determined by the near-duplicate similarity matching
process. In addition, to withstand intentional attacks, a
reputation mechanism is also provided in Cosdes to ensure
the truthfulness of user feedback.
To the best of our knowledge, there is no prior research
in considering e-mail layout structure to represent e-mails
in the field of near-duplicate spam detection. In summary,
the contributions of this paper are as follows:
1. We propose the specific procedure SAG to generate
the e-mail abstraction using HTML content in e-mail,
and this newly devised abstraction can more
effectively capture the near-duplicate phenomenon
of spams.
2. We devise an innovative tree structure, SpTrees, to
store large amounts of the e-mail abstractions of
reported spams. SpTrees contribute to the accom-
plishment of the efficient near-duplicate matching
with a more sophisticated e-mail abstraction.
3. We design a complete spam detection system
Cosdes with an efficient near-duplicate matching
scheme and a progressive update scheme. The
progressive update scheme enables system Cosdes
to keep the most up-to-date information for near-
duplicate detection.
The rest of this paper is outlined as follows: In Section 2,
preliminaries including the definition of near-duplicate and
the related works are given. In Section 3, we introduce the
novel e-mail abstraction scheme. In Section 4, the complete
system model of Cosdes is depicted. The experimental
results are shown in Section 5, and finally, this paper is
concluded with Section 6.
2 PRELIMINARIES
In this section, the definition of near-duplicate, in this
paper, is presented in Section 2.1. We then review the
related works on spam detection in Section 2.2.
2.1 Definition of Near-Duplicate
The central idea of near-duplicate spamdetectionis to exploit
reported known spams to block subsequent ones which have
similar content. For different forms of e-mail representation,
the definitions of similarity between two e-mails are diverse.
Unlike most prior works representing e-mails based mainly
670 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
on content text, we investigate representing each e-mail
using an HTML tag sequence, which depicts the layout
structure of e-mail, and look forward to more effectively
capturing the near-duplicate phenomenon of spams. Initi-
ally, the definition of <oic/oi tag is given as follows:
Definition 2 (<anchor). The tag <oic/oi is one type of
newly defined tag that records the domain name or the e-mail
address in an anchor tag.
For example, the anchor tag <a href=http://arbor.ee.
ntu.edu.tw/index.htm is transformed to <arbor.ee.ntu.
edu.tw. The anchor tag <a href=mailto:cytseng@arbor.
ee.ntu.edu.tw is transformed to <[email protected].
ntu.edu.tw. The purpose of creating the <oic/oi tag is
to minimize the false positive rate when the number of tags
in an e-mail abstraction is short. The less the number of tags
in an e-mail abstraction, the more possible that a ham may
be matched with known spams and be misclassified as a
spam. Therefore, when the number of tags in an e-mail
abstraction is smaller than a predefined threshold, for each
anchor tag <o, we specifically record the targeted domain
name or e-mail address, which is a significant clue for
identifying spams.
On the other hand, in this paper, the detailed definition
of near-duplicate is given as follows:
Definition 3 (Near-Duplicate). Let 1 ft
1
. t
2
. . . . . t
i
. . . . . t
i
.
<iytcrt,. <oic/oig be the set of all valid HTML tags
with two types of newly created tags, <iytcrt, and
<oic/oi, included. An e-mail abstraction derived from
procedure SAG is denoted as <c
1
. c
2
. . . . . c
i
. . . . . c
i
, which
is an ordered list of tags, where c
i
2 1. The definition of near-
duplicate is: Two e-mail abstractions c <o
1
. o
2
. . . . .
o
i
. . . . . o
i
and u </
1
. /
2
. . . . . /
i
. . . . . /
i
are viewed as
near-duplicate if 8o
i
/
i
and i i.
Definition 4 (Tag Length). The tag length of an e-mail
abstraction is defined as the number of tags in an e-mail
abstraction.
Note that we strictly define that two e-mail abstractions
are near-duplicate only if they are exactly identical to each
other. The major reason is that there are numerous HTML
tag patterns appearing commonly and frequently. Partial
matching of HTML tag sequences will cause much higher
rate of false positive error, and the complexity will be too
high to achieve efficient matching. In addition, for further
speed-up, while the tag length of an e-mail abstraction is
longer, we even apply a looser matching criterion, which
does not degrade detection results.
2.2 Related Works
Since the e-mail spam problem is increasingly serious
nowadays, various techniques have been explored to relieve
the problem. Based on what features of e-mails are being
used, previous works on spam detection can be generally
classified into three categories: 1) content-based methods,
2) noncontent-based methods, and 3) others. Initially,
researchers analyze e-mail content text and model this
problem as a binary text classification task. Representatives
of this category are Naive Bayes [14], [20] and Support
Vector Machines (SVMs) [1], [10], [15], [27] methods. In
general, Naive Bayes methods train a probability model
using classified e-mails, and each word in e-mails will be
given a probability of being a suspicious spam keyword. As
for SVMs, it is a supervised learning method, which
possesses outstanding performance on text classification
tasks. Traditional SVMs [10] and improved SVMs [1], [15],
[27] have been investigated. While above conventional
machine learning techniques have reported excellent results
with static data sets, one major disadvantage is that it is
cost-prohibitive for large-scale applications to constantly
retrain these methods with the latest information to adapt to
the rapid evolving nature of spams. The spam detection of
these methods on the e-mail corpus with various language
has been less studied yet. In addition, other classification
techniques, including markov random field model [3],
neural network [6] and logic regression [2], and certain
specific features, such as URLs [26] and images [19], [29]
have also been taken into account for spam detection.
The other group attempts to exploit noncontent informa-
tion such as e-mail header, e-mail social network [4], [28],
and e-mail traffic [5], [9] to filter spams. Collecting
notorious and innocent sender addresses (or IP addresses)
from e-mail header to create black list and white list is a
commonly applied method initially. MailRank [4] examines
the feasibility of rating sender addresses with algorithm
PageRank in the e-mail social network, and in [28], a
modified version with update scheme is introduced. Since
e-mail header can be altered by spammers to conceal the
identity, the main drawback of these methods is the
hardness of correctly identifying each user. In [5], [9], the
authors intend to analyze e-mail traffic flows to detect
suspicious machines and abnormal e-mail communication.
It is noted that these approaches have to operate in
coordination with other complementary methods to gain
better results. Moreover, some researchers consider com-
bining the merits of several techniques [2], [13], [18]. Even
though the performance of classifier integration seems
prominent, there is still no conclusion on what is the best
combination. In addition, how to efficiently update the
whole included classifiers is another unsolved issue.
On the other hand, collaborative spam filtering with
near-duplicate similarity matching scheme has been stu-
died extensively in recent years. Regarding collaborative
mechanism, P2P-based architecture [8], [12], [31], centra-
lized server-based system [16], [21], [22], [30], and others
[17], [23] are generally employed. Note that no matter
which mechanism is applied, the most critical factor is how
to represent each e-mail for near-duplicate matching. The e-
mail abstraction not only should capture the near-duplicate
phenomenon of spams, but should avoid accidental
deletion of hams. In [30], the first N hash values of each
length L substring are used as vector representation of the
e-mail. In [7], [8], [17], a 32-byte code derived from a
variation of Nilsimsa digest technique is utilized to
represent the distribution of word trigrams in e-mail. In
[23], [24], [25], the authors improve the open digest
technique [7] by representing each e-mail with multiple
digests produced from the strings of fixed length sampled
at randomized positions within e-mail. In [12], [31], a
feature vector of a block text fingerprint generated from the
set of checksums of each length L substring is exploited. In
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 671
[22], the authors make use of spam-vocabulary patterns
produced by Teiresias pattern discovery algorithm. In [16],
the I-Match signature determined by a set of unique terms
shared by spams and the I-Match lexicon is put to use. In
[21], the content similarity of e-mails computed using
extracted words is measured. It is noted that most existing
methods generate e-mail abstractions based mainly on
content text. However, randomized and normal paragraphs
are commonly inserted in spams nowadays, and thus if an
e-mail abstraction is generated by the whole content text,
the near-duplicate part of spams cannot be captured.
Moreover, generating e-mail abstraction with the content
text also suffers from the problem of not being applicable to
all languages.
In light of the above problems, it deserves further studies
to design a better e-mail abstraction approach that is more
robust to withstand intentional attacks.
3 E-MAIL ABSTRACTION SCHEME
In this section, a novel e-mail abstraction scheme is
introduced. In Section 3.1, procedure SAG is presented to
depict the generation process of an e-mail abstraction. The
devised data structures SpTable and SpTrees are illustrated
in Section 3.2. Finally, the robustness issue is discussed in
Section 3.3.
3.1 Structure Abstraction Generation
We propose the specific procedure SAGtogenerate the e-mail
abstraction using HTML content in e-mail. SAGis elaborated
with the example of Fig. 3, andthe algorithmic formof SAGis
outlined in Fig. 1. Procedure SAGis composed of three major
phases, Tag Extraction Phase, Tag Reordering Phase, and
<oic/oi Appending Phase. In Tag Extraction Phase, the
name of each HTML tag is extracted, and tag attributes and
attribute values are eliminated. In addition, each paragraph
of text without any tag embedded is transformed to
<iytcrt,. In lines 4-5, <oic/oi tags are then inserted
into ic/oioct, andthe first 1,023 validtags are concatenated
to form the tentative e-mail abstraction. Note that we retain
onlythe first 1,023tags as the tagsequence. The mainreasonis
that the rear part of long e-mails can be ignored without
affecting the effectiveness of near-duplicate matching. Sub-
sequently, inline 6 of Fig. 1, we preprocess the tagsequence of
the tentative e-mail abstraction. One objective of this
preprocessing step is to remove tags that are common but
not discriminative between e-mails. The other objective is to
prevent malicious tag insertion attack, and thus the robust-
ness of the proposed abstraction scheme can be further
enhanced. The followingsequence of operations is performed
in the preprocessing step.
1. Front and rear tags (as shown in the gray area of the
example e-mail in the top of Fig. 3) are excluded.
2. Nonempty tags
2
that have no corresponding start
tags or end tags are deleted. Besides, mismatched
nonempty tags are also deleted.
3. All empty tags
2
are regarded as the same and are
replaced by the newly created <cijty, tag.
Moreover, successive <cijty, tags are pruned
and only one <cijty, tag is retained.
4. The pairs of nonempty tags enclosing nothing are
removed.
Example 1. Consider the example e-mail abstraction in Fig. 2
that has been produced through the execution of lines 1-5
inprocedure SAG. The first operationof the preprocessing
step is to remove tags which are in front of the </odytag
and which are in rear of the <,/ody tag. With regard to
operation 2, since there is no end tag of <o, this tag is
deleted. Besides, the tags <)oit and < ,)oit are also
deleted because the position of <,)oitis incorrect. Note
that we can utilize the stack data structure to determine
whether nonemptytags are mismatched. After that, empty
tags are transformed to < cijty, in operation 3. More-
over, since <iytcrt,<iiq,</i, appear consecu-
tively, only one <cijty, tag is retained. Finally,
<di.<)oit<,)oit<,di. are removed due to the
lack of content.
672 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Fig. 1. Algorithmic form of procedure SAG.
Fig. 2. An example of the preprocessing step in Tag Extraction Phase of
procedure SAG.
2. In the HTML Document Type Definition (DTD), empty tags contain no
text and have no end tags. For example, </i,, </i,, <iiq,, <iijnt,
are empty tags. The newly created tag, <iytcrt,, is also regarded as an
empty tag in this paper. On the other hand, all tags that are not declared in
the DTD as empty tags are nonempty tags. Nonempty tags must have an
end tag.
The middle part of Fig. 3 shows an example of a tentative
e-mail abstraction and ic/oioct (i.e., <:joi.coi)
derived from Tag Extraction Process.
On purpose of accelerating the near-duplicate matching
process, we reorder the tag sequence of an e-mail abstrac-
tion in Tag Reordering Phase. Note that since the arrange-
ment of HTML tags is regular and in pairs, various
sequential patterns of tags are contained in e-mails. In the
worst case, if we consider two e-mail abstractions which
have the same tag length and differ only in their last tags,
the difference cannot be detected until the last tags are
compared. To handle this problem, we destroy the
regularity by rearranging the order of tag sequence to
lower the number of tag comparisons. Note that this process
ensures that the newly assigned position numbers of e-mail
abstractions with the same number of tags are completely
identical. As such, the matching process can be accelerated
without violating the definition of near-duplicate in this
paper. In lines 8-11 of Fig. 1, each tag is assigned a new
position number by function ASSIGN_PN (PN denotes for
Position Number) with following expressions,
/ d
1
p
e.
i 1`
oiiq
1%/.
b1`
oiiq
1,/c 1.
1`
icn
/ i / 1.
where 1 is the tag length of an e-mail abstraction, and 1`
oiiq
is the original position number. Variable / is the number of
buckets. Variable i indicates which bucket should be placed,
and variable is the number of shift counts from the end of
this bucket. Fig. 3 demonstrates the assignment of the first
six tags. The final e-mail abstraction is the concatenation of
all tags with new position numbers (the vacant positions,
e.g., positions 9 and 13 in Fig. 3, are ignored). Additionally, if
the tag length of an e-mail abstraction is smaller than a
predefined tag length threshold (set as 16 in the experi-
ments) of the short e-mail, the tags in ic/oioct will be
appended in front of the e-mail abstraction. The main
objective of appending <oic/oi tags is to reduce the
probability that a hamis successfully matched with reported
spams when the tag length of an e-mail abstraction is short.
An example e-mail abstraction produced by procedure SAG
is shown in the bottom of Fig. 3.
3.2 Design of SpTable and SpTrees
One major focus of this work is to design the innovative
data structure to facilitate the process of near-duplicate
matching. SpTable and SpTrees (sp stands for spam) are
proposed to store large amounts of the e-mail abstractions
of reported spams. As shown in Fig. 4, several SpTrees are
the kernel of the database, and the e-mail abstractions of
collected spams are maintained in the corresponding
SpTrees. According to Definition 3, two e-mail abstractions
are possible to be near-duplicate only when the numbers of
their tags are identical. Thus, if we distribute e-mail
abstractions with different tag lengths into diverse SpTrees,
the quantity of spams required to be matched will decrease.
However, if each SpTree is only mapped to one single tag
length, it is too much of a burden for a server to maintain
such thousands of SpTrees. In view of this concern, each
SpTree is designed to take charge of e-mail abstractions
within a range of tag lengths. As can be seen in Fig. 4,
SpTable is created to record overall information of SpTrees.
The i
t/
column of SpTable links to the root of SpTree_i by a
pointer, and e-mail abstractions with tag lengths ranging
from 2
i
to 2
i1
1 belong to SpTree_i.
Regarding how an e-mail abstraction is stored in SpTree,
Fig. 5 gives an example with the same e-mail abstraction
derived from Fig. 3. An e-mail abstraction is segmented into
several subsequences, and these subsequences are consecu-
tively put into the corresponding nodes from low levels to
high levels. As such, an e-mail abstraction is stored in one
path from the root node to a leaf node of SpTree, and hence
the matching between a testing e-mail and known spams is
processed from root to leaf. As shown in Fig. 5, the example
e-mail abstraction is stored in a path from the root node o to
the leaf node d. The primary goal of applying the tree data
structure for storage is to reduce the number of tags
required to be matched when processing from root to leaf.
Since only subsequences along the matching path from root
to leaf should be compared, the matching efficiency is
substantially increased. Note that if each type of HTML tag
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 673
Fig. 3. An example procedure flow of SAG.
Fig. 4. The data structures of SpTable and SpTrees.
determines a branch direction (i.e., the tree degree will be
the number of HTML tag types) and each level of SpTree
contains merely one HTML tag (i.e., the tree height will be
the tag length of the longest e-mail abstraction), the number
of tag comparisons will be minimum. However, it is
infeasible because the degrees and the heights of SpTrees
will be too large, and SpTrees will be extremely unbalanced.
To achieve efficient matching with balanced tree struc-
ture, SpTrees are designed to be binary trees. The branch
direction of each SpTree is determined by a binary hash
function. If the first tag of a subsequence is a start tag (e.g.,
<di.), this subsequence will be placed into the left child
node. A subsequence whose first tag is an end tag (e.g.,
<,di.) will be placed into the right child node. Since most
HTML tags are in pairs and the proposed e-mail abstraction
is reordered in procedure SAG, subsequences are expected
to be uniformly distributed. Moreover, on level i of each
SpTree (with the root on level 0), each node stores
subsequences whose tag lengths are equal to 2
i
. For instance,
as shown in Fig. 3, the subsequence <:joi.coi (whose
tag length is 2
0
) is placed into level 0, the subsequence
<,j<o (whose tag length is 2
1
) is placed into level 1, and
so forth. Note that since SpTree_i takes charge of e-mail
abstractions with tag lengths ranging from 2
i
to 2
i1
1,
based on the above-mentioned arrangement, the last
subsequence of each e-mail abstraction in an SpTree will
be stored in the leaf nodes on the same level. Also note that
the tag lengths of subsequences stored in leaf nodes of level ,
range from 1 to 2
,
. As described in Section 3.1, we design
that the longest length of an e-mail abstraction is 1,023,
meaning that there are totally 10 SpTrees (from SpTree_0 to
SpTree_9) in our database while the proposed arrangement
is applied. In addition, to further accelerate the process of
matching, we employ a hash function to map each
subsequence to an integer. The key idea is that only
subsequences which look like the testing subsequence
should be exactly matched. The hash function is defined
as follows:
/o:/:c ):c0 2
i1
):c1 2
i2
):ci 1 2
0
.
where iis the number of tags in this subsequence and :ci
denotes the tag type of the i
t/
tag. The function ) converts
each type of tag to a unique integer. Moreover, for the
subsequence which contains more than eight tags, we just
use the first eight tags to generate the hash value (i.e., i 8).
With the hash function, most subsequence matching is
transformed to the integer matching, and hence the complex-
ity of matching process can be substantially reduced.
Overall, the advantageous features of this innovative
arrangement are as follows: 1) The height of an SpTree is
equal to lg 1 b c, where 1 is the tag length of the longest e-
mail abstraction in this SpTree. 2) Owing to the fact that
parent nodes store less number of subsequences than
children nodes, we design that longer subsequences are
put into higher levels (the tag length of the subsequence
on level i is 2
i
). Thus, the number of tags matched from
root to leaf is markedly decreased. Moreover, with the
hash function, the matching efficiency is substantially
increased. 3) The numbers of tags stored in the nodes of
an SpTree are expected to be similar, and hence SpTrees
are balanced binary trees. Assume that there are ` e-mail
abstractions in SpTree_i and subsequences on the same
level are uniformly distributed. For each node on level ,,
there are
`
2
,
subsequences (the number of nodes on level ,
is 2
,
) whose tag lengths are equal to 2
,
. Thus, the number
of tags stored in each node is
`
2
,
2
,
`, which is not
correlated with ,. It is noted that the number of tags in
each leaf node is less than ` because not all subse-
quences in leaf nodes are with the longest allowed tag
length.
On the other hand, as shown in the bottom of Fig. 5,
certain additional information is required and kept in each
subsequence of a node. For subsequences in internal nodes,
:joi id and tiic:toij are included. For subsequences in
external nodes, :joi id, n:ci id, tiic:toij, |ciqt/ 1 (the
length of the e-mail abstraction), and o
1
(the reputation
score) are involved.
3.3 Robustness Issue
The main difficulty of near-duplicate spam detection is to
withstand malicious attack by spammers. Prior approaches
generate e-mail abstractions based mainly on hash-based
content text. These methods primarily differ in what
granularity is used as the input of the hash function. For
example, the authors in [16], [21], [22] extract words or
terms to generate the e-mail abstraction. Besides, substrings
extracted by various techniques are widely employed in [7],
[8], [12], [17], [23], [25], [30], [31]. However, this type of e-
mail representation inherently has following disadvantages.
First, the insertion of a randomized and normal paragraph
can easily defeat this type of spam filters. Moreover, since
the structures and features of different languages are
diverse, word and substring extraction may not be applic-
able to e-mails in all languages. Concretely speaking, for
instance, trigrams of substrings used in [7], [8], [17] are not
suitable for nonalphabetic languages, such as Chinese.
In this paper, we devise a novel e-mail abstraction scheme
that considers e-mail layout structure to represent e-mails.
To assess the robustness of the proposed scheme, we model
possible spammer attacks and organize these attacks as
following three categories. Examples and the outputs of
preprocessing of procedure SAG are shown in Fig. 6.
3.3.1 Random Paragraph Insertion
This type of spammer attack is commonly used nowa-
days. As shown in Fig. 6, normal contents without any
674 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Fig. 5. The illustration of SpTree_3 with an example e-mail abstraction.
advertisement keywords are inserted to confuse text-
based spam filtering techniques. It is noted that our
scheme transforms each paragraph into a newly created
tag <iytcrt,, and consecutive empty tags will then be
transformed to <cijty,. As such, the representation of
each random inserted paragraph is identical, and thus our
scheme is resistant to this type of attack.
3.3.2 Random HTML Tag Insertion
If spammers know that the proposed scheme is based on
HTML tag sequences, random HTML tags will be inserted
rather than random paragraphs. On the one hand, arbitrary
tag insertion will cause syntax errors due to tag mismatch-
ing. This may lead to abnormal display of spam content that
spammers do not wish this to happen. On the other hand,
procedure SAG also adopts some heuristics (as depicted in
Section 3.1) to deal with the random insertion of empty tags
and the tag mismatching of nonempty tags. Fig. 6 shows
two example outputs and the details of each step can be
found in Fig. 2. With the proposed method, most random
inserted tags will be removed, and thus the effectiveness of
the attack of random tag insertion is limited. We shall verify
this inference in Section 5.4.
3.3.3 Sophisticated HTML Tag Insertion
Suppose that spammers are more sophisticated, they may
insert legal HTML tag patterns. As shown in Fig. 6, if tag
patterns that do conform to syntax rules are inserted, they
will not be eliminated. However, although some crafty
tricks may be conceivable, it is not intuitive for spammers to
generate a large number of spams with completely distinct
e-mail layout structure.
Note that due to space limitation, we are not able to
discuss all possible situations. Nevertheless, representing e-
mails with layout structure is more robust to most existing
attacks than text-based approaches. Even though new attack
has been designed, we can react against it by adjusting the
preprocessing step of procedure SAG. On the other hand,
our approach extracts only HTML tag sequences and
transforms each paragraph with no tag embedded to
<iytcrt,, meaning that the proposed abstraction scheme
can be applied to e-mails in all languages without modifying
any components. This important feature also enables system
Cosdes to perform more robustly. We shall assess the
effectiveness of our approach with real e-mail streams in
Section 5.2.
4 COLLABORATIVE SPAM DETECTION SYSTEM
COSDES
A complete collaborative spam detection system Cosdes is
introduced in this section. The system model of Cosdes is
given in Section 4.1. We then elaborate the processing
handlers of Cosdes in Section 4.2. Finally, we describe the
reputation mechanism of Cosdes in Section 4.3.
4.1 System Model of Cosdes
The system model of Cosdes is illustrated in Fig. 7, and the
algorithmic form is outlined in Fig. 8. Initially, three
parameters, T
i
(the maximumtime span for reported spams
being retained in the system), T
d
(the time span for triggering
Deletion Handler), and o
t/
(the score threshold for deter-
mining spams) should be given for Cosdes. Before starting
to do the spamdetection, Cosdes collects feedback spams for
time T
i
in advance to construct an initial database. Three
major modules, Abstraction Generation Module, Database
Maintenance Module, and Spam Detection Module, are
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 675
Fig. 6. Examples of possible spammer attacks.
Fig. 7. System model of Cosdes.
Fig. 8. Algorithmic form of system Cosdes.
included in Cosdes. With regard to Abstraction Generation
Module, each e-mail is converted to an e-mail abstraction by
Structure Abstraction Generator with procedure SAG. Three
types of action handlers, Deletion Handler, Insertion
Handler, and Error Report Handler, are involved in
Database Maintenance Module. Note that although the term
database is used, the collection of reported spams can be
essentially stored in main memory to facilitate the process of
matching. In addition, Matching Handler in Spam Detection
Module takes charge of determining results.
There are three types of e-mails, reported spam, testing
e-mail, and misclassified ham, required to be dealt with by
Cosdes. When receiving a reported spam, Insertion
Handler adds the e-mail abstraction of this spam into the
database except that the reputation score of this reporter is
too low. Whenever a new testing e-mail arrives, Matching
Handler performs the near-duplicate detection with col-
lected spams to do the judgment. Meanwhile, if a testing e-
mail is classified as a spam, this e-mail will be viewed as a
reported spam and be added into the database. Moreover,
Error Report Handler copes with feedback misclassified
hams and adjusts Cosdes by degrading the reputation of
related reporters to prevent malicious attacks. For every T
d
,
Deletion Handler is triggered to delete obsolete spams
which exist over time T
i
. The main functionalities of
deleting outdated spams are not only to alleviate the
overhead of the server, but to reduce the risk of accidental
deletion of hams. Due to the evolving nature of spams, it is
inappropriate to utilize old spams to filter current ones.
Overall, Cosdes is self-adjusting and retains the most up-to-
date spams for near-duplicate detection.
4.2 Procedures of System Cosdes
In this section, we elucidate each procedure of system
Cosdes. Cosdes deals with four circumstances by handlers
(the algorithmic forms are shown in Fig. 9), and the detailed
procedure flow will be explained as follows: For Insertion
Handler in Fig. 9a, initially, the corresponding SpTree is
found in SpTable according to the tag length of the inserted
spam, and ion`odc is assigned as the root of this SpTree. In
lines 3-8, we iteratively insert the subsequences of the
e-mail abstraction along the path from root to leaf. If
ion`odc is an internal node, the subsequence with 2
i
tags
is inserted into level i, which is illustrated in Fig. 5.
Meanwhile, the hash value of this subsequence is com-
puted. Then, ion`odc is assigned as the corresponding
child node based on the type of the next tag. If the next tag
is a start (end) tag, ion`odc is assigned as the left (right)
child node. Finally, when ion`odc is processed to a leaf
node, the subsequence with remaining tags is stored.
Matching Handler (as shown in Fig. 9b) is the most
significant procedure in Cosdes to achieve efficient match-
ing between every testing e-mail and the known spam
database. There are two major phases in the matching
process: Approximate Matching Phase and Exact Matching
Phase. As mentioned in Section 3.2, the tag lengths of e-mail
abstractions in an SpTree may not be identical. However,
two e-mail abstractions are possible to be near-duplicate
only when the numbers of their tags are identical. For this
reason, in Approximate Matching Phase, we traverse
directly to the targeted leaf node based on the types of tags
at positions 2
i
without doing tag comparisons. It is certain
that a testing e-mail may merely be near-duplicate with
spams which have the same tag length and are in the same
path. Therefore, we tentatively record the information of
spams, which appear in the targeted leaf node and have the
same tag length, into a candidate set coidoct. The main
objectives of the approximate matching are: 1) to reduce
676 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Fig. 9. Algorithmic forms of all processing handlers. (a) Algorithmic form
of Insertion Handler. (b) Algorithmic form of Matching Handler.
(c) Algorithmic form of Error Report Handler. (d) Algorithmic form of
Deletion Handler.
unnecessary tag comparisons of e-mails with different tag
lengths, and 2) to exclude e-mails which can be determined
without the exact tag matching. Subsequently, the process
starts Exact Matching Phase from the root of the SpTree. For
each level, in lines 11-17, the hash values of subsequences
are matched first. Then, we do the exact matching of
subsequences only if their hash values are matched. The
unmatched information of spams will be deleted from
coidoct. Moreover, we design that the exact tag matching is
only processed to level ) |c.c| (set as 3 in the experiments),
namely, only the first 2
) |c.c|1
1 tags are exactly matched.
It means that the looser matching criterion is applied when
the length of an e-mail abstraction is longer than 2
) |c.c|1
.
This looser criterion substantially promotes the efficiency of
matching but does not influence detection results owing to
the effects of the preceding approximate matching and the
tag reordering process of procedure SAG. Finally, if the sum
of o
1
of all candidate spams in coidoct exceeds o
t/
, the
testing e-mail will be classified as a spam.
To handle e-mails with only plain text, we maintain a
blacklist of not only sender addresses but also URLs and
e-mail addresses included in the content of reported
spams. Gathering these feedback information to block
text/plain spams is very effective since spammers have to
let receivers connect to them so as to make profits.
Therefore, Cosdes still can cope with spams with merely
plain text even though we concentrate on e-mail layout
structure in this paper.
When receiving a misclassified ham, Error Report
Handler (shown in Fig. 9c) first finds the corresponding
SpTree and does the matching process as the same in
Matching Handler. For the spams matched with the
reported misclassified ham, we reset o
1
of these spams as
0 to avoid subsequent misclassification incurred by the
identical group of spams. In addition, the reputation scores
of reporters who cause the false positive error are halved to
prevent continuous attacks by specific users.
Moreover, to delete obsolete spams, for every T
d
,
Deletion Handler (as shown in Fig. 9d) traverses each
SpTree in inorder (traverses the left subtree, visit the root,
and then traverses the right subtree) to visit all nodes in
SpTrees. For each subsequence, if the existing time exceeds
T
i
, it will be viewed as outdated and be deleted from this
node. As such, all obsolete spams are removed from the
known spam database after Deletion Handler is processed.
4.3 Reputation Mechanism
The principal concept of collaborative spam detection is to
collect human judgment to block subsequent near-duplicate
spams. To ensure the truthfulness of spam reports and to
prevent malicious attacks, we propose the reputation
mechanism to evaluate the credit of each reporter. The
fundamental idea of the reputation mechanism is to utilize a
reputation table to maintain a reputation score o
1
of each
reporter according to the previous reliability record. Each
inserted spam is given a suspicion score equal to o
1
of the
reporter. In such a context, when doing near-duplicate
detection, if the sum of suspicion scores of matched spams
exceeds a predefined threshold, the testing e-mail will be
classified as a spam. The reputation mechanism is described
in detail as follows:
1. Each reporter is assigned an initial score o
iiitio|
when
he submits a reported spam at the first time.
2. If a reporter submits any feedback spam once more,
the reputation score will be incremented by a smaller
incremental score o
iicic
. The value of o
iicic
is set as
o
iiitio|
10
in the experiments.
3. If a reporter is charged that his previous feedback
spam is mistaken, the reputation score will be
halved.
To prevent malicious error reports and to attain a near-zero
false positive rate, we cautiously increase the reputation
score but drop it drastically while a false positive error is
issued. On the other hand, when o
1
of a reporter is smaller
than o
iiitio|
, his subsequent feedback spams will not be
added into the database until o
1
is equal to or larger than
o
iiitio|
. Regarding the parameter o
t/
, we simply use a fixed
small value (set as three in the experiments) instead of
determining the threshold according to the ratio of total
users. The reason is that as long as there are certain trusty
users reporting the e-mails with the same e-mail abstraction
as spams, it is sufficiently reliable to classify the subsequent
near-duplicate e-mails as spams.
5 PERFORMANCE EVALUATION
To assess the feasibilityof systemCosdes, we conduct several
experiments to explore its efficiency and detection results.
The real spamdata sets used in the experiments are from the
e-mail servers of Computer Center in National Taiwan
University, which has over 30,000 students. Since the ground
truth of real e-mail streams is unavailable, spams are
extracted from the well-known existing system, SpamAssas-
sin.
3
Concerning hams, we not only include public data sets
(around 4,000 e-mails) provided by SpamAssassin,
4
but also
obtainfromvolunteers. There are about 60,000 spams per day
and a set of 7,000 or so hams in the data set. Note that
numerous related works have evaluated the proposed
methods with static databases. However, to access the
performance of spam detection system with near-duplicate
matching scheme, real e-mail streams are more appropriate
than static data sets. Therefore, in this paper, we use
university-scale e-mail streams as the experimental data sets
to better simulate the e-mail environment. On the other hand,
three representative approaches [7], [24], [30] of near-
duplicate spam detection are employed for comparison.
The authors of [8], [17] also adopt the same e-mail
representation approach as in [7] but with different sharing
mechanisms. For ease of presentation, Damianis work is
abbreviated as Digest. Sarafijanovics work is abbreviated as
MultiDigest, andYoshidas work is abbreviatedas Density. It
is worth mentioning that Sarafijanovics work [24] improves
Damianis one [7] by representing each e-mail with multiple
digests produced from the strings of fixed length sampled at
randomized positions within e-mail. The processes of
generating each digest in Digest and MultiDigest are
identical. Although Sarafijanovics work claims that using
multiple digests canenhancetherobustness of near-duplicate
spam detection system, these two works have not been
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 677
3. http://spamassassin.apache.org.
4. http://spamassassin.apache.org/publiccorpus. We include three files,
that is, 20030228_easy_ham.tar.bz2, 20030228_easy_ham_2.tar.bz2, and
20030228_hard_ham.tar.bz2.
validated by real e-mail streams. Besides, Yoshidas work
considers maintaining a direct-mapped cache to facilitate the
process of matching. Inthe experiments of his work[30], there
are 10 million spams in the database and 10 percent of hash
values are copiedinthe cache. However, tofairlycompare the
detection performance, the cache mechanism of Yoshidas
work is not included in our experiments, meaning that all
spams in the database are used for detection. We implement
Cosdes andcomparative techniques with C++language, and
the programs are executed in Windows XP professional
platform with Pentium 43GHz CPU and 1GB RAM. The
programs of Digest and MultiDigest are implemented with
source codes shared by original authors of [7] and [24].
Initially, the efficiency and the space usage of generating e-
mail representation are investigated in Section 5.1. We
compare andanalyze the detectionresults of four approaches
in Section 5.2. The detailed efficiency analysis is presented in
Section 5.3. Finally, Section 5.4 simulates the spammer attack
of random HTML tag insertion.
5.1 E-mail Representation
The processing time of the generation of e-mail abstractions
with the number of e-mails varied is shown in Fig. 10a. As
mentioned in Section 2.2, most prior works on near-duplicate
spam detection represent e-mails based mainly on content
text. Cosdes is the first work to attempt to utilize HTML
content, which depicts the layout structure of an e-mail, for
representation. Regarding Digest, word trigrams are ex-
tracted consecutively along the whole e-mail content. As for
Density, the authors acquire the first N (in [30], N is set as
100) hash values of each length L substring with a fixed-size
windowsliding through an e-mail. As can be seen in Fig. 10a,
Density takes the least time since it computes only the first N
hash values. As for Digest, hash values of word trigrams in
the whole e-mail are required to be computed. Regarding
MultiDigest, each e-mail is represented by a set of digests,
meaning that each e-mail is separated into multiple strings
with fixed length (in [24], the length is set as 60 characters)
sampled at randomized positions. Although the length of
each string is much shorter, the overall complexity still
increases since each e-mail has multiple strings needed to be
processed. Fig. 10a shows that the processing time of
MultiDigest is about four times longer than that of Digest,
and thus MultiDigest takes the longest time among four
approaches. In procedure SAG of Cosdes, HTML tags are
extracted and each paragraph of text is transformed to the
newly created tag <iytcrt,. If only these operations are
performed, Cosdes will be the most efficient. However,
sequence preprocessing and tag reordering are also executed
in Cosdes, and therefore, the execution time slightly
increases. Overall, the cost of generating e-mail abstractions
of four approaches is very low.
With regard to the space issue, Fig. 10b shows the
memory usage of four approaches with the number of
e-mails varied in the database. Note that we estimate a hash
value or an integer number as one unit of memory usage,
which approximates to 2 bytes. Digest represents each
e-mail with a 32-byte code, which is equal to 16 units. As
mentioned above, MultiDigest utilizes multiple 32-byte
codes for the representation of each e-mail. In the experi-
mental data set, the average number of digests in each
e-mail is approximately 12, and thus the memory usage of
MultiDigest is larger than that of Digest by around 12 times.
Regarding Density, N hash values, that is, 100 units, are the
representation of each e-mail. However, since some e-mails
are too short to be extracted N hash values, the average
memory usage of each e-mail in Density is smaller that
100 units. On the other hand, a sequence of HTML tags is
the representation of each e-mail in Cosdes . We can replace
each type of tag with a unique integer, and thus each tag
can be viewed as 1 unit. It is calculated that the average
length of the HTML tag sequence produced by procedure
SAG of Cosdes is approximately 35. In addition, as stated in
Section 3.2, additional information is required in SpTable
and SpTrees of Cosdes. We include this memory overhead
as well. It can be observed in Fig. 10b that the memory
usage of Digest is the least, and MultiDigest uses the largest
memory space among four approaches. As for Cosdes, the
memory usage is larger than that of Digest by two to four
times. Although a fairly succinct e-mail abstraction can
greatly reduce the overhead of near-duplicate matching, we
can find in the following section that the effectiveness of
Digest cannot be validated.
5.2 Accuracy Evaluation
In this section, we evaluate the detection performance of
Cosdes and three competitive approaches. The most
important requirement for a spam detection system is the
capability to resist malicious attack that evolves continu-
ously. To examine this capability, two recent streams of
spams (collected from National Taiwan University in
September 2007 and February 2009) are utilized as the
experimental data sets. Regarding the language of content
text, 80 percent of all e-mails are in Chinese, and 15 percent
of them are in English. The minority of e-mails are in
Japanese, French, and so forth. Since Chinese is a
nonalphabetic language and English is an alphabetic one,
the data set used in the experiments can verify the
effectiveness of spam detection system with different kinds
of languages to a certain extent. Before a system starts to do
the near-duplicate spam detection, a set of known spams is
inserted into the system. We consider situations with the
parameter T
i
varied from one to five days. That is, as
shown in the left side of Fig. 11, the detection results are
produced by inserting spams within T
i
days first, and then
the following one-day spams are tested. Note that each
spam is inserted into the database after the process of
matching. On the other hand, the entire set of hams is tested
678 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Fig. 10. The execution time and the memory usage of generating e-mail
abstractions with the number of e-mails varied. (a) Execution time of e-
mail representation. (b) Memory usage of e-mail representation.
in each situation. True positive rate (i.e., TP, a real spam is
classified as a spam) and false positive rate (i.e., FP, a real
ham is misclassified as a spam) are listed in Fig. 11.
As can be seen in Fig. 12a, Cosdes reports 96.47 percent
TP rate and 0.46 percent FP rate on average, which has the
most outstanding performance. The TP rate of Digest is
extremely high but the FP rate is unacceptable. In order to
accelerate the process of near-duplicate matching, only a 32-
byte code is used in Digest to represent each e-mail.
Moreover, as defined in [7], two e-mails are determined as
near-duplicate if more than 182 bits of their 32-byte (i.e.,
256 bits) codes have the same value. It is shown in Fig. 12a
that as the size of spam database is large, the 32-byte code is
not discriminative to clearly distinguish each e-mail, and
thus hams are easily mismatched with known spams. As for
MultiDigest, although the authors claim in [24] that using
multiple digests to represent each e-mail can be more robust
against increased obfuscation effort by spammers, the FP
rate of MultiDigest is even worse than that of Digest as the
size of spam database is large. This is owing to the reason
that MultiDigest separates each e-mail into a set of short
strings. As long as one digest in the huge spam database is
similar to one of digests in the testing e-mail, this e-mail will
be classified as a spam. In [7] and [24], there are only
2,500 spams and 2,500 hams in the data set, which might not
suffice to reflect the real situation. In addition, the effective-
ness of Digest and MultiDigest has not been validated by
real e-mail streams. On the other hand, the effectiveness of
Density has been evaluated in [30] with 10 million spams in
the database. One problem of Density is that a huge number
of known spams are required to make the proposed cache
mechanism of Density work well. However, in our experi-
ments, even though we do not consider the cache mechan-
ism of Density and all reported spams are used for near-
duplicate spam detection, the effectiveness of Density on
more recent e-mail streams cannot be validated. Moreover,
several parameters should be given for Density and be
adjusted according to different environments. Since the
authors in [30] do not provide the parameter tuning method,
in this experiment, we follow the same setting as in [30] and
obtain the results in Fig. 11. Note that various tricks targeting
at nullifying the approaches of hash-based text representa-
tion have been increasingly employed recently. Besides,
most spams in our data set are in Chinese, which is a
nonalphabetic language. However, Digest and MultiDigest
generate hash values with trigrams of substrings. The
method of this kind is essentially designed to resist the
word obfuscation of nonalphabetic languages. Therefore, the
results in this section also verify the language restriction of
comparative approaches. These factors could partially
account for the results shown in Fig. 11.
Concerning Cosdes, we extract more essential informa-
tion to represent each e-mail, and the newly devised e-mail
abstraction can more effectively capture the near-duplicate
phenomenon of spams with an acceptable FP rate. More-
over, since we represent each e-mail using HTML tag
sequence rather than content text, Cosdes can intrinsically
apply to both alphabetic and nonalphabetic languages
without modifying any components of the system. This
advantageous property is verified with our data set that
consists of 15 percent English e-mails and 80 percent
Chinese ones. In addition, to further investigate the
components of Cosdes, we evaluate the detection perfor-
mance when either the sequence preprocessing step or the
anchor-appending step of procedure SAG is removed. It can
be observed in Fig. 12b that the performance almost does
not degrade as we exclude the sequence preprocessing step.
This consequence reveals that the proposed abstraction
scheme has not been countered. Nevertheless, we propose
elaborate design to guarantee the robustness of Cosdes. On
the other hand, as depicted in Section 3.1, the main objective
of appending <oic/oi tags in front of the e-mail
abstraction is to reduce the probability that a ham is
successfully matched with reported spams when the tag
length of an e-mail abstraction is short. Anchor-appending
process will be included as the length of an e-mail
abstraction is smaller than 16. It can be seen in the right
side of Fig. 12b that the TP rate increases slightly but the FP
rate is twice higher than that of the original situation as this
process is removed. This is because there are several hams
containing only a URL that normal users want to share with
their friends. If the anchor-appending process is removed,
these e-mails will be misclassified as spams, and thereby the
FP rate deteriorates.
Subsequently, we explore the impact of score threshold
that determines the number of matched spams that Cosdes
will detect a testing e-mail as a spam. Fig. 12c shows the TP
rate and the FP rate of Cosdes with score threshold o
t/
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 679
Fig. 12. The comparison of detection performance and the performance
of Cosdes with and without sequence preprocessing in procedure SAG.
(a) Average detection results. (b) Impact of sequence preprocessing and
anchor-appending. (c) Impact of score threshold (Sth) in Cosdes.
Fig. 11. Performance of detection results.
varied from 1 to 5. It can be observed that as o
t/
increases
but remains small, the detection results of Cosdes are still
consistent. This means that no complex heuristic method is
required to determine the appropriate value of o
t/
. More-
over, this threshold is not required to be adjusted according
to the size of the data set. Note that as the FP rate increases
to a certain unacceptable value, our system can simply
response by slightly decreasing the value of o
t/
. The
property of simple threshold setting is also an advanta-
geous feature of Cosdes.
5.3 Efficiency Analysis
In the succeeding experiments, we initially examine the
efficiency of near-duplicate matching. Owing to the fact that
each incoming e-mail has to be matched with a huge spam
database, the efficiency of near-duplicate matching is
crucial to a collaborative spam detection system. On
evaluation of matching performance, we consider the
situation of matching with the number of e-mails varied
while there are identical e-mails in the database. As shown
in Fig. 13a, the execution time of Digest is minimal since
only two 32-byte codes of each pair of e-mails have to be
compared. However, as shown in Fig. 12a, the detection
results of Digest are not satisfied even if their matching
processes are very fast. As for MultiDigest, it is defined in
[24] that the similarity measure between two e-mails is the
maximum number of equal bits over all pairs of digests.
According to this definition, processing all pairs of digests
between two e-mails requires i
2
comparisons, where i is
the average number of digests in an e-mail. It is calculated
that i is close to 12 in our data set, meaning that the
matching time of MultiDigest is larger than that of Digest by
over 100 times. This indicates that the matching process of
MultiDigest is the least efficient among four approaches.
Regarding Density, the sequences of the first 100 hash
values, which are longer than Digest and Cosdes, are
matched, and therefore Density takes more time than Digest
and Cosdes. It is noted that the authors in [30] propose a
cache mechanism to avoid matching each e-mail with the
huge database. The cache mechanism enables Density to
markedly enhance the efficiency of matching. However, this
will degrade the detection performance while the spam
database is not as huge as in [30]. To fairly compare the
performance, we ignore the cache mechanism of Density in
this experiment. On the other hand, Cosdes has to match a
longer sequence than Digest, and in essence Cosdes
requires more time for matching. Nevertheless, with the
devised data structure and the customized matching
scheme, the execution time is still controlled in two to four
times of Digest. Moreover, regarding the effectiveness of the
proposed hash function in the matching process, Fig. 13b
shows that a speedup of approximately three times is
achieved. According to above results, it is shown that with
the design of a more sophisticated e-mail abstraction,
Cosdes gains better near-duplicate detection results with
still efficient matching time.
Subsequently, we conduct the efficiency investigation of
Cosdes on inserting e-mail abstractions into the database
and deleting outdated spams from the database. Owing to
the fact that competitive approaches, Digest and Density,
did not isolate insertion parts from the systems and did not
take account of deletion, we only study the performance of
Insertion Handler and Deletion Handler in Cosdes. Fig. 14a
shows the execution time of Insertion Handler of Cosdes
with the number of e-mails varied. The execution time
grows linearly and costs merely 3.5 seconds for inserting
100,000 spams into the database. Moreover, as can be seen
in Fig. 10a, the process of generating 100,000 e-mail
abstractions costs about 10 seconds. On the other hand,
the performance of Deletion Handler is shown in Fig. 14b.
We evaluate the execution time of deleting spams in one
day while the number of e-mails in the database varied. The
main purpose of this experiment is to examine whether the
efficiency of deletion will be influenced by the amount of e-
mails stored in SpTrees. It is shown that the deletion
process costs only 2 to 3 seconds in each situation, and the
execution time slightly increases with the amount of
e-mails. Therefore, we can observe that both the processes
of insertion and deletion in Cosdes are efficient and incur
very little overhead.
To further evaluate the proposed e-mail abstraction
scheme, we consider the sequence preprocessing step and
the reordering step of procedure SAG. The primary
objective of the sequence preprocessing is to prevent
malicious tag insertion attack, and thus the robustness of
Cosdes can be enhanced. Fig. 15a shows that generating
e-mail abstractions with the sequence preprocessing step
leads to little increment of execution time. Although the
detection results in Fig. 12b can be inferred that spammers
still do not intend to obfuscate HTML content, this
protection process enables Cosdes to perform more
robustly in the future. On the other hand, the main purpose
of the reordering step is to differentiate e-mails with similar
tag sequences in the earlier stage of matching. As shown in
680 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011
Fig. 13. The comparison of matching efficiency and the efficiency of
Cosdes with and without the hash function in the matching process.
(a) Execution time of matching of 4 approaches. (b) Impact of hash in
Cosdes
Fig. 14. Performance of Insertion Handler and Deletion Handler in
Cosdes. (a) Execution time of insertion of Cosdes. (b) Execution time of
deletion of Cosdes.
Fig. 15b, 10 percent of efficiency improvement is achieved.
The improvement is limited since we map each subse-
quence in a node of an SpTree to a hash value. Therefore,
the subsequences that have some prefix tags in common
still can be differentiated with one comparison. Never-
theless, the reordering step certainly can further accelerate
the matching process.
5.4 Spammer Attack
To further verify the robustness of Cosdes, in this section,
we simulate the spammer attack of random HTML tag
insertion. We consider the situation that a spammer sends
i identical e-mails at a time, where i is varied from 1,000 to
100,000. It is assumed that a sequence of random HTML tags
is inserted into the beginning of each e-mail. The number of
tags in a sequence is a random number between 1 and 50.
5
Regarding the type of tag, we randomly choose them from
HTML tag list.
6
As shown in Fig. 16a, around 90 percent of e-
mails are matched with other e-mails, meaning that only
10 percent of spams have completely distinct HTML tag
sequences as random HTML tag insertion is applied. This is
because the sequence preprocessing step of procedure SAG
will delete nonempty tags that have no corresponding start
tags or end tags. Random HTML tag insertion cannot
generate legal tag sequences and thus most tags will be
eliminated. Concerning the efficiency analysis, it can be
observed in Fig. 16b that the sequence preprocessing step
incurs very little overhead. This also indicates that if new
spammer attack occurs, Cosdes still has capacity to react
against it by applying a more sophisticated countermeasure.
6 CONCLUSION
In the field of collaborative spam filtering by near-duplicate
detection, a superior e-mail abstraction scheme is required to
more certainly catch the evolving nature of spams. Com-
pared to the existing methods in prior research, in this paper,
we explore a more sophisticated and robust e-mail abstrac-
tion scheme, which considers e-mail layout structure to
represent e-mails. The specific procedure SAGis proposed to
generate the e-mail abstraction using HTML content in
e-mail, and this newly-devised abstraction can more
effectively capture the near-duplicate phenomenon of
spams. Moreover, a complete spamdetection systemCosdes
has been designed to efficiently process the near-duplicate
matching and to progressively update the known spam
database. Consequently, the most up-to-date information
can be invariably kept to block subsequent near-duplicate
spams. In the experimental results, we show that Cosdes
significantly outperforms competitive approaches, which
indicates the feasibility of Cosdes in real-world applications.
ACKNOWLEDGMENTS
This work was supported in part by the National Science
Council of Taiwan under Contracts NSC97-2221-E-002-
172-MY3.
REFERENCES
[1] E. Blanzieri and A. Bryl, Evaluation of the Highest Probability
SVM Nearest Neighbor Classifier with Variable Relative Error
Cost, Proc. Fourth Conf. Email and Anti-Spam (CEAS), 2007.
[2] M.-T. Chang, W.-T. Yih, and C. Meek, Partitioned Logistic
Regression for Spam Filtering, Proc. 14th ACM SIGKDD Intl Conf.
Knowledge Discovery and Data mining (KDD), pp. 97-105, 2008.
[3] S. Chhabra, W.S. Yerazunis, and C. Siefkes, Spam Filtering Using
a Markov Random Field Model with Variable Weighting
Schemas, Proc. Fourth IEEE Intl Conf. Data Mining (ICDM),
pp. 347-350, 2004.
[4] P.-A. Chirita, J. Diederich, and W. Nejdl, Mailrank: Using
Ranking for Spam Detection, Proc. 14th ACM Intl Conf.
Information and Knowledge Management (CIKM), pp. 373-380, 2005.
[5] R. Clayton, Email Traffic: A Quantitative Snapshot, Proc. of the
Fourth Conf. Email and Anti-Spam (CEAS), 2007.
[6] A.C. Cosoi, A False Positive Safe Neural Network; The Followers
of the Anatrim Waves, Proc. MIT Spam Conf., 2008.
[7] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati,
An Open Digest-Based Technique for Spam Detection, Proc. Intl
Workshop Security in Parallel and Distributed Systems, pp. 559-564,
2004.
[8] E. Damiani, S.D.C. di Vimercati, S. Paraboschi, and P. Samarati,
P2P-Based Collaborative Spam Detection and Filtering, Proc.
Fourth IEEE Intl Conf. Peer-to-Peer Computing, pp. 176-183, 2004.
[9] P. Desikan and J. Srivastava, Analyzing Network Traffic to
Detect E-Mail Spamming Machines, Proc. ICDM Workshop Privacy
and Security Aspects of Data Mining, pp. 67-76, 2004.
[10] H. Drucker, D. Wu, and V.N. Vapnik, Support Vector
Machines for Spam Categorization, Proc. IEEE Trans. Neural
Networks, pp. 1048-1054, 1999.
[11] D. Evett, Spam Statistics, http://spam-filter-review.topten
reviews.com/spam-statistics.html, 2006.
[12] A. Gray and M. Haahr, Personalised, Collaborative Spam
Filtering, Proc. First Conf. Email and Anti-Spam (CEAS), 2004.
[13] S. Hershkop and S.J. Stolfo, Combining Email Models for False
Positive Reduction, Proc. 11th ACM SIGKDD Intl Conf. Knowledge
Discovery and Data Mining (KDD), pp. 98-107, 2005.
[14] J. Hovold, Naive Bayes Spam Filtering Using Word-Position-
Based Attributes, Proc. Second Conf. Email and Anti-Spam (CEAS),
2005.
TSENG ET AL.: COSDES: A COLLABORATIVE SPAM DETECTION SYSTEM WITH A NOVEL E-MAIL ABSTRACTION SCHEME 681
Fig. 15. Efficiency analysis on the sequence preprocessing and the
reordering steps of Cosdes. (a) Impact of sequence preprocessing in
Cosdes. (b) Impact of reordering in Cosdes.
Fig. 16. Effectiveness and efficiency analysis of the sequence
preprocessing step under the spammer attack of random HTML tag
insertion. (a) Probability of matching. (b) Execution time of the sequence
preprocessing step.
5. It is calculated that the average length of the HTML tag sequence
produced by procedure SAG of Cosdes is approximately 35.
6. http://www.w3schools.com/tags/default.asp.
[15] A. Kolcz and J. Alspector, SVM-Based Filtering of Email Spam
with Content-Specific Misclassification Costs, Proc. ICDM Work-
shop Text Mining, 2001.
[16] A. Kolcz, A. Chowdhury, and J. Alspector, The Impact of Feature
Selection on Signature-Driven Spam Detection, Proc. First Conf.
Email and Anti-Spam (CEAS), 2004.
[17] J.S. Kong, P.O. Boykin, B.A. Rezaei, N. Sarshar, and V.P.
Roychowdhury, Scalable and Reliable Collaborative Spam
Filters: Harnessing the Global Social Email Networks, Proc.
Second Conf. Email and Anti-Spam (CEAS), 2005.
[18] T.R. Lynam and G.V. Cormack, On-Line Spam Filter Fusion,
Proc. 29th Ann. Intl ACM SIGIR Conf. Research and Development in
Information Retrieval (SIGIR), pp. 123-130, 2006.
[19] B. Mehta, S. Nangia, M. Gupta, and W. Nejdl, Detecting Image
Spam Using Visual Features and Near Duplicate Detection, Proc.
17th Intl Conf. World Wide Web (WWW), pp. 497-506, 2008.
[20] V. Metsis, I. Androutsopoulos, and G. Paliouras, Spam Filtering
with Naive BayesWhich Naive Bayes? Proc. Third Conf. Email
and Anti-Spam (CEAS), 2006.
[21] M.S. Pera and Y.-K. Ng, Using Word Similarity to Eradicate Junk
Emails, Proc. 16th ACM Intl Conf. Information and Knowledge
Management (CIKM), pp. 943-946, 2007.
[22] I. Rigoutsos and T. Huynh, Chung-Kwei: A Pattern-Discovery-
Based System for the Automatic Identification of Unsolicited E-
Mail Messages (SPAM), Proc. First Conf. Email and Anti-Spam
(CEAS), 2004.
[23] S. Sarafijanovic and J.-Y.L. Boudec, Artificial Immune System for
Collaborative Spam Filtering, Proc. Second Workshop Nature
Inspired Cooperative Strategies for Optimization (NICSO), 2007.
[24] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, Improving Digest-
Based Collaborative Spam Detection, Proc. MIT Spam Conf., 2008.
[25] S. Sarafijanovic, S. Perez, and J.-Y.L. Boudec, Resolving FP-TP
Conflict in Digest-Based Collaborative Spam Detection by Use of
Negative Selection Algorithm, Proc. Fifth Conf. Email and Anti-
Spam (CEAS), 2008.
[26] K.M. Schneider, Brightmail URL Filtering, Proc. MIT Spam Conf.,
2004.
[27] D. Sculley and G.M. Wachman, Relaxed Online SVMs for Spam
Filtering, Proc. 30th Ann. Intl ACM SIGIR Conf. Research and
Development in Information Retrieval (SIGIR), pp. 415-422, 2007.
[28] C.-Y. Tseng, J.-W. Huang, and M.-S. Chen, Promail: Using
Progressive Email Social Network for Spam Detection, Proc. 10th
Pacific-Asia Conf. Knowledge Discovery and Data Mining (PAKDD),
pp. 833-840, 2007.
[29] Z. Wang, W. Josephson, Q. Lv, and K.L.M. Charikar, Filtering
Image Spam with Near-Duplicate Detection, Proc. Fourth Conf.
Email and Anti-Spam (CEAS), 2007.
[30] K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma, A.
Nakashima, H. Fujikawa, and K. Yamazaki, Density-Based Spam
Detector, Proc. 10th ACM SIGKDD Intl Conf. Knowledge Discovery
and Data Mining (KDD), pp. 486-493, 2004.
[31] F. Zhou, L. Zhuang, B.Y. Zhao, L. Huang, A.D. Joseph, and J.D.
Kubiatowicz, Approximate Object Location and Spam Filtering
on Peer-to-Peer Systems, Proc. ACM/IFIP/USENIX Intl Middle-
ware Conf., pp. 1-20, 2003.
Chi-Yao Tseng received the BS degree in
electrical engineering from the National Taiwan
University, Taipei, in 2004, and is currently a PhD
candidate in the Graduate Institute of Electrical
Engineering of the National Taiwan University.
His research interests include sequential pattern
mining, e-mail spam detection, social network
mining, and multimedia data mining.
Pin-Chieh Sung received the BS degree in
computer science from the National Tsing Hua
University, Hsinchu, in 2006 and the MS degree
in the network database laboratory (led by
Professor Ming-Syan Chen), Graduate Institute
of Electrical Engineering, National Taiwan Uni-
versity, Taipei, in 2008. Currently, he is a
software engineer in Cazoodle Inc. His research
interests include web mining, recommendation
system, and multimedia data mining.
Ming-Syan Chen received the BS degree in
electrical engineering from the National Taiwan
University, Taipei, and the MS and PhD degrees
in computer, information and control engineering
from the University of Michigan, Ann Arbor, in
1985 and 1988, respectively. He is now a
distinguished research fellow and the director
of Research Center for Information Technology
Innovation (CITI) in the Academia Sinica, Tai-
wan, and is also a distinguished professor jointly
appointed by EE Department, CSIE Department, and Graduate Institute
of Communication Engineering (GICE) at the National Taiwan Uni-
versity. He was a research staff member at IBM Thomas J. Watson
Research Center, Yorktown Heights, New York, from 1988 to 1996, the
director of GICE from 2003 to 2006, and also the president/CEO of the
Institute for Information Industry (III), which is one of the largest
organizations for information technology in Taiwan, from 2007 to 2008.
His research interests include databases, data mining, mobile comput-
ing systems, and multimedia networking, and he has published more
than 280 papers in his research areas. In addition to serving as program
chairs/vice chairs, and keynote/tutorial speakers in many international
conferences, he was an associate editor of IEEE TKDE, VLDB Journal,
KAIS, and also JISE, is currently the editor-in-chief of the International
Journal of Electrical Engineering (IJEE), and is a distinguished visitor of
the IEEE Computer Society for Asia-Pacific from 1998 to 2000, and also
from 2005 to 2007. He is now also serving as the chief executive officer
of Networked Communication Program, which is a national program
coordinating several primary activities in information and communication
technologies in Taiwan. He holds, or has applied for, 18 US patents and
seven ROC patents in his research areas. He is a recipient of the
Academic Award of the Ministry of Education, the National Science
Council (NSC) Distinguished Research Award, Pan Wen Yuan
Distinguished Research Award, Teco Award, Honorary Medal of
Information, and K.-T. Li Research Breakthrough Award for his research
work, and also the Outstanding Innovation Award from IBM Corporate
for his contribution to a major database product. He also received
numerous awards for his research, teaching, inventions, and patent
applications. He is a fellow of the ACM and the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
682 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 23, NO. 5, MAY 2011