0% found this document useful (0 votes)
251 views14 pages

Hybrid Spam Filtering For Mobile Communication: Ji Won Yoon, Hyoungshick Kim, Jun Ho Huh

1) The document proposes a hybrid spam filtering framework for mobile communications that combines content-based filtering and challenge-response methods. 2) Content-based filtering is first used to classify messages as ham (not spam), uncertain, or spam. Messages classified as uncertain are further analyzed using challenge-response. 3) Challenge-response protocols require the message sender to respond to a challenge, with the goal of determining if the sender is human or a machine (spam bots cannot respond correctly). Any uncertain messages whose senders cannot respond are classified as spam.

Uploaded by

nimmi9
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
251 views14 pages

Hybrid Spam Filtering For Mobile Communication: Ji Won Yoon, Hyoungshick Kim, Jun Ho Huh

1) The document proposes a hybrid spam filtering framework for mobile communications that combines content-based filtering and challenge-response methods. 2) Content-based filtering is first used to classify messages as ham (not spam), uncertain, or spam. Messages classified as uncertain are further analyzed using challenge-response. 3) Challenge-response protocols require the message sender to respond to a challenge, with the goal of determining if the sender is human or a machine (spam bots cannot respond correctly). Any uncertain messages whose senders cannot respond are classified as spam.

Uploaded by

nimmi9
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

computers & security 29 (2010) 446–459

available at www.sciencedirect.com

journal homepage: www.elsevier.com/locate/cose

Hybrid spam filtering for mobile communication

Ji Won Yoon a, Hyoungshick Kim b,*, Jun Ho Huh c


a
Statistics Department, Trinity College Dublin, Ireland
b
Computer Laboratory, University of Cambridge, Cambridge, UK
c
Computing Laboratory, University of Oxford, UK

article info abstract

Article history: Spam messages are an increasing threat to mobile communication. Several mitigation
Received 22 August 2009 techniques have been proposed, including white and black listing, challenge-response and
Received in revised form content-based filtering. However, none are perfect and it makes sense to use a combination
17 October 2009 rather than just one. We propose an anti-spam framework based on the hybrid of content-
Accepted 6 November 2009 based filtering and challenge-response. A message, that has been classified as uncertain
through content-based filtering, is checked further by sending a challenge to the message
Keywords: sender. An automated spam generator is unlikely to send back a correct response, in which
Spam SMS messages case, the message is classified as spam.
Hybrid Our simulation results show the trade-off between the accuracy of anti-spam classifiers and
Content-based filtering the incurring traffic overhead, and demonstrate that our hybrid framework is capable of
Challenge-response achieving high accuracy regardless of the content-based filtering algorithm being used.
Threshold sensitivity problem ª 2009 Elsevier Ltd. All rights reserved.

1. Introduction content-based filtering (Healy et al., 2005; Metsis et al., 2006;


Bratko et al., 2006; Cormack et al., 2007; Dwork et al., 2003;
Short Message Service (SMS) and Multimedia Messaging Hall, 1998; Golbeck and Hendler, 2004; Androutsopoulos et al.,
Service (MMS) are a popular means of mobile communication. 2000). Different characteristics between emails and SMS
Texting costs have decreased continuously over the years (to messages make it harder for one to apply such approaches
an extent of free texting) whereas the bandwidth for directly in mobile networks and analyze the results (Deng and
communication has increased dramatically. Such trends have Peng, 2006). For example, the extra traffic required to perform
attracted a large number of phishing and spamming attacks challenge-response needs to be minimized (or needs to be
using SMS messages. In particular, spam containing porno- compensated for) as it is more expensive to use the bandwidth
graphic or promotive materials are an emerging phenomenon in mobile networks. Also, applying content-based filtering
and they have caused a significant level of inconvenience for methods to SMS messages is a challenging task since a mobile
users. These are now prevalent in Korea, Japan and China and text message d containing only a small text and phone
prone to spread across countries where mobile communica- number d is relatively shorter in length and contains less
tion is popular. Statistics for 2008 (He et al., 2008) show that structured fields compared to an email. With emails, addi-
a user in China, on average, receives 8.29 SMS spam per week. tional fields like attachments, links, and images are
Much of the existing research into anti-spam solutions, commonly used for detecting spam. However, these are not
however, has focused on spam emails. Some of the popular available in SMS messages to construct filtering rules that are
methods include white and black listing, digital signature, as effective as ones used for emails. Due to various drawbacks
postage control, address management, collaborative and associated with challenge-response and content-based

* Corresponding author.
E-mail addresses: [email protected] (J.W. Yoon), [email protected] (H. Kim), [email protected] (J.H. Huh).
0167-4048/$ – see front matter ª 2009 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cose.2009.11.003
computers & security 29 (2010) 446–459 447

filtering, it would make more sense to use a combination rather happen more frequently with SMS messages than with emails
than just relying on one. due to their smaller size and simpler content. Moreover,
In this paper, we propose a spam filtering framework based adaptive schemes as such are fundamentally weak against
on combination of these two methods and demonstrate that innovative attacks where strategies constantly evolve to
our combined approach can be more effective and efficient in manipulate classification rules. Filtering alone will not be
handling spam SMS messages. Using the content-based sufficient to detect spam.
filtering approach, obvious spam are filtered first to reduce the Many anti-spam solutions (He et al., 2008; Shirali-Shahreza
number of messages subject to challenge-response; the chal- and Movaghar, 2008) have been suggested based on a chal-
lenge-response protocol then classifies machine-generated lenge-response protocol. A message sender needs to prove
spam with high accuracy. By combining the content filtering that they are a human user sender by answering the challenge
algorithm with the challenge-response scheme, we show that, message (e.g. through a web interface) before their message is
ultimately, high accuracy and low message traffic can be forwarded to the recipient. The senders authenticate them-
achieved simultaneously. We also describe four challenge- selves as a human user by answering a simple Turing test
response protocols based on ‘Completely Automatic Public which a machine cannot easily understand. The protocol,
Test to tell Computer and Humans Apart’ (CAPTCHA). Even however, has often been criticized for extra user interaction
though many researchers have discussed CAPTCHA based and traffic used. There might also be a significant overhead in
challenge-response protocols (Roman et al., 2006; Shirali- storing and managing challenge messages. Roman et al. (2006)
Shahreza and Movaghar, 2008; He et al., 2008), their protocols have introduced a pre-challenge method to overcome these
do not consider the cryptographic details. We extend these problems. Their method assumes that each user has a chal-
protocols for formal verification under a security threat lenge associated with their email address. Hence, the email
model. Moreover, our simulation results (see Section 4) show sender can instantly access the recipient’s challenge, and
that this hybrid approach is capable of controlling high- send the response together with the email. Their security
volume spam and traffic usage. model is undermined, however, when the response is exposed
The remainder of the paper is organized as follows. Section to an adversary.
2 discusses the related work. Section 3 describes the hybrid He et al. (2008) proposed a framework which combines
filtering framework. Section 4 evaluates the performance of white/black listing and challenge-response methods.
the proposed framework based on two measures: traffic usage However, their work does not consider the necessary security,
and accuracy. Finally, Section 5 discusses the contribution of performance, and cost implications of using such a protocol in
this paper and entails the remaining work. detail.

2. Related work 3. A hybrid framework

Content-based filtering solutions have been proved to be This section describes our hybrid approach. SMS messages are
effective against emails (Androutsopoulos et al., 2000; Metsis first classified into three different regions using the content-
et al., 2006; Bratko et al., 2006), which are typically larger in based filtering method: ham, uncertain and spam. Considering
size compared to SMS messages. Abbreviations and acro- that the filtering method is not suitable for dealing with
nyms are used more frequently in SMS messages and they uncertain messages, the challenge-response method is then
increase the level of ambiguity. This makes it difficult to used to further classify the uncertain messages into ham and
adopt traditional spam filters without any modification. spam regions. In practice, the majority of spam messages are
Healy et al. (2005) discuss the problems of performing spam generated by machines. Therefore, a human verification
classification on short messages by comparing the perfor- mechanism d in the form of challenge-response d is used to
mance of the well-known K-Nearest-Neighbor (KNN), detect whether an uncertain message falls into the ham or
Support Vector Machines (SVM), and Naive Bayes classifiers. spam region.
They conclude that, for short messages, the SVM and Naive Fig. 1 shows a high level overview of three major stake-
Bayes classifiers substantially outperform the KNN classifier; holders: the message sender, message center, and recipient.
and this contrasts with their previous results obtained for The message center (owned by the mobile operator) sends
longer emails. Hidalgo et al. (2006) also carried out content a challenge query to check whether the sender is a human or
filtering experiments with English and Spanish spam SMS machine. The sender responds by answering the query and
corpora to prove that Bayesian filtering methods are still the message center compares the returned value against the
effective against spam SMS messages. Deng and Peng (2006) known correct value. If the values match, the message is
designed a distributed, content-based filtering method that classified as ham, otherwise, it is classified as spam. We are
considers other SMS message characteristics such as its interested in further classification of this uncertain region.
length, which is usually longer than that of a ham (normal We would suggest that the message center should be given
message). the full responsibility of running our framework for the
One of the drawbacks of existing solutions, however, is following reasons:
that they often look for topical terms or phrases such as ‘free’
or ‘viagra’ to identify spam messages. In consequence, some  to reduce the traffic usage by filtering spam messages at the
of the legitimate SMS messages that contain such black listed earliest possible stage; that is, before forwarding them to the
words can be mistakenly classified as spam. This could recipient.
448 computers & security 29 (2010) 446–459

Fig. 1 – Hybrid spam filtering overview.

 by using the challenge-response protocol, the message This separates ham from spam (the odd ratio approach is
center will be able to collect a large amount of sample data a special case where h ¼ 0.5). The main problem with this
in real time; these can be used to develop highly effective approach is finding a proper threshold: because the threshold
classifiers and continuously improve the performance of for ground truth h~ is unknown, there are two possible cases as
filtering algorithms. shown in Fig. 2(a).
 it would be difficult to install and maintain a homogeneous If h is higher than h,~ some of the ham in region A could be
anti-spam software on all mobile devices; instead we rely on classified as spam. If h is lower than h,~ some of the spam in
one solution deployed in the message center. region B could bypass the content-based filter and reach the
recipients. Such a threshold problem will always be present in
In practice, however, it is possible that the operator of the ~
classification: it is almost impossible to find the underlying h,
message center would allow certain companies to send spam and the anti-spam software companies are likely to use
messages to its users for a payment. Our work assumes that strategies based on their own experiences. In order to mini-
the operator always works in the best interest of the user and mise the false negatives (i.e. ham being classified as spam and
will only allow such messages to go through if the user has not reaching the recipient) in mobile networks, binary
agreed to receive messages from these companies. We methods tend to be configured with less sensitivity with
imagine that the message center holds the user’s white list of regards to detecting spam. They would rather mistakenly
‘interesting companies’ and only forwards messages from the forward spam than prevent any legitimate message from
listed companies. reaching the recipient. We believe that these problems can be
resolved by introducing an uncertain region with two
3.1. Introducing the uncertain region thresholds (see Fig. 2b). These can be implemented as the
upper and lower boundaries of a traditional threshold system.
If we assume there are only two regions d ham and spam d the As a result, we now have three labels: spam, uncertain area,
content-based filter will use binary classification. Suppose that and ham d the focus is on the uncertain area. Spam and ham
we have a probabilistic model for the anti-spam classifier as regions are classified as in the traditional system. Only the
a posterior distribution Prðc ¼ hamjyÞ. This is the probability messages that fall into the uncertain area are checked further
that a message falls into the ham region: c and y denote reali- using the challenge-response protocol. The next section
zation of random variables for a class and message, respec- describes our proposed protocols in detail.
tively. The odd ratio of the posterior is used to obtain
a measurable classification by Opost ¼ Prðc ¼ hamjyÞ= 3.2. Challenge-response protocols
Prðc ¼ spamjyÞ. If Opost > 1, a message is classified as ham,
otherwise, as spam. Alternatively, we can simply use First, we assume that there is a Turing test available with
a threshold based approach in the posterior distribution. If a low probability of producing false positives and false nega-
Pr(c ¼ hamjy) is closer to one, a message is likely to be ham; if tives. CAPTCHA is a commonly used one d it generates
closer to zero, it is likely to be spam. Let c ¼ f ðy; hÞ be the pattern matching problems for which a human can easily
content-based filter where c and h are the output and given recognize and solve, whereas a machine cannot. An auto-
threshold, respectively. This filter would work with the mated program that generates thousands of spam will not be
following rules: capable of answering a CAPTCHA based challenge, which
 could be a graphical image containing a faint typeface. If the
ham if Prðc ¼ hamjyÞ  h response is correct, there is a high probability that the sender
c ¼ f ðy; hÞ ¼ (1)
spam if Prðc ¼ hamjyÞ < h
is a human. CAPTCHA can be designed in different media
computers & security 29 (2010) 446–459 449

a 1  managing session information between all trusted pairs for


challenge-response would impose huge storage overhead
0.9
on the message center; there might be more than one
0.8 message center sharing this information, and it might or
A
0.7 might not be stored in the center.
B
Pr(c=ham|y)

0.6
Mindful of these security and scalability issues, we
0.5 proposed four different protocols: protocols 3 and 4 assume an
0.4 authenticated channel, whereas protocols 1 and 2 do not;
protocols 1 and 3 assume that the message center manages
0.3
the session information, whereas the others do not.
0.2

0.1 ’spam’ 3.2.1. Protocol notations


’normal’
Standard engineering notations (Burrows et al., 1989) were
0
1 2 used in describing the protocols. In a protocol that is used by A
and B, ‘‘A/B : X’’ implies that A sends message X to B. The
b 1
spam symbols S and R represent the Sender and Recipient, respec-
0.9 uncertain area tively. M represents the Message center, T a Timestamp, N
ham
0.8 a Nonce, K a Key and K1 its inverse. In a symmetric crypto-
0.7
system such as AES, K and K1 are always equal. A Plain SMS
message encrypted with K is represented as {P}K. H is a one-
Pr(c=ham|y)

0.6
way hash function. The subscript m in Km implies that Km is
0.5 M’s public key. Additionally, ms in Kms shows that Kms is
0.4 intended for communication between M and S.
The sender’s ability to send a correct response depends on
0.3
their competence to interpret the key, K1 c . An unauthorized
0.2 sender (e.g. a program sending spam) will not be able to
0.1 interpret and figure out K1 c d this key serves to identify
machine-generated spam. For simplicity, encryption algo-
0
1 rithms were not considered in the protocols.

Fig. 2 – (a) Two possible cases: h>h ~ (case 1) and h<h~ (case
3.2.2. Protocols
2) for a given ground truth h~ (red dot line) and (b) modified
In protocol 1, the message center (M ) maintains the session
classification embedding uncertain area given a ground
information.
truth h~ (red dot line).

[Protocol 1]
(M1) S/M : S; R; P
forms such as an image, an audio file or a text (von Ahn et al., (M2) M/S : M; S; fKms gKc ; fHðS; R; PÞ; NgKms
2008). Their implementation details, however, are beyond the (M3) S/M : S; M; fHðS; R; PÞ; N þ 1gKms
scope of this paper.
A number of challenge-response protocols have already Before sending message 1, S stores R and P to prevent
been proposed (He et al., 2008; Shirali-Shahreza and Mova- message modification attacks. After receiving message 1, M
ghar, 2008). Although, these focus only on the implementation generates Kms and stores (S, R, P, Kms, N ) as the session infor-
issues without considering the security model and crypto- mation. Kms is protected with Kc. An image CAPTCHA would be
graphic details. We define our own security models and one way of protecting Kms against spam programs. After
describe a number of possible protocols in line with them. receiving message 2, S decrypts fKms gKc by answering the
There are several issues we need to consider before designing challenge (their ability to interpret K1 c ). S then decrypts H(S, R, P)
the protocols: and N using Kms. S compares H(S, R, P) against the previously
stored values. S terminates the protocol if these values do not
 when we are dealing with spam, message authentication match; otherwise, S generates fHðS; R; PÞ; N þ 1gKms by Kms and
and integrity are important, whereas confidentiality is not.1 sends it to M. After receiving message 3, M verifies
 SMS messages are usually unencrypted and unsigned; fHðS; R; PÞ; N þ 1gKms . If it is valid, M forwards the stored message
hence, it is possible to tamper with them during (S, R, P) to R. Finally, M deletes the session information. The proof
transmission. of this protocol is presented in Appendix A.
 security properties of the communication channel between The users could become frustrated, however, if they
the message center and the sender need to be defined; this receive too many challenge messages. We use a timestamp
channel might or might not be an authenticated one. (T ) to solve this problem. After receiving message 3, M
maintains a session information (S, R, P, Kms, T ) between S and
1
The adversary’s goal is to deliver spam messages to the R for a given time interval. M checks the validity of Kms using
recipient. the session information and a policy that defines the lifetime
450 computers & security 29 (2010) 446–459

of Kms. This is also effective for detecting and avoiding any Hence, all existing message centers would have to support the
replay attacks. new protocol. While this is a large change and a challenging
The main drawback of this protocol is that M has to bear one, operator-sponsored forums like OMTP (Open Mobile
the huge overhead of maintaining the session information. Terminal Platform), are working with key mobile operators to
We describe another protocol which solves this issue by using unify and recommend mobile terminal requirements (Rogers,
authorized tokens instead: 2007). With the increasing number of spam texts, it seems
likely that the ability to filter machine-generated uncertain
[Protocol 2] texts will persuade operators into upgrading their systems.
(M1) S/M : S; R; P
(M2) M/S : M; S; fKms gKc ; fHðS; R; PÞgKms ; fKms ; HðS; RÞ; TgK1
m
3.3.2. Performance degradation
(M3) S/M : S; R; fPgKms ; fKms ; HðS; RÞ; TgK1
m
If there are too many messages subject to challenge-response,
its overhead will dominate. For example, sending an image
The main difference is the use of fKms ; HðS; RÞ; TgK1 m
(which CAPTCHA is a huge overhead to authenticate a 100 character
can only be generated by M ) as the authorization token for SMS message. Future work may look at adding a ‘bypass’ to
verifying a response. M checks whether S is authorized by the hybrid, so that a message originating from a verified
looking at fKms ; HðS; RÞ; TgK1
m
. Using this token, S can just send sender can be automatically treated as ham without having to
message 3 alone, including a new text (P0 ), within the lifetime go through the spam filtering process.
of T: For instance, the message center could manage the recip-
ient’s white list of acceptable phone numbers d typically,
(M1) S/M : S; R; fP0 gKms ; fKms ; HðS; RÞ; TgK1
m
through synchronization with the recipient’s contact list.
Since the message center has secure access to the message
In these protocols, however, S cannot verify the authen- sender details (including the phone number), it can first check
ticity of the challenge message. Before describing the next two to see if the sender’s phone number is included in the recip-
protocols which aim to solve this problem, we make an ient’s white list. If it is a listed number, the message can be
assumption that there is an authenticated channel between M treated as ham and forwarded to the recipient; if not, the
to S, and M’s public key (Km) is securely installed on a mobile message can go through the spam filtering process. As the
device owned by S (perhaps during the process of uncertain region becomes smaller, we expect the perfor-
manufacturing). We describe the following protocols based on mance of our framework to improve.
this assumption:
3.3.3. Usability issues
[Protocol 3] Adapting CAPTCHA methods will have implications on
(M1) S/M : S; R; P usability. A mobile device might not have the capability to
(M2) M/S : M; S; ffKms gKc ; NgK1
m
display an image CAPTCHA to a readable standard; also
(M3) S/M : S; R; fN þ 1gKms a mobile user might find it difficult to verify an audio
CAPTCHA due to background noise d hence, it is important to
In protocol 3, M maintains the session information, (S, R, P, set up user-friendly CAPTCHA methods.
Kms, N ). When message 2 arrives, S verifies the signature on Different approaches for generating user-friendly
ffKms gKc ; NgK1
m
. S does not respond if the signature is invalid. CAPTCHA messages have been discussed by various
researchers (Leveraging the CAPTCHA Problem, 2005; Yan and
[Protocol 4] El Ahmad, 2008). Chow et al. (2008) proposed a new CAPTCHA
(M1) S/M : S; R; P technique that minimizes the level of user frustration and
(M2) M/S : M; S; ffKms gKc ; HðS; RÞ; TgK1m
; fPgK1
m
facilitate the use of CAPTCHA on mobile devices. Their tech-
(M3) S/M : S; R; fPgKms ; ffKms gKC ; HðS; RÞ; TgK1
m
nique is well suited for keyboard-less mobile devices.

Protocol 4 uses ffKms gKC ; HðS; RÞ; TgK1


m
as the authorized
token. Our protocols are likely to be compatible with existing 4. Evaluation
devices since the majority already have built-in encryption
and hash functions. Fig. 3 demonstrates a basic SMS deployment architecture and
its wireless network components (Prieto et al., 2004): the
3.3. Observations Home Location Register (HLR), Mobile Switching Center (MSC),
SMS Gateway (SMSG), and SMS Center (SMSC) are such
3.3.1. Upgrading protocols components. These are interconnected as shown in Fig. 3.
A message is always sent to the message center of the con-
tracted operator first. If the message is directed at someone 4.1. Description of datasets
contracted to a different operator, it is forwarded to another
message center before reaching the recipient’s handset (Enck In order to measure the performance of our framework, we
et al., 2005). This means if one of the message centers decides generated synthetic datasets. Suppose that there are N sent
not to use our framework, all uncertain texts delivered via that messages (we set N ¼ 5000). We use p and q to show the ham to
center would bypass the content-based filter. It would be the spam proportion where p þ q ¼ 1, and p and q are non-negative
weakest point (and the only route needed) for an attack. numbers (in reality, different operators will have different
computers & security 29 (2010) 446–459 451

Fig. 3 – SMS deployment architecture.

proportions). Let k be a random variable generated from an For an example of this paper, we use m0 ¼ 0.3750,
existing filtering method, given an observed data y: s0 ¼ 0.7143, m1 ¼ 0.1614 and s1 ¼ 0.1597 and then the hyper-
k ¼ Prðc ¼ hamjyÞ. For an artificial dataset, we build a mixture parameters of beta distribution are set as follows: a0 ¼ 3, b0 ¼ 5,
model given by a1 ¼ 5, b1 ¼ 2. Also, we built an artificial dataset based on
a Spanish database (Hidalgo et al., 2006) which shows the
pðkjlÞ ¼ pðkjc ¼ ham; lÞpðc ¼ hamjlÞ proportion of spam as 14.57% and ham as 85.32% d that is,
þ pðkjc ¼ spam; lÞpðc ¼ spamjlÞ (2) q ¼ 0.1457 and p ¼ 0.8532. Fig. 4 shows the distribution of
generated data: in graph (a), the red crosses represent ham and
where l denotes a set of hyper-parameters which control
blue circles represent spam; the same colouring scheme is
parameters. Since c can only be 0 (spam) or 1 (ham), we
used in graph (b). These graphs show that there is a large
assume ci is generated from Bernoulli distribution with hyper-
amount of overlapping labels between 0.2 and 0.8. This over-
parameters p and q. Thus, we have:
lapping section is considered as the uncertain region. Since the
cwpðcjlÞ ¼ Bernoulliðc; pÞ ¼ pc ð1  pÞ1c ¼ pc q1c challenge-response protocol is not perfect, some spam will
bypass the protocol with correct responses, and some ham will
After classifying the ith sample message, we generate the
be filtered mistakenly with incorrect responses. To model this
expected probability (this is the filtering output):
imperfection, we use e1 and e2 to represent the ratios of False
 Positives (FP) and False Negatives (FN) in the uncertain region.
pðkjc ¼ ham; lÞ ¼ Bðk; a1 ; b1 Þ
kwpðkjc; lÞ ¼ In addition, Sections 4.4 and 4.5 use a wide range of
pðkjc ¼ spam; lÞ ¼ Bðk; a0 ; b0 Þ
randomly generated parameters to demonstrate how perfor-
Beta distribution ðBÞ was used here: k denotes the proba-
mance is affected in various environments: Section 4.4 studies
bility of ham classified from the existing filtering method, so
performance with varying ham and spam proportions;
the random variable lies between 0 and 1; k can be designed by
Section 4.5 uses fixed proportions of ham and spam (q ¼ 0.1457
beta distribution to continuously generate samples between
and p ¼ 0.8532), and varies other parameters to study how
0 and 1. Both thresholds (h1 and h2) vary between 0 and 1 by 1/
performance changes.
30. In practice, the hyper-parameters, a0, b0, a1 and b1, are
obtained by the means and variances of spam and ham 4.2. Traffic usage comparison
respectively; this is given by:
We simulated the traffic usage using the variable thresholds
m2i ð1  mi Þ
ai ¼ mi þ and analyzed the results. Our framework considers several
s2i
stakeholders (see Fig. 3): the message Sender (S), message
  Receiver (R), message center (either MSC or SMSG), and other
1
bi ¼ ai 1 (3) network components (SMSC, HLR).
mi
First, we calculated the traffic used by an existing filtering
where mi and si are means and standard deviation of the costs/ method. In practice, the size of each message in the protocol
likelihood of spams (i ¼ 0) and hams (i ¼ 1). will be different. For instance, the size of the challenge
452 computers & security 29 (2010) 446–459

a 1
Hams
messages are deleted at the message center. Suppose that
Spams yhc¼type for type ˛ {ham, spam} denotes all messages filtered as
0.9
type in terms of h, then the total amount of traffic used is the
0.8 sum of jyhc¼ham j  6; and jyhc¼spam j  1 where j$j represents the
cardinality of a set: NFilteringOnly ¼ jyhc¼ham j  6 þ jyhc¼spam j  1.
0.7
This is because a ham traverses through six different paths
Pr(c=Ham|y)

0.6 (S / MSC/SMSG / SMSC / HLR / SMSC / MSC / R)


0.5 whereas a spam just traverses through one (S / MSC/SMSG).
In contrast, our hybrid model divides the measurable space
0.4
into three different areas using two thresholds: h1 and h2. As
0.3 a result, we have two more parameters to estimate: the traffic
used by ham (Nun) and spam (Nus) in the uncertain region. Let
0.2
y~c¼type for type ˛ {ham, spam} be a set of messages that have
0.1 label type as a ground truth.
As Fig. 5 shows, there are four possible pathways:
0
500 1000 1500 2000 2500 3000 3500 4000 4500 5000
Index number, i for yi  in (a), a message classified as ham (using the higher
threshold) is sent directly to R via the network components;
b 900
Hams
the number of paths taken is six: S / MSC/SMSG /
Spams SMSC / HLR / SMSC / MSC / R.
800
 in (b), a message is in between the higher and the lower
700 thresholds; a correct response is submitted by the sender
The number of messages

and the message is classified as ham; the number of paths


600
taken is eight: S / MSC/SMSG / S / MSC/SMSG /
500
SMSC / HLR / SMSC / MSC / R.
 in (c), a message is in between two thresholds again; this time
400 no response is returned and the message is classified as spam;
the number of paths taken is two: S / MSC/SMSG / S.
300
 in (d), a message, classified as spam using the lower
200 threshold, is deleted at the message center (MSC/SMSG); the
number of paths taken is one: S / MSC/SMSG.
100

0
The traffic usage is calculated using:
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Pr(c=Hamly)
Nn ¼ jy hc¼ham
2
j6 
 h1 
Fig. 4 – Displaying k [ Pr(c [ hamjy) for N messages: spam Nun ¼ yc¼ham Xyhc¼spam2
Xy~c¼ham   ð1  e1 Þ  8
 
(14.57%) and ham (85.32%). (a) N messages and (b)  h 
þyc¼ham
1
Xyhc¼spam
2
Xy~c¼spam   e2  8
distribution.  
  (4)
Nus ¼ yhc¼ham
1
Xyhc¼spam
2
Xy~c¼spam   ð1  e2 Þ  2
 
 1 
þyhc¼ham Xyhc¼spam
2
Xy~c¼ham   e1  2
message will be greater than other text messages since it
Ns ¼ jyhc¼spam
1
j1
would include a CAPTCHA image. For simplicity, however, we
Nhybrid ¼ Nn þ Nun þ Nus þ Ns
assume that all messages have the same size.
Only the messages with filtering probabilities higher than where e1 is a probability that humans fail to respond correctly
the threshold h reach R via the network components; other and e2 is a probability that spam generated by machines pass

Fig. 5 – Four possible pathways for the hybrid method.


computers & security 29 (2010) 446–459 453

a Turing test. Again, for simplicity, we assume that these


a Ntotal=Nun+Nus+Ns
probabilities are relatively low and set e1 and e2 at 0.02 and
4 NFiltering Only
0.01, respectively. x 10
6
Fig. 6 shows the traffic usage and accuracy with varying
5.5
thresholds, h1 and h2. We assume that h1 is smaller than h2,
5
and only the right half of the graph is meaningful since the 4.5
right and left halves of the graph are symmetric. The green 4
plane represents the traffic used by the filtering method alone, 3.5
and the blue represents the traffic used by our hybrid 3

framework. 2.5
2
In order to show the changes in traffic usage with two
1.5
varying thresholds, the inner sections of Fig. 6 were explored 1
further in Fig. 7. Graph (a) was plotted with the higher 0.8 1
0.6 0.8
threshold fixed to 0.73333, and with the lower threshold 0.4 0.6
0.4
increasing from 0 until it reached this value. The graph shows h2 0.2 0.2
0 0 h1
that the traffic usage decreases as the lower threshold
increases. Additionally, the traffic usage ratios of our hybrid
approach and the conventional approach (filtering only) are b ACChybrid
ACCFiltering Only
1:37 ð¼ 4:1=3Þ and 1:27 ð¼ 3:3=2:6Þ respectively (at low
1
thresholds 0 and 0.5). From this, we concluded that the
amount of traffic used in our approach is roughly 1.3 times 0.9
greater than that of the conventional approach. We also
ACC 0.8
monitored the traffic usage with the lower threshold fixed
to ¼ 0.1, and with the higher threshold increasing from 0.1 to 1 0.7
(see graph (b) in Fig. 7). The traffic usage does not change with
0.6
the filtering-only approach because the lower threshold is the
same as h. As the number of messages in the uncertain region 0.5
1
increases so does the traffic usage. 0.8 1
0.6 0.8
0.4 0.6
0.4
4.3. ROC comparison h2 0.2 0.2
0 0 h1

One of the good measures used in classification is Receiver Fig. 6 – 3D view of the (a) traffic usages and (b) accuracy in
Operating Characteristic (ROC) curves. We calculated and terms of varying thresholds.
compared True Positive (TP), True Negative (TN ), False Posi-
tive (FP) and False Negative (FN ) of the underlying classes and
the expected ones between the filtering-only method and our ~qfiltering only ¼ Eðqjh; yÞ (5)
hybrid method. Let q be {TP, TN, FP, FN}. We obtained the
proper estimate of q from the posterior distributions: p(qjh) where Eð$jh; yÞ denotes expectation given a threshold. We also
was used for the filtering method and p(qjh1, h2) for the hybrid obtained q in the hybrid method by Eðqjh1 ; h2 ; yÞ, which denotes
method. q was obtained in the filtering-only method by expectation given two thresholds. We used marginalized

b
4
x 104 A fixed high threshold: 0.73333 A fixed low threshold: 0.1
a
x 10
6 6

5 5
The total traffic path length
The total traffic path length

4 4

3 3

2 2

1 1

0 0

0 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Low thresholds High thresholds

Ham, Nn Spam, Ns Ham, Nn Spam, Ns


Nhybrid=Nun+Nus+Ns Ham in uncertain, Nun Ntotal=Nun+Nus+Ns
Ham in uncertain, Nun
NFiltering only Spam in uncertain, Nus NFiltering only
Spam in uncertain, Nus

Fig. 7 – Slides of an axis (with fixed threshold). (a) A fixed high threshold and (b) a fixed low threshold.
454 computers & security 29 (2010) 446–459

posterior distribution for the hybrid method since the number


Table 1 – Comparison of AUC.
of thresholds need to be equal for a fair comparison.
Method Ratio (e1) Ratio (e2) AUC

~qhybrid ¼ Eðqjh1 ; yÞ Filtering only – – 0.9261


R Hybrid 0.02 0.008 0.9783
¼ qpðqjh1 ; yÞdq
Hybrid 0.04 0.008 0.9736
R hR i
q

¼ q h2 pðq; h2 jh1 ; yÞdh2 dq Hybrid 0.02 0.01 0.9782


q " # Hybrid 0.04 0.01 0.9735
R R
¼ q pðqjh2 ; h1 ; yÞpðh2 jh1 Þdh2 dq
Rq R h2 (6)
¼ qpðqjh2 ; h1 ; yÞpðh2 jh1 Þdh2 dq
q h2
4.4. Variant proportion of spam
 
R R
¼ qpðqjh2 ; h1 ; yÞdq pðh2 jh1 Þdh2
h2 q Previously, in Sections 4.2 and 4.3, the proportions of spam
P R
zjH12 j qpðqjh2 ; h1 ; yÞdq and ham were fixed to 14.57% and 85.32% respectively. In this
h2 ˛H2 q
1
P section we show how the performance is affected when these
¼ jH2 j Eðqjh1 ; h2 ; yÞ
h2 ˛H2 proportions change.
Table 2 describes a small number of samples from the nine
h2 w p(h2jh1) and H2 are a set of samples h2. From Eq. (5) and Eq. different proportions. Each record has six different columns:
(6), we plotted an ROC curve with the threshold increasing proportion of spam (%), lower threshold (h1), higher threshold
from 0 to 1 (see Fig. 8); x-and y-axis stand for 1-specificity and (h2), traffic usage (TU) of Nhybrid, ratio ð¼ Nhybrid =Nfilteringonly Þ,
sensitivity; these are estimated by and accuracy ðACC ¼ ðTP þ TNÞ=ðP þ NÞÞ. It uses three different
measures for the performance. If the traffic usage is less, we
TN TP
specificity ¼ and sensitivity ¼ : (7) say the system is lighter and is more economical. The ratio is
FP þ TN TP þ FN
only close to 1 if the traffic used in the hybrid method is close
The plain black line is used to show the filtering-only to the amount used in the other. The accuracy measures the
method. The coloured lines with markers are used for the correctness of message classification. We can select practical
others that use the hybrid method. We have tested four threshold values for each spam proportion to compare the
different e1s and e2s: e1 ˛ {0.02, 0.04} and e2 ˛ {0.008, 0.01}; these performance. For instance, threshold values h1 ¼ 0.1 and
are shown with the coloured lines. This graph shows that our h2 ¼ 0.2 can be selected in 10% spam proportion to show
hybrid method has higher performance than the other. The a reasonable performance of the hybrid method. However, if
ROC can also be used to generate a summary statistic. One of the system is concerned with achieving high accuracy and not
the common versions is the Area Under the ROC Curve (AUC). with reducing the traffic usage, h1 ¼ 0.1 and h2 ¼ 0.9 values can
The AUC corresponds to the probability of a classifier ranking be used. In a spam-dominant environment (for spam
a randomly chosen positive instance higher than a negative
one. The comparison of AUC for all methods is described in
Table 1. In this table, the AUCs of all hybrid methods are
higher than that of the filtering method. This emphasizes our Table 2 – Traffic amounts and accuracy of hybrid methods
previous result of the hybrid method having a superior in terms of thresholds.
performance. In addition, as the ratios of e1 and e2 become
Proportion h1 h2 TU Ratio ACC
smaller, the AUC increases. of spam

10% 0.1 0.2 30,136.4 1.0079 0.9185


0.1 0.9 47,440.88 1.5867 0.9831
1 0.4 0.6 31,683.34 1.1438 0.9527
0.8 0.9 16,553.18 1.3111 0.3968
0.9
20% 0.1 0.2 30,305.68 1.0161 0.8331
0.8 0.1 0.9 47,539.24 1.5939 0.9839
0.4 0.6 30,522.02 1.1625 0.9415
0.7
0.8 0.9 15,569.8 1.3035 0.4693
0.6
Sensitivity

30% 0.1 0.2 30,487.38 1.0234 0.7470


0.5 0.1 0.9 47,764.28 1.6034 0.9846
0.4 0.6 29,691.62 1.1843 0.9421
0.4 0.8 0.9 14,176.7 1.2882 0.5312
0.3 40% 0.1 0.2 30,697.42 1.0322 0.6635
Filtering Only, [AUC: 0.92608] 0.1 0.9 47,797.02 1.6072 0.9853
0.2 FP(uncertain):0.02, FN(uncertain):0.008[AUC:0.9783]
FP(uncertain):0.04, FN(uncertain):0.008[AUC:0.97362] 0.4 0.6 28,747.42 1.2053 0.9329
0.1 FP(uncertain):0.02, FN(uncertain):0.01[AUC:0.97818] 0.8 0.9 12,850.56 1.2605 0.5963
FP(uncertain):0.04, FN(uncertain):0.01[AUC:0.97349]
0 50% 0.1 0.2 30,863.82 1.0402 0.5826
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.1 0.9 47,892.5 1.6142 0.9863
1−Specificity 0.4 0.6 27,396.98 1.2228 0.9275
0.8 0.9 11,703.38 1.2411 0.6625
Fig. 8 – ROC curve.
computers & security 29 (2010) 446–459 455

proportion of 50%), reasonable threshold values would be achieving a high accuracy is considered relatively more
h1 ¼ 0.4 and h2 ¼ 0.6. Returning back to the figures for a spam important than the resulting increase in the traffic usage.
proportion of 10%, h1 ¼ 0.1 and h2 ¼ 0.9 will be selected when The worst set of samples are those clustered around the
accuracy is the most significant factor. bottom left region of the graph (assigned with label 3).
Although these show small increase in the traffic usage, the
4.5. Different content-based filtering parameters and accuracy is also very low. For this reason, the hybrid method
performance implications is not really suitable for handling such samples. The
samples clustered broadly around the center region of the
The performance (accuracy and traffic usage) of the proposed graph (assigned with label 4) are considered better than
hybrid method will vary depending on the characteristics of those assigned with label 3, but worse than those assigned
the content-based filtering method being used. In order to with labels 1 and 2.
study this further, we simulated the hybrid method with 200 More can be observed from an associated set of data, pre-
random samples, each representing a different case of the sented in Table 3 (see Appendix B). Each record describes
content-based filtering method. The following parameters a sample profile.2 Different applications and business models
were considered for each sample: will have different preferences for these two ratios. Taking
this into consideration, we studied the trends for three
 m0: the mean of the cost/likelihood of spam representative cases where
 s0: the standard deviation of the cost/likelihood of spam
 m1: the mean of the cost/likelihood of ham 1. accuracy is considered twice as important as traffic usage –
 s1: the standard deviation of the cost/likelihood of ham Fig. 10(a)
 h1: a lower threshold 2. both are considered as equally important – Fig. 10(b)
 h2: a higher threshold. 3. traffic usage is considered twice as important as accuracy –
Fig. 10(c)
From the results, their accuracy and ratio of traffic usage
(hybrid method to conventional filtering-only method) The results are shown in Fig. 10. In Fig. 10(a), the samples
values were plotted in a graph (see Fig. 9). The results were which satisfy DACC > 0.5DTU þ 0.5 are plotted with red dots and
clustered using a well-known clustering technique called these represent good samples. The others, plotted with black
the K-means algorithm (Hartigan, 1975) by setting K ¼ 4 ‘x’s, are classified as relatively bad samples. Similarly, in
(implying 4 clusters). The samples clustered around the top Fig. 10(b) and (c), good samples (red dots) satisfy DACC > DTU
left region of the graph (assigned with label 1) are regarded and DACC > 2DTU  1, respectively, whereas relatively bad ones
as recommendable cases since they show high accuracy but (black ‘x’s) do not.
with smaller increase in the traffic usage compared to the We also plotted the ratio between the number of good
filtering-only method. The samples clustered around the top samples and bad samples as derived from these three cases
right region of the graph (assigned with label 2) show high (see Fig. 11). The graph shows that, with the varying impor-
accuracy but suffer from a large increase in the traffic usage tance of these two factors, the ratio of the number of good
compared to the filtering-only method. This implies that the samples to bad samples changes: as the importance of accu-
hybrid method should be used for these samples when racy (relative to traffic usage) increases, so does the number of
good samples, and vice versa.
Figs. 10 and 11 both show that as the traffic usage
1 103
134
113158
16
92166
122
198
78146
167
188 22195
179 61 27
10
119
96172 76 21 175181
85
2927 12917388
160
169 142 108
57132 138 100 115 19660
84 55
123
8194
165
18
136
156
28
67
75
189
42
116
185
112
193
70
63
98 becomes more important, the number of good samples
81
177 50 14 5218769 17023 82 41 105
120 47 126
1 79 159 5 94 73
0.9 125 114 128161
62 4 149
31 183
89 decreases. Hence, if our hybrid method was to be applied in
46 15
135
17
99
49 11 155
33
24
35
44
91
9015740
154
151
53
118
83
74
150
56
19
109
147
192
59
102
3
104
13
107
190 48 19765
121 186 an environment where traffic usage is considered relatively
180 178 12 71 51
0.8 95 162
39 14043 117 6 more important, the content-based filtering parameters
164 37 9 68
0.7 182
176
64 153 127 as well as the threshold values need to be selected more
72
111
54
97 carefully.
0.6 25
ACC

200
0.5 171 152
174 124
45
133
0.4 2
A sample profile consists of the specific parameter values.
148 87
141
Given the space available, we only show 10 representative
0.3 80 36
143 168 samples for each label (L). Consider the samples that have been
130
139
131
0.2 110 184 77 assigned with labels 3 (L ¼ 3) and 4 (L ¼ 4). In practice, there is
3066 93 20199
191 a very low chance for these cases to arise since the means of
144
32
34
58
86
101
106
137
163
26
145 38
0.1 spam and ham (m0 and m1) are lower than the lower threshold, and
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
the standard deviations are also relatively small. Therefore, if we
Ratio of traffic
study the graph without being too concerned about such cases, it
becomes clearer that our hybrid method, in general, is capable of
Fig. 9 – Accuracy and traffic usage ratio for 200 random
achieving high accuracy regardless of the content-based filtering
samples d each assigned with one of the four labels: label
algorithm (or the parameters) being used.
1 (red dots, top left), label 2 (black crosses, top right), label 3 We also studied the configurations of the hybrid method with
(blue squares, bottom left) and label 4 (pink triangles, respect to the ratio of accuracy ðDACC ¼ ACChybrid =ACCfilteringonly Þ
center). and the ratio of traffic usage ðDTU ¼ TUhybrid =TUfilteringonly Þ.
456 computers & security 29 (2010) 446–459

a b
1.15 1.15

Ratio of the ACC

Ratio of the ACC


1.1 1.1

1.05 1.05

1 1

0.95 0.95
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Ratio of the traffic Ratio of the traffic

c
1.15
Ratio of the ACC

1.1

1.05

0.95
1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8
Ratio of the traffic

Fig. 10 – Comparison between the accuracy ratio and the traffic usage ratio for 200 samples d dotted blue lines stand for the
borderline between the good samples and bad samples. (a) DACC:DTU [ 1:2. (b) DACC:DTU [ 1:1. (c) DACC:DTU [ 2:1.

usage in using our framework and compared it with the


5. Conclusion and future work conventional content-based filtering method. Moreover,
through a simulation of 200 randomly generated samples
We proposed a hybrid spam filtering framework for mobile (each representing a unique set of content-based filtering
communication using a combination of content-based filtering parameters and threshold values) we showed that our hybrid
and challenge-response. A message that falls into the uncertain approach, in general, achieves high accuracy regardless of the
region (after filtering), is further classified by sending a chal- content-based filtering algorithm being applied. Although,
lenge (e.g. an image CAPTCHA) to the sender: a legitimate when traffic usage becomes relatively more important than
sender is likely to answer it correctly, whereas an automated accuracy, the underlying filtering algorithms must be selected
spam program is not. Challenge-response protocols have been more carefully.
carefully designed with the necessary cryptographic features. In this paper, a synthetic dataset, as opposed to a real
We have also shown the trade-off between accuracy and traffic dataset, has been used due to the following two reasons: first,
we wanted to consider a wide range of application environ-
1.4 ments, each of which will require different level of accuracy
and traffic usage (e.g. VoIP spam filters Croft and Olivier, 2005);
1.2 and second, this protocol involves a great level of human
# of good samples/ # of bad samples

interaction and developing such a prototype (in order to


1 generate our own dataset) was outside the scope. As part of the
future work, we could contact mobile operators and forums like
0.8 OMTP to collect real data and verify the accuracy of our results.
Having the network operators charge for sending of SMS
0.6 messages has been one of the big inhibitors to the growth of
spam: even a cent per message might hugely alter the
0.4 economics of a spammer. Assuming that a reasonable filtering
method is in place, another hybrid potential is to force
0.2 spammers to opt into a charging scheme where the cost of
responding to a challenge is larger than sending an initial
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
spam. For example, if it costs two cents to send a spam, then it
ΔACC/ΔTU would cost extra five cents to answer an image CAPTCHA. It is
difficult to assess how effective a solution this might be, but
Fig. 11 – The ratio between the number of good samples future work may explore these economic measures in depth
and bad samples. as a potential enhancement to the hybrid approach.
computers & security 29 (2010) 446–459 457

interpreting K1
c . The proof is described as follows:
Acknowledgements Sending message 2 leads to:

The authors would like to thank the anonymous referees for KMS
(1) MjwfM 4 SgKC
their careful attention and insightful comments. (2) MjwfNM ; ðS; R; PÞgKMS
KMS
(3) S9fM 4 SgKC
(4) S9fNM ; ðS; R; PÞgKMS
Appendix A.
Proof of protocol 1
Sending message 3 leads to:
Protocol 1 has been proved by BAN logic (refer to the rules given
by Burrows et al., 1989). In this protocol, M should be able to (5) SjwfNS ; NM ; ðS; R; PÞgKSR
trust H(S, R, P) returned from S, and know whether it is a legit- (6) M9fNS ; NM ; ðS; R; PÞgKSR
imate sender or not. For formal verification, we derive an ideal
protocol of protocol 1. The symbols Ns and Nm represent
a sender’s nonce and a recipient’s nonce, respectively. Our goal (7) is derived from (A3) and (6) by the message-meaning rule.
is to show that this protocol satisfies the security property (G1).
(7) MjhSjwðNS ; NM ; ðS; R; PÞÞ
[Ideal Protocol 1]
Kms
(M2) M/S : fM 4 SgKc ; fNm ; ðS; R; PÞgKms (NS, NM, (S, R, P)) contains the nonce, NS, and hence (8).
(M3) S/M : fNs ; Nm ; ðS; R; PÞgKms
(8) Mjh#ðNS ; NM ; ðS; R; PÞÞ
[Security Goal]
(G1) MjhSjhðNs ; Nm ; ðS; R; PÞÞ Finally (G1) is derived from (7) and (8) by the nonce-verifi-
cation rule. In protocol 1, S cannot verify whether message 2 is
Message 1 is ignored since it does not contribute much to from their contracted operator. Other protocols can be proved
achieving the goal; {N þ 1} is shown as Ns. The initial state in a similar manner.
assumptions have been derived: (A3) assumes that Kms will be
shared with a legitimate sender capable of

(A1) Mjh#Nm
(A2) Sjh#Ns Appendix B.
Kms
(A3) MjhM 4 S Traffic amounts and accuracy of hybrid methods
Kms
(A4) SjhMj0M 4 S of 200 random samples (L: labels/classes)
Kc
(A5) Mjh / S
k1
c
(A6) Sjh / S Table 3.

Table 3 – Traffic amounts and accuracy of hybrid methods of 200 random samples (L: labels/classes).
L ID m0 s0 m1 s1 h1 h2 TU Ratio ACC

1 11 0.6732 0.1385 0.6759 0.1187 0.3 0.6 3.517eþ004 1.173 0.8921


12 0.7174 0.05449 0.8433 0.1805 0.7333 0.8333 2.624eþ004 1.135 0.8074
13 0.7604 0.008962 0.8526 0.12 0.2667 0.4 3.005eþ004 1.002 0.8565
14 0.6399 0.09525 0.9397 0.1433 0.1333 0.7 3.316eþ004 1.108 0.959
15 0.637 0.1124 0.9365 0.04716 0.2333 0.5667 3.071eþ004 1.024 0.8924
16 0.5648 0.09243 0.8593 0.04778 0.5 0.7333 3.137eþ004 1.077 0.9954
17 0.739 0.1425 0.8176 0.06279 0.4 0.6333 3.056eþ004 1.022 0.8878
19 0.5905 0.07692 0.95 0.0202 0.1667 0.3667 3eþ004 1 0.8572
21 0.4844 0.06268 0.6896 0.07335 0 0.6667 3.914eþ004 1.305 0.9921
24 0.6414 0.02375 0.7796 0.04666 0.2667 0.5333 3eþ004 1 0.857

2 8 0.5559 0.04695 0.6891 0.02269 0.5 0.8 4.883eþ004 1.655 0.9816


9 0.4918 0.0645 0.8896 0.0539 0.8667 0.9667 3.156eþ004 1.564 0.7387
18 0.6342 0.06492 0.7137 0.05991 0 0.8667 4.966eþ004 1.655 0.9815
23 0.4885 0.09164 0.7519 0.2259 0.3333 0.9 4.107eþ004 1.444 0.9321
28 0.8101 0.08229 0.8187 0.0469 0.4 1 4.97eþ004 1.657 0.9814
29 0.6651 0.09587 0.8033 0.08575 0.06667 0.8333 4.267eþ004 1.422 0.9851
31 0.5595 0.09035 0.8751 0.07934 0.7667 0.9333 3.515eþ004 1.446 0.9038
37 0.5251 0.1116 0.7205 0.1148 0.6667 0.8667 3.151eþ004 1.528 0.7385
41 0.6798 0.2155 0.7382 0.07893 0.3 0.8667 4.8eþ004 1.612 0.9491
51 0.5878 0.1235 0.7914 0.1444 0.6667 0.9333 3.48eþ004 1.499 0.8222
458 computers & security 29 (2010) 446–459

Table 3 (continued)
L ID m0 s0 m1 s1 h1 h2 TU Ratio ACC

3 34 0.592 0.06966 0.6375 0.02727 0.8 0.9 5000 1 0.143


58 0.5707 0.04015 0.709 0.04381 0.8667 0.9 5000 1 0.143
66 0.3453 0.09073 0.8467 0.04666 0.9333 0.9667 5506 1.042 0.1542
77 0.6668 0.06 0.7714 0.1038 0.9 0.9667 8408 1.213 0.2187
86 0.7671 0.002867 0.7887 0.0686 0.9667 1 5000 1 0.143
93 0.6524 0.01272 0.6852 0.1023 0.8667 1 5876 1.07 0.1622
101 0.5471 0.03209 0.7869 0.05634 0.9667 1 5000 1 0.143
106 0.7164 0.01455 0.8104 0.04486 0.9667 1 5000 1 0.143
110 0.6261 0.06151 0.7295 0.1309 0.9 0.9333 7596 1.126 0.2119
144 0.5295 0.06446 0.726 0.01037 0.8333 0.8667 5000 1 0.143

4 80 0.6065 0.09277 0.7828 0.08489 0.8667 0.9333 1.129eþ004 1.305 0.2859


87 0.5061 0.07041 0.7851 0.01709 0.8 1 1.225eþ004 1.353 0.302
124 0.5949 0.1187 0.6783 0.2052 0.7667 0.8667 1.711eþ004 1.242 0.4799
130 0.4295 0.0666 0.7427 0.056 0.8 0.9667 1.071eþ004 1.307 0.2682
131 0.5576 0.03068 0.6677 0.1524 0.8333 1 1.066eþ004 1.305 0.2671
133 0.6727 0.006693 0.8018 0.0628 0.8333 0.9 1.725eþ004 1.41 0.4273
139 0.6048 0.03089 0.7245 0.07272 0.8 0.9333 1.069eþ004 1.306 0.2677
141 0.5513 0.07846 0.7591 0.1525 0.9 1 1.272eþ004 1.365 0.3121
143 0.6858 0.1073 0.6896 0.2036 0.9 0.9667 1.033eþ004 1.239 0.2742
152 0.5276 0.2158 0.8952 0.08286 0.9333 1 2.118eþ004 1.508 0.4964

references Hartigan JA. Clustering algorithms. New York: John Wiley and
Sons; 1975.
He P, Sun Y, Zheng W, Wen X. Filtering short message spam of
group sending using CAPTCHA. In: Workshop on knowledge
Androutsopoulos I, Koutsias J, Chandrinos K, Spyropoulos CD.
discovery and data mining; 2008. p. 558–61.
An experimental comparison of naive Bayesian and
Healy M, Delany S, Zamolotskikh A. An assessment of case-based
keyword-based anti-spam filtering with personal e-mail
reasoning for short text message classification. In:
messages. In: SIGIR ’00: Proceedings of the 23rd annual
Proceedings of 16th Irish conference on artificial intelligence
international ACM SIGIR conference on research and
and cognitive science; 2005. p. 257–66.
development in information retrieval. New York, NY, USA:
Hidalgo JMG, Bringas GC, Sanz EP, Garc FC. Content based SMS
ACM; 2000. p. 160–7.
spam filtering. In: Proceedings of the 2006 ACM symposium on
Bratko A, Filipiè B, Cormack GV, Lynam TR, Zupan B. Spam
document engineering. Amsterdam, The Netherlands: ACM
filtering using statistical data compression models. Journal of
Press; October 2006. p. 10–3.
Machine Learning Research 2006;7:2673–98.
Leveraging the CAPTCHA Problem; 2005.
Burrows M, Abadi M, Needham R. A logic of authentication. ACM
Metsis Vangelis, Androutsopoulos Ion, Paliouras Georgios. Spam
Operating Systems Review 1989;23(5):1–13.
filtering with naive Bayes – which naive Bayes? In: Third
Chow R, Golle P, Jakobsson M, Wang L, Wang X. Making
conference on email and anti-spam (CEAS); 2006.
CAPTCHAs clickable. In: HotMobile ’08: Proceedings of the 9th
Prieto AG, Cosenza R, Stadler R. Policy-based congestion
workshop on mobile computing systems and applications.
management for an SMS gateway. In: Proceedings of the fifth
New York, NY, USA: ACM; 2008. p. 91–4.
IEEE international workshop; 2004.
Cormack GV, Hidalgo JMG, Sanz EP. Spam filtering for short
Rogers D. Mobile handset security: securing open devices and
messages. In: Proceedings of the 16th ACM conference on
enabling trust. OMTP Limited White Paper 2007.
information and knowledge management; 2007. p. 313–20.
Roman R, Zhou J, Lopez J. An anti-spam scheme using pre-
Croft NJ, Olivier MS. A model for spam prevention in IP telephony
challenges. Computer Communications 2006;29(15):2739–49.
networks using anonymous verifying authorities. In: ISSA,
Shirali-Shahreza S, Movaghar A. An anti-SMS-spam using
new knowledge today conference; 2005.
CAPTCHA. In: CCCM ’08: Proceedings of the 2008 ISECS
Deng W, Peng H. Research on a naive Bayesian based short
international colloquium on computing, communication,
message filtering system. In: Machine learning and
control, and management. Washington, DC, USA: IEEE
cybernetics, 2006 international conference on Aug. 2006.
Computer Society; 2008. p. 318–21.
p. 1233–7.
von Ahn L, Maurer B, Mcmillen C, Abraham D, Blum M.
Dwork C, Goldberg A, Naor M. On memory-bound functions for
RECAPTCHA: Human-based character recognition via web
fighting spam. In: Proceedings of the 23rd annual international
security measures. Science August 2008.
cryptology conference (CRYPTO 2003); August 2003.
Yan J, El Ahmad AS. Usability of CAPTCHAs or usability issues in
Enck W, Traynor P, McDaniel P, Porta T. Exploiting open
CAPTCHA design. In: SOUPS ’08: Proceedings of the 4th
functionality in SMS-capable cellular networks. In: CCS, Nov.
symposium on usable privacy and security. New York, NY,
2005.
USA: ACM; 2008. p. 44–52.
Golbeck J, Hendler J. Reputation network analysis for email
filtering. In: Proceedings of the conference on email and anti-
spam (CEAS); 2004. Ji Won Yoon. He received the B.Sc. degree in information engi-
Hall RJ. How to avoid unwanted email. Communications of the neering at the SungKyunKwan University, Korea. He obtained the
ACM March 1998. M.Sc. degree in School of informatics at the University of
computers & security 29 (2010) 446–459 459

Edinburgh UK in 2004 and the Ph.D. degree in signal processing engineer from May 2004 to September 2009. He also served
group at the University of Cambridge UK in 2008 respectively. In a member of DLNA and Coral standardization for DRM interop-
2008, he moved to department of Engineering science, the erability in home networks.
University of Oxford, UK to do postdoctoral research. He is currently studying in the Computer Laboratory at the
He is currently a Research Fellow with Statistics department, University of Cambridge as a PhD student. His research interest is
Trinity College Dublin, Ireland. His research interests include focused on security and privacy in complex networks and
Bayesian statistics, Machine Learning, data mining, Network distributed systems.
Security and Biomedical engineering. He has worked on applica-
tions in brain signals, cosmology, biophysics and multimedia. Jun Ho Huh. He holds Software Engineering and International
Business degrees from Auckland University. He is currently
Hyoungshick Kim. He received the B.Sc. degree in information a DPhil student at the Oxford University Computing Labora-
engineering at the SungKyunKwan University, Korea. He obtained tory. His research interests include trusted virtualization,
the M.Sc. degree in department of computer science, KAIST, Korea trustworthy audit and logging, and security in distributed
in 2001. He previously worked for Samsung Electronics as a senior systems.

You might also like