A Framework For SMS Spam and Phishing Detection in Malay Language: A Case Study
A Framework For SMS Spam and Phishing Detection in Malay Language: A Case Study
net/publication/289036263
A framework for SMS spam and phishing detection in Malay language: A case
study
CITATIONS READS
4 759
3 authors:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Integrating information quality dimensions into information security risk management (ISRM) View project
All content following this page was uploaded by Cik Feresa Mohd Foozy on 23 July 2018.
Abstract - Short Message Service (SMS) spam and SMS phishing has been ina'eese nov,adavs
especially in Malal' language which is the Jirst langtage Jbr Malaysia country. Currently, nnnl'
SMS spam in others language has heen proposed, however nol yetJbr Maluy- language and we are
the Jirst to propose these. In addition, this paper also analysl on several Jramev'orks of SMS spam
Jiltering.[or onr SMS spanr and phishing detection .framevork. From the ana{uis, the chosen
framev,ork has been enhanced.fbr Malav- SMS spam and phishing. The enhancement has been
done on classiJication phase *'here our.frameu'ork proposed dual clussification. The t'lassification
I w,itl classit'y the SMS into ham ancl scam SMS. For classification 2, the scam SMS will be
c'lassified again into SMS spam and SMS phishing. Aller dual classifications phase completed, the
Malay SMS Aas been examined asing iVaii'e Ba3,es and J48 unsupervised Machine Learning
tec'hniques. The result shows high aecurdcy in detecting Malay SMS ham, spam and phishing.
Copyright @ 2014 Praise Worthy Prize S,r.I. - AII rights reserved.
1248
Cik Fercsa Mohd Foon. Rabiah Ahmad. Faizal M. A.
This paper is organized as follows. In Section 2, For this paper, the Malay SMS phishing and spam
Literature ret'iew on SMS phishing detection. will be classified based on the generic features such as
In Section 3, will explained the pre-processing SMS Total words t431, t441, [45] and [46], Number of
datasets, features selection development and the Character bi-grams[43] antl [45], Number of Character
experiments to identiff the accuracy offeatures selection n'i-grams[43] and [45], Average number of length [44]
using data mining tools named WEKA. Section 4, shows and [a6] and Average number of word [aa] and [46].
the result and findings based on the experiment that has We also add additional f'eatules fbr this study such as
been done and finally is conclusion and fufure wort were advertisement, contest, ut3enr SMS, ask money, asked to
given. response SMS, telephone, URL and arurounce user get
free gift or win. These criteria is based oo spam and
phishing attack. Total f'eatures applied in this study are
II. Related Work 14 fearures.
Copl'7if6t 'g 2014 Praise ltortlry Prize S.r.l. - .4ll rights resen'ed httenatktnal Reviev: on Comuulers and ktftu'are, rbL 9, N. 7
t249
Cik Fercsa Mohd Foozy, Rabiah Ahmad, Faizal fuL A.
TABLE I
LJTERATURE ON TEXT MESSAGING CoRPUs
References Tevt Size TertLensuaec TcxtContributor Text Collcction l\{ethod
Pietrini ll4l 500 Italian l5-35 Years Old Unknoun
Schlobinski et al. [151 1500 Germany StDdents Unknown
Shortis l16l 207 English I Male Student liicnds and Family Transcript
Doring [r7] 100{) German 200 partrcipans Unknou'n
Hard ef Setgetstad ll$l | 152 Swedish I I 2 trom an anonymous webpage, 252 From webpage, rrolunteers,
nrcssagcs foru'ardcd from voluntccrs family and liicnds
and ?88 fiorn family and friends
Kasesnicmi and Rautiainen 7800 Finnish Teenagers ( l3-18 Years Old) Transcript
flel
Grinter and Eldridge[201 4'7'l English l0 Teenagers i15-16 Years Old) Tmnscripr
Thurlorv end Brown [2ll 5M English 135 Freshmen Trarrscript
Ogl" 97 Englisir Nightclubs Subscribe SMS Promotion
Of Nightclub
Yljue How and Ken l22l l0l l7 Englislr ?003 Respondents Transcript
Feiron end Paumler ll0l 30000 French 166 Univexity Students For*,ard
Choudhury [l ll lo00 English 3.200 Contributos Search The SMS
From The Website
Rich Ling [231 867 Norwegian Rendomly Transcript
Rcttie (2007) 278 English 3l contributors Unknown
Zic Fuchs and Tudman 6000 Croation University students, Unknown
Vukorid l24l family and iiiends
However, there are multiple frameworks for spam four processes such as identify suitable representation of
filtering and detection such as below: an SMS, build spam models, classihcation and
determined the spam. ln addition, SMS-Watchdog SMS
t Tools detection scheme by Yan et al. [53] has tlu'ee processes
Cormack et al. [45] on SMS tiltering using 5 fypes of of monitoring, anomaly detection and alefi handling
spam filter tools to filter the SMS. Before applied tools using SMS services.
on SMS, there are pre-processing data that has been done
to come out with four (4) main features. Moreover, to . Contenl-Based Filte,'ing
detect SMS protocol in real time Rafique et al. [51] J. W. Yoon, et al. [41] proposed hybrid framewor*
applied Hidden Markov Model (MIm) which the that implement content-based technique with challenge-
architecture consists of sniffer. feafi.rre extraction, response scheme. The SMS classified into ham, spam
classifier and mles decision. Que and Farooq [52] also and uncertain, then the challenge response will classifu
apply MHH on byte level distribution of SMS that have uncertain into ham and spam by matching the sender
Copyright 'O 201 4 Praise Worthj, Prize S.r.l. - .4ll rights resen ed Iuternational Revieu, on Computers and Soflu,are, Vrtl. 9, N. 7
l25t)
Cik Feresa Mohd Foozy, Rabiah Ahmad, Faizal M. A.
response. Gdmez Hidalgo et al. [54] have proposed Which is mostly has been applied in SMS shrdtes.
content-based SMS filtering for English and Spanish These explained the SMS spam filtering study already
SMS spam using Bayesian filtering that consist of applied var-ious filtering and detecrion techniques with
preprocessing, feature selection and learning. different SMS language except for Malay SMS language.
Additional, one of the difl'erent befween frameu'orks
t Mac:hine Learning is the technique applied but yet the result is still good.
Xiang et al. [55] proposed Support Vector Machine Varieties of fiamework presented for spam filtering
technique to filter the mobile spam. Moreover, Cai et al. and detection. However, for SMS spam and phishing
[56] improved the spam tilter using traditional balanced attack not yet available. Thus, the proposed SMS spam
Winnow algorithm which applied pre-processor, feature and phishing framework will do some enhancement on
selection, texts representation and winnow algorithm the generic lramework of SMS spam tiltering by [50].
module. An independent mobile device filtering by The enhancement framework will have dual
Taufiq Numzzaman et al. [44] applied several processes classification tbr SMS Malay language.
in their SMS independent sparn filtering such as data set The reason to have dual classification is to identify the
and running environment, feature extraction, vector SMS collection has been classified conectly. After the
creation and filtering process of Naive Bayes or SVM first classification done, the second classification proeess
and update filtering system. Yadav [46], [57] had three will classity the scam SMS into spam and phishing. The
process in the their SMS filtering such as Bayesian framework will discuss further in next section.
filtering algorithm, mobile application and
synchronization service on server.
UI. Methodology
t Vatching Pattern This section, explain about Malay SMS spam and
Wu et al. [58] has proposed SMS filtering flow such phishing detection framework development. The SMS
as SMS screening. bayesian learning. keyword SMS and spam and detection will focus on features based on
Pinyin Fuzzed keyword matching. previous studies. As mention before, this study is the lirst
Moreover, the Chinese SMS filtering by Jie et al, to collect Malay SMS for detecting spam and phishing.
[59] has pre-processing, lbatures selection. modeling, and Thus" there are no SMS spam and phishing datasets
classifier. In addition. Najadat et al. [60]. frameworks available in Malay Language. SMS spam and phishing
involve of three proccsses of data collection. pre- datasets nced to be prepared for this srudy.
processing, text mining, testing, evaluation metrics and For datasets preparation, a collection of Malay SMS
implementation. has been done from website, unknown respondents.
friends and fomily. The proposed framework is based on
t Artift'ial I ntmttneSlstern Guzella and Caminhas [50] which have four (4) main
T. M. Mahmoud and A. M. Mahfouz [61] applied steps in filtering spam nressage such as tokenization,
arrifrcial immune system method filter SMS spam that lemmatization, representation and classifier.
contain analysis engine, tokenize word, stop word, For this framework, four main steps will be applied.
dataset. training and AIS engine. Chaminda et al. [62] However. additional classifier will be added in this
proposed a hybrid solution ot neural network and flamework which called dual classifier in Malay SMS
Bayesian filtering where the SMS filtering process are spam and phishing detection tlamework. .
sender identification module, spam folder, SMS content
The reason we need dual classifier compare to a single
extractor, tokenizer. Bayesian filter.. categorization, classification process because we collect SMS ham and
training and inbox. scam SMS from website. friend, family and unknown
respondents. The respondents usually have basic
. knowledge about SMS spam and phishing and some
Ctyptograph)'
doesn't know anything about these attacks.
In Cryptography area, Saxena[63] proposed a secure
After Malay SMS collection have done SMS harn md
SMS protocol for SMS tr.an$mission and a cryptographic
algorithm in the SIM card. The processos nf framework SMS scam will be tokenizing, lemmatizing and stop
are request to send SMS and authenticate sending SMS-
word removal, representation and classification I -
In addition, Pereira et al.[64] also proposed a
Atter get the result liom classitication 1, second
lightweight cryptography algorithm to mitigate the SMS classification process will be proceed to classified again
SMS scam into SMS spam and phishing. The similar
security issues, protocols, pror,iding encryption,
method has been applied by J. W. Yoon, et al. [41] to
authentication and signature services. In addition. Choi
classiiy uncertain SMS into spam and ham class. Figure
t65l applied Common Public Kuy Cryptography I and the process below listed the process of Malay SMS
technique for SMS communication efl'eciency which
spam and phishing datasets and detection development:
containt of initialization for aulhenticate, encrypt. or
decrypt and communication phuses lbr sending SMS.
i. Collect SMS ham and scam SMS from rvebsite,
Cop)'right O 2014 Praise Worthl,Prize $.r.!. - 4!! rights resseryed Inkrnational Re'-ie*- on Computers and Sttflu'are, Vol- 9, N- 7
l25t
Cik Fet'esa Mohd Foo4,, Robiah Ahmad, Faizal M. A.
iii. Lemmatization as remove redundancy and noise. However, SMS usually will contain many
iv. Representation Srrings into nominal datasets. abbreviation words. It is difficult to group similar
v- Features Selection- meaning for variety of words such as in Malay SMS
vi. Second classification to Scam SMS into spam and abbreviation, &e word Thank You can be typed as TQ,
phishing. thank Q, thanks or tengkiu. Thus, for this study, all
vii.Examine the result Malay SMS datasets using Naive words in ttris SMS will be calculated the occurrences and
Bayes and J48 Technique. will be identified as different words.
The calculation of SMS word occurrences process are
done by using JAVA programming to identitied the
lnuoming
Lemmatization I unique words in these Malay SMS collection. There are
Tokenization
Ham and
Scam xl and stop u'ord
removal
80? words in SMS after lemmatization.
SMS
lII.4. Fedtures Selec'tion
Representation
Clessilkr I
(Features Selection) SMS representation in this paper isapplying the
features based on the previous studies. The features arc
Total words, Number of Character bi-grams, Number of
Scam
Character tri-grams. Average number of length, and
sMs
Average number of word.
For this study, additional features based on the spam
[Gr*"ur and phishing characteristics also included such as
l2 Advertisement or announcement, Contest, Malicious
-_Ir*
______u__.
URL, Telephone Number, Winning or Free gift. SMS ask
help'to get money and SMS ask to respond or subscribe
si*;
S.mm or PhiqlsEiir!
sen'ices.
sMs
Fig. l. SMS Spam and Phishing Detection in Malay Language il1.5. Class(icatiotr
Framework
fitere are two classil'rcations processes proposed in
this framework. The raw data collection has been
il|.1. Matav SMS Corpus Collection Method classified into SMS ham and SMS scam. After
classification process l, the classification result shou's
As preliminary study in collecting Malay SMS corpus, high accuracy. Afier that, the second classification
Malay SMS has been collected using methods in Table I. process, classify the SMS scam into SMS spam and
The SMS collection methods are from website, phishing. The reason dual classifications are done
personal SMS tbrwarding, transcriptions and online first study to proposed framervork in
because this is the
fbrm. The SMS contributors are from respondent, detecting SMS spam and phishing. Thus, to ensure good
website, family and {'riends. After the SMS collections result in classification accuracy. this dual classification
are done, all SMS are transcript into Microsofl Office has been proposed and the results rvill be discussed in tle
Excel 2007 for tokenization; lemmatization and
next section.
representation process.
III.2. Tokenization
IV. Analysis and Findings
Tokenization is a process to divide the sentence into An experiment has been done to examined 179 of
SMS ham, spam and phishing class using WEKA a data
word, The purpose tokenizations have been done for
mining tools to test the classified accuracy, truc positive,
calculating the word for f'eatures selection and
classificarion process. Fig. 2 is an example of the SMS
true false on Malay spam and phishing corpus using
tokenization. There are 179 SMS has besn tokenize and
Naive Bayes and J48. The reason these technique has
total word after-tokenization are 21694.
been applied to tested the classifrcarion accuracy rate
because these techniques is one of the well known
supervised method in machine leaming techniques.
Table II is a classification result between 4l SMS ham
and 82 SMS scam. The tesult shows Nai've Bayes and
Fig. 2. Atter SMS Tokcnization Process J48 is 100%. Table III is a classification is result for
Scam SMS that has been classified into 4l SMS phishing
and 4l SMS spam Malay SMS. The result also shows
IIL3. Lemmatization Narve Bayes and J48 get 100 %. The final result lbr
Lemmatization is a process to group the same ternary classification of 41 SMS ham,41 SMS phishing
meaning wortls. and 4l SMS spam show 1007o accuracy.
Copyright Q 2014 Praise Worthy Prize S.r,t. - .4ll rights resen;ed htternational Revieut on Compulers and Soflu'are, hL 9, N. 7
r252
Cik Feresa Mohd Foon'. Rabiuh Ahmad. Faizal M. A.
Copyright,g 201,1 Praise Wortlry Prize S.r.l. - .4ll rights resened httentational Review on Compulers and ktflu'are, rbl. 9, N. 7
r25l
Cik Felesa Mohd Foo4', RabiahAhmad. Faizal M. A.
[2a] M. Zic Fuchs and N. Tudman Vukovii, "Communication [4?] Y. C. Llm, ct al., "Application of Genetic Algorithm in unit
technologies and theil influence on l:rngurge: Reshuffling tenses selection for Malal' speech synthesis system," Expet ,S)rst€ms
in Croatirur SMS text messaging,r' Jezikoslovlje, pp. 109-122, with Applications, vol. 39, pp. 53?6--5383,2012.
2008. [48] F. S. Tsai, er rrl., "Multilingual novelty detection." Eryert S]srerfls
[25] D. Gibbon and l!{. Kul, "Economy Strategies in Resricted with Applications vol. 38, pp. 652-658. 201 l-
Communicadon Channcls. A study of Polish shon toxt [49] T. Subramaniam, et a/., "Naivc Baycsian ,{nti-spam Filtcring
messages," f,008. Technique for Malay Language."
[26] A- Deumert and S. Oscar Masinyana. "Mobile language choices [50] T, S. Guzella and W. M. Caminhas, "A revierv of rnachine
The use of English and isiXhosa in text messages (SMS) leaming approaches ro spam fihering," Lrpert Systcnts v:ilh
Eviclcncc tiont a bilingual South Afiican sanrplc," English World- ,4pplicatiorts. vol- 36, pp. I 0206- l 0222, 2009.
Wide, vol. 29, pp. I 1?-147, 2008. t51] M. Z. Rafique, el a/.,'Applicarion of evoludonary algorithrns rn
[27] I. Hutcl$y and V. Tanna, "Aspecrs of sequential organizadon in detecting SMS spam at access layer," presented at the
text message cxchange." Diseourse & Cotnmunication, vol. 2, pp. Proceedings of rhe l3th annual confbrence on Genetic and
r43-r64,2008. cvolutionary conrputalion, Dublin, Irclancl 201 l.
[28] J. Walkowska, "Gathering and Analysis of a Corpus of Polish l52l M. Z. R. que and M. Farooq, "SMS Spam Detection By Opemting
SMS Dialogucs," Challenging Pnthlems oJ Science. Computer On B1'te-Level Distributions Using Hidden Markov Models
Science- Rtcent Advances in Intdligent l4formation.ilsreor.s. pp. {HMMS)." prcsented at the Virus Bulletin Contbrence September
r45-r57.:009. 20r0.10r0.
[29] C. Tagg, "A Corpus Linguistics Srudy of SMS Text Messaging." [53] C. Yan, er a/-, "SMS-Watchdog: Profiling Social Bchaviors of
Doctor of Philosophy" Department of English. The University of SMS Users for Anomaly Detection
Birmingham. Birmingham. 2009- Recent Advaace-s in Intrusion Detection." vol- .57-58. E. Kirdu et al.,
[30] F. W. Elvis. "The sociolinguistics of] motrile phone sms usage in Eds.. ed: Springer Berlin Heidelbery.2009. pp. 202-223.
cameroon and nigeria," nrc btternalional Journal of Languuge [54] J. M. G. Hidalgo, et sl.. "Content based SMS spam filtering,"
Society and Culture. vol. 28. pp. 25-40,2009. presented at the Ptoceedings of the 2006 ACM symposium on
[31] S. N. Barasa, langroge, mobile phones antl internet: a studr rt Dorcument engineering. Amsterdam. The Netherlands, 2006.
Sl/S lertrirg, entail, bi ancl SNS chats in campuler netlialed [55] Y. Xiang. c't aL, "Filiering nrobile spam by suppo{ \'ectot
comnunication (Cl\{C) in Kenya, 2410. machirie " presented at thc Conferencc on (lomputer Sciences,
[32] A. B. Bodomo. "Thc Gmmmar of Mobile Phone Written Softwarc Engineeiing lnformation Technology. E-Business md
[.anguage," Chaprer, vol. 7. pp. I l0-198,2010. Applications (3rd: 20M : Cairo, Eg1'pt). Cairo. Egypt. 2004.
[3]l W. Liu and T. Wang. "lndex-based online text classification lbr {561 C- Jie er c/., "Spam Filter fbr Short Me*sages Using Wirtnow," in
snrs spam filtering." Journul rdCompurers, vol. 5. pp.844-851. Atlvaneetl Language Processirg and Web InJbrmation
2010. Technolog;, 2008- ALPTT '08. Into nalional Cotlfbr<nce on, 2008,
[34] S. Sotillo. 'SMS Texting Practices and Communicalive pp.454-459. :
Intcntion," t)hapter, vol- I 6. pp. 252-265, 2010. [57] K. Yadav. et a/.. "Take Control of Your SMSes: Designing an
[35] C. Dirscheid and E. Srark. "SMS4science: An internstional Usable Spam SMS Filtering System," in lt{obile Dant
corpus-based texting pmject and the specilic challenges for lfianagement (MD!L{), 2012 IEEE ISth lnternatiilnal Conference
multilingual Switzerland." Digitul Dis<'ourse: Languagr in the on,2012,pp- l5:-355.
Nc*- .Vcdia: Languuge in the Neu' i\'ledia. p. 299. 201l. [58] W. Ningning. ar a/., "Real-time monitoring and tiltering systenr
[36] K. V" Lerander. "Names U ma puce: multilingual texting in for mobile SMS," in la<Justial Electronics and appliutions,
Senegal," Working paper20l l. 20A8. rc1E.4 300tt. -lrd IEEE Conleren<r' on. 2008, pp. l3l9-
[3?] J. Elizondo, "Not 2 Cryptic 2 DCode: Paralinguistic Restitution. 1 324.
Delction. and Nonstandard Orthography in Text Messages," Ph- [59] J. Huang, et d., "A Bayesian Approach for Text Filter on 3G
D. thqsis. Swanhmore College.20l l Network." in ll'ireless Communicatiotts Nr:nrurl'rng and Mobile
[3E] T. Chen and M.-Y. Kan, "Creating a live, public short mcssage Conl>uting {WiL'.Olrl), }0lA 6& Internationill Conference on,
service corpus: the NUS SMS corpus." Lunguagc Rcxturc:<'s and 1010, pp, l -.5.
Eva luation. vol. 47. pp. 299-335. 20 I il06/0 I 20 | 3. [60] H. Najadat. er al, "Mobile SMS Spam Filtering based on l\{ixing
[39] O. Salem" er al, "Awareness Program and AI based Tool to Classitiers."
Reduce Risk of Phishing Attacks," in Computer und Infbrmarion [{rl] T. M. Mahmoud and A. M. Mahfouz, "SMS Spam Filtering
TechnologS. (CID. 2010 IEEE IOth International Conference on, Technique Based ol Anificial Immune Systern." IJCI9|
2010. pp. l4l8-14?3. Inlernational .Iour,tdl oJ Camputcr Scr'arca' /-rsl<'s, vctl. 9, 20 I 2.
[a0l Q. Xu, e, a/., "SMS Spam Detection using ContentJess Features." [62] T. Charninda, et al.. "Clontent based hybrid srns spam filtcring
hxelligent System-s, /fEf. vol. PP, pp. l-1,_2012. system." 20l4.
t4ll J- W. Yoon. cl a/.. 'Hybrid spam filtering ftrr mobile [6,1] N. Saxena and N. S. Chaudhari, "SecureSMS: A secure SMS
eommunication." (lompaters &: Securit.r'. tol. ?9. pp. 446- protocof for VAS and other applications," Journal ofSysleils and
459, t0lo. Soliv.are. vol. 90. pp. 138-150.2014.
[42] H. Peizhou. a a/.,
nA Novel Method for Filtering Group Sending
[64] G. C. C. F. Pereira, er a/.. "SMSCT]?Io: A lightrveight
Short Message Spam," in Convergence und H1-hrid Inl-ormation cryptographic tiarnework for secure SMS transmission," ,Journol
Tcc h no logr, 2 {n8. I C HIT'08. I n ttr n at i on a l Co n ft t"t' nca' on, 2008. ry' S.1sr,nr.s ond Sr/irrzn'. vol. 86. pp. 698-706. 20 I 3.
pp. 60-65. [65] J. Choi and H. Kim, "A Novel Approach for SMS sccurity."
[43] G. V. Cormack, er a/., "Content bascd SMS spam filtering," Intenntional Journd oJ Security & Ils Applicalions, I'ol. 6,2012.
presented at the Proceedings of the 2006 ACM symposium on
Document engineering, Amsterdam, The Netherlands, 2006.
[4a] Ir,f. Taufiq Nrnuzzaman, er a/., "Simplc SMS spam filtcring on Authors' information
independent mobile phone," Spcuri{' and Communicatiort
vol. 5. pp. ll09-l:20,2012.
Neru.nrrLr, Cik Fcrcse Mohd Foozl is cunendy working
[45] G. V. Cormack, el al. "Fearure engineering tbr mobile (SMS) with Universiti Tun Hussein Onn Malal'sia
spanr filtoing," p'escntcd at thc Procccdiltgs of the 30th annual (UTHIVI), Malal'sia. Feresa holds a l\{a$er's
intemationai ACM SIGIR conference on Research and degree in Computer Sciencefinformadon
development in informarion retrierai, Anrsterdarn, Tlre Sccuriry) fronr Universiri Tel:nologi Malaysia
Nerherlands, 200?- Malaysia and a Bachelor's degree in Intbrmation
[46J K. Yadav. e/ ,/., "SN,lsAssassin: crou.Gourcilg driven mobilc- Tectmology and l\'lultimedia tlom Universiti
based system l.or SMS spam tiltering," prcserrted the at Tun Hussein Onn Malaysia (UTln[, Malaysia.
Proceedings of the l.lth Workshop on Mobile Computing She is crurently pwsuing her PhD at the Universiri Teknikal Malaysia
Systems and Applications, Phoenix, Arizona. 20 I I . Melaka. Malaysia.
Copyright,g 2014 Praise Worthy Prize S.r.l. - .4ll rights resemed htternational Review on Comprters and Softu'are, VoL 9, N. 7
r254