Solution: March 2018
Solution: March 2018
net/publication/323809241
.Net library for SMS spam detection using machine learning: A cross platform
solution
CITATION READS
1 135
1 author:
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
.Net library for SMS spam detection using machine learning View project
All content following this page was uploaded by Syed Sarmad Ali on 08 January 2019.
Abstract—Short Message Service is now-days the most used The recent growth in mobile users because of the recent
way of communication in the electronic world. While many advancements in smart phones, the popularity of SMS’s has
researches exist on the email spam detection, we haven’t had the
increased. This has caused a lot of different communities to
insight knowledge about the spam done within the SMS’s. This
might be because the frequency of spam in these short messages is
create tools and techniques for spamming the user’s mobile
quite low than the emails. This paper presents different ways of phones in order to get the desired output. To create a better
analyzing spam for SMS and a new pre-processing way to get the understanding in terms of machine learning algorithms to sort
actual dataset of spam messages. This dataset was then used on out the spam and filter them. There existed a lack of a proper
different algorithm techniques to find the best working algorithm dataset as well as a lack on the study for different algorithms and
in terms of both accuracy and recall. Random Forest algorithm clustering techniques for this specific problem.
was then implemented in a real world application library written
In this research paper, a new tool is created based on .Net
in C# for cross platform .Net development. This library is capable
of using a prebuild model for classifying a new dataset for spam framework. The resultant tools are actually a cross platform
and ham. library project which is compatible of using an already
normalized dataset to map it within the internal model and to see
Keywords—spam filter; SMS; detection; machine learning; in real examples what are ham and which are spam.
classification; clustering; algorithms; C# library; online detection
II. LITERATURE STUDY
I. INTRODUCTION
In 2013, Houshmand[1] put on dissimilar machine learning
Short Message Service (SMS) has indeed occupies the majority algorithms to SMS spam classification problem. Further they
of our communication and has become an essential part in daily analyze and compare the output to achieve the understanding
human activity. According to [1], SMS itself has become a that can sieve the SMS spams. The Author use the database from
multi-billion industry. It is now a matter of seconds for anyone UCI machine learning repository, explained in [3][4]. An SMS’s
to connect with others using SMS. With the recent subset arbitrarily selects ham messages. The dataset
advancements in the technology and with a huge competition
constitutes of the label message and trailed by the message
among the different cellular companies the cost of sending the
string. Methods like SVM and Naïve bayes are imposed to the
SMS has reduced to just about nothing. Now with different
cellular packages you get close to unlimited SMS’s and the sample which are initially processed and then features are
ability to send world-wide and low cost. This along with the extracted. Finally, the best classifier will be compared to the
betterment has also caused the short message service to be used dataset discussed in [4]. Matlab was used for feature extraction
as a marketing or other un-wanted services. In order to keep the and the analysis of the data and then different algorithm are
quality of this service in check, proper steps must be taken for applied using the python scikitlearn library.
the prevention of spam. Spam can be easily described as an
unwanted content that is send in a bulk quantity to bulk users. In 2011, Tiago et al [5] studied this issue and attempted to find
The purpose of spam is to either get users toward a specific different smaller datasets and their own personal study to create
marketing scheme or to just scam. a better dataset for academic studies. They created a new
collection of 4827 ham and 757 spam SMS and they donated this
Even today the quantity of spam SMS is quite low than spam
dataset to the community for further analysis. This was a
emails, but still there is enough quantity to create a miss-leading
remarkable step towards finding a solution to stop spam.
usage. In 2010-2012, it is reported [2] that about 90% of emails
are spam worldwide while this number is very low in terms of In 2012, Coskun and Giura from a research institution in New
SMS. In Asia about 30% of total Messages were actually spam York City performed an experiment [6] to classify the spam-ham
dataset by using the similarity equation. What they did was to
[2]. As the percentage is quite small, there has been more create an algorithm capable of performing a block match
advancements in terms of catching and blocking email-spam but analysis on a steam of different messages to find a similarity
still a very few studies are available for doing the similar thing among them. Their hypothesis was that if a lot of messages are
in terms of SMS. similar to previously sent messages than this steam is basically
a combination of spam and should be blocked. It was in-fact a
smart independent way of classifying spam without using any
kind of previous knowledge base. They used an internal
algorithm called the Counting Bloom Filter which was capable
of finding true similarity. Another interesting thing about their
,(((
To better test this significance value the Fredmen’s test was done
among the eight classifiers and RF being ranked 1st in all of the
5 folds made it quite easy for the test to determine the
significance of these differences. Our null hypothesis being that
there is no difference among any of them was obviously rejected
by the Fredmen’s test clearly stating that there is a significant
difference among all of these. Later on Wilcoxon test proved that
Fig 4: Result of X-Mean Algorithm
the Random Forest (RF) performed the best in terms off-
Measure ranking and had the most accuracy and recall value. The Farthest First on the other hand made both clusters as a
combination of both classes.
Reason of Random Forest Success -Analysis
The reason why The Random Forest classifier was so successful
in our experiments can be because as [10] suggested that RF is
recognized as an active classifier when dealing with
approximations of what variables are significant in the
arrangement. RF also equipped with corresponding error in class
population disturbed data sets.
DeBarr, D..et al [11] proposed that the reason why Random
Forest has been successful is because the strong point of the
Random Forest technique comprises feature selection and
deliberation of numerous feature subsets. Fig 5: Result of the Farthest First
B. Clustering Algorithms This suggested that in terms of clustering there might not be any
After testing the different classification approaches another possibility to determine the data difference between the two
interesting thing would have been to study the different classes as they both have similar attributes and they are made
from a string attribute. There is no specific difference among
clustering algorithms out there to see if any of them have any
REFERENCES
[1] Mehar, H.S. 2013. SMS Spam Detection using Machine Learning
Approach.. International Journal of Information Security Science
2.
[2] Wikipedia-Docs-
https://en.wikipedia.org/wiki/Mobile_phone_spamJ. Clerk
Maxwell, A Treatise on Electricity and Magnetism, 3rd ed., vol. 2.
Oxford: Clarendon, 1892, pp.68-73
[3] SMS Spam Collection Data Set from UCI Machine Learning
Repository,
http://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection
[4] SMS Spam Collection v.1, ”http://www.dt.fee.unicamp.br/∼tiago/
smsspamcollection”
[5] Tiago A. Almeida, Jos Mara G. Hidalgo, and Akebo Yamakami.
2011.