Email Spam Filtering ITS Repository 5216201701-Master - Thesis
Email Spam Filtering ITS Repository 5216201701-Master - Thesis
SUPERVISOR
Dr. Ir. Aris Tjahyanto, M.Kom,
POSTGRADUATE PROGRAM
DEPARTMENT OF INFORMATION SYSTEM
FACULTY OF INFORMATION AND COMMUNICATION TECHNOLOGY
INSTITUT TEKNOLOGI SEPULUH NOPEMBER
SURABAYA
2018
1
(This page is intentionally left blank)
2
APPROVAL SHEET
i
(This page is intentionally left blank)
ii
A COMPARISON OF MACHINE LEARNING TECHNIQUES:
E-MAIL SPAM FILTERING FROM COMBINED SWAHILI
AND ENGLISH EMAIL MESSAGES
ABSTRACT
The speed of technology change is faster now compared to the past ten
to fifteen years. It changes the way people live and force them to use the latest
devices to match with the speed. In communication perspectives nowadays, use of
electronic mail (e-mail) for people who want to communicate with friends,
companies or even the universities cannot be avoided. This makes it to be the
most targeted by the spammer and hackers and other bad people who want to get
the benefit by sending spam emails. The report shows that the amount of emails
sent through the internet in a day can be more than 10 billion among these 45%
are spams. The amount is not constant as sometimes it goes higher than what is
noted here. This indicates clearly the magnitude of the problem and calls for the
need for more efforts to be applied to reduce this amount and also minimize the
effects from the spam messages.
Various measures have been taken to eliminate this problem. Once people
used social methods, that is legislative means of control and now they are using
technological methods which are more effective and timely in catching spams as
these work by analyzing the messages content. In this paper we compare the
performance of machine learning algorithms by doing the experiment for testing
English language dataset, Swahili language dataset individual and combined two
dataset to form one, and results from combined dataset compared them with the
Gmail classifier. The classifiers which the researcher used are Naïve Bayes (NB),
Sequential Minimal Optimization (SMO) and k-Nearest Neighbour (k-NN).
The results for combined dataset shows that SMO classifier lead the others
by achieve 98.60% of accuracy, followed by k-NN classifier which has 97.20%
accuracy, and Naïve Bayes classifier has 92.89% accuracy. From this result the
researcher concludes that SMO classifier can work better in dataset that combined
English and Swahili languages. In English dataset shows that SMO classifier leads
other algorism, it achieved 97.51% of accuracy, followed by k-NN with average
accuracy of 93.52% and the last but also good accuracy is Naïve Bayes that come
with 87.78%. Swahili dataset Naïve Bayes lead others by getting 99.12% accuracy
followed by SMO which has 98.69% and the last was k-NN which has 98.47%.
Key Words: Swahili, Gmail, Classifier, email, Naïve Bayes, SMO, k-NN
iii
(This page is intentionally left blank)
iv
PERBANDINGAN TEKNIK MACHINE LEARNING:
PENYARINGAN E-MAIL SPAM DARI KOMBINASI PESAN E-
MAIL BAHASA SWAHILI DAN BAHASA INGGRIS
ABSTRAK
v
87,78%. Dataset Swahili Naïve Bayes lebih baik daripada yang lain dengan
akurasi 99,12% diikuti oleh Elective yang memiliki akurasi 98,69% dan yang
terakhir adalah k-NN yang memiliki akurasi 98,47%.
Kata kunci: Swahili, Gmail, Classifier, email, Naïve Bayes, SMO, k-NN
vi
DEDICATION
I would like to dedicate my thesis to my beloved mother Mrs.
Maimuna Omar Ali, my wife Ashura and my daughter
Ibtisam for supporting me all the time that I have been out of
my country.
vii
(This page is intentionally left blank)
viii
ACKNOWLEDGEMENTS
I would like to say ALHAMDULILLAH for giving me ability to work on this
thesis.
I would like to express my sincere thanks and gratitude to my supervisor Dr. Ir.
Aris Tjahyanto whose invaluable assistance and guidance made the completion of
this thesis possible.
My greatest sincere thanks goes to my mother for her encouragement and moral
support, and to my wife who courageously with support from her family endured
the burden of looking after our daughter during my absence.
Thanks are also due to my friends at the Institute for making my stay in Indonesia
enjoyable.
ix
(This page is intentionally left blank)
x
TABLE OF CONTENTS
APPROVAL SHEET ......................................................................................................... i
ABSTRACT ...................................................................................................................... iii
ABSTRAK ......................................................................................................................... v
DEDICATION................................................................................................................. vii
ACKNOWLEDGEMENTS ............................................................................................ ix
TABLE OF CONTENTS ................................................................................................ xi
LIST OF PICTURES ...................................................................................................... xv
LIST OF TABLE .......................................................................................................... xvii
CHAPTER 1 INTRODUCTION ..................................................................................... 1
1.1 Background ......................................................................................................... 1
1.2 Problem Formulation .......................................................................................... 5
1.3 Research Objectives ........................................................................................... 6
1.3.1 General Objective ............................................................................................... 6
1.3.2 Specific Objective............................................................................................... 6
1.4 Contribution ........................................................................................................ 6
1.5 Benefits of research ............................................................................................ 7
1.5.1 Theoretical benefits ............................................................................................ 7
1.5.2 Practical benefits................................................................................................. 7
1.6 Limitations of Research ...................................................................................... 7
CHAPTER 2 LITERATURE REVIEW ......................................................................... 9
2.1 Application of text categorization ...................................................................... 9
2.2 Email group (Spam and Non-Spam)................................................................... 9
2.3 Swahili Language ............................................................................................. 10
2.4 Spammer motivations ....................................................................................... 12
2.5 The Damage Caused by Spam .......................................................................... 13
2.6 Fighting Spammer approaches ......................................................................... 14
2.7 Classification Techniques ................................................................................. 15
2.7.1 Whitelist and Blacklist...................................................................................... 15
2.8 Text Classification Algorithms ......................................................................... 16
2.8.1 Naive Bayes Classifier...................................................................................... 16
2.8.2 Support Vector Machines ................................................................................. 17
2.8.3 Sequential Minimal Optimization..................................................................... 18
2.8.4 K-Nearest Neighbors ........................................................................................ 18
xi
2.8.5 Gmail Filter ....................................................................................................... 19
2.9 Text categorization approaches......................................................................... 19
2.9.1 Machine learning .............................................................................................. 20
2.9.2 Waikato Environment for Knowledge Analysis (Weka) .................................. 20
2.10 Comparison of Spam Filtering .......................................................................... 20
2.11 Dimensionality Reduction ................................................................................ 21
2.11.1 Feature Extraction ............................................................................................. 22
2.11.2 Feature Selection............................................................................................... 24
2.11.2.1 Supervised (Wrapper method) .......................................................................... 25
2.11.2.2 Unsupervised (Filter method) ........................................................................... 26
2.12 Evaluation Measures ......................................................................................... 26
CHAPTER 3 RESEARCH METHODOLOGY ........................................................... 29
3.1. Literature Review.............................................................................................. 29
3.2. Data Collection ................................................................................................. 30
3.3. Creation of the Dataset...................................................................................... 31
3.4. Data Processing................................................................................................. 31
3.4.1. Feature Extraction ............................................................................................. 32
3.4.2. Training and testing data ................................................................................... 33
3.4.3. Classification .................................................................................................... 33
3.5. Result and Evaluation ....................................................................................... 34
3.5.1. Performance Measures ...................................................................................... 34
3.5.2. Comparison among Classifiers ......................................................................... 34
CHAPTER 4 PRELIMINARY PROCESSES .............................................................. 35
4.1. Data Collection ................................................................................................. 35
4.2. Creation of the Dataset...................................................................................... 35
4.3. Data Processing................................................................................................. 37
CHAPTER 5 RESULTS AND EVALUATION ........................................................... 39
5.1. The Results ....................................................................................................... 39
5.1.1. English dataset .................................................................................................. 40
5.1.1.1. Performance Measures for English Dataset ...................................................... 40
5.1.2. Swahili dataset .................................................................................................. 41
5.1.2.1. Performance Measures ...................................................................................... 41
5.1.3. Combined English and Swahili Dataset ............................................................ 43
5.1.3.1. Performance Measures ...................................................................................... 43
xii
5.2. Increase Classifiers’ Accuracy ......................................................................... 44
5.3. Evaluation ......................................................................................................... 48
CHAPTER 6 CONCLUSIONS AND RECOMMENDATIONS ................................ 53
6.1. CONCLUSIONS .............................................................................................. 53
6.2. RECOMMENDATIONS .................................................................................. 54
References ........................................................................................................................ 55
THE AUTHOR’S BIOGRAPHIES ............................................................................... 61
xiii
(This page is intentionally left blank)
xiv
LIST OF PICTURES
Picture 2. 1: Email filtering ............................................................................................... 15
xv
(This page is intentionally left blank)
xvi
LIST OF TABLE
xvii
(This page is intentionally left blank)
xviii
CHAPTER 1
INTRODUCTION
1.1 Background
The speed of technology change is faster now compared to the past ten
to fifteen years. It changes the way people live and force them to use the latest
devices to match with the speed of technological advancement. In communication
perspectives nowadays, use of electronic mail (e-mail) for people who want to
communicate with friends, companies or even the universities cannot be avoided.
The traditional mail has many weaknesses including number of days it takes to be
delivered, means of checking to know if it has been delivered or not, the time it
takes for the sender to wait for a reply, and is also unreliable. So the importance of
using e-mail as the main means of communication cannot be underestimated. This
makes emails to be the most targeted by bad guys especially hackers to get the
benefit from the email users.
Many agencies tried to investigate this issue in-order to know how many
email accounts were affected and how the number increases so that they could
predict growth in the future. THE RADICATI GROUP, INC in summary of the
Email Statistics Report, 2015-2019 reported that, they are expecting the total
number of email accounts worldwide to increase from nearly 2.6 billion in 2015 to
over 2.9 billion by the end of 2019. This represents an average annual growth rate
of about 3% over the next four years”. The amount of email messages will also
increase at the end of 2017, (N pérez-díaz et. al. 2016). The total usage of sending
and receiving email statistics shows that it increases each year, and for the year
2015 there was an average increase of 538.1 million messages per day. The
statistics shows that there has been an increase of 5% since 2010. In general,
emails trends is expected to increase more in the coming years, where the business
1
emails are specifically expected to increase to over 132 billion more in sending
and receiving email messages per day by the end of 2017.
As mentioned earlier in the first paragraph, the use of electronic mail has
increased greatly as people can now send and receive emails where ever they are,
be it at home or even travelling around the globe. This has been made possible
the use of smart phones. Those who use smart phones which have android
operating system for example, are forced to have at least one Gmail account, that
account help them to access other facilities including download applications in
their phones. So, by being forced to have an account, the owners will use the
emails some of them without proper prior knowledge on good usage and what
danger there is when using such appliances inappropriately. The possibility to be
among the victims is more than 70% because they will use their accounts
carelessly by opening the emails which contain viruses that can crash their phones
or submit their personal information to bad guys and misuse them. Nowadays
there are social engineers and spammers that pretend to be someone you know,
who ask users to follow the link which will lead them to their website (phishing
sites).
This problem can be reduced, if not solved, by the classifier to separate
the emails into different folders. In the classification of emails the message can be
classified into two groups, the one which is legitimate, also called as non-spam,
which means the message is harmless, and the second type is bulk e-mail or
unsolicited e-mail message, also known as spam. Unsolicited messages are
normally distributed by using bulk-mailers and address lists harvested from
different web pages or in news group archives. The message content varies
significantly, some are vacation advertisements to get-rich schemes, some of them
are the advertisement of products like Viagra and others can also come from the
service companies.
The common feature of these messages is that they usually have little
interest to the majority of the recipients. In some cases, they may even be harmful
to the recipient, and some spam messages advertising, pornographic sites. It will
not check the recipient’s age, and possibly be sent and read by children. Un-harm
spams sometimes just waste your time and bandwidth, especially for those who
2
use dial-up connections. Apart from this, spam e-mail also costs money to users.
The reports from spamlaws.com said that spammer from all over the world use
their accounts for sending 14.5 billion messages daily which is almost 45% of all
emails sent. Some research companies estimate that spam email makes up an even
greater portion of global emails. Some reports put the figure to be up to 73%. The
country which is number one on the ranking of the spam or unwanted email
senders and recipients is the United States, followed closely by South Korea.
These countries are the largest spam messages distributors in the world. The
report also shows that advertising-related email type of spam is leading when you
compare with other types of emails. This type of spam accounts for approximately
36% of all spam messages. The second most common category of spam is on
adult-related subjects and makes up roughly 31.7% of all spam. Unwanted emails
related to financial matters are the third most popular form of spam, at 26.5%.
Surprisingly, scams and fraud comprise only 2.5% of all spam email; however,
identity theft which is known as phishing makes up 73% of this figure (2.5%), the
remaining shares with others like botnet etc.
According to the Anti-Phishing Working Group (APWG) reported in the
fourth quarter of the year 2015 and also 4th quarter of the year 2016, it shows that
the email phishing is not fixed because in some months it goes higher but in some
months it becomes low which means it is not predictable although in general
email phishing is still a big threat and a challenging one. The number of unique
phishing e-mails reported in 2015 campaigns received by APWG from consumers
in the fourth quarter shows that in October it was 48,114, November 44,575, and
December 65,885. The sum of emails which have phishing attacks observed in the
fourth quarter was 158,574. This shows the increase of over 21,000 phishing sites
detected during the holiday season. In 2016, during the 4th quarter, the record for
October was 51,153, November 64,324 and December 95,555.
Protection is needed in order to reduce the damage that is caused by
spam emails. Spammers are working hard to organize criminal activities, illegal
trafficking of goods and services in the stock market fraud, wire fraud, identity
theft and hijacking using computers. This is very costly in business when you
want to respond on request of your customer (Thiago S. Guzella, 2009). The cost
3
caused by spam in terms of lost productivity in the USA has reached USD21.58
billion per year and worldwide USD50 billion (Tiago A. 2011). The individuals
incur 10% cost in spam email according to (Ion Androutsopoulos 2000) this cost
including the waste of bandwidth for the dialup connections.
Due to the seriousness of the issue in hand, a lot of researches have been
done using dataset in English, Arabic and Chines languages. It is unfortunate
though that there is no research done using Swahili language. Swahili (Kiswahili)
is a language that is widely spoken in all East African countries of Tanzania,
Kenya and Uganda. Countries such as Rwanda and Burundi who have recently
joined the East African Community as well as other neighboring countries
including Democratic Republic of Congo, Malawi and Mozambique have also
started using the language to ease communication especially in the area of trade.
The Swahili is very complex in terms of structure and the addition of suffix, this
make the verb to be a complete sentence (“anakimbilia mpira” he is running for
the ball). The applied suffix on Swahili verbs has long posed an analytical
problem, the basic meaning of this suffix has to do with "directing the action
against something" (Port, R. F. 1981). Also the negation in Swahili is different
when you compare with English language, this part is very complicated because it
does not have the specific words and position of the word for refusal/opposite
(Contini-Morava, E. 2012). In this way, it is difficult to know whether the
messages sent through the email using Swahili are spam or not, contrary to the
ones sent through the English language whose vocabulary is largely standard.
The Swahili people use technology and have so many researches, but
until now there is no research on email spam which has been conducted using
Swahili language. It makes near impossible to get the dataset that is written in
Swahili so that force us to create our own dataset in Swahili language, to be used
in this research. There are some challenges though, the first one is time to collect
all emails that are written in Swahili language.
The ways classifiers algorithm is used help to reduce the impact of the
spam. In this thesis, we will check the ability of google mail (Gmail) classifier to
see how accurate it is in detecting spam emails because it has been found out that
although it works well but it has some weaknesses. This has led to the decision to
4
tackle this problem in this thesis. To begin with, we went through many email
addresses which are @gmail.com and two more were also created specifically for
use in this thesis. It was found out that some spam emails were actually in the
inbox. This should not be the case as they were supposed to be at junk/spam. It
was also observed that some non-spam emails were in spam folder, and
sometimes names of the email addresses confused the Gmail classifier, this led to
the emails to be put in wrong or inappropriate folders. At a later stage,
calculations were done manually from one of our Gmail accounts, to determine
the performance of the emails received by using confusion matrix. The result
showed that 86.26% of the emails were correctly classified, while those wrongly
classified were 13.74%. There is a possibility for this to be improved by two to
four percent.
The solution of this problem includes selection of classifiers that achieve
expectations that the researcher have. In recent years, any researches have been
done in recent years on text categorization which suggested many methods,
among them are Naïve Bayes which was singled out as an effective method to
construct automatically anti-spam filtering with good performance (Ion
Androutsopoulos 2000), (Sahami 1998), (Daelemans et. Al. 1999), (Koutsias, J.,
et. al. 2000, July). All these papers compared Naïve Bayesian algorithm and other
algorithms NB come with best result. Also in other researches of (Yu, B., & Xu,
Z. B. 2008) shows the Naïve Bayes and Support Vector Machine both perform
well. The researchers (Hmeidi, I., et. al. 2015) using Arabic dataset tried to
compare the classifiers which are Naïve Bayes, Support Vector Machine,
Decision Tree, Decision Table, and K-Nearest Neighbour (KNN). The results
showed that Support Vector Machine leave behind all the other classifiers. Among
the ones that were suggested by the researchers, the author choose three - Naïve
Bayes, Support Vector Machine and K-Nearest Neighbour.
The background of the thesis states that spam emails are increasing,
according to the researchers which were cited above. So, the problem formulation
is as follow:
5
1. How can the features be extracted in such a way that the classifiers’ work
could be simplified in order to increase accuracy?
2. What would be the performance of classifier if the dataset is a combination
of two languages (Swahili Language and English Language)?
In this research objectives are divided in to two parts, one which is the
main/general and the ones which are specific in the classifying.
1.4 Contribution
Due to the seriousness of the issue at hand, a lot of researches have been
done using dataset in English language (Zhang, I, Zhu, J, & Yao, T. 2014). The
researchers tested the three English dataset with one Chinese dataset. On the
other hand other researchers used the Arabic language for the same purpose.
These are (El-Halees, A. 2009), (Hayati, P., & Potdar, V. 2008) (Al-Harbi, S. at
al 2008), (Khorsheed, M. S, & Al-Thubaity, A. O. 2013) and (Hmeidi, I. at al
2015) in these five researches they tested algorithms which are written in Arabic
language. Chinese language has also been used, where (Dong, J., Cao, H., Liu,
P., & Ren, I. , 2006, October). It is unfortunate therefore that there is no research
6
done using Swahili language and also no dataset that contain Swahili emails. The
main contribution in this research is:
To have dataset that combine two languages (Swahili and English)
To test the Algorithms performance on English and Swahili dataset.
The research will only test one to three emails and the dataset which the
author created might be of lower standard compared to the dataset that have been
created by the professionals who have been making dataset for many years.
7
(This page is intentionally left blank)
8
CHAPTER 2
LITERATURE REVIEW
In this section we will only focus more on spam rather than non-spam
messages or legitimate messages. A legitimate message can be defined as a
message that comes from the source which the recipient knows or expects to
receive a message from them. This will not create any doubt because he/she
knows the sander(s), and also the message itself will not contain any spam
content. Sometimes this is not the case because spammers use the addresses
which the recipient knows to send spam messages. The non-spam messages must
not be harmful this is the main point, but in the body, subject or address the
content is not specific in terms of words used.
9
The Unsolicited Bulk Emails (UBE) or spam are the messages which
mainly come from unknown sources, although sometimes they may come from
the known sander address but the content will be deferent. Unsolicited means the
recipient is not expecting to receive any email. Bulk means a message is sent out
as part of a large number of messages with all having substantively identical
content at a reduced rate (Spamhaus Project 2017). Spam can come in the form of
an advertisement which do not have any harm as its objective is only to advertise
and promote a product. This is considered as time wasting. On the other hand
there are spam messages which intend to cause the damage to recipient or to
his/her network infrastructure. There are so many groups of spam, among them
are phishing and social engineering. Although these contribute to a small amount,
yet they are so dangerous for the email users.
10
sentence that consists of a single clause; a complex sentence that consists of one
main clause and at least one subordinate clause which obligatorily follows the
main clause, and a compound sentence that consists of at least two main clauses
joined by a coordinating conjunction. In terms of word order the Swahili language
has fixed order Subject Verb Object at the sentence this means the subject come
first before verb and object “Ally anacheza mpira” means “Ally is playing
football” (a-na-cheza ‘he-present-play’).
The Swahili language in the verbs they use suffix, this make the verb to
be a complete sentence (“anakimbilia mpira” he is running for the ball) when
translate in English can come with a phrase with more than three words. The
applied suffix on Swahili verbs has long posed an analytical problem, the basic
meaning of this suffix has to do with "directing the action against something"
(Port, R. F. 1981).
Example 1:
In the example 2b the subject and object pronouns are 'he/she' and 'me'
correspondingly, and the verb is suffixed with IE. Meanwhile the implication of
the sentence is 'he cut meat for me', actually IE adds the role of a beneficiary or
indirect object that is played by the first person singular pronoun in the object
prefix.
Example 2b:
a-li-ni-kat-ia
S(he)-Past-O(me)cut-IE
11
Nilikatiwa nyama “I had meat cut for me (by him)”
Ni-ta-ku-poke-lea Ni-li-m-shindili-lia
S(I)-future(will)-Object(you)-accept S(I)-past-Object(him)-pack
Nilimshindililia majani “I packed down the leaves for him” the word ‘shindilia’
means to press down.
12
make a lot of money through advertisement using the websites, google AdSenseTM
an organization that pays a lot of money on that. Spammers exploit the services by
generating copied (synthetic) content and then monetize (earn revenue from) it
from the AdSense™. Some spammers rank their websites incorporate with search
engine optimization techniques to get their website a higher rank, with outcomes
in extra traffic and consequently more revenue via advertising. When more users
access/visit a website, will give the credit to that site and also increase the rank for
a web site. By Promoting Products and Services, the spammer get paid by the
company which they work with in order to advertise their product. The reasons
which are mentioned above means some spam are not intended to harm or
intrude the user’s privacy or security, it is just a waste of the bandwidth and time
for the recipient. Meanwhile there are spammer are who motivated by stealing of
someone’s confidential information such as bank account, PIN, username and
passwords and also target to destroy the network or make it busy (phishing and
botnet) (Hayati, P., & Potdar, V. 2008).
13
2.6 Fighting Spammer approaches
14
representation change the messages in the format which machine learning
algorithm can use for classification (Thiago S. et. la 2009).
The classification techniques are many. While there are some techniques
which were once used and now are no longer functional, there are others though
which can still be applied. Many papers presented in various settings show that
the most popular of email classification techniques that is being used in text
classification include naïve Bayes, rule learners, and support vector machines.
Most of these techniques examine or concentrate on the words that means text in
the message headers and body to predict its folder classification (which folder the
message belongs to).
15
concentrate on IP addresses without taking consideration on the email addresses.
For other incoming messages from the senders which do not appeared on the lists,
content-based filters might be practically applied so that the two approaches can
complement one another (Lam, H. Y., & Yeung, D. Y. 2007).
16
dimensionality is high. The assumption of Naïve Bayes classifiers is that
effectiveness of a value in a certain class is independent of the values of other
variable. Naïve-Bayes also computes conditional likelihoods of the classes given
the instance and choices the class with the highest posterior. In supervised
learning setting Naïve Bayes classifiers is more capable and more efficient to be
trained. Bayes rules applied to documents and classes, this class can be ham or
spam in email context (Tretyakov, K. 2004, May). Formula 1. Show probability P
of document d in a given class c equal to P(d | c) P(c) dived by P(d).
(2.1)
17
Picture 2. 3: SVM classifier structure
18
explicit category representation. The membership of class in k-NN is allocated to a
vector but not assigned a vector to a specific class. This has some benefits like there
being no random assignments are made by the classifier (Keller, J. M., et. al. 1985).
The categorization can be done when the new documents have to be
classified, where comparable document, that is neighbors, are discovered. If the
document is assigned to a category then the new document will also be assigned
to that category. The nearest neighbors by using the traditional indexing can be
found easily and quickly. To decide which group does a message falls into, that is
whether legitimate or spam, we consider the messages classes that reside near to
it. We conclude by saying that the comparison of the vectors are real time process.
There are two main approaches in text categorization which are the
Knowledge Engineering approach in which the expert’s knowledge about the
categories is directly encoded into the system declaratively or in the form of
19
procedural classification rules, and Machine Learning (Konstantin Tretyakov
2004).
Weka is the data mining tools which can be used for classifying and
clustering information. It’s a pool of algorithm from machine-learning, among
them are classification, regression, clustering, and association rules to complete
the data mining tasks. The interface can link with email information to gather
the information for pre-processing then generate the coaching and take a look at
data sets and then to convert each set into rail format. We have a tendency to
pass coaching set to the rail library to coach the classifier then take a look at the
effectiveness by looking at a set (Joachims, T. 1998).
20
have tried to write about comparison of methods and come out with many
suggestions. The Naïve Bayes has been singled out as an effective method to
construct automatically anti-spam filtering with good performance. In his paper,
(Ion Androutsopoulos 2000) compared two approaches, Naïve Bayesian algorithm
which was also used in (Sahami 1998), and memory-based of TiMBL (Daelemans
et. Al. 1999). These two classifiers gave the results that showed both approaches
achieved very high classification accuracy and precision.
Koutsias, J., et. al. (2000, July) two classifiers compared Naïve Bayesian
with keyword-based anti-spam and their result shows that Naïve Bayesian
performed much batter then the keyword based. The comparison of four
classification on (Yu, B., & Xu, Z. B. 2008) which are Naïve Bayes (NB), Neural
Network (NN), Support Vector Machine (SVM) and Relevance Vector Machine
(RVM) shows that NN can achieve high accuracy compared to symbolic
classifiers, only that it needs extensive time to select parameter and network
training (Yu B and Z. B. 2008). The researchers (Hmeidi, I., et. la. 2015) by using
Arabic dataset they tried to compare the classifiers which are Naïve Bayes,
Support Vector Machine, Decision Tree, Decision Table, and K-Nearest Neighbor
(KNN). The results showed that SVM leave behind all the other classifiers.
According to these references which were mentioned earlier, we prefer to select
among them those that many researchers have identified as the ones showing good
performance when they are compared with other classifiers.
When dealing with textual data there are some methods which have been
developed to deal with this area. These have been divided in two groups, the
supervised and unsupervised methods. Supervised methods use a set of pre-
classified documents and consider the labeling of data. This means each text
belong to one limited number of class and have a label which shows that text
belong to which class, (Verbeek J. 2000), while unsupervised does not use the
label. The feature is a group of attributes, in other words known as keywords that
capture important data characteristics in a dataset. In feature selection, if the
features for classifying is properly selected, then obviously the expected result
21
will be good, otherwise the result will not be as good as expected. This indicates
that you must be careful on the selection method you choose.
(2.2)
22
The value of TF-IDF increases proportionately along with the number of
words that appeared in the document, but is compensated by the frequency of
words present in the corpus, that helps to regulate the fact that some words
appeared many times and commonly used than others.
Term Frequency (TF)
Term Frequency ftt,d means that the term (t) in a document that is
presented by ‘d’, TF calculate the number of times which the word appears in a
document.
Document frequency (DF)
Document frequency in this approach shows that the rare terms are more
informative compared to frequent terms. The terms like “increase, line, high” can
be found in many documents. The document with these common terms are
considered to be more relevant compared to the ones with rare terms.
The Bag-of-Word (BoW) Model
The Bag-of-word model is among the mostly used feature extraction in
spam filtering where the frequency of each word is used as a feature for training a
classifier. This model does not consider order of words in a document. For
example if one writes “Ally is running faster than Haji” or “Haji is running faster
than Ally” these two sentences in BoW will be treated as same because it does not
care about the grammar and even how your words structure in a sentence is
formed but will be keeping multiplicity structure. The model describes documents
by word frequency and totally ignore the relative position information of the
words in the document. Bags can contain a repetition or redundant words. Some
researchers said there are some specific strategies like Counting, Tokenization and
normalization as bag of words.
WEKA Tokenization
The process of splitting up an arrangement of strings into occurrences
such as phrases, keywords, words, symbols is called Tokenization process, there
are many ways/methods for increasing the classifiers accuracy. In this research the
author increases the accuracy by changing the tokenizers, in WEKA use three type
of tokenizer which are:
23
WordTokenizer: A simple tokenizer that is using the
java.util.StringTokenizer class to tokenize the strings. The attributes for this
include numbers ‘123’, special characters ‘# &’ and words ‘hotel, administrator’. If
there are two words that joined without space between them then, it is counted as one
word ‘AccorHotel’.
NGramTokenizer: Splits a string into an n-gram with min and max grams. In
the n-gram model, the technique is applied to characters or symbol but not like
Word-Tokenizer which applied a word. The attribute for this tokenizer include
alphabet “A”, words “click” and also phrase, “click here”, “click here to” as shown in
picture 2.4 bellow:
24
and redundant features. (Hsu, H.H. and Hsieh, C.W., 2010). SVM are said to
perform well and produce good results without employing any feature selection
techniques (Aakanksha S. et la 2015). This means that some classifiers need help
to give good results but for some that is not the case. Although there are basically
many methods for accomplishing a feature selection process but they are
classified in to two groups only, that is supervised and unsupervised.
Dataset
Invention
Estimation
No
Subset
Test condition if
Satisfied
Yes
Verification
Feature subset
25
Then it will use a classification algorithm to induce classifier from the feature
in each subset
It will consider the subset of feature with which the classification algorism
perform the best
To find a subset, the evaluator will use a search technique (random search, breadth
first search, depth first search or a hybrid search)
In our case of emails spam and non-spam, the value in confusion matrix
are True Positive (TP), False Positive (FP), False Negative (FN) and True
Negative (TN) this means that TP are the actual Non-spam that were correctly
26
classified as Non-spam, FP are the Spam that were imperfectly classified as Non-
spam, FN Non-spam that were wrongly marked as Spam and the TN are Spam
correctly classified as Spam.
Accuracy this is a percentage of value/data that is correctly classified
from the total amount of data. The calculation formula is as follows.
True Positive (TP) Rate – a rate of true positive that means instances
which is correctly classified in a given class. This exposes the classifier’s
capability to detect instances of the positive class.
TPR = TP / (TP + FN)
27
False Positive (FP) Rate – a rate of false positive that means instances
which is incorrect classified in a given class. This reflects the frequency with
which the classifier makes a mistake by classifying normal state as pathological
FPR = FP / (FP + TN)
28
CHAPTER 3
RESEARCH METHODOLOGY
This chapter will explain the activities or steps which will be included
when we conduct this research as shown in graph 3.1 and explain each step in
detail. The steps start with literature review which will give us the idea on the
topic. This will be followed by data collection which will explain how data were
collected and where, and then a section on dataset creation will be discussed
because in order to proceed we must have the dataset to train and to test the
classifiers. Data processing including pre-processing, classifiers evaluation and
result and conclusion will all be covered here.
29
and non-spam. So we will check on how other literature has said about the topic,
methodology and other methods used. We will also check the classifiers used in
these papers and their performance that will prove if what we want to do is
relevant. We will also look into the dimensionality reduction and how it helps the
identified classifiers to perform.
30
3.3. Creation of the Dataset
After the completion of data collection exercise, the creation of the
dataset process started. This is also a very important and interesting task. A new
dataset created will be used in our thesis for training the classifier and also part of
that dataset will be used to test classifiers in machine learning by using WEKA.
The dataset has just three features which are sender address, the email body and
the email subject and the fourth one is the label of it (Spam and Non-spam). This
task also consume more time in the research, the dataset content are email which
is text only used to train and check the performance of the classifiers. Also the
text in that dataset are a mix of two languages which are English language and
Swahili language known as “lugha ya Kiswahili” the language from East Africa.
The process of dataset creation involve the data cleaning to make it
suitable to run in the program which we plan to use. Generally, email messages
contain many full stops, commas and quotation marks or single quotes which are
not suitable to use when you run in WEKA. The input file format for WEKA
system dataset is known as attribute-relation file format (ARFF). The example of
dataset which we create the input file format which WEKA support for spam and
non-spam data in that file structured like this have name of the relation (Email
Spam n Non-spam) at the top, the block that define the attributes of features
(sender, subject, body, Class {Spam, Non-spam}) Nominal attributes that is
followed by the set of values can be enclosed in curly braces.
The researcher created his own dataset because no dataset of spam/non-
spam which is written in Swahili language was found. This will be the first one to
be created and it is believed that the move might encourage others to do so in
order to make life a little bit easier for the upcoming researchers whose interest
will be on Swahili language. The main challenge is in content of the data in the
dataset as in the Swahili language we do not have many spam email messages so
that we have been forced to use English to replace that.
31
together represent the number of email files and used to develop Term-Document
Matrix (TDM). Usually, this matrix will be large and sparse in nature due to a
large number of email files available for classification hence dimensionality
reduction method is performed to tackle this problem that is done by feature
selection and feature extraction processes. Some additional steps for dimension
reduction of the matrix is also involved such as stop word (Least informative
words such as pronouns, prepositions, and conjunction) removal (Joachims 1998)
and lemmatization (grouping similar informative words such as Perform,
Performed and Performing can be grouped as perform).
Feature
Pre-processing
Extraction
Training/Test
Classifier
Data
32
3.4.2. Training and testing data
The first activity here is to use a training set in other word Supervised
learning because the class label each tuple in the training set that means the
classifier directed that the tuple belong to which class. In the creation of dataset
we used to label the tuples each time a line of email was added so that the training
dataset will work efficiently as planned. The performance measurement of the
tuples which are already labeled in our case spam and non-spam. This will be
compared with the WEKA application that will give the accuracy of the
algorithm. This result will not affect the test set. The dataset which is planned to
be used in this thesis have approximately 1000 tuples which will be labeled in to
two different groups (spam and non-spam).
To forecast the performance of a classifier on a new data, we need to
evaluate its error-rate on a dataset that participated in the structure of the
classifier. This independent dataset is called the test set. The test set which can
also be called unsupervised learning, the dataset will be smaller in number of
tuples compared to the training set. Also when the tuples not classified that
means not label in advance, the classifier will identify and group the tuples. The
performance of the test set can be measured by checking whether it is correct or
not. If it is correct, it is counted as success and otherwise that means error (error
rate). Error rate as defined by Ian H. W 2005 “The error rate is just the proportion
of errors made over a whole set of instances, and it measures the overall
performance of the classifier”. We assume that both the training data and the test
data are representative samples of the underlying problem. The plan is to have a
test set with not less than three hundred tuples for testing our classifiers.
3.4.3. Classification
The main activity in the classification is to filter the messages using
classifier algorthm. In this research three classifier will be trained and tested with
the dataset that was created before. The classifiers that many researchers agreed
on their high perfomance, the Naïve Bayes is among the classifier that will be
used in this research. We believe it will give good result. The other two are The
Sequential Minimal Optimization (SMO) and The k-nearest neighbor K-NN.
33
False Negative can be reduced by applying very strong classifiers with
the help of best feature extraction. Some of the work which have recently been
published have proposed the idea of a partial Naive Bayes approach, influenced
towards low false positive rates.
34
CHAPTER 4
PRELIMINARY PROCESSES
This chapter will explain how and where data have been collected, the
process of creating dataset and the format, and pre-process of implementation of
the dataset.
35
performance of the classifiers. The researcher create two datasets, Swahili
language dataset and English language dataset, after running them separately,
researcher combine them to form one dataset that contents are mixed of English
sand Swahili language.
This process involve the data cleaning activity to make it suitable to run
in the WEKA program. Generally, email messages contain many full stops (.),
commas (,) and question mark (?) and single quotes (‘) which are not suitable to
use in WEKA. Commas and single quotes (‘,’) are used in WEKA dataset as
separator between attributes. So if the email message body contains one of them
when executing the dataset, the error windows will appear with error line number.
Swahili dataset contains four hundred and fifty seven instances, among
them four hundred and thirty eight instances are non-spam content and thirteen
instances are spam content. English dataset contains four hundred and one
instances, one hundred and eighty eight instances are spam content and two
hundreds and thirteen instances are non-spam content. Combined English-Swahili
dataset contains eight hundred and fifty eight instances, among them six hundred
and fifty one instances are non-spam, and two hundred and seven instances are
spam content. Statistical figures are shown in table 4.1 bellow:
The example of the dataset is shown in picture 4.1. The file structure has
name of the relation (Email Spam n Ham1) at the top, the block that defines the
features attributes (sender, subject, body, Class {Spam, Ham}) Nominal attributes
that is followed by the set of values can be enclosed in curly braces, and also the
instances.
36
Picture 4. 1: WEKA Dataset
37
Bayes and k-NN. The next step author choose a classifier and run them one at a
time. The author choose the 10-fold Cross-validation.
Cross-validation is a method of evaluating the predictive models by
partitioning the original sample into a training set to train the model, and a test set
to evaluate it. This means in this thesis the training set will be 70% and testing set
will be 30% of instances of the dataset. All three dataset same setting was used.
38
CHAPTER 5
RESULTS AND EVALUATION
In this chapter, the author will display and discuss about the result
obtained after doing the experiment of the created datasets in machine learning
(WEKA).
This experiment gives the answer for the two questions that are written
in chapter one, which are:
What would be the performance of classifier if the dataset is a combination of
two languages (Swahili Language and English Language)?
How can the features be extracted in such a way that the classifiers’ work
could be simplified in order to increase accuracy?
39
after getting the result, the researcher combine them to form one dataset and do
experiment again. The classifiers used are Naïve Bayes, Sequential Minimal
Optimization (SMO) and k-Nearest Neighbour (k-NN). The Researcher use
StringToWordVector attribute filter to convert string to vector and 10 fold cross-
validation is chosen as a test mode. Also, confusion matrix is used to predict and
summarize the performance of a classifiers.
The confusion matrix is a technique for summarizing the performance of
a classification algorithm. Table 5.1 is an example of it. The matrix is easy to
read and understand. The confusion matrix demonstrates the ways in which
classification model is confused when it makes predictions. Accuracy of classifier
is calculated by (TP+TN)/total if these numbers are high that means the accuracy
is good. Opposite is Misclassification Rate "Error Rate" which is calculated by
(FP+FN)/total, if Error Rate is high that means bad classifier.
40
followed by k-NN with average accuracy of 93.52% and the last but also good
accuracy is Naïve Bayes that come with 87.78%.
41
classified instances and 6 incorrect classified instances out of 457 instances, Naïve
Bayes classifier 453 classified correctly instances and 4 incorrect classified
instances out of 457 instances, and k-NN classifier has 452 classified correctly
instances and 7 incorrect classified instances out of 457 instances.
Percentage wise as shown in Table 5.5 indicates that the Naïve Bayes
classifier leads other classifiers by 0.43%, it has correctly classified of 99.12%,
which is followed by SMO classifier with average accuracy of 98.69%. SMO
classifier lead k-NN by 0.22% and the last but also good accuracy is k-NN that
come with 98.47% correctly classified. This can be concluded that the Naïve
Bayes can work better with the Swahili language, although the gap from one
classifier to another is not that big. Also table 5.5 continue to shows average of
Precision, Recall, F-Measure and ROC Area. Naïve Bayes classifier still performs
well on this by having average of Precision 0.991, Recall 0.991 Recall, F-Measure
0.991 and ROC area 1, followed by SMO classifier which has Precision 0.987,
Recall 0.987, F-Measure 0.986 and ROC area 0.842. The last one is k-NN
classifier which comes with Precision 0.985, Recall 0.985, F-Measure 0.983 and
ROC area 0.873.
42
5.1.3. Combined English and Swahili Dataset
The combined English-Swahili dataset contains eight hundred and fifty
eight instances, among them six hundred and fifty one instances are non-spam,
and two hundred and seven instances are spam content. The experiment also use
StringToWordVector attribute filter used to modify datasets in a systematic
fashion that means with string to vector, there are 858 instances and 1985
attributes in combined dataset. In picture 5.2 “WEKA Explorer” shows this in
details.
43
classified out of 858 instances. The Gmail classifiers has 157 correctly classified
out of 182 and 25 incorrect classified.
44
dataset which can help to bring about good result. Classifiers individually can be
improved in different ways. The Naive Bayes classifier’s performance is
extremely sensitive to the selected attributes and the number of selected terms by
the term-selection methods in the training stage (Almeida, T. A. et. el. 2011). Also
accuracy can be increased by attribute subset selection, attribute creation and
removing the redundant features (Kotsiantis, S. B., & Pintelas, P. E. 2004).
However, it’s not clear yet how features in email header can help to improve
filtering result (Zhang, L., Zhu, J., & Yao, T. 2004).
SMO classifier which is an implementation of SVM use kernel function.
It is well known that the choice of the kernel function is crucial to the efficiency
of SVM. The four types of kernel functions are linear, polynomial, RBF and
sigmoid frequently used with SVM. Yu, B., & Xu, Z. B. (2008) adopt sigmoidal
kernel in the experiment, so the result shows that it does not matter whether it uses
high volume of feature or low, the performance will remain the same. Joachims,
T. (1998) use SVM with Polynomial and RBF kernels and compared with Naïve
Bayes and k-NN, the performance of SVM was great. RBF also can be used in
ANN (Kotsiantis, S. B., at. la. 2007). RBF is a three-layer feedback network.
Individually, every unseen component implements a radial activation function and
individually output component implements a weighted sum of hidden component
outputs.
There are many methods applied for increasing the classifiers
performance, in this research the author increases the accuracy by changing the
tokenizers. Almeida, T. A., et. al. (2011, September) He tried to increase the
performance by applying two types of tokenizers, The first one that targets to be
viewed as a unit apart pattern, domain names and mail addresses by dots, this will
help classifier to identify a domain even if subdomains are differ. The second type
is a token that targets to identify symbols that are used in spam messages, so this
will help in identifying the class of the message. Also in his further research he
recommends to have standard tokenizers that can produce a bigger number of
tokens and patterns to contribute to classifier abilities to separate no-spam
messages from spams. In WEKA for example, StringtoWordVector filter uses
three types of tokenizers which are:
45
WordTokenizer: A simple tokenizer that is using the java.util.StringTokenizer
sclass to tokenize the strings.
NGramTokenizer: Splits a string into an n-gram with min and max grams. In
this thesis the default setting for this minimum is 1 and maximum is 3.
AlphabeticTokenizer: Alphabetic string tokenizer, tokens are to be formed
only from contiguous alphabetic sequences.
In WEKA, STWV by default use Word Tokenizer, this tokenizer used in
first experiment. The Author try to increase the classifiers’ accuracy by using
Alphabetic Tokenizer, so the results below was experimented by using STWV and
in tokenizer the alphabetic tokenizer was selected. Picture 5.3 show list of
tokenizers which are available in WEKA.
Table 5.14 confusion Matrix for SMO show results before that means
Word-Tokenizer was used and after means the application of Alphabetic
tokenizer. Before results shows TP 13 instances, TN 6 instances and FP 438
instances, no FN. After using the alphabetic tokenizer TP was 16 instances, TN 3
instances and FP 438 instances and no FN. This means the experiment identify
three more instances.
46
Table 5. 8 : Confusion Matrix for SMO
SMO CLASSIFIERS
Before After
Spam Ham Spam Ham
Spam 13 6 16 3
Ham 0 438 0 438
Table 5.9 shows the experimental results that shows the results before
and after applying the alphabetic tokenizer. Before was 98.69% was correctly
classified and with Precision 0.987, Recall 0.987, F-Measure 0.986 and ROC area
0.842, and after applying alphabetic tokenizer it shows 99.34% was correctly
classified, the improvement was 0.65%, average of Precision 0.993, Recall 0.9933
and F-Measure 0.993 and ROC area is 0.921.
47
Table 5. 10: Confusion Matrix for SMO ‘Tokenizers’
SMO CLASSIFIERS
WordTokenizer AlphabetTokenize NGramTokenizer
Spam Ham Spam Ham Spam Ham
Spam 198 9 198 9 199 8
Ham 3 648 3 648 2 649
5.3. Evaluation
The aim of this thesis was to have the dataset that combined two
languages, English language and Swahili language. The dataset created was and
experimented in WEKA by using three classifiers, which are Sequential Minimal
Optimization (SMO) classifier which is implementation of SVM, Naïve Bayes
classifier and k-NN classifier. But before combining that dataset, the researcher
conducted experiment with separate English set and Swahili set.
The result in English dataset contains four hundred and one instances,
among them one hundred and eighty eight instances are spam and two hundreds
and thirteen instances are non-spam, and 2138 attributes. The English dataset
result in accuracy for the classifiers shows that SMO classifier leads others. The
48
SMO has 391 correctly classified which is 97.51%, k-NN 375 classified correctly
93.52% and Naïve Bayes 352 classified correctly which is 87.78%.
The Swahili dataset contain four hundred and fifty instances among
them four hundred and thirty eight instances are non-spam content and thirteen
instances are spam content and contain 1686 attributes. The Swahili language
emails messages for now they do not have many spam messages that means still
Swahili emails can be trusted but they still use to circulate English spam messages
in the area. The Swahili dataset result is different from English dataset where
SMO leads other classifiers. Here, with the Swahili dataset Naïve Bayes leads
other classifiers by having 453 classified correctly instances, in percentage this is
99.12%, followed closely by SMO classifier has 451 correctly classified instances
equal to 98.69%, and last is k-NN classifier by having 452 classified correctly
instances equal to 98.47%. The result above for Swahili dataset conforms to the
result in (Ion Androutsopoulos 2000), (Sahami 1998), (Daelemans et. Al. 1999),
(Koutsias, J., et. al. 2000, July) for Naïve Bayes to have good performance among
other classifiers.
The combined English-Swahili dataset contains eight hundred and fifty
eight instances, among them six hundred and fifty one instances are non-spam,
and two hundred and seven instances are spam content. It has 1985 attributes. The
experiment result for combined dataset shows SMO classifier leads other
classifiers, it has 846 correctly classified instances this is equivalent to accuracy of
98.60%, followed by k-NN classifier which has 834 instances classified correctly
and accuracy of 97.20%, and Naïve Bayes classifier which has 797 instances
classified correctly with the accuracy of 92.89%. The average classifiers accuracy
by classes, SMO classifier still perform well on this by having average of
Precision 0.986, Recall 0.986 and F-Measure 0.986, followed by k-NN classifier
which has Precision 0.972, Recall 0.972 and F-Measure 0.972, and Naïve Bayes
classifier come with Precision 0.928, Recall 0.929 and F-Measure 0.928.
For both the English and the combined English – Swahili datasets, SMO
implementation of SVM got best results because its support boundaries, also
ability of execute very large dataset without requiring extra matrices storage, and
it does not invoke any repetition of routine number for every sub-problem. This
49
results conforms with the one that was reported by (Al-Shargabi, B., Al-
Romimah, W., & Olayah, F. 2011, April) and (Al-Kabi, M., Al-Shawakfa, E., &
Alsmadi, I. 2013). SMO achieve higher than Naïve Bayes and J48 when researcher
experiment them by using Arabic dataset, (Yu, B., & Xu, Z. B. 2008), and
(Hmeidi, I., et. al. 2015) using Arabic dataset, the results showed that Support
Vector Machine leave behind all the other classifiers. The combined dataset come
with the results that the researcher predict to get in the proposal, which was to get
high performance compared with the Gmail classifier. But things did not go well
with the collection of emails. While it was expected that Swahili messages will
have a large number of spam email, it was not so. Unfortunately the researcher
found out that the Swahili language emails messages for now do not have many
spam messages that means they still use English spam messages in the area.
Tokenizer changing has impact to classifier’s performance as shown in
results from experiment of combined dataset, and Swahili dataset. The classifier
used was SMO classifier for all experiments and for all dataset (Swahili and
Combined). In Swahili language dataset the author concentrated only in changing
tokenizers, from word tokenizer to alphabet tokenizer. The results shown in table
5.9 had improved a little bit from accuracy of 98.69% to 99.34% this means
increasing of 0.65%, and combined dataset three tokenizers were experimented,
the results was N-gram tokenizer come with the good results after changing the
maximum to 2 and minimum to 1, the default setting was maximum is 3 and
minimum is 1, the accuracy for N-gram was 98.83%, the results was slightly
deferent for two tokenizers (word and alphabet) both came with same results, in
accuracy was 98.60%. This means that whether you choose word tokenizer or
alphabet tokenizer there will be no change and their performance will be the same
for this dataset.
The results of our experiment indicate that the combined dataset can give
good results if N-gram tokenizer is used rather than Word tokenizer and Alphabet
tokenizer. Krouska, A., Troussas, C., & Virvou, M. (2016, July) try to compare N-
gram by changing from unigram, bigram and 1 to 3 grams classifier used were
NB, SVM, k-NN and C4.5, the results shows 1-3 gram achieve good results for
NB 92.59% which is higher for all classification experiment. The ability of n-
50
gram to detect word is higher some e-mails can have phrase like “n@ked l@dies”
this can be extracted by n-grams as splitting words can be “n@k”, “@ked”
(Goodman, J., Heckerman, D., & Rounthwaite, R. 2005) this can help to be
identified as spam easily.
51
(This page is intentionally left blank)
52
CHAPTER 6
CONCLUSIONS AND RECOMMENDATIONS
This chapter will explains the conclusions of the research that has been
conducted and suggestions to support further researches that may be possible to be
carried out.
6.1. CONCLUSIONS
Swahili language is widely spoken in all East African countries for easy
communication especially in the area of trade. The Swahili is complex somehow
because it uses suffix on verbs, in directing the action against something. Also in
negation it does not have specific words and position, sometimes can be in the
beginning or at the end or in the middle of the verb. The Swahili standard is not
like English language especially in vocabulary.
The Swahili emails are currently increasing in numbers and spread all
over not only in East Africa but also in the world wherever the Swahili speakers
travel to, work and reside. The precaution therefore must be taken and efforts
needed to prevent before it is too late. If measures are not taken now, it will be
very difficult in the next few years as Swahili people continue their study
especially in new technology, this can make some of them to be bad guys that
want to get money easily. This research will help the policy maker in East African
countries to take this in to their considerations when they make ICT and Security
Policies.
This research tried to answer two questions, first the performance of
classifier if the dataset is a combination of two languages (Swahili Language and
English Language). After the experiment the results show that SMO classifier has
good performance in both the English dataset as well as the combined dataset,
followed by k-NN classifier and Naïve Bayes classifier. Although Naïve Bayes
classification result was not very good in English dataset and combined dataset,
yet, it showed good performance for the dataset that was created by using Swahili
language. This indicates that SMO classifier, k-NN classifier and Naïve Bayes all
can be used in many languages, either by combining them or individually.
53
The second question was on the features to be extracted in such a way
that the classifiers’ work could be simplified in order to increase accuracy. The
classification accuracy can be increased by selecting features, each classifier can
use different ways to increase the performance. Some can be increased by select
attribute, reduce redundant features, attribute subset selection, and attribute
creation. SMO can be increased by choosing the kernel functions.
The algorithm that will fit to area that use mixed language like East
Africa, because they also have two languages (English and Swahili) that are
mostly used in the area not only at national level but also at international level as
well. People used to compose or write their email by using those languages,
sometimes they even mix them in one message. So the author recommend that
when it comes to making decision as to which algorithm to use between SMO,
Naïve Bayes and k-NN when they want to filter email messages, the answer is
Sequential Minimal Optimization ‘SMO’. It is the best choice for that because it
was proved by the results in chapter 5, by achieving higher performance in
combined dataset.
The experiment can have impact if tokenization settings are changed as
shown in Chapter 5 when the author tried to change three tokenizers in String-to-
Word-Vector, the results was deferent N-gram-tokenizer came with higher
accuracy compared with word-tokenizer and alphabet-tokenizer.
6.2. RECOMMENDATIONS
Further research that might be possible to be conducted are to collect
more Swahili language email messages especially Spam email messages and
evaluate the result because very few Swahili spam email messages were collected
in this research. The possibility of getting high performance by using your own
filter is higher more than to use readymade, because it can be modified easily.
54
References
Al-Kabi, M., Al-Shawakfa, E., & Alsmadi, I. (2013). The Effect of Stemming on Arabic
Text Classification: An Empirical Study. Information Retrieval Methods for
Multidisciplinary Applications, 207.
Almeida, T. A., Almeida, J., & Yamakami, A. (2011). Spam filtering: how the
dimensionality reduction affects the accuracy of Naive Bayes classifiers. Journal of
Internet Services and Applications, 1(3), 183-200.
Al-Shargabi, B., Al-Romimah, W., & Olayah, F. (2011, April). A comparative study for
Arabic text classification algorithms based on stop words elimination. In Proceedings of
the 2011 International Conference on Intelligent Semantic Web-Services and
Applications (p. 11). ACM.
Androutsopoulos, I., Koutsias, J., Chandrinos, K. V., & Spyropoulos, C. D. (2000, July).
An experimental comparison of naive Bayesian and keyword-based anti-spam filtering
with personal e-mail messages. In Proceedings of the 23rd annual international ACM
SIGIR conference on Research and development in information retrieval (pp. 160-167).
ACM.
Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Sakkis, G., Spyropoulos, C. D., &
Stamatopoulos, P. (2000). Learning to filter spam e-mail: A comparison of a naive
bayesian and a memory-based approach. arXiv preprint cs/0009009.
Broomfield, G. (1930). The Development of the Swahili Language. Africa, 3(4), 516-522.
55
Goodman, J., Heckerman, D., & Rounthwaite, R. (2005). Stopping spam. Scientific
American, 292(4), 42-49.
Hayati, P., & Potdar, V. (2008, November). Evaluation of spam detection and prevention
frameworks for email and image spam: a state of art. In Proceedings of the 10th
International Conference on Information Integration and Web-based Applications &
Services (pp. 520-527). ACM.
Hmeidi, I., Al-Ayyoub, M., Abdulla, N. A., Almodawar, A. A., Abooraig, R., &
Mahyoub, N. A. (2015). Automatic Arabic text categorization: A comprehensive
comparative study. Journal of Information Science, 41(1), 114-124.
Hoanca, B. (2006). How good are our weapons in the spam wars? IEEE Technology and
Society Magazine, 25(1), 22–30
Hsu, H.H. and Hsieh, C.W., (2010). Feature Selection via Correlation Coefficient
Clustering. JSW, 5(12), pp.1371-1377.
Ian H. W and Eibe F. 2005 Data Mining “Practical Machine Learning Tools and
Techniques” 2nd edition.
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features (pp. 137-142). Springer Berlin Heidelberg.
Joachims, T. (1998). Text categorization with support vector machines: Learning with
many relevant features. Machine learning: ECML-98, 137-142.
Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest neighbor
algorithm. IEEE transactions on systems, man, and cybernetics, (4), 580-585.
Khan, A., Baharudin, B., Lee, L. H., & Khan, K. (2010). A review of machine learning
algorithms for text-documents classification. Journal of advances in information
technology, 1(1), 4-20.
56
Kotsiantis, S. B., & Pintelas, P. E. (2004, September). Increasing the classification
accuracy of simple bayesian classifier. In International Conference on Artificial
Intelligence: Methodology, Systems, and Applications (pp. 198-207). Springer, Berlin,
Heidelberg.
Kotsiantis, S. B., Zaharakis, I., & Pintelas, P. (2007). Supervised machine learning: A
review of classification techniques.
Krouska, A., Troussas, C., & Virvou, M. (2016, July). The effect of preprocessing
techniques on Twitter sentiment analysis. In Information, Intelligence, Systems &
Applications (IISA), 2016 7th International Conference on (pp. 1-5). IEEE.
Kumar, N., & P, D. (2015). Study on Feature Selection Methods for Text Mining.
Lam, H. Y., & Yeung, D. Y. (2007). A learning approach to spam detection based on
social networks (Doctoral dissertation, Hong Kong University of Science and
Technology).
Mojdeh, M. (2012). Personal Email Spam Filtering with Minimal User Interaction.
Petzell, M., (2005). Expanding the Swahili vocabulary. Africa & Asia, 5, pp.85-107.
Platt, J. (1998). Sequential minimal optimization: A fast algorithm for training support
vector machines.
Port, R. F. (1981). The applied suffix in Swahili. Studies in African Linguistics, 12(1),
71.
57
Saad, M. K. (2010). The impact of text preprocessing and term weighting on arabic text
classification. Gaza: Computer Engineering, the Islamic University.
Thota, H., Miriyala, R. N., Akula, S. P., Rao, K. M., Vellanki, C. S., Rao, A. A., &
Gedela, S. (2009). Performance comparative in classification algorithms using real
datasets. Journal of Computer Science and Systems Biology, 2(1), 97-100.
Verbeek, J., 2000, December. Supervised feature extraction for text categorization. In
Tenth Belgian-Dutch Conference on Machine Learning (Benelearn'00).
Wang, L. (Ed.). (2005). Support vector machines: theory and applications (Vol. 177).
Springer Science & Business Media.
Youn, S., & McLeod, D. (2007). A comparative study for email classification. Advances
and innovations in systems, computing sciences and software engineering, 387-391.
Yu, B., & Xu, Z. B. (2008). A comparative study for content-based dynamic spam
classification using four machine learning algorithms. Knowledge-Based Systems, 21(4),
355-362.
Zeng, Z. Q., Yu, H. B., Xu, H. R., Xie, Y. Q., & Gao, J. (2008, November). Fast training
Support Vector Machines using parallel sequential minimal optimization. In Intelligent
System and Knowledge Engineering, 2008. ISKE 2008. 3rd International Conference
on (Vol. 1, pp. 997-1001). IEEE.
Zhang, L., Zhu, J., & Yao, T. (2004). An evaluation of statistical spam filtering
techniques. ACM Transactions on Asian Language Information Processing
(TALIP), 3(4), 243-269.
Website
58
The Radicati Group www.radicati.com/
http://www.spamlaws.com/spam-stats.html
Anti-Phishing Working Group (APWG)
http://www.antiphishing.org/resources/apwg-reports/
The Spamhaus Project - The Definition of Spam
https://www.spamhaus.org/definition.html
https://www.pdx.edu/multicultural-topics-communication-sciences-
disorders/swahili Portland state University
59
(This page is intentionally left blank)
60
THE AUTHOR’S BIOGRAPHIES
Rashid Abdullah Omar is among the four sons and
one daughter of Mr. Abdullah’s family. He was
born in the Island of Zanzibar, which is part of the
United Republic of Tanzania. The author acquired
his Primary and High School education in
Zanzibar. He studied his Bachelor degree in
Computing and Information System at the Institute
for Information Technology in Dar-es-salaam
Tanzania and got second class with honors in 2006.
From 2006 to 2010 he worked with the different
companies in Dar-es-salaam. In May 2010 he went back to Zanzibar and worked
in Government institutions. In 2015, the author joined the Institut teknologi
Sepuluh Nopember Surabaya, Republic of Indonesia for his Masters in the same
field of Information System and specialized in Security in the lab IKTI. The
author successfully completed his postgraduate studies in March 2018.
61
(This page is intentionally left blank)
62