SMS Fraud Detection and Prevention in Pakistan
SMS Fraud Detection and Prevention in Pakistan
Prevention in Pakistan
Final Report
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Project Team
Dr. Faisal Kamiran Principal Investigator, Assistant Professor Information Technology University
Lubna Razaq Director, FinTech Center, Information Technology University
Maryem Zafar Usmani Project Manager, Fintech Center, Information Technology University
Rai Shahnawaz Research Associate, Fintech Center, Information Technology University
M. Umer Ramzan Research Associate, Fintech Center, Information Technology University
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Table of Contents
1. Introduction .........................................................................................................................................................................3
2. Data Collection .....................................................................................................................................................................5
Data Collection Phase 1: Group data collection activity through Safe SMS app .................................................................5
Results ........................................................................................................................................................................... 7
Data Collection Phase 2: One-to-one fraudulent SMS collection ........................................................................................ 8
Results ................................................................................................................................................................................. 9
Data Collection Phase 3: Fraudulent SMS data collection through mass media ................................................................ 9
Results ......................................................................................................................................................................... 10
3. Data Pre-Processing ............................................................................................................................................. 11
Data Understanding and Cleaning ................................................................................................................................ 11
Addressing Unlabeled and Falsely-labeled Data Received from the Users ........................................................................ 13
Methodology ................................................................................................................................................................ 13
Results............................................................................................................................................................................. 18
All-inclusive Data Statistics ........................................................................................................................................ 19
4. Modeling ...................................................................................................................................................................... 21
Lexicon Based ................................................................................................................................................................. 21
Machine Learning .......................................................................................................................................................... 21
Methodology ...................................................................................................................................................................... 22
Results: Model Selection .................................................................................................................................................. 23
Deep Learning .......................................................................................................................................................................27
5. Proposed Deployment Stages and Strategy ..................................................................................................................... 28
Identifying Stakeholders and their use of the system ........................................................................................................ 29
Gathering System Requirements and Development............................................................................................................ 31
6. Conclusion ...................................................................................................................................................................35
Appendix A: Glossary of Terms ............................................................................................................................................... 37
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
1. Introduction
Short Message Service (SMS) has become a common communication mode, specifically in Pakistan 1. In 2009, Pakistan
had the largest text messaging growth in Asia Pacific2. The reduction in the cost of SMS services by telecom companies
has enabled the increased use of SMS. The use of SMS in financial transactions e.g., bank transactions, mobile money,
bill payments, notification of due dates, as well as one time PINs (OTP) is extremely common. Due to the frequency and
familiarity of these SMSes, attackers sometimes send similar messages as spam or fraud, seeking to either extract some
private information or to defraud them by transferring money to some third party scam, from legitimate traffic.
This project focuses on exploring the landscape of fraudulent and spam SMS messages in the SMS messages universe in
Pakistan. For this purpose, SMS data has been collected from multiple sources over a period of five months. Further, this
project builds on the data and develops a machine-learning algorithm to identify and tag fraudulent and spam messages
in the data corpus.
1 Proliferation
of SMS & MMS in Pakistan with emphasis on Premium Rate SMS services. (2012). PTA
2 Study
on SMS Traffic in Pakistan & Global Trends: The Inter‐Cellular Network Utilization for SMS Traffic in Pakistan & its
comparison with Global Trends. (2010). PTA.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Accordingly, we executed our project in the following phases:
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
stages of data collection: 1) Through one-to-one data
2. Data Collection collection, and 2) Through mass media
Prior literature indicates that SMS spam is a growing
problem, and highlights the need for a larger more
Data Collection Phase 1: Group data
accessible dataset for researchers. Current SMS data sets collection activity through Safe SMS app
are either small or created primarily from student users
from developed countries, whereas such a dataset is not Survey to understand user response to Spam and
available in the context of Pakistan. Therefore, to fraudulent messages:
understand the nature of the spam and fraud SMS, the
first step was collecting SMS data, representative of the Prior to rolling out the application, we conducted a survey
average urban and rural population. Following are the (in the urban population) to understand user response
goals of this activity: with respect spam or fraud SMS messages and probable
response for an application that detects the nature of such
➢ Collection and creation of a data corpus containing messages. The purpose was to understand how people feel
10,000 SMS when they get brand promotional SMS messages, how they
➢ An understanding of the prevalence of SPAM and react to fraudulent SMS messages, and what they think
Fraud SMS, in comparison with regular messages about preventing these kinds of messages. Some of the key
findings are highlighted in Figure 3 – 7 below:
➢ Review of data collection strategies and the
comparative effectiveness
Methodology
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Figure 6 Survey Result: Emotional response to Spam
Figure 4 Survey Result: Emotional response to Spam
The results demonstrate the types of spam and fraud
messages reportedly received by the respondents, as well
as their emotional responses to such messages. With
respect to both spam and fraudulent messages, the
majority of the respondents claim to be either irritated or
angry at receiving them. Therefore, the problem of spam
and fraudulent messages resonated with most of the
respondents, and they demonstrated an interest in using
some form of fraudulent SMS prevention:
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Ok: Normal Conversations
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Data Collection Phase 2: One-to-one
Message distribution for user
labels fraudulent SMS collection
Spam In this phase, we used personal networks in different
17.34%
localities, including rural and urban communities, for
Fraud data collection through the Safe SMS application. This
0.41%
activity continued over a period of 2 months.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Results ➢ Lack of incentive and no utility: People did not
have any incentive to perform message-sharing
Fraud SMS collection results were not satisfactory for activity, and a few of them highlighted the lack of
these efforts due to following reasons. utility of this application. Providing incentives was
challenging, and we were uncertain of its value,
➢ Unavailability of fraud messages in user’s phones:
unless a mechanism existed where we could
Individuals tend to delete fraud or spam messages
ensure that users uploaded unique and relevant
from their phones. Furthermore, as validated by
messages.
our message distribution in the previous phase,
fraud SMS ratio is much lower in comparison to ➢ Phone types: Feature and iPhone users were
spam or ok conversations. Consequently, we were clearly excluded from the data collection activity
able to get only one fraud message per 10 – 15 through the app, and hence the options they had
people. to share the relevant messages with us were
limited. They could only forward the relevant
➢ Android application usage: Among Android phone
messages, in which case there was an
users, there were various factors that contributed
apprehension of being reported as the sender of
to the ineffectiveness of data collection through
such messages.
Safe SMS app:
o Privacy concerns: People had reservations ➢ Limited connectivity: In rural areas, there was
installing the Safe SMS application because limited mobile connectivity for most service
of privacy concerns. providers, and only one or two networks would
o Usability issues: Smartphone users in provide good signal strength. Therefore, willing
villages had usability challenges, and most users were even unable to forward or upload the
of them only knew how to receive and dial a fraudulent messages.
call. With regards to data collection, it was
an even harder experience because of
Data Collection Phase 3: Fraudulent SMS
inadequate knowledge of smartphone data collection through mass media
usability.
o Time-consuming activity on an application Since the one-to-one data collection had limited success
with multiple steps: Aside from other in collecting fraud SMS messages, we decided to employ
reasons, very few agreed to invest their time social and print media, in which the general population
in installing the application, labeling and was asked to share the fraudulent SMS texts on the
then uploading the selected messages. centralized number. In print media, the news requesting
the public to share fraudulent SMS texts they receive was
published on Jang newspaper.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Results SMS messages, resulting in a SMS database with not only
This activity had the highest success in collecting fraud a greater quantity of fraudulent SMS, but also more
messages. While 132 messages were collected through variety of fraudulent SMS.
the application and 37 messages through the one-to-one
In addition to reporting SMS frauds, the respondents also
campaign, 536 fraudulent messages were collected
call or message to share fraudulent calls, letters, and
through the mass media campaign over a period of 2
WhatsApp messages. This behavior is indicative of the
weeks.
need of an easy-to-use reporting mechanism, as well as
This strategy also resulted in a system that created a the importance of marketing a system.
respondent list, who actively engage with us and
periodically share fraudulent messages as they receive
them. Hence, to date, we have an influx of fraudulent
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
3. Data Pre-Processing Language Distribution:
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Figure 10 Messages with null body by user label
Word Count
12000
Reducing Dictionary Size: To increase uniformity within 12278 12270 13024
10000
the derivatives of the same word in the English language, 8000
and hence, reduce the dictionary size, stemming and 6000
lemmatization was applied. Additionally, stop words and 4000
punctuations were removed as well to generate the 2000
optimal dictionary size: 0
Original Porter Snowball Lemmatization
Corpus Stemmer Stemmer
Figure 13 Unique word count of English SMS
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Addressing Unlabeled and Falsely-labeled
Data Received from the Users
Methodology
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Figure 15 Process: Cosine Similarity Figure 19 Process: Minimum Edit Distance
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Figure 20 Results from Similarity/Distance Measures for fraud SMS
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
will be discarded this way, hence false negative
rate will not be zero in this case.
❖ Jaccard Similarity: This measure was able to
extract 38 fraud messages from the SMS corpus,
with 35 messages above the “fraud only” threshold
value of 0.254 (See Figure 20).
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Table 1 Summary statistics: Similarity/Distance measures number of clusters was evaluated to be 500 (See Figure
22), which minimized the cost of clustering. Thereafter,
Measure Similarity Distance Threshold Fraud Traversed
Threshold Threshold Recall Found Messages we randomly selected unique messages from the 500
generated clusters, which were then manually verified.
Jaccard 0.25398 -- 0.92 38 124
Similarity
Minimum -- 47 0.90 29 57
Edit
Distance
(MED)
Euclidean -- 0.743528 0.57 35 1167
Distance
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
data set sorted in descending order of similarity was Results
generated.
Collectively, with spam and fraud similarity measures
activity, 6859 different messages were labeled manually.
Figure 24 below shows the detailed view for the statistics:
As Figure 23 demonstrates, the Jaccard Similarity Figure 24 Statistics for Manually labeled data based on similarity
measure was able to extract 2505 spam messages which measures
were manually verified, with 1139 messages over the
“spam only” threshold value > 0.216.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
All-inclusive Data Statistics
Finalized Data Statistics
Post data collection, we had the following statistics for 45000
user-labeled data: 34,432 unlabeled; 10,647 OK; 9,272 40000 42462
spam; and 861 frauds: 35000
Number Of SMS
30000
25000
20000
15000
10000
11794
5000 705
0
OK Spam Fraud
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Prevalent Spam schemes
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
the respective language. Therefore, its effectiveness is
4. Modeling tightly bound with the goodness of the lexical resources
it relies on. Here the goodness means that how rich the
Subsequent to the data pre-processing and analysis, we
lexical resource is and how efficiently polarity assigned to
have advanced an algorithm for the SMS classification
words in the lexical dictionary. All the available lexical
task. During data pre-processing, we cleaned our SMS
resources, including SentiWordNet3, WordNet-Affect4,
dataset, and assigned true labels against each message.
MPQA5, SenticNet6, include word mappings with either
The best way to assess the ability of a predictive model to
categorical (positive, negative, neutral) or numerical
perform on future data is to try to simulate this
sentiment scores. In sentiment analysis, a positive and
phenomenon. For this purpose, we split our dataset into
negative score is assigned to the words in a sentence and
two subsets – training data and test data, which we treat
then combine the score to decide the polarity of sentence
as if were data from the future. We randomly divided our
based on the overall score. Therefore, they cannot be used
SMS corpus into 75% training set and 25% test set, and
for other than sentiment classification task. In addition,
have trained different models on training data, after
all these lexicons are in the English language. Therefore,
which the performance of each model was evaluated
owing to the fact, that more than 70% of our SMS data
using test data.
corpus is in Roman Urdu, which has no lexical resources,
The developed classifier is responsible for making a it is not viable to use lexicon methods for our
distinction between fraud, spam and normal messages. classification purpose.
There are three main streamline approaches for text
Machine Learning
classification as mentioned below.
In machine learning, there are unsupervised and
➢ Lexicon based
supervised classification methods. In unsupervised
➢ Machine learning
classification, there is no labeled data and unsupervised
➢ Deep learning
classification algorithm uses similarity measures to
Lexicon Based identify the type of text. On the other hand, supervised
classification, there is previously labeled data and
Lexicon based methodology is a way of classifying text, supervised classification algorithm learns its features
which makes use of the lexicon structural resources for according to the label of text. It is evident from the recent
3 Baccianella, S. (2010). Sentiwordnet 3.0: an enhanced 5 Wiebe,J. a. (2005). Annotating expressions of opinions and
lexical resource for sentiment analysis and opinion mining. In emotions in language. Language resources and evaluation,
S. a. Baccianella, Lrec (pp. pages= {2200--2204}). 165--210.
4 Strapparava, C. a. (2004). Wordnet affect: an affective 6 Cambria, E. a. (2014). SenticNet 3: a common and common-
extension of wordnet. In Lrec (pp. 1083--1086). Citeseer. sense knowledge base for cognition-driven sentiment analysis.
In Twenty-eighth AAAI conference on artificial intelligence.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
works7,8 that supervised machine learning approaches lemmatizer or stemmer in our case to unify different word
generally outperform unsupervised methods. Since we forms relating to the same concept, since the existing
have labeled data in our corpus, we will be using stemmers and lemmatizers are not applicable to majority
supervised methods for our text classification task. of our text messages which are in Roman Urdu.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
The accuracy of the models was additionally verified
using the k-fold cross-validation with k=10 in all
experiments.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Table 2 Classifiers Accuracy and Runtime
10-fold CV
Classifier Feature Transformer Accuracy Running Time Accuracy
Naïve Bayes Count Vectorizer 0.934 0.162966251 0.909
Naïve Bayes Word Level TF-IDF 0.94 0.160802364 0.916
Naïve Bayes N-Gram TF-IDF 0.94 0.170782089 0.895
Logistic Regression Count Vectorizer 0.952 2.269209146 0.912
Logistic Regression Word Level TF-IDF 0.935 0.853351116 0.91
Logistic Regression N-Gram TF-IDF 0.935 0.853185892 0.866
Random Forest Count Vectorizer 0.945 41.70967031 0.908
Random Forest Word Level TF-IDF 0.945 40.60741115 0.909
Random Forest N-Gram TF-IDF 0.945 40.64784145 0.905
SVM Count Vectorizer 0.94 159.9172568 0.91
SVM Word Level TF-IDF 0.94 116.431967 0.913
SVM N-Gram TF-IDF 0.94 110.93911 0.9
Xtreme Gradient Boosting Tree Count Vectorizer 0.895 9.12604022 0.908
Xtreme Gradient Boosting Tree Word Level TF-IDF 0.9 14.98653936 0.909
Xtreme Gradient Boosting Tree N-Gram TF-IDF 0.9 15.03456521 0.877
However, accuracy only reflects the overall results and can be extremely misleading, especially in scenarios where data
distribution for classes is not proportionate. In our setting, we are having the lowest data proportion for the central
class (Fraud) of the project. Therefore, we have further analyzed the precision, recall, and F1-measure for the individual
classes and results are depicted below Table 2 for all the experiments done in the previous step.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Table 3 Individual Class Precision, Recall and F1-Measure Scores
Classifiers applied on
various DTMs Fraud OK Spam
Count
Naïve Bayes Vectorizer 0.67 0.98 0.79 0.98 0.94 0.96 0.81 0.93 0.86
Word Level
Naïve Bayes TF-IDF 1 0.7 0.82 0.97 0.95 0.96 0.84 0.9 0.87
N-Gram TF-
Naïve Bayes IDF 1 0.7 0.82 0.97 0.95 0.96 0.84 0.9 0.87
Logistic Count
Regression Vectorizer 1 0.99 1 0.96 0.97 0.97 0.9 0.87 0.89
Random Count
Forest Vectorizer 1 0.98 0.99 0.96 0.98 0.96 0.9 0.83 0.87
Count
SVM Vectorizer 0.99 0.99 0.99 0.97 0.97 0.97 0.88 0.87 0.88
Word Level
SVM TF-IDF 0.99 0.98 0.98 0.97 0.97 0.97 0.88 0.87 0.88
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
N-Gram TF-
SVM IDF 0.99 0.98 0.98 0.97 0.97 0.97 0.88 0.87 0.88
Xtreme
Gradient
Boosting Count
Tree Vectorizer 1 0.84 0.91 0.9 0.98 0.94 0.88 0.59 0.71
Xtreme
Gradient
Boosting Word Level
Tree TF-IDF 1 0.84 0.91 0.9 0.98 0.94 0.89 0.62 0.73
Xtreme
Gradient
Boosting N-Gram TF-
Tree IDF 1 0.84 0.91 0.9 0.98 0.94 0.89 0.62 0.73
According to these results, the Naive Bayes classifier performed the worst with CountVectorizer as the feature transformer
in the case of fraud labeled messages. It was able to extract 98% of the fraud messages from the test data set. However,
of all the SMS texts it extracted from the test data set and labeled as fraud, only 67% of those were actual fraud SMS
messages. On the other hand, logistic regression and SVM performed best among others with CountVectorizer as a feature
transformer in each case. SVM performed better than logistic regression with 1% margin only in case of precision for OK
labeled messages. Otherwise, logistic regression outperformed all the classifiers on approximately all other evaluation
metrics. Moreover, runtime for the logistic regression model (2.27 seconds) is also far less than SVM which takes 160
seconds approximately to develop. The model learning time is important because when we deploy our model, then it needs
to be updated periodically on the new SMS data. Ultimately, our final developed model would follow the below concrete
steps:
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Deep Learning whereas we have 55,212 text messages only. For future
Deep learning text study is based on deep convolutional work, we can extend our study with the classification
and recurrent neural networks, which usually do not model deployment integrated with a mobile phone
perform well on small data sizes. Therefore, deep learning application, and, hence we can get much more data to
study was not carried out as it was out of scope, and explore and apply deep learning models.
additionally requires millions of data points to work on,
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
5. Proposed Deployment Stages and Strategy
SMISHING (SMS Phishing) is a form of social engineering which exploits human weaknesses to obtain confidential
information from individuals and may lead to financial loss. It is one of the steps of a complex fraud scheme which
randomly targets a large number of people to solicit response from a certain portion of those contacted. Lack of awareness
among victims and the portion of the population exposed to such schemes can increase the resultant impact of SMISHING
frauds among other things. The goal of our work is to counter SMISHING by reducing the overall scale and the resultant
impact of the fraudulent activity by addressing these drivers.
The scope of our work included data collection and developing a classifier algorithm which resulted in two concrete
functions being performed by this system that directly contribute to the goal of reducing SMISHING activity and its impact:
1) Data collection which works on crowdsourcing model resulted in building a database, albeit small, of prevalent
fraud schemes and fraudulent numbers
2) Building on the collected data, the resulting classifier when provided with an SMS as input differentiates between
fraudulent and non-fraudulent SMS based on the message content.
The goal for deployment of the SMS Fraud classifier system is to reduce the overall SMISHING activity by scaling of both
of the above mentioned functions to a larger scale.
For this classifier to be put to use for public good, design, installation, testing, marketing and scaling of the system is
required which we refer to as the deployment of the service. In this chapter, we present the proposed stages of deployment
also shown in figure 29.
Gathering
Identifying System
user Testing Deployment
stakeholders development
requirements
Figure 29 Stages of deployment
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Identifying Stakeholders and their use of the system
The users of the systems will include:
1) Consumers
2) Regulatory bodies and law enforcement agencies
3) Telecom operators
1) Consumers:
a) Consumers will benefit from this classifier’s output, on interfaces appropriately designed for their modality, to
differentiate between fraud, spam and OK messages. Like other social engineered frauds, SMISHING can be prevented
by providing end users with a mechanism for identifying which communication to trust and vice versa. Such tools
have been around for email inboxes for many years. The SMS Fraud Detection classifier helps users differentiate
between fraudulent and non-fraudulent SMS based on the message content. Existing spam detection applications rely
on sender’s number to identify spam and remain ineffective for SMISHING, where the fraudsters keep changing SIMs
to avoid leaving a trace. It is therefore proposed that the algorithm is deployed in such a manner which informs the
users of potential fraud and spam messages by tagging them.
b) Consumers already file reports to the regulators, law enforcement agencies and telecom operators about fraudulent
SMS and calls. However, none of the above mentioned organizations confirmed maintaining a repository of such
fraudulent SMS. The proposed system will provide the consumers with a unified point for reporting and create a single
universal repository of fraud schemes and fraudulent numbers which continues to feed the classifier.
c) Consumers will also provide feedback on the tags returned by system by accepting or rejecting the tags.
For consumers, this service can be made available over multiple interfaces:
• For feature phones, the classifier can be made part of the feature phone SMS application designed by the
manufacturer. For this, manufacturers of feature phones will have to be contacted.
• Smartphone application or web portal for Fraud SMS reporting, tagging, identification that is designed with the
appropriate usability and features to keep the users engaged.
For agencies and regulators, the database of reported fraudulent numbers and SMS can be made available via web portal
and APIs. The purpose of exposing database of fraudulent numbers to law enforcement agencies can serve two purposes:
a. Increase the speed with which fraud numbers are identified for blocking to limit the fraudsters operations enabled
by taking measures and incorporating design elements which increase the reporting ratios of such frauds
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
b. Verify fraud complaints by referring to this database.
3) Telecom Operators
Telecom operator can use this classifier to tag the messages for consumers and relay messages with tags indicating
potential fraud and spam e.g. user receives a message with a tag ‘Possible Fraud’. This method is also suitable to inform
the feature phone users. A similar approach can be seen in mail servers where spam email is tagged but the decision of
what the user wants to do with spam email is left up to the user.
The less favorable way to deploy on service side could be to block fraudulent messages after identifying them. However,
this route could pose a regulatory risk and might be unsuitable from a consumer perspective. The regulations might
hinder the filtering of data based on its content. Moreover, the consumers might not respond favorably to any intervention
which causes some of the messages intended for them to not reach them. In the year 2011, PTA proposed blocking of
SMS (for reasons other than fraud prevention) based on a list of words 11but there were concerns on the source of such
word list as they could lead to blocking of many regular messages for the users. The practice was therefore withdrawn 12.
PTA introduced the Protection from Spam, Unsolicited, Fraudulent and Obnoxious Communication Regulations in 2009
which only works to blacklist fraudulent numbers and run ad campaigns. PTA also introduced schemes for blocking spam
based on the number of messages sent over a certain duration of time13.
Following Table 4 shows the stakeholders of the system and their objectives.
Stakeholder Objective
11
Popalzai, Shaheryar. “Filtering SMS: PTA May Ban over 1,500 English, Urdu Words.” The Express Tribune, 16 Nov. 2011,
tribune.com.pk/story/292774/filtering-sms-pta-may-ban-over-1500-english-urdu-words/.
12
Attaa, Aamir. “PTA Decides to Withdraw SMS Filtration Orders.” Propakistani, 2011, propakistani.pk/2011/11/22/pta-decides-to-withdraw-sms-filtration-
orders/.
13
“PTA Launches Updated SMS Sending Policy For Mobile Subscribers.” Awami Politics, 26 Feb. 2014, www.awamipolitics.com/pta-launches-updated-sms-
sending-policy-for-mobile-subscribers-15271.html.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Regulator and Agencies Access fraud data through API to increase the speed of
fraudulent SIMs identification and blocking, verifying
complaints
A business requirements analysis is an overall widespread statement of what the project is supposed to achieve. This is
a step-by-step process to determine, explore, and document the essential requirements related to a business project. This
step includes the following phases:
• High-Level requirement
• Functional and Non-Functional requirement
• SRS (Software Requirement Specification) Document
Furthermore, SRS document includes the following information regarding the document and the system:
In accordance with the user requirements and goals, there are two parts of the system development:
1. Front-end development
2. Back-end development
1. Front-end development
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Front-end of the system can be either android application or web application. This refers to the interface on which the
classifier service interacts with the end users. Users will input their data using the front end of an application, web
interface or from a feature phone. The front end should enable the user to share data on fraud SMS along with a tag
which will be input to the AI module and engine for further processing and action. The usability of the front-end is crucial
for uptake of the application.
Regardless of the interface on which consumer is using the service, the user should be able to tag the various spam and
fraud messages, that are previously untagged or change tags for falsely labeled by the application as spam or fraud.
Henceforth, the classifier will be able to verify the tags based on the feedback by other users, who can accept or reject the
proposed label.
The user-driven tagging is likely to be of high accuracy because users are very unlikely to tag the messages (just like
emails) meant for them as spam and fraud.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Front-End development flow includes the following phases:
2. Back-end development
Back-end of the system holds the brain of the classifier. It consists of the back-end server and databases where the
gathered data resides and is processed for the classifier to automatically learn and adjust itself for new data.
➢ Back-end modules will perform the basic tasks of reading the text and storing it in a database with a label and
requisite actions like reporting the number to the relevant authority for blocking or any other action.
➢ The AI module will perform text classification and assign an appropriate label to the text.
Project infrastructure phase needs to setup the following tools for the development of the system:
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
o Bitbucket Setup for Code Management
o Technical Specification Documents
o Development, Testing, Staging and Production Environment Setups
o Reporting analysis
o Reporting use-cases identifications
o Data cleaning and EDA (Exploratory Data Analysis)
o Data visualization ( Google Datastudio / Tableau Server)
o Reports generation using ETL queries
Once the system has been developed, it will have to undergo various levels of testing separately for front-end, back-end
and then the whole integrated system:
• Unit Testing
• System Integration Testing
• Regression TestingUser Acceptance Testing (Alpha and Beta Testing)
• Performance Testing
• Vulnerability Assessment/Penetration Testing
When the system is more thoroughly tested, a greater number of bugs will be detected, this ultimately results in higher
quality system. Once the testing process has been completed and the system has successfully passed through all the
testing phases, the system will then be delivered to production.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
6.Conclusion
Digital Financial Services were built on the rails of ubiquitous cellular communication. They rely on calls and messages
for customer verification and communication, complaint resolution, cross selling products, payment reminders etc.
However, as digital banking and mobile money grow, these channels are increasingly being exploited by malicious agents
to defraud phone users, both users and non-users of DFS14. To protect consumers and retain trust among these systems
and services, vulnerabilities need to be identified and addressed on an on-going basis. SMS phishing is one such problem
demanding attention. Previous attempts to warn end users through advertisements in print media have not been
successful in reaching consumers as more and more customers keep getting exposed and falling prey to SMS frauds 15.
Our interactions with consumers during the study also suggest that consumers remain unaware about the reporting
platforms and recourse mechanisms.
To address the problem of SMS frauds, our study aimed at developing a classifier based on content of Fraudulent SMS
which would help end users identify a trustworthy message from a fraudulent one and hence reduce the impact of such
frauds. For this purpose, we gathered a corpus of fraudulent, spam and regular or OK SMS in Pakistan. Data collection
exercise exposed us to the issues of a lack of database recording fraud and spam complaints and SMSes, consumer
concerns around sharing fraudulent messages (and possibility of being falsely accused of being the sender of such
messages), the time and effort required in reporting or sharing such messages as a deterrent to reporting and the low
frequency of SMS frauds per individual consumer coupled with lack of retention of such messages by users in their
phones. We deployed multiple approaches to collect data while addressing these concerns and created a small database
of fraud numbers and SMS schemes which lead to the realization of the potential of such crowdsourced system to create
a universal database of fraudster’s numbers for regulators and hence block them. This can lead to reduction in the
fraudulent SIMs being used by fraudsters which are an important resource limiting their activity. Such a database also
provides a list of prevalent fraudulent schemes to update the classifier.
After collecting SMS data, we used different data preprocessing techniques to clean and prepare dataset while automating
labeling of unlabeled data and handling of false labels. We developed a data science model on collected SMS dataset that
can identify fraudulent from non-fraudulent (spam or OK) SMSes.
The resultant impact of both aspects of this system, namely fraud detection and database creation, is dependent on
scaling the system but requires continuous efforts in marketing and raising awareness of system usage. The process of
14
“Fraudlulent activities: FIA recovers 2,000 verified SIMs from eight suspects”, DAWN NEWS, DAWN, 21 Nov. 2018,
www.fia.gov.pk/en/images/2018/nov/full_news/21-11-2018%201.jpg.
15
FIA. “ANNUAL ADMINISTRATION REPORT 2016.” FEDERAL INVESTIGATION AGENCY , 2016, www.fia.gov.pk/en/ccro/Annualreport2016.pdf.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
updating the fraud detection algorithm incorporating the continuous stream of new and existing fraud schemes and
SMSes will have to be automated as well.
While we started with study of SMS frauds, we came across incidences and stories of a wide variety of frauds and security
breaches occurring in the Pakistani Digital Financial Services space affecting users and non-users of DFS and other types
include Voice Phishing, ATM Frauds 16, security breaches and call masking which also need to be studied and resolved.
16
Express. “Online Bank Fraud.” Daily Express, 20 Dec. 2018, www.fia.gov.pk/en/images/2018/dec/full_news/20-12-2018%201.jpg.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Appendix A: Glossary of Terms
Term Description
Bootstrapping Bootstrapping is a re-sampling technique in which samples are constructed by randomly
drawing observations from a large data set one at a time, to create a smaller representative
subset.
CountVectorizer CountVectorizer converts a collection of text documents to a matrix of token (words) counts.
Document Term Matrix A document-term matrix or term-document matrix is a mathematical matrix that describes
the frequency of terms that occur in a collection of documents (SMS dataset). In a
document-term matrix, rows correspond to documents in the collection and columns
correspond to terms.
F1 Score F1 Score is the weighted average of Precision and Recall.
Lemmatization Lemmatization is the process of converting the words of a sentence to its dictionary form.
For example, given the words amusement, amusing, and amused, the lemma for each and
all would be amuse.
K-fold Cross-Validation Cross-validation is a technique to evaluate predictive models by partitioning the original
sample into a training set to train the model, and a test set to evaluate it.
In k-fold cross-validation, the original sample (dataset) is randomly partitioned into k equal
size subsamples. Of the k subsamples, a single subsample is retained as the validation
data for testing the model, and the remaining k-1 subsamples are used as training data.
The cross-validation process is then repeated k times (the folds), with each of the k
subsamples used exactly once as the validation data. The k results from the folds can then
be averaged (or otherwise combined) to produce a single estimation. The advantage of this
method is that all observations are used for both training and validation, and each
observation is used for validation exactly once.
Recall and Precision Recall expresses the ability of the model to find all relevant instances in a dataset,
precision expresses the proportion of the data points our model says was relevant,
actually were relevant.
Stemming Stemming is the process of converting the words of a sentence to its non-changing portions.
In the example of amusing, amusement, and amused, the stem of these words would be
amus.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Porter Stemmer The Porter stemming algorithm is a process for removing the commoner morphological and
inflectional endings from words in English.
Snowball Stemmer Snowball stemmer is derived from porter stemmer – it is the improved and modified version
of the Porter stemmer. It is also called Porter2 stemmer.
TF-IDF Hashing Instead of maintaining a dictionary, a feature TF-IDF vectorizer uses the hashing trick that
can build a vector of a pre-defined length by applying a hash function h to the features
(e.g., words), then using the hash values directly as feature indices and updating the
resulting vector at those indices.
Tokenization Tokenization is a process of segmenting the text message into words called tokens.
Word2Vec Word2vec takes as its input a large corpus of text and produces a vector space, typically of
several hundred dimensions, with each unique word in the corpus being assigned a
corresponding vector in the space.
Similarity/Distance Description
Measures
Cosine Similarity Cosine similarity calculates similarity by measuring the cosine of the angle between two
vectors. For the highest similarity, the similarity value will be 1, as the angle between the
two vectors is zero
Jaccard similarity Jaccard similarity is the size of intersection divided by the size of the union of two sets.
Jaccard similarity works on sets or vectors with discrete values. The similarity is calculated
based on common terms between the messages
Euclidean Distance The Euclidean distance between two points is the length of the path connecting them. The
Pythagorean theorem gives this distance between two points.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Manhattan Distance Manhattan distance is a metric in which the distance between two points is the sum of the
absolute differences of their Cartesian coordinates. In a simple way of saying it is the total
sum of the difference between the x-coordinates and y-coordinates.
Minimum Edit Distance The minimum edit distance between two strings is the minimum number of editing
(MED) operations (insertion, deletion, substitution) needed to transform one into the other
Classifiers Description
Logistic Regression Logistic regression is a model for classification. In this model, the probabilities describing
the possible outcomes of a single trial are modeled using a logistic function.
Naive Bayes In machine learning, naive Bayes classifiers are a family of simple "probabilistic classifiers"
based on applying Bayes' theorem with strong (naive) independence assumptions between
the features of data.
Support Vector Machine A Support Vector Machine (SVM) is a discriminative classifier formally defined by a
(SVM) separating hyperplane. In other words, given labeled training data (supervised learning),
the algorithm outputs an optimal hyperplane which categorizes new examples.
Random Forest Random forest (classifier) builds multiple decision trees and merges them together to get a
more accurate and stable prediction.
Extreme Gradient Boosting Extreme Gradient Boosting is an implementation of gradient boosted decision trees
Tree designed for speed and performance.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.
Integration Testing Integration testing allows individuals the opportunity to combine all of the units within a
program and test them as a group. This testing level is designed to find interface defects
between the modules/functions. This is particularly beneficial because it determines how
efficiently the units are running together. Keep in mind that no matter how efficiently each
unit is running, if they aren’t properly integrated, it will affect the functionality of the
software program. In order to run these types of tests, individuals can make use of various
testing methods, but the specific method that will be used to get the job done will depend
greatly on the way in which the units are defined.
System Testing System testing is the first level in which the complete system is tested as a whole. The goal
at this level is to evaluate whether the system has complied with all of the outlined
requirements and to see that it meets quality standards. System testing is undertaken by
independent testers who haven’t played a role in developing the program. System testing
verifies that the application meets the technical, functional, and business requirements
that were set by the end user.
User Acceptance Testing Acceptance testing (or user acceptance testing) is conducted to determine whether the
system is ready for release. During the software development life cycle, requirements
changes can sometimes be misinterpreted in a fashion that does not meet the intended
needs of the users. During this phase, the user will test the system to find out whether the
application meets their needs. Initially, dark launch is performed, which is the process of
releasing production-ready features to a subset of your users prior to a full release. This
will enable the system developers to decouple deployment from release, get real user
feedback, test for bugs, and assess infrastructure performance. Beta testing is the final
stage of testing before the official release of the system. In this phase, system with full
features is given to the user outside the organization for real-world exposure. In beta
testing, the system can be made public over the internet and ask the users to download
the trial version of the system and give feedback. Once this process has been completed
and the software has passed, the program will then be delivered to production.
This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.