0% found this document useful (0 votes)
123 views13 pages

Future Generation Computer Systems: Sandhya Mishra Devpriya Soni

This article discusses a new model called 'Smishing Detector' to identify smishing (SMS phishing) messages while reducing false positives. The model consists of four modules: SMS content analysis using machine learning, URL filtering, analyzing website source code, and detecting malicious app downloads. The model was tested on SMS datasets and achieved an overall accuracy of 96.29%.

Uploaded by

José Patrício
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
123 views13 pages

Future Generation Computer Systems: Sandhya Mishra Devpriya Soni

This article discusses a new model called 'Smishing Detector' to identify smishing (SMS phishing) messages while reducing false positives. The model consists of four modules: SMS content analysis using machine learning, URL filtering, analyzing website source code, and detecting malicious app downloads. The model was tested on SMS datasets and achieved an overall accuracy of 96.29%.

Uploaded by

José Patrício
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Future Generation Computer Systems 108 (2020) 803–815

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Smishing Detector: A security model to detect smishing through SMS


content analysis and URL behavior analysis

Sandhya Mishra , Devpriya Soni
Jaypee Institute of Information Technology, Sector-128, Noida, India

article info a b s t r a c t

Article history: Smartphone’s popularity and their constant connectivity to the World Wide Web have made these
Received 15 July 2019 devices vulnerable to phishing and smishing attacks. Phishing is a practice of sending malicious
Received in revised form 20 February 2020 emails to users. Smishing is a combined form of SMS and Phishing in which invaders send SMS
Accepted 7 March 2020
containing malicious content to the victim. This content sometimes includes links which redirect the
Available online 12 March 2020
user to websites containing malicious applications and user interfaces. Researchers have proposed
Keywords: various methods in past years to detect smishing but still, we lack a method that significantly avoids
Smishing false-positive results i.e. falsely categorizing a message as malicious when it is genuine. Hence, we
Phishing have proposed a model called ’Smishing Detector’ to identify smishing messages while reducing
Text messaging false-positive results at every possible step. The proposed method consists of four modules, namely,
Mobile security
SMS Content Analyzer, URL Filter, Source Code Analyzer and Apk Download Detector. SMS Content
Machine learning
Analyzer analyzes the text message contents. Naive Bayes Classification Algorithm is used to identify
SMS
the malicious contents and keywords present in the text message. URL Filter inspects the URL to
identify malicious features. Source Code Analyzer examines the source code of the website to identify
the harmful code embedded in it. Form tag and URL domain present in the source code are also
inspected in this module. APK Download Detector identifies whether any malicious file is downloaded
while invoking the URL. User consent taken while downloading the file is also inspected in this
module. Finally, we have developed a prototype of the proposed system which has been validated
with experiments on SMS datasets. In this paper, we have demonstrated the results of each module
separately and also we have demonstrated the final results. The results of the experiments show an
overall accuracy of 96.29%. We have compared this model with other models proposed by various
researchers and we have found that this model covers more security aspects as compared to other
models.
© 2020 Elsevier B.V. All rights reserved.

1. Introduction by 2021 [2]. Currently, the Android operating system by Google


is the most popular mobile operating system [3], whose market
The development and growth of Information Technology have share is greater than 80%. Due to its popularity and lax security,
led us to connectivity through the Internet and increased usage Android-based smartphones are becoming the main target for
of smartphones. The data shows [1] smartphones are becoming attackers. User privacy is becoming an important issue in Android
more and more lavish and Smartphone users are accessing the In- devices as users are keeping their crucial information like debit
ternet all over the world. Companies like Facebook and Google are card and credit card details in an Android device. Attackers aim
striving continuously to make the Internet a more fascinating to steal user’s private information such as user credentials and
utility. As technology gets cheaper, the number of smartphones financial details by masquerading as a legitimate entity. Mobile
and Internet users will keep on increasing. The rapid growth device users are more prone to phishing attacks due to the
of Internet users has lead to a huge increase in incidents of following reasons [4]:
cybercrimes like phishing and smishing. In 2017, the number
of mobile Internet users in India was 351 million. In 2018, the • Users are not able to check the legitimacy of a webpage
number of users who accessed the Internet through mobile in within the small display of mobile devices because URL
India was 390 million and is expected to reach almost 469 million addresses are not fully visible within the small display of
mobile browsers.
∗ Corresponding author. • Mobile users lack awareness about the risks that they might
E-mail address: [email protected] (S. Mishra). face as a result of adopting less secure behaviors and most

https://doi.org/10.1016/j.future.2020.03.021
0167-739X/© 2020 Elsevier B.V. All rights reserved.
804 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

of the mobile users lack knowledge about security options


that can be utilized to avoid phishing attacks.
• Mobile users are in a habit of entering their credentials
whenever asked. Several surveys conducted by other au-
thors observed that [5] 40% of Smartphone users enter their
passwords into user interfaces at least once. This makes the
attacker to come up with fake user interfaces of legitimate
websites to obtain the personal and financial credentials of
the Smartphone user.

Short Message Service i.e. SMS is very popular among mobile


phone users. In India, the usage of smartphones has increased
rapidly and people have found text messaging the way to com-
Fig. 1. Malicious activities of a smishing SMS.
municate. As text messaging is the most economical option avail-
able to reach masses, it is extensively used for offering purchase
points of interest and as an advertising medium. Approximately
9132 lac text messages are sent every hour across the world [6]. • Phone number and/or Email address: The attackers sends an
This increasing use of text messages makes it highly favorable SMS to the victim declaring free coupons, discounts or free
gifts along with a phone number or email id mentioned in it
for attackers. More than 90% of SMS messages are opened within
so that the victims contact the attacker on the phone num-
3 seconds [6]. This shows that users feel a sense of urgency
ber or email address and gradually attackers request their
which is essential for targeting an attack. SMS phishing is an
credentials. Example of smishing messages [9] containing
attack, in which an attacker sends a text message to the user,
phone number and email id is shown in Figs. 2(b) and 2(c).
and that message contains phone numbers or email ids of the
• Self-Answering SMS: The attackers send a self-answering
attacker or it contains links to malicious web pages, applications
message to the victim asking the user to Subscribe or Un-
or user interfaces that prompt the user to enter their credentials.
subscribe a service. These links redirect the user to malicious
Users reveal their significant information like user id, password
websites.
and credit card information through these user interfaces. The
attackers acquire user’s sensitive information like contact num- During smishing, attackers design the user interface wisely so
bers, photos, etc along with monetary gain through this attack. that the user is unable to identify the minor differences between
Attackers prefer using SMS instead of email for Phishing [7] a legitimate website and fake website created by the attacker.
because of the following reasons: They copy the source code of the legitimate website to create a
fake web page that will look similar to a legitimate web page but
• Text messages have higher response rates from users in these websites could re-direct the user to other malicious links.
comparison to emails. Also, attackers modify the URL to create a fake URL that looks like
• As sending SMS is an economical option, attackers are able URL of a legitimate website. The attacker misplaces a character or
to send SMS in bulk at reduced rates to the victims with a misspells the URL to redirect it to a malicious website or file.
cheap SMS package. The literature survey reveals that there are a few strate-
• Users find it difficult to identify whether the URL is a gen- gies proposed by previous researchers to distinguish smishing
uine URL or phishing URL by gazing at the link contained in SMS from legitimate SMS. While analyzing the existing methods,
the text messages [7]. we found that most of them used the blacklist approach [10],
whitelist approach and heuristic approaches [11,12] to detect
SMS Phishing is also known as Smishing which is a compound smishing attacks. However, the blacklist method is not efficient
word of SMS and Phishing. David Rayhawk of McAfee used the because URL domains of smishing messages are changed fast (age
word ‘‘SMISHING’’ for the first time on August 25, 2006. Smishing of the domain is very less) and they use URL shortcut service
or text message-based phishing involves sending malicious SMS (short URL). So attackers keep on changing the domain name
to the user through which the user is prompted to send his frequently for phishing attacks which makes the blacklisting
sensitive information to smisher. Smisher is an attacker who technique ineffective. Some researchers have used a heuristic
uses SMS for phishing. Malicious activities of a smishing SMS approaches but features of an SMS content alone cannot de-
are categorized in Fig. 1. A Smishing SMS contains any of the termine it is smishing or not. The presence of some keywords
following activities [8]: and URL in the message may indicate a smishing but it might
be a legitimate message sent by an online shopping website
• Embedded URL: In this technique, the attackers send a text or mobile service provider. It can be a spam message but it
message to victims by inserting a URL into it. An example of cannot be categorized as a smishing message until and unless we
a Smishing message [9] containing URL is shown in Fig. 2(a). find that the URL is redirecting the user to a malicious website
Once the user reads the messages and click the URL, the trying to steal the credentials of the user or it is downloading
following activities are initiated: a malicious file automatically. If we classify the spam message
as a smishing message, it indicates a false-positive result. False-
• The malicious code embedded in the URL installs into positive results can be defined as incorrectly labeling a legitimate
the victim’s phone. SMS as smishing SMS. It is essential to check the authenticity
• Some URL redirects the user to a malicious phishing of the URL contained in a message before categorizing it as
website which looks similar to a legitimate website smishing [8]. The existing methods proposed by other authors
and prompts the user to fill user credentials or credit for detection of smishing messages have some limitations. So, it
card details in the form provided. is essential to come up with a new and efficacious method that
• *.apk file is downloaded into the victim’s phone which identifies smishing messages avoiding false-positive results.
turns out to be malware and it causes malicious activ- Hence, we proposed a model called Smishing Detector based
ities later on. on SMS content Analysis and URL behavior analysis that has low
S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 805

Fig. 2. Examples of Smishing messages.

false-positive results. We have done the content analysis of the messages. S-Detector contains four components, namely, SMS
smishing SMS and have also analyzed the behavior of the URL monitor, SMS analyzer, SMS determinant, and Database. This
included in it. We have analyzed the source code of the URL to model inspects the URL to analyze whether .apk file is down-
check its malicious behavior and re-directions. loaded or not. If an executable file is downloaded, the SMS is
What we propose in this paper: categorized as smishing. They have used the Naive Bayes clas-
sification algorithm to analyze the contents of the messages.
• A method that can detect Phishing initiated through text Another study proposed by [14] suggested a rule-based method
messaging.
for identifying Smishing messages. They have identified nine rules
• An embedded APK Download Detector module which veri- to filter smishing messages from legitimate messages. Further,
fies the behavior of the URL included in the SMS.
these rules were trained using different classification algorithms
• A Source Code Analyzer module that inspects the legitimacy like Decision Tree, RIPPER, and PRISM. The performance evalua-
of the Source code and its re-directions.
tion of the approach shows more than 99% true negative rate. A
• A user interface which helps the user to select or skip the research work proposed by Sonowal et al. [11] proposed a model
various steps involved in detecting the Smishing SMS in case
named SmiDCA for detecting smishing messages using a machine
if the user is already aware of the legitimacy of the text
learning approach. The proposed model extracted 39 features
message.
from smishing messages using correlation algorithm. Machine
• To experiment with a real-time application of the system, a Learning algorithms were used and experiments were conducted
prototype of the system is developed and the results of the
on different datasets. Experimental evaluation of SmiDCA dis-
model are verified using SMS datasets.
played an accuracy of 96.40% using Random Forest classifier. Goel
The remaining part of this paper is arranged as follows: A et al. [10] proposed a smishing detection system called ‘smishing
summary of the research work done by other authors in this classifier’ for identifying Smishing messages. This system inspects
field is discussed in Section 2. Our proposed system is presented the contents of the SMS and SMS keywords using Naïve Bayesian
in Section 3. The architecture and flowchart of the model are Classifier. The proposed framework verifies the existence of URL
depicted in Sections 3.1 and 3.2 respectively. Section 4 explains in SMS and it also inspects the mobile number of the sender.
the results of the experiments conducted on the prototype of the Further, login page appearance and download of APK file are also
proposed system. A comparison of the suggested system with evaluated in this model. A. Kang et al. [15] discussed various
other related works is presented in Section 5. The Conclusion of types of phishing and smishing attacks. They have proposed a URL
the proposed work is given in Section 6. validation test to check the authenticity of the URL included in
the message. They have also discussed the smishing box approach
2. Related work as a security measure against downloaded applications during a
smishing attack. In another work proposed, the author [16] used a
In recent years, researchers are focusing more on smishing due content-based approach to detect smishing messages. A machine
to its popularity in mobile attacks. Some researchers have pro- learning algorithm is used for identifying the most frequently
posed models to detect smishing messages [10,11,13–17]. Some used keywords in smishing SMS. Further, the appearance of the
of the spam detection models which used similar approaches login page and downloading of the .apk file is also evaluated
are also discussed here [18,19]. Many researchers have discussed in this model to inspect the maliciousness of the URL. A recent
smishing strategies and smishing detection approaches to bring work proposed by Ankit et al. [17] proposed a feature-based
awareness among users and researchers [20–23]. This section model for detecting smishing messages using a machine learning
gives an overview of various research work presented by other approach. The proposed model extracted some features of smish-
authors in the context of Smishing. As smishing is a category of ing messages and analyzed the results using 5 Machine Learning
phishing and similar techniques are used for phishing detection, algorithms on the same dataset. Experimental evaluation of this
we have discussed some phishing detection approaches [12,24– model displayed an accuracy of 98.74%.
28]. Some authors have proposed spam SMS detection model using
In a smishing study proposed by Joo et al. [13], the author similar approaches. In a study presented by Yadav et al. [18],
suggested a system called ‘S-Detector’ for identifying smishing the author presented ‘SMSAssassin’ a spam SMS detection model
806 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

using the Naive Bayes algorithm and Support Vector Machine et al. [12] suggested a heuristic approach for detecting phishing
(SVM). Authors have used crowd sourcing to keep the spam websites. Based on the characteristics of phishing URLs and web
database updated. Users having SMSAssassin application on their pages, they have defined 20 heuristics tests. They have also de-
mobile device can use the spam messages identified on their veloped an active anti-phishing toolbar called Phishark. From the
device to update the Spam Keyword Frequency list to enhance the experimental results, they have shown that the combination of
performance of the system. The authors have also presented [19] URL-based features and HTML-based features is very effective in
the design and implementation of SMSAssassin which performs distinguishing legitimate websites from phishing websites.
machine learning-based SMS content filtering to distinguish spam The existing methods proposed by other researchers used
messages. They have designed and implemented an Android ap- rule-based approaches [10,13,14,16] and heuristics approaches
plication which categorizes and handles different kind of mes- [11,12] to detect smishing attacks. Rule-based approaches in-
sages which makes the SMS management simple for the user. cluded some rules which covered the flow of smishing attacks
This application also offers an interface to select the notification and its re-directions. But rule-based approaches do not cover
preference to the user. the different versatilities of the smishing attacks. Heuristic ap-
Some researchers have discussed phishing and smishing de- proaches analyze the contents of the text message. Applying
tection approaches to bring awareness among users and re- machine learning algorithms on SMS contents only does not
searchers. Foozy et al. [20] proposed taxonomy of phishing detec- detect the versatile smishing trends. We strongly feel that it is
tion on a mobile device. They have stated and elaborated various necessary to analyze the behavior of the URL link present in the
phishing techniques like Bluetooth phishing, SMS phishing, voice text message for the detection of smishing.
phishing, and mobile application phishing. The researchers also
compared and evaluated various phishing detection techniques. 3. Proposed work
Hossain et al. [21] elaborated various types of phishing attacks.
They have also discussed various phishing mitigation techniques 3.1. Architecture of the proposed system
and best policies which should be followed to prevent phishing
attacks on mobile devices. Their work is aimed at bringing more The proposed model is a Smishing detector which is comprised
of 4 modules, namely; SMS Content Analyzer, URL Filter, Source
security awareness to mobile users. Diksha et al. [22] presented
Code Analyzer and Apk Download Detector as depicted in Fig. 3.
a detailed study of phishing attacks. They have discussed var-
SMS Content Analyzer verifies the presence of URL, self an-
ious mobile phishing attacks conducted by attackers and their
swering link (SAL), phone number and email id in the SMS.
countermeasures. They have also discussed the taxonomy of
Messages containing URL or SAL are transferred to URL Filter.
phishing solutions, techniques used by authors, various chal-
Messages containing email id and phone number are processed
lenges involved in the detection of phishing attacks and they
for blacklist check. Then messages are forwarded for text pre-
have also elaborated various phishing datasets. In a recent work,
processing. Text pre-processing is conducted to convert the text
Sandhya et al. [23] discussed various smishing attacks performed
into a form that can be used for text analysis. Text pre-processing
by attackers. Author has also suggested some policies which can
includes removing all punctuations and special strings, converting
be adopted by the user to deal with the smishing attacks. Vari-
each word to lower case, splitting words to tokenize, stemming
ous techniques and methodologies used for mitigating smishing
i.e. converting each word to its root form and preparing a word
attacks are also discussed in this paper.
vector corpus. Then we categorize the messages on the basis
In a study to detect phishing websites, the author [24] pro-
of keywords present in it. Keywords contained in the message
posed a model called ‘PhiDMA’, incorporating five layers. They
are classified using TfidfVectorizer and Naive Bayes classifier.
have also implemented a sample of this system which offers TfidfVectorizer converts each word in the message to the feature
an interface for visually impaired persons. Experimental results index in the feature vector matrix. In each vector, the numbers
have shown an accuracy of 92.72%. Wu et al. [25] proposed represents the TF–IDF score of each word selected as a feature.
a phishing detection technique called ‘‘MobiFish’’ that detects These feature vectors are used as input to Naive Bayes Classifier.
phishing attacks on mobile devices. For mobile applications, they Naive Bayes Classifier predicts the result based on the learning
have designed ‘‘AppFish’’ and for mobile web pages, they have from the feature vector matrix.
designed ‘‘WebFish’’. The proposed method examines the URL, URL Filter first converts the short URL to Long URL. Then,
the IP address of the URL and HTML source code of the website it looks for the URL in Blacklist. URL found in the blacklist is
to detect the maliciousness of the webpage. Zhang et al. [26] categorized as Smishing. In the next step, URL Filter verifies four
proposed a system that uses the features of phishing URLs, such features of the URL, namely, the age of domain, presence of @tag,
as hosted features and lexical features to inspect the URL. They presence of hyphen and number of dots present in the URL to
have used a machine learning classifier to detect the phishing check the authenticity of the URL. If the threshold of the above
sites on the basis of the selected features. They have shown features is greater than or equal to three i.e. 75% or above, then
high accuracy of their proposed system by experimenting it on we categorize the message as smishing. If the threshold is less
multiple datasets. Mohammad et al. [27] suggested a phishing than three, we pass on the URL to Source Code Analyzer.
website classification system. They have investigated 17 efficient Source Code Analyzer verifies the presence of any form tag
features for the detection of phishing websites and developed a in the source code. If form tag is present in the source code,
new rule for each feature. They have considered the frequency of Source Code Analyzer compares the domain of the request URL
each feature in their datasets to identify the most popular feature in source code with the domain of the actual URL invoked. If the
in the detection of phishing websites. domain is different, the message is classified as Smishing. On the
A novel system proposed in [28], called CANTINA which is a other hand, if the domain is same, the URL is transferred to APK
content analysis method to detect phishing websites using TF–IDF Download Detector.
measures. TF–IDF score is calculated for each word in a document APK Download Detector checks for any file downloading while
and then five words with the highest values are selected. A lexical invoking the URL. This checking is actually carried on by the
signature is found on the basis of selected words, which is fed proposed system without visiting the website. The base name
into a search engine. If the domain name of N top search results of actual URL is extracted to assess whether it contains .apk
matches the domain name of the current web page, authors extension as part of the base name. This process does not in-
declared it as legitimate website else phishing website. Sophie volve invoking the URL or downloading the APK file. It also
S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 807

Fig. 3. Architecture of the Proposed System.


Fig. 4. Flowchart of SMS Content Analyzer.

checks whether the file is downloaded with user permission


or not. Sometimes files are downloaded after re-direction of be Smishing. So we are checking for the presence of smishing
web pages on phishing websites. Our model also checks for APK keywords only when there is the presence of phone number or
downloading after re-direction of the web pages. email id in the SMS. On the other hand, if keywords are not
malicious, it might be a phone number or email id forwarded by
3.2. Flowchart of the proposed system a known sender. So, it might be a legitimate SMS. Hence, we are
trying to eliminate false positive results at every possible step. If
In this section, we elaborate on the flow of the proposed a URL is present in the message, we further analyze the behavior
system. The flowchart of SMS Content Analyzer is shown in Fig. 4, of the URL to come to a conclusion. A sender in contacts list might
URL Filter in Fig. 5, then Source Code Analyzer in Fig. 6, and finally forward a malicious link without knowing its maliciousness. So all
APK Download Detector is depicted in Fig. 7. the links forwarded by known contacts also need to be analyzed
for maliciousness. So we are not considering the factor of ‘sender
3.2.1. SMS Content Analyzer in contacts list’ for our SMS filtering. This helps in avoiding false
SMS Content Analyzer analyzes the contents of the message negative results.
as depicted in Fig. 4. The entire working scenario of SMS Content
Analyzer is drafted in the following steps: 3.2.2. URL Filter
Step 1(a): When an SMS is received, it is passed on to SMS URL Filter analyzes the URL contained in the message as de-
Content Analyzer (Fig. 4). Here, the SMS content is analyzed and picted in Fig. 5. The entire working scenario of URL Filter is
the system verifies the existence of URL or SAL (Self Answering drafted in the following steps:
Link) in the message. If URL or SAL is present in the SMS, then Step 2(a): In URL Filter (Fig. 5), short URL is converted to long
the message is processed to URL Filter (Step-2(a)). URL.
Step 1(b): If URL or SAL does not exist in the SMS, we check Step 2(b): Now, system check for the presence of URL in
for the presence of email id or phone number in it. If the phone blacklist. If it does exist in the blacklist, then the message is
number or email id is not present in the message, it is classified categorized as a Smishing message.
as legitimate message. Step 2(c): If the URL does not exist in the blacklist, URL Filter
Step 1(c): If the phone number or email id is present in verifies four features of the URL, namely, age of domain, presence
the SMS, then system checks for email id and phone number of @tag, presence of hyphen and number of dots present in the
contained in the message in the corresponding blacklists. If phone URL to check the authenticity of the URL.
number or email id contained in the message does exist in the Step 2(d): If the threshold of the above features is greater than
blacklist, message is categorized as smishing. or equal to three, then message is categorized as smishing else
Step 1(d): If email id or phone number is not present in if the threshold is less than three, URL is handed over to Source
the blacklist, then the system analyzes the keywords present Code Analyzer (Step-3(a)).
in the message to classify the SMS using the machine learning In URL Filter, we are checking for presence of URL in blacklist.
algorithm. URL blacklist is constructed from PhishTank [29], which is a
Step 1(e): If the prediction of the algorithm is malicious based website where users and researchers can share phishing data.
on the learning, message is classified as Smishing else it is classi- But most of the URLs are not available in blacklists as attackers
fied as legitimate message. keep on changing the domains and blacklists are not updated
As shown in Fig. 1, a smishing SMS contains either a URL, a self frequently. So we are inspecting the authenticity of the URL by
answering link, a phone number or an email id in it. According to checking some malicious features of the URL and age of domain.
our study, without any of these items, an SMS is not expected to Age of domain is a significant factor for detection of phishing
cause any malicious activity. Presence of malicious keywords may sites as most of the phishing websites have an age of domain
indicate a spam SMS or an advertising campaign but it cannot less than 6 months as they frequently makes new domains. As
808 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

Fig. 6. Flowchart of Source Code Analyzer.

Fig. 5. Flowchart of URL Filter.

shown in Fig. 1, a malicious URL either prompts the user to fill


user credentials or sensitive information like credit card details
in a form provided or it downloads a malicious APK file which
might turn out to be malware. So, once URLs are filtered by the
URL Filter, Source Code Analyzer checks for the form tag in source
code first to ensure that the attacker is not prompting the user to
fill any form provided in the website. If form tag is not found, we
check for the malicious file download activity by passing the URL
to Apk Download Detector.

3.2.3. Source Code Analyzer


Source Code Analyzer analyzes the source code of the URL
without invoking the URL as depicted in Fig. 6. The entire working
scenario of Source Code Analyzer is drafted in the following steps:
Step 3(a): In Source Code Analyzer (Fig. 6), source code of the
URL is analyzed to check for the presence of form tag in it. If form
tag is not present in the source code, URL is processed to APK
Download Detector.
Step 3(b): If form tag is present in the source code, source code
is further analyzed to check whether the domain name of request
URL in the source code is different from the domain of the actual
URL. If the domain is different, message is classified as Smishing. Fig. 7. Flowchart of APK Download Detector.
Step 3(c): If the domain of the request URL is the same as the
actual URL, URL is transferred to APK Download Detector (Step-
4(a)) to further inspect the behavior of the URL.
Step 4(a): Now the APK Download Detector (Fig. 7) extracts
Source Code Analyzer inspects the source code of the target
the target URL of the actual URL.
URL without invoking the URL. This avoids any malicious activity
by the phishing website until the analyzing process is completed Step 4(b): If .apk extension is present in the target URL, the
and malicious URLs are identified. Form tag might be a login system checks whether user consent is taken while downloading
page of a legitimate shopping website which prompts the user to the file. If user consent is not taken, the message is declared as
fill user credentials. So, the presence of a form tag alone cannot a Smishing message. On the other hand, if user consent is taken,
predict maliciousness. Hence, we are matching the URL domain the message is regarded as a Legitimate message.
in the source code with the actual URL domain to check its Step 4(c): If an APK is not downloaded in first access, then
legitimacy which again helps us in reducing the false positive the system checks whether .apk extension is present in the target
results. For most of the legitimate websites, the domain name is URL after re-direction of the page. If yes, then we go to step-4(b)
same in the source code. to check for the user consent else if APK file is not downloaded
after the re-direction of the page, the message is regarded as
3.2.4. APK Download Detector Legitimate message.
The flowchart of APK Download Detector is depicted in Fig. 7. Apk Download Detector checks for any file download without
The entire working scenario of APK Download Detector is drafted invoking the URL to avoid any attack while accessing the URL.
in the following steps: If the file is downloaded after taking user consent, it might be
S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 809

Table 1 Table 2
Popularity of features used in designing phishing websites. Performance of machine learning classifiers.
Feature Frequency in % Classifiers Precision Recall F1-Score Accuracy
Long URL 47.2 Naive Bayes Classifier 0.93 0.92 0.92 91.6
Age of Domain 93.5 Random Forest classifier 0.88 0.82 0.83 82.3
Presence of hyphen (-) 29.4 Decision Tree classifier 0.91 0.88 0.89 88.2
Number of dots in URL 10.2
Form tag in source code 96.5
Difference in domain 98.3
APK Download 56.4 on VERIFY button, the message is evaluated by the SMS Content
Analyzer and the system displays the result and it redirects the
user to URL Filter screen. URL Filter and Source Code Analyzer
an application of any legitimate website like Flipkart or Pinter- screen provide the user with the option of skipping that particular
est. Hence, if user consent is taken, we regard the message as step by tapping the SKIP button as shown in Figs. 8(b) and
legitimate. We are also checking for malicious file download after 8(c). APK Download Detector module can be skipped by tapping
re-direction of the page. on NEXT button which redirects the user to the next message.
Hence, the user can navigate to next message either from the SMS
3.3. Popularity of features used in the system Content Analyzer module or APK Download Detector module.
Fig. 9(a) shows a legitimate message identified by the pro-
In case of a URL is detected in the message, we used some posed system. This interface provides three buttons: READ But-
features of the phishing websites in order to detect the phish- ton, NEXT Button and GO BACK Button. The message is recognized
ing attack. Some features commonly used by the attackers in as genuine by the model, hence, the user is permitted to access
designing phishing websites are identified. To find out which the message by clicking the READ Button or user has the option to
feature is most popular in designing phishing websites, we calcu- tap the NEXT button to go to the NEXT message. Go Back button
lated the number of appearance of each feature in the phishing navigates the user to the previous screen. Fig. 9(b) shows the
dataset [29]. The percentage of appearance i.e. frequency of each smishing message recognized by the system. As the message is
feature used in our system is shown in Table 1. The results categorized as smishing by the model, the DELETE button deletes
showed that ‘Difference in domain’ is most popular feature in the messages without navigating to the inbox, and the NEXT
designing phishing websites. Form Tag, Age of Domain and APK button allows the user to move to the next message.
download also appeared in most of the phishing websites with a SMS Content Analyzer analyzes the text content of the SMS.
frequency of 96.5%, 93.5% and 56.4% respectively. That is why we Here, we check for the presence of email id, phone number and
have tried to cater the features of higher percentage with utmost URL in the message. We segregated the messages based on its
care. contents. Fig. 10 shows the result of the SMS content analysis. Out
of 5858 messages, 5122 messages are segregated and declared as
4. Implementation results and evaluation legitimate messages in this stage, as they do not contain any of
the malicious contents like email id, phone number, or URL.
In this section, we describe the implementation details and Text pre-processing is done on messages containing email id
evaluation results of the proposed system. A prototype of the and phone number. Text pre-processing is conducted to convert
Smishing Detector is developed using python in Jupyter Note- the text into a form that can be used for text analysis. Text
book. The system is developed into four parts, namely, SMS Con- pre-processing includes removing all punctuations and special
tent Analyzer, URL Filter, Source Code Analyzer and APK Down- strings, converting each word to lower case, splitting words to
load Detector. Further, these four modules are integrated to get tokenize, stemming i.e. converting each word to its root form and
a final prototype of the whole system. The system is finally preparing a word vector corpus. Fig. 11 shows the results of the
evaluated using the dataset. Dataset is collected from the research text pre-processing module.
work contributed by the author Almeida [30], a contribution to After text pre-processing, we categorize the messages on the
the study of SMS Spam Filtering. This dataset contains a total basis of keywords included in it. Keywords contained in the
of 5574 messages in which 4827 are ham messages and 747 message are classified using TfidfVectorizer and machine learning
are spam messages. As per our research, smishing dataset is not classifiers. TfidfVectorizer converts each word in the message
publicly available until now. But smishing message is a spam to feature index in the feature vector matrix which is used as
message that strives to steal sensitive data from the user. So, input to machine learning classifiers. Machine learning classifier
we have extracted some smishing messages from spam dataset predicts the result based on the learning from the feature vector
because smishing messages are part of spam messages. Also, we matrix. We have used three machine learning algorithms for
have extracted 284 smishing images from pinterest.com [9]. We the classification and comparison purpose, namely, Naive Bayes,
have added these messages into the dataset after converting it Random Forest, and Decision Tree. We have evaluated our dataset
into text form. Our final dataset is a total of 5858 messages for keyword classification using these three algorithms and we
which contains 538 smishing messages and 5320 ham messages. are getting the best results using Naive Bayes Algorithm for the
Smishing detection is a binary classification problem in which particular dataset.
either a message is smishing or it is legitimate. The model evaluated the performance of three well-known
Fig. 8 depicts that the model has four interfaces: SMS Content machine learning classifiers as depicted in Table 2. Fig. 12 show
Analyzer, URL Filter, Source Code Analyzer and APK Download the performance of machine learning algorithms. In this Figure,
Detector. SMS Content Analyzer displays the phone number of the the x-axis displays the machine learning algorithms, and the
sender and the text message received. It provides three buttons y-axis displays the performance of the algorithms. For the im-
for the convenience of the user. If the user is already aware of plemented dataset, Naive Bayes gives the best performance and
the genuineness of the message, then the user can tap the NEXT Random Forest gives the least performance.
button to skip the verification of the particular message and go Source Code Analyzer analyzes the presence of any form tag
to next message or the user can tap the Go BACK button to go in the source code of the URL. It fetches the actual URL of the
to the previous screen. On the other hand, if the user is tapping short URL provided. Then it accesses the source code of the
810 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

Fig. 8. User interfaces of Smishing Detector.

Fig. 9. Results displayed by Smishing Detector.


S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 811

Fig. 10. Results of text messages segregation based on its contents.

Fig. 11. Results of SMS text pre-processing done on messages containing email id and phone number.

actual URL without actually invoking the URL. Now, the system 80% of data for training purpose, and remaining 20% data for test-
search for a form tag in the whole source code. The result is ing purpose. After 5 iterations, our system gave a final accuracy
displayed according to the search outcome obtained. The result of 96.29%. Cross-validation results are depicted in Table 3.
of the procedure is depicted in Fig. 13. The result of the prototype evaluated and experimented shows
APK Download Detector checks for any file download without that the model gives an accuracy of 91.6% using Naive Bayes
invoking the URL. Fig. 14 shows the whole procedure of per- Classifier in SMS Content Analyzer module. Further, the model
forming an APK Download Check. The first line shows the URL shows a final accuracy of 96.29% after evaluating all four modules.
which is in short URL. The second line shows the result of fetching Hence, through our model evaluation, it is concluded that this
the actual link address from the short URL. The third line shows system is efficient in detecting smishing messages received in
extracting the file name of the file downloaded from the link mobile devices and thereby protecting the users from probable
without taking user consent. If the file is downloaded without threats.
user consent, the link is declared as malicious and hence the As we have followed a flow based approach consisting of a
message is categorized as SMISHING. set of diverse rules, the attacker needs to breach each rules in
We have evaluated our dataset with 5-fold cross-validation a module to reach the next stage in a particular module. We
where we partition the data into 5 equally sized subsets or folds. have four modules in our system, it is almost impossible for the
In the first iteration, first subset is used to test the model and rest attacker to predict the whole flow of the system, to circumvent
4 subsets are used to train the model. In the second iteration, each of the diverse rules applied in each module and to reach to
2nd subset is used as the testing set while the rest serve as the the final decision stage. Hence, this model is a proof against the
training set. This process is repeated until each subset of the 5 attacker who is constantly trying to invent different techniques
folds has been used as the testing set. 5-fold cross-validation uses to circumvent the smishing detection system.
812 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

Fig. 12. Performance of Machine learning algorithms for SMS keyword classification.

Fig. 13. Results of Form Tag check in the HTML Source code by Source Code Analyzer.

Fig. 14. URL Prediction on the basis of APK Downloading.


S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 813

Table 3
Cross-validation results of the Proposed Approach.

Table 4
Comparison of our proposed system with the existing researches.
Security S-Detector [13] Rule-Based [14] SmiDCA [11] Feature-Based [17] Proposed System
Requirements
Approaches Used Rule-Based Rule-Based Heuristic-based Feature Machine learning Rule-Based Flowchart
Flowchart selection classification of SMS
Features
Classification Naive Bayes Decision Tree, RIPPER Random Forest, Decision Logistic Regression, Naive Bayes
Approach and PRISM Tree, AdaBoost and SVM Neural Network, Random
Forest, Naive Bayes and
SVM
Dataset Used Not Specified Almeida [30]-spam Almeida [30]-spam Almeida [30]-spam Almeida [30]-spam
dataset manually filtered dataset of 5574 dataset manually filtered dataset manually
to make a new smishing messages of which 747 to make a new smishing filtered to make a new
dataset of 5169 are spam messages and dataset of 5169 smishing dataset of
messages of which 362 4827 are ham. messages of which 362 5858 messages of
are smishing messages are smishing messages which 538 are
and 4807 are ham. and 4808 are ham. smishing messages and
5320 are ham.
Keywords ✓ ✓ ✓ ✓ ✓
Classification
presence of URL ✓ ✓ ✓ ✓ ✓
Presence of Phone no. ✓ ✓ ✓ ✓ ✓
and Email id in the
message
Phone no. and Email X X X X ✓
id in blacklist
URL in Blacklist X X X X ✓
Check for login page X X X X ✓
Difference in URL X X X X ✓
Domain
APK Download ✓ X X X ✓
APK Download after ✓ X X X ✓
re-direction
User consent while X X X X ✓
downloading APK

5. Comparative analysis of any login page which prompts the user to fill user credentials.
Ankit et al. [14] have followed a rule-based smishing detection
We analyzed the existing researches related to Smishing and system considering the contents of the message but they have not
compared the related works based on the security measures. analyzed the behavior of the URL. SmiDCA proposed by Sonowal
Table 4 represents a comparative analysis of recent researches et al. [11] has presented a heuristics-based system considering
with the proposed system for Smishing detection systems. the contents of an SMS and the keywords present in it. This
The existing Smishing detection systems proposed by other system does not check for the maliciousness of the URL, whether
authors and the proposed system commonly check for the pres- it downloads a malicious file or prompts the user to fill user
ence of URL when an SMS is received, and they also do the con- credentials in a form provided. As per our study, Smishing is
tent analysis of the text message received. But Content Analysis a threat in which an attacker sends an SMS to the user, and
alone cannot determine the maliciousness of a message until and that SMS includes links to user interfaces, malicious applications,
unless we further check the maliciousness of the URL involved. web pages that prompt user to enter their credentials. Hence, we
Verification for the presence of self answering link, phone cannot classify a message as smishing if the link contained in it
number and email id is not done in Smishing detection system is legitimate even if the message contains malicious keywords.
proposed by Joo at al [13]. They have analyzed the URL for Hence, we strongly sense the need for checking the behavior of
download of any APK file but have not checked for the presence the URL further to analyze the maliciousness of the link.
814 S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815

Smishing Classifier proposed by Goel et al. [10] has done sim- CRediT authorship contribution statement
ilar work but they have not eliminated the false-positive results.
If an APK is downloaded, they have classified it a smishing but an Sandhya Mishra: Conceptualization, Methodology, Software,
APK can be the legitimate app of a shopping website like Flipkart. Data curation, Writing - original draft, Visualization, Investiga-
Similarly, login page and self-answering link can be legitimate tion. Devpriya Soni: Conceptualization, Supervision, Validation,
too in case of a message received from Facebook or Amazon Writing - review & editing.
asking the user to log in to their website for availing some offers.
Hence, we need to further check the legitimacy of the website References
to avoid false-positive results. If a URL or Self Answering Link
[1] Mobile Internet - Statistics & Facts, Retrieved from https://www.statista.
is not present in a message, keywords threshold alone cannot
com/topics/779/mobile-internet/.
determine the maliciousness because as per our study, a message
[2] Number of mobile phone users in India from 2013 to 2019, Retrieved from
without any link, mobile number or E-mail ID cannot cause any https://www.statista.com/statistics/558610/number-of-mobile-internet-
malicious activity even if it has high threshold keywords present user-in-india/.
in it. Verification for the User Consent while downloading APK [3] Global mobile statistics, part a: Mobile subscribers; handset market share;
file performed in the proposed system is not done by any of the mobile operators, 2014, Retrieved from https://mobiforge.com/research-
authors till now. It considerably helps to detect the malicious file analysis/global-mobile-statistics-2014-part-a-mobile-subscribers-
handset-market-share-mobile-operators/.
downloading on a mobile device. This check also helps to reduce
[4] Daily SMS Mobile Usage Statistics. https://www.smseagle.eu/2017/03/06/
false-positive results. sms-mobilestatistics-2/. (Accessed 7 June 2017).
[5] L. Kessem, Rogue Mobile Apps, Phishing, Malware and Fraud, 2012, Re-
6. Conclusion and future work trieved from https://blogs.rsa.com/rogue-mobile-apps-phishing-malware-
and-fraud.
[6] https://cmap.amp.vg/web/b3lknalklab1f.
Smishing is a critical attack involved with mobile devices
[7] G. Canova, M. Volkamer, C. Bergmann, R. Borza, B. Reinheimer, S. Stock-
which is rising in this mobile era. Hence, this paper proposed hardt, R. Tenberg, Learn to spot phishing urls with the android nophish
an efficient model titled Smishing Detector to detect and block app, in: IFIP World Conf. Information Security Education, Springer, 2015,
Smishing attacks. The proposed model is using SMS content anal- pp. 87–100.
ysis and URL inspection method to classify smishing messages [8] L. Cranor, S. Egelman, Y. Zhang, Phinding Phish: Evaluating anti-phishing
from legitimate messages. SMS Content Analyzer is the module tools, in: Proceedings of The 14th Annual Network and Distributed System
Security Symposium, February 28–March 2, 2007, 2017.
to analyze the contents of the message. URL Filter, Source Code
[9] Pinterest, Smishing message images, November 20 2018, Retrieved from
Analyzer and APK Download Detector are the modules to inspect
https://in.pinterest.com/seceduau/smishing-dataset/?lp=true.
the behavior of URL contained in the message. We have also [10] Diksha Goel, Ankit Kumar Jain, Smishing-classifier: A novel framework for
developed a prototype of the system using the tkinter package. detection of smishing attack in mobile environment, in: NGCT, CCIS 828,
This prototype provides user-friendly buttons to skip some of 2018, pp. 502–512.
the steps involved in the system. Machine learning algorithms [11] Gunikhan Sonowal, K.S. Kuppusamy, SmiDCA: An anti-smishing model
are used to classify messages on the basis of smishing keywords. with machine learning approach, Comput. J. 61 (8) (2018) 1143–1157.
[12] Sophie Gastellier-Prevost, Gustavo Gonzalez Granadillo, Maryline Laurent,
Naive Bayes classifier shows the best accuracy for the keyword
Decisive heuristics to differentiate legitimate from phishing sites, in:
classification in our proposed system for the particular dataset in- Conference on Network and Information Systems Security, La Rochelle,
volved. After integrating all the four modules, the final prototype 2011, pp. 1–9, http://dx.doi.org/10.1109/SAR-SSI.2011.5931389.
experimented shown an accuracy of 96.29%. [13] J.W. Joo, S.Y. Moon, S. Singh, J.H. Park, S-detector: an enhanced security
A comparison of our model with existing models displayed model for detecting smishing attack for mobile computing, Telecommun.
that this system comprises more security requirements as com- Syst. 66 (2017) 1–10.
pared to other proposed models. This system is expected to [14] Ankit kumar Jain, B.B. Gupta, Rule based framework for detection of
smishing messages in mobile environment, Procedia Comput. Sci. 125
deliver more effective security against attacks in terms of iden-
(2018) 617–623.
tifying smishing messages and preventing false-positive results. [15] Anna Kang, Jae Dong Lee, Won Min Kang, Leonard Barolli, Jong Hyuk Park,
The practical implementation of this work can be integrated with Security considerations for smart phone smishing attacks, 2014, http:
the Android platform and can be used as an application to detect //dx.doi.org/10.1007/978-3-642-41674-3_66.
the smishing messages. This application can be used to identify [16] Sandhya Mishra, Soni Devpriya, A Content-Based Approach for Detecting
smishing message when a message is received and the message Smishing in Mobile Environment, Suscom, 2019, Available at SSRN: http:
can be discarded or saved based on the result. //dx.doi.org/10.2139/ssrn.3356256.
[17] Ankit Jain, B.B. Gupta, Feature based approach for detection of smishing
In future, we are planning to incorporate more techniques to
messages in the mobile environment, J. Inf. Technol. Res. 12 (2019) 17–35,
the proposed model in order to prevent more intelligent and ver- http://dx.doi.org/10.4018/JITR.2019040102.
satile threat methods. This system is lacking security in ensuring [18] K. Yadav, P. Kumaraguru, A. Goyal, A. Gupta, V. Naik, Smsassassin: Crowd-
the genuineness of the application downloaded in APK Download sourcing driven mobile-based system for SMS spam filtering, in: Proc. 12th
Detector module. Hence, to ensure application security, we are Workshop on Mobile Computing Systems and Applications, New York, NY,
also planning to embed Malware detector with APK Download USA HotMobile, ACM, 2011, pp. 1–6.
Detector to identify malicious apps in our future work. This [19] K. Yadav, S.K. Saha, P. Kumaraguru, R. Kumra, Take control of your SMSes:
Designing an usable spam SMS filtering system, in: IEEE 13th International
will focus on more research work to provide security against
Conference on Mobile Data Management, Bengaluru, Karnataka, 2012, pp.
personal information leakage and to detect malicious applications 352–355.
downloaded. [20] C.F.M. Foozy, R. Ahmad, M.F. Abdollah, Phishing detection taxonomy for
mobile device, Int. J. Comput. Sci. 10 (1) (2013) 338–344.
Declaration of competing interest [21] Hossain Shahriar, Tulin Klintic, Victor Clincy, Mobile phishing attacks and
mitigation techniques, J. Inf. Secur. 06 (2015) 206–212, http://dx.doi.org/
10.4236/jis.2015.63021.
The authors declare that they have no known competing finan- [22] Diksha Goel, Ankit Kumar Jain, Mobile phishing attacks and defence
cial interests or personal relationships that could have appeared mechanisms: state of art and open research challenges, Comput. Secur.
to influence the work reported in this paper. (2017) http://dx.doi.org/10.1016/j.cose.2017.12.006.
S. Mishra and D. Soni / Future Generation Computer Systems 108 (2020) 803–815 815

[23] S. Mishra, D. Soni, SMS phishing and mitigation approaches, in: Twelfth Sandhya Mishra is currently pursuing her Ph.D. in Mo-
International Conference on Contemporary Computing (IC3), Noida, India, bile Security from Jaypee Institute of Information Tech-
2019, pp. 1–5, http://dx.doi.org/10.1109/IC3.2019.8844920. nology, Noida, India. She received Master of Computer
[24] G. Sonowal, K. Kuppusamy, Phidma—a phishing detection model with Applications from Guru Gobind Singh Indraprastha
University, Delhi, India. Her research interest includes
multi-filter approach, J. King Saud Univ. Comput. Inf. Sci. 29 (2017) 1–15.
Cyber security, Mobile security, Smishing and Phishing
[25] L. Wu, X. Du, J. Wu, MobiFish: A lightweight antiphishing scheme
Detection, Web security, Social Media Network and
for mobile phones, in: 23rd International Conference on Computer
Machine Learning.
Communication and Networks, ICCCN, 2014, pp. 1–8.
[26] J. Zhang, Y. Wang, A real-time automatic detection of phishing URLs,
in: 2nd International Conference on Computer Science and Network
Technology, ICCSNT, IEEE, 2012, pp. 1212–1216. Dr. Devpriya Soni is presently working as Associate
[27] R.M. Mohammad, F. Thabtah, L. McCluskey, Intelligent rule-based phishing Professor in Jaypee Institute of Information Technol-
websites classification, IET Inf. Secur. 8 (2014) 153–160. ogy, Noida, India. She has received her Ph.D. degree
[28] Yue Zhang, Jason Hong, Lorrie Cranor, Cantina: a content-based approach from Maulana Azad National Institute of Technology
to detecting phishing web sites, 2007, pp. 639–648, http://dx.doi.org/10. (MANIT), Bhopal. She is the Board Member for confer-
1145/1242572.1242659. ence proceedings and member of review committee for
journals. She is a Remote Center Coordinator for ISTE
[29] PhishTank – Blacklisted URLs, Retrieved from http://data.phishtank.com/
workshops organized by IIT, Bombay and conducted
data/online-valid.csv.
several workshops as RC coordinator. She is also recog-
[30] T.A. Almeida, J.M.G. Hidalgo, A. Yamakami, Contributions to the study of
nized as the Aakash Project Coordinator for the remote
SMS spam filtering: New collection and results, in: 11th ACM Symposium center of IIT, Bombay and guiding many undergraduate
on Document Engineering, 2011, pp. 259–262. level projects for Aakash tablet. She is also serving as a Reviewer of several
Ph.D. thesis and was invited as jury member, examiner in Skema University
(Lille, France). She has Guided several projects at undergraduate and post
graduate level. Her research interests include mobile security, Mobile Application
Development, Information Retrieval and Data Mining, Software Engineering, Data
Science.

You might also like