Dsmishsms-A System To Detect Smishing SMS: S.I.:Machinelearningapplicationsforsecurity

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Neural Computing and Applications (2023) 35:4975–4992

https://doi.org/10.1007/s00521-021-06305-y (0123456789().,-volV)
(0123456789().,-volV)

S . I . : M A C H I N E L E A R N I N G A P P L I CA T I O N S F O R S E C U R I T Y

DSmishSMS-A System to Detect Smishing SMS


Sandhya Mishra1 • Devpriya Soni1

Received: 4 January 2021 / Accepted: 1 July 2021 / Published online: 28 July 2021
 The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2021

Abstract
With the origin of smart homes, smart cities, and smart everything, smart phones came up as an area of magnificent growth
and development. These devices became a part of daily activities of human life. This impact and growth have made these
devices more vulnerable to attacks than other devices such as desktops or laptops. Text messages or SMS (Short Text
Messages) are a part of smartphones through which attackers target the users. Smishing (SMS Phishing) is an attack
targeting smartphone users through the medium of text messages. Though smishing is a type of phishing, it is different
from phishing in many aspects like the amount of information available in the SMS, the strategy of attack, etc. Thus,
detection of smishing is a challenge in the context of the minimum amount of information shared by the attacker. In the
case of smishing, we have short text messages which are often in short forms or in symbolic forms. A single text message
contains very few smishing-related features, and it consists of abbreviations and idioms which makes smishing detection
more difficult. Detection of smishing is a challenge not only because of features constraint but also due to the scarcity of
real smishing datasets. To differentiate spam messages from smishing messages, we are evaluating the legitimacy of the
URL (Uniform Resource Locator) in the message. We have extracted the five most efficient features from the text messages
to enable the machine learning classification using a limited number of features. In this paper, we have presented a
smishing detection model comprising of two phases, Domain Checking Phase and SMS Classification Phase. We have
examined the authenticity of the URL in the SMS which is a crucial part of SMS phishing detection. In our system, Domain
Checking Phase scrutinizes the authenticity of the URL. SMS Classification Phase examines the text contents of the
messages and extracts some efficient features. Finally, the system classifies the messages using Backpropagation Algorithm
and compares results with three traditional classifiers. A prototype of the system has been developed and evaluated using
SMS datasets. The results of the evaluation achieved an accuracy of 97.93% which shows the proposed method is very
efficient for the detection of smishing messages.

Keywords Smishing  Phishing  Paytm SMS scam  Mobile security  Machine learning  Backpropagation Algorithm 
Cyber security  Covid-19 SMS Scam

1 Introduction union of ‘SMS’ and ‘phishing’ in which the attacker uses


text messaging instead of email [1]. It is a fraudulent attack
SMISHING is a mobile security issue recently getting in which a text message pretending to be a genuine mes-
popular all over the world. It is a phishing attack initiated sage is sent to the mobile user. This text message often
through text messaging to befool the users. Smishing is a contains a URL that redirects the user to malicious web-
sites. For instance, a smishing message pretending to be
from your bank might ask you for personal or financial
& Sandhya Mishra information through a link provided in the message. The
[email protected] text in the SMS prompts the user to click the link in the
Devpriya Soni SMS. This link in turn will ask the user to provide his/her
[email protected] sensitive information in consideration of unblocking the
1
debit/credit card, getting some gifts or discounts, etc.
Department of Computer Science & Engineering and
Information Technology, Jaypee Institute of Information Mobile users are less aware of the security risks asso-
Technology, Sector-128, Noida, India ciated with smartphones. Most of them assume that mobile

123
4976 Neural Computing and Applications (2023) 35:4975–4992

devices are more secure than computers, but smartphones but if it does not contain a URL to redirect the user to
are at high risk because of their platform offering greater malicious websites, then it cannot bring much harm to the
flexibility for cybercriminals. As mobile users are user until and unless the user contacts the attacker on the
increasing, attacks against mobile devices are also sky- phone number provided in the message. Some researchers
rocketing. Another risk factor associated with mobile have used blacklist methods, but this method is not effi-
devices is that people often use mobile devices on the go cient because blacklists are not updated frequently.
and read text messages when they are in a hurry. This leads Attackers change the domain name of their website more
users to respond to SMS and click on malicious links frequently. Hence, blacklists are unable to track the fre-
without thinking about its maliciousness. quently changing patterns of attackers. Some authors have
Attackers use several communication mediums to con- used blacklists along with various other rules to examine
nect with the victims such as text messages, email, What- the maliciousness of the message. Other smishing detection
sapp messages, and phone calls. [2]. However, SMS is a approaches were using some complex methods for smish-
medium to communicate with mobile users without the ing detection. Hence, we strive to propose a less complex
internet. The report of Statista [3] reveals that the number and less time-consuming system to check the genuineness
of smartphone users in the world was 2.9 billion in 2018, of the URL present in the SMS.
and it would be approximately 3.8 billion in 2021. Though smishing is considered to be a type of phishing,
According to CallHub, the response rate of text messages is phishing can be detected using the information available in
much higher in comparison with email [4]. This prompts the phishing email or in the phishing website. In case of
the attackers in using text messages as a medium to com- smishing, the strategy of attack is different and we are left
municate with the users. For attackers, sending text mes- with very few amount of information shared by the attacker
sages is less expensive because they can send a significant in the Short Text Messages. Moreover, attackers use
number of messages to the users [5] with a single SMS idioms, emoticons, abbreviations, and leet words in the
package. Smishers (Smishing attacker) target to get user’s SMS. Hence, smishing detection becomes a difficult task
personal or financial data for which they might use two considering the minimum amount of information we have
methods. They might trick users into downloading malware to design the smishing detection strategies. The URL
which will in turn send sensitive information to the included in the text message is also a short URL. Some of
attacker. On the other hand, the link in the text message the challenges faced in the detection of smishing SMS are
might redirect the user to a fake website which in turn asks listed below:
for user-id, passwords, and financial details.
• The abbreviated form of text messages makes it
Recently, the Paytm smishing scam was very alarming
difficult to analyze the maliciousness of the message.
among smartphone users for defrauding their hard-earned
This leads to limited number of features extracted from
money. In this smishing scam, fraudsters pretending to be
the message, and hence, the identification of malicious
from Paytm send fraudulent messages to smartphone users.
SMS becomes difficult.
These smishing messages include malicious texts which
• Leet words, idioms, and misspelled words are used in
give an impression to the Paytm user that their KYC
the text message which leads to hassle in the identifi-
(Know Your Customer) is expired and needs to be
cation of smishing keywords.
renewed. These messages contain either a phone number,
• Spam messages contain a similar set of features in
email id, or URL. The text in the message prompts the user
comparison with smishing messages. Hence, differen-
to contact the attacker for not getting their account blocked.
tiating among spam messages and smishing messages is
When the user contacts the attacker, they later ask the user
a tedious task thereof.
to download an application through which the attacker gets
• The scarcity of real-time, public smishing datasets
access to the user’s device remotely. In turn, attackers
makes it a challenge to evaluate the smishing detection
get all the sensitive details entered by the Paytm users, and
systems.
attackers use these details to activate fraudulent financial
transactions. To address the above-mentioned challenges, we have
Many researchers have already worked on the smishing built a smishing detection system and evaluated it using
problem and have identified several methods to detect real-time datasets. The description of the dataset is given in
malicious text messages. Some of them have used heuristic Sect. 5 of this paper. To differentiate among spam mes-
methods in which researchers select some features of the sages and smishing messages, the legitimacy of the URL in
text messages with the aid of classification algorithms and the message is evaluated. A URL in the message that
categorize the SMS based on these features, but these redirects the user to malicious phishing websites conforms
methods do not inspect the URL present in the SMS and its to the maliciousness of the message. This is a crucial step
redirections. A message might contain suspicious features, in the detection of smishing within the messages. Also, leet

123
Neural Computing and Applications (2023) 35:4975–4992 4977

words and misspelled words are selected as a smishing papers which implemented neural network and Backprop-
heuristic in our system to enhance the detection capability. agation Algorithm for SMS Classification. Moreover, the
Leet words and misspelled words are often used by the research works are also categorized based on their detec-
attackers to befool the users. On the other hand, a legiti- tion techniques used. For Smishing and phishing detection,
mate message sent by a legitimate organization never researchers have used various methods like content-based
contains leet words and misspelled words. We have methods, flow-based methods, heuristic-based methods,
extracted the five most efficient features from the text and signature-based filtering methods. The smishing-re-
messages to enable the machine learning classification lated research works are also categorized based on the
using a limited number of features. Also, we are analyzing machine learning algorithms used for detection of smish-
the authenticity of the URL to avoid high false-positive ing. Traditional machine learning algorithms are used in
rates, i.e., the number of legitimate messages categorized some research works, whereas neural network is imple-
as smishing, in our system. mented in other research works for SMS Classification.
The motivation of this paper is to develop an efficient
and less complex smishing detection model using some 2.1 Smishing detection
less laborious steps. At the same time, we are checking the
authenticity of the URL along with analyzing the contents Some researchers have conducted studies to bring Aware-
of the message. To address smishing, the most popular ness among users and researchers about smishing. These
mobile security issue, we have performed a case study on research works include a study about SMS Phishing
the Paytm smishing scam and developed a system that attacker strategies, detection techniques, approaches, and
detects smishing. In this system, we have used Backprop- various policies that can be followed to mitigate smishing.
agation Algorithm to classify the text message. The system In the research paper [1] titled ‘SMS Phishing and Miti-
checks for URLs and extracts few features from the mes- gation Approaches,’ the author focused on various strate-
sage. Text pre-processing is done, and nouns are selected gies followed by researchers to detect and prevent
from the message to form a signature that is provided to the smishing. This paper also discusses various approaches that
search engine. The domain name of the URL is also should be adopted by users to prevent smishing attacks. It
extracted to create the signature. The system consists of also elaborated on various techniques used by the
two phases, Domain Checking Phase and SMS Classifica- researchers to detect smishing messages. Their work is
tion Phase. A model of the proposed system is also focused on bringing awareness among users about smish-
implemented to evaluate the results. We propose the fol- ing attacks.
lowing components in this paper: The heuristic-based classification system is also used by
researchers to classify the SMS using machine learning
• A system to detect smishing.
algorithms. In these studies, researchers extract some set of
• A case study is performed on the Paytm smishing scam.
features from the dataset and classify the SMS based on
• A two-step Domain Checking technique is developed to
these features. Ankit et. al. [6] proposed a heuristic-based
check the legitimacy of the URL.
algorithm to detect smishing messages. The authors
• The Backpropagation Algorithm is studied and imple-
selected 10 features out of smishing messages and classi-
mented using real-time datasets.
fied the messages based on these selected features with the
• A model of the system is implemented, and its
help of classification algorithms. Authors have experi-
evaluation using SMS datasets is presented.
mented with their approach on a manually processed
The rest of the part of this paper is arranged as follows: dataset. Their evaluation results showed an accuracy of
Background study of this research work is explained in 98.74%. SmiDCA [7], and a smishing detection technique
Sect. 2. The novel system suggested in this paper is elab- proposed by Sonowal et al. showed an accuracy of 96.4%.
orated in Sect. 3. Challenges faced while doing this work The authors selected 39 features of smishing messages and
and comparison are discussed in Sect. 4. Experimental then, used dimensionality reduction to reduce the number
details and results are shown in Sect. 5, and finally, the of features and to select the 20 best features. They have
conclusion is given in Sect. 6. also used the correlation algorithm and showed good
accuracy in their experiment. The authors [8] proposed a
smishing detection system using machine learning algo-
2 Background study rithms in a recent work proposed. They have used Support
Vector Machine (SVM), Random Forest (RF), and Logistic
Background study about smishing revealed that smishing- Regression (LR) for the classification part of their system
related studies fall into two categories, studies on smishing and reported a 92.7% F1-score using Support Vector
and studies on phishing. We have also included research Machine. In the latest research work[9] proposed for

123
4978 Neural Computing and Applications (2023) 35:4975–4992

smishing detection, the authors used four correlation various policies that can be followed to mitigate Phishing.
algorithms, namely Pearson rank correlation, Spearman’s Research work presented in paper [15] ‘‘Phishing-chal-
rank correlation, Kendall rank correlation, and Point bise- lenges and solutions,’’ the author discussed phishing chal-
rial rank correlation to rank the features. Finally, the best lenges and their solutions. The author discussed the
feature set is selected for smishing detection and reported phishing solution in three steps, Prevent Phishing, Detect
an accuracy of 98.40%. Phishing, and stake holder Training. The author also dis-
Flow-based approaches are used to build a system for cussed various challenges involved in the detection of
smishing detection which is arranged in layers. These phishing. Diksha et al. proposed a research paper [16]
layers are depicted in flowcharts in the best way available. focusing on phishing and its related areas. This paper dis-
These systems are built to extract content-based features cusses all categories of phishing and the various techniques
from the SMS to facilitate their classification. In another used by researchers in this field to detect phishing. Various
research paper titled S-detector [10], the authors followed a categories of phishing like smishing and vishing are elab-
flow-based approach to detect smishing messages. The orated in this paper. It also discusses detection techniques
system flow is decided based on whether a URL exists in and approaches relevant to the phishing area. Anna et al.
the SMS or not. If the URL exists, APK download criteria proposed a paper [17] in which they discussed various
are analyzed. They have also done the keyword classifi- mobile security attacks such as phishing, voice phishing,
cation using classification algorithms in case the URL does and smishing. They focused on smishing and its problems
not exist in SMS. In a study to recognize smishing mes- and preventive measures in this paper. Foozy et al. pro-
sages [11], the authors presented a content-based approach. posed a paper [18] discussing various phishing identifica-
Whether the user is prompted to fill a form for revealing his tion techniques on a smartphone device. They have
credentials is inspected in this approach. Also, whether an detailed phishing attacks like Bluetooth phishing, smish-
executable file is downloaded into the mobile device on ing, and vishing. This paper also differentiated among
clicking the link is inspected. Finally, the message is cat- several phishing detection techniques. Hossain et al. pro-
egorized as smishing and legitimate. Diksha et al. [12] posed a research paper [19] in which the author focused on
suggested a smishing detection model named as ‘smishing phishing attacks and their categories. In addition to this,
classifier’ in which the system analyzes the content of the phishing mitigation techniques and best policies to avoid
message and segregates smishing keywords using the phishing attacks on android devices are also discussed in
Naı̈ve Bayes Algorithm. In this work, the author checks the this paper. Their work is focused on bringing more
existence of any link in the SMS and it also scrutinizes the awareness among users about phishing attacks. In a recent
phone number of the SMS sender. In addition to this, the study [20] conducted by researchers, an overview of Arti-
appearance of an interface to fill credentials and APK ficial Intelligence (AI) techniques used for phishing
downloading to devices is also evaluated in this model. A detection is discussed. This paper conducts a comparison of
research work presented by Ankit et al. [13] presented a different phishing detection techniques and highlights the
smishing detection system using some set of rules. The pros and cons of these techniques. This paper also dis-
authors selected nine rules from the smishing dataset. cusses the challenges of phishing detection and its future
Then, they have used three algorithms, namely Decision directions.
Tree, RIPPER, and PRISM to classify the messages. The Some researchers have used filtering techniques to
implementation results of the technique presented good detect phishing. The phishing dataset is filtered into dif-
accuracy. In the latest work, the author proposed a research ferent categories during each stage of filtering. PhiDMA
paper [14] titled ‘Smishing Detector’ in which they fol- [21] proposed by Sonowal et al. is a phishing detection
lowed a content-based approach having four modules. system using a multi-filter approach: whitelist filter to
They analyzed the content of SMS in the SMS content check the URL in the whitelist, URL feature filter to check
analyzer module and then inspected the authenticity of the the malicious features of the URL, lexical signature filter to
URL through URL filter, Source code analyzer, and APK create a signature from the words present in webpage,
download detector modules. They have shown an accuracy string matching filter for matching the URLs obtained from
of 96.2% in the experiment conducted. a search engine, and accessibility score filter to compare
the accessibility score. This paper also presented an inter-
2.2 Phishing detection face for visually impaired persons.
The heuristic-based classification system is also used by
Some researchers have conducted studies to bring Aware- researchers to classify the phishing URL using machine
ness among users and researchers about Phishing attacks. learning algorithms. In these studies, researchers extract
These research works include a study about phishing some set of features from the dataset and classify the URL
attacker strategies, detection techniques, approaches, and based on these features. An anti-phishing model [22]

123
Neural Computing and Applications (2023) 35:4975–4992 4979

proposed by authors suggested a phishing detection system Cantina [29] is another method proposed for detecting
in which they categorized phishing features into various phishing URLs in which they have formed a signature
categories like features related to address bar, features using words collected from the website with the help of the
related to HTML, and javascript, features related to the TF-IDF algorithm. They have also selected a few heuristics
domain name of URL and listed each one of them. They of the URL to get better accuracy. Finally, they have pre-
have also shown the popularity of their features in detect- sented four experiments to show the comparison of the
ing phishing websites. Zhang et al. [23] proposed a tech- features they have used.
nique that selected some features from the phishing dataset Flow-based approaches are used to build a system for
and then, categorized them as hosted features and lexical phishing detection which is arranged in layers. These layers
features. A classification algorithm was also used to iden- are depicted in flowcharts in the best way available. Wu
tify the phishing URLs based on these features. Their et al. presented MobiPhish [30], an anti-phishing model in
method has shown good accuracy while examining it on which they have presented two interfaces, WebPhish and
various datasets. Authors [24] proposed a phishing detec- AppPhish for detecting phishing in webpages and mobile
tion approach called CANTINA ? in which they have apps, respectively. They have used techniques like
used eight features, HTML DOM, and search engines. In matching the domain name in the whitelist, checking
addition to this, a near-duplicate filter is presented which whether IP address is used instead of the domain name,
identifies highly similar phish. A login form filter that checking for sensitive terms in the text included in the web
categorizes web pages without a login form in it as legit- page.
imate is also included in this paper.
The latest research work proposed by Ankit et al. [25] 2.3 Implementation of neural network for SMS
implements a search engine-based technique and some classification
heuristics to detect a phishing attack. They have extracted
their search query from the URL and used this query to SMS Classification includes both spam detection and
search for a legitimate website. They have also extracted smishing detection. Both of these studies include a similar
some heuristics from the source code of the URL like login set of features and detection techniques. Spam messages
form, input tag, etc. Their experiments showed 99.05% are messages sent by companies and business organizations
accuracy on phishing data consisting of 2000 phishing for advertisement purposes and thereby increasing their
URLs and 2000 legitimate URLs. In a recent work [26], revenue. On the other hand, smishing messages are spam
authors proposed a phishing detection system that utilizes messages which contain a malicious URL, attacker’s phone
eight machine learning algorithms to detect phishing. They number, or email id which redirect the user to malicious
have used three different datasets for the evaluation of their websites. Hence, smishing messages are a part of spam
work. The final evaluation and comparison of their work messages, but they are more vulnerable to mobile devices.
has shown notable performance in the detection of phishing The latest work published by the author Ankit et al. [31]
URLs. Lokesh et al. [27] proposed a phishing detection proposed a system for identifying spam and smishing
system that uses the wrapper-based method for feature messages from the same dataset. First, they have segre-
extraction, and some effective features are extracted. For gated the spam messages and legitimate messages based on
the final classification, they have used various machine some features. Then, they have segregated the smishing
learning algorithms like Random Forest, K nearest neigh- messages out of the spam messages based on a new set of
bors, Decision Tree, and SVM. A comparative study of features. They have reported an accuracy of 96% on the
their work has shown that their system is efficient in the neural network. In a research paper authored by Ghourabi
detection of phishing URLs. In a recent work [28] pre- et al. [32], the author proposed a model for the detection of
sented by Saravanan et al., feature extraction is done using spam messages. They have implemented the system com-
the phishing dataset obtained from PhishTank. This feature bining two methods, CNN and LSTM. The combination of
vector is further forwarded to the module GenFea which these two methods gave them an accuracy of 98.37%. The
conducts feature reduction, and the best feature set is proposed model detects both Arabic and English spam
obtained. This feature set is again treated by the mod- messages. In the comparative study with other machine
ule PhiDec to identify the maliciousness of the website. learning algorithms, it is shown that the CNN-LSTM
The evaluation results of the system proved that their model gives better accuracy. In a research work proposed
system is efficient in phishing detection. by Roy et al. [33], the author implemented a deep learning-
The signature-based detection system is used by some based model using CNN and LSTM along with traditional
researchers for phishing classification. In this detection classifiers such as Naive Bayes and Random Forest. This
technique, a signature is formed using few features of the model is intended to differentiate between spam and non-
URL which is used to check the legitimacy of the webpage. spam text messages. They have implemented CNN in three

123
4980 Neural Computing and Applications (2023) 35:4975–4992

phases, creating a word matrix, identifying the features Figure 1 shows a Paytm scam message in which
from messages, and classifying them into predefined clas- attackers used leet words to defraud the users. In this, some
ses. They have shown that the CNN-based model gave an leet words, i.e., numerals looking like alphabets are used to
accuracy of 99.44% on tenfold cross-validation. In a novel make an illusion that it is a genuine message and at the
system proposed by Sheikhi et al. [34], the model is pre- same time they are not copying the brand name of a gen-
sented in two stages: feature extraction and decision uine organization. Leet words are often used by attackers to
making. Some features of spam messages are extracted in bypass word filtering and to prevent smishing messages
the first stage; then, messages are classified in the decision- from being discovered via keyword search. Mobile users,
making stage. For this classification, they have used an often in a hurry, do not notice such minor differences in
averaged neural network. The avNNet (Averaged Neural brand names or URL domain names, and in turn, they click
Network) used for the classification consists of one hidden on the fake links or call on the mobile number provided.
layer. The experimental analysis of the system has shown We have noticed two observations by conducting this
an accuracy of 98.8%. In a research paper [35] proposed by case study. One is attackers often use mobile numbers in
author Nandita et al., the Multilayer Perceptron model is the smishing messages, but these types of messages are
used for the classification of spam emails. The Backprop- often harmless until and unless the user contacts the
agation Algorithm is used to train the algorithm and for attacker on the phone number provided. Hence, it is
calculating its gradient. They have changed the learning strongly advised not to contact the phone numbers and
rate in every iteration for achieving faster convergence. email ids provided in unknown messages. Even if the user
The four models implemented by them have shown an urgently needs to contact the sender considering it as a
average accuracy of 95%. genuine message, then the user should search for genuine
phone numbers and email ids from legitimate websites.
Text messages containing the phone number and email id
3 Research work are extracted and processed separately by our system. Our
system performs the content analysis of the messages
In this research work, we have focused on proposing a containing the phone number to assess its maliciousness.
novel yet less complex technique for the detection of We have extracted and analyzed these features in SMS
smishing messages. We have conducted a case study on Classification Phase. The second observation is that
various smishing messages to get better clarity about the attackers use leet words to befool the users. In the message
attacker’s strategy of targeting the users. To get an shown in Fig. 1, they are using a numeral ‘2’ in the brand
enhanced view of the scenario of smishing, we have con- name Paytm. Mobile users often in a hurry do not notice
ducted a detailed study about Paytm smishing scam. Paytm these minor differences in the brand names. Attackers also
smishing scam is initiated by the attacker sending smishing used the numeral ‘0’ instead of ‘O’ in the word ‘blocked,’
messages to Paytm users. These attackers masquerading as which indicates a leet word. Leet words are often used by
Paytm official informs the user that their KYC has been attackers to give a genuine effect to malicious texts, URLs,
expired and needs to be renewed else their account will get
blocked in 24 h. Hence, this message prompts the Paytm
user to contact the fraudster immediately through a link or
a phone number provided in the message. When the user
contacts the attacker, the credentials and sensitive financial
details of the user are asked through a user interface pro-
vided which looks similar to the Paytm website. These
details are visible to the attacker in plain text after they
have been entered and submitted by the user. The fraudster
would immediately access the user’s Paytm account or
connected bank account using these details. Sometimes, the
attacker asks the user to download an application through
which the attacker can view and access the user’s device
remotely. Through this method, they extract sensitive
details like card number, CVV (Card Verification Value),
OTP (One Time Password), etc., to access the credit card or
debit card of the Paytm user. In this way, attackers can do
fraudulent financial transactions and in turn, Paytm users
lose their hard-earned money. Fig. 1 A smishing message showing Paytm scam

123
Neural Computing and Applications (2023) 35:4975–4992 4981

and messages. If we feed these leet words into Google website that sneaks their sensitive information such as
search engine, we often get zero search results. Instead, if credit card information.
we put genuine brand names in the Google search engine, Covid-19-related smishing messages often include a
we get the genuine website of the legitimate brand in the URL that redirects the user to fake websites. They prompt
top search results. Hence, zero search results in Google are the user to provide their sensitive data in the form provided
counted as smishing by our system. Sometimes, the Paytm on the website. Hence, our system carefully analyzes the
smishing message contains a URL. A Paytm smishing maliciousness of the URL provided in the SMS to assess its
message containing a URL to defraud the user is shown in authenticity and thereby predicting the final results. We
Fig. 2. When the Paytm user clicks on the link, his/her have inspected the messages in Domain Checking Phase in
sensitive financial details are asked through a user interface our system to deal with this issue.
provided which looks similar to the Paytm website. The Figure 4 depicts the overall working of the proposed
details entered and submitted by the user on this website model. The system is arranged in two phases, Domain
are visible to the fraudster in plain text. To detect this type Checking Phase and SMS Classification Phase. Domain
of fraudulent activity, we have processed the messages in Checking Phase inspects messages containing URL, and
the Domain Checking Phase which inspects the genuine- SMS Classification Phase focuses on messages containing
ness of the URL present in the message. email id and phone number. DsmishSMS Algorithm is
Smishers are exploiting the Covid-19 situation to com- presented in Algorithm 1. First, the text pre-processing is
mit fraud by using scam text messages imitating health done. Text pre-processing means bringing the text into a
departments, banks, and other trusted organizations. It is form that is analyzable for the piece of work. Text pre-
always advised to contact the departments on their phone processing is a crucial step for Natural Language Pro-
number or email that is given on their official website. cessing. It includes tokenization, lowercasing the text, stop
Attackers are trying to take advantage of people’s panic word removal, stemming, and lemmatization.

and fear in the face of the COVID pandemic. A smishing 3.1 Domain checking phase
message showing the Covid-19 scam [36] is shown in
Fig. 3. This particular phishing SMS is prompting people If a URL is detected in the message, it is inspected in
to click on a link to know about new symptoms and test Domain Checking Phase. When a URL is detected in the
locations. Instead, the link leads users to a malicious SMS, the system extracts the domain name of the URL. It
also extracts all nouns present in the message. It forms a

123
4982 Neural Computing and Applications (2023) 35:4975–4992

signature by using all nouns and the domain name. This


signature is provided to the Google search engine. The top
5 search results of the Google search engine are selected
and compared with the current URL. If the domain name
matches, the message is declared as legitimate else the
messages are transferred for the second-level domain
checking. Domain Checking Algorithm is presented in
Algorithm 2.

Fig. 2 A URL-based smishing message showing Paytm scam

nouns and verbs. Here, we extract the nouns from the


message using the POS tagging of the NLTK package.
Domain Name The system extracts the domain name of
the URL contained in the text message. If the URL is short,
it is converted to a long URL and then, the domain name is
extracted. Later, this domain name is used with nouns to
form the signature.
Signature A signature is formed using all the nouns
extracted from the text message and the domain name of
the URL. This signature is fed to the Google search engine.
Google advanced search operators are special com-
mands that can be used to perform effective retrieval. We
used these retrieval operators to develop our search com-
mand. The search command used in our system is specified
below:

The system extracts the value of the following features In the above command, inurl and allintitle are the
from the text message: commands. Signature, domain name, and nouns are the
Noun Fraudsters send text messages masquerading as values extracted from the text message.
authentic, genuine organizations. Hence, these messages inurl This command will return all results containing the
contain the brand names of genuine organizations. Nouns specified word in the URL. Through this command, search
in the message represent the brand names of genuine results can be reduced drastically.
organizations. Hence, Nouns in the message can be inclu- allintitle This command will find pages with all of the
ded in the signature to find the website of the genuine specified words in the title tag.
brands. These nouns and domain name will help us in This command facilitates the detection of the website
finding the genuine website of the authentic organization. with the exact match of the domain name in the URL and
Our system used the Natural Language Tool Kit the nouns provided in the title of the website. Most of the
(NLTK) which is a python package to extract nouns used in legitimate websites have their brand name in the title of the
the text message. NLTK provides POS (parts of speech) website.
tagging, which separates the words into parts of speech like

123
Neural Computing and Applications (2023) 35:4975–4992 4983

results and legitimate sites are ranked higher than phishing


sites. The age of the domain of malicious websites is very
short. Phishers keep on changing their domain name very
frequently. The average life span of a phishing website is
4.5 days [38]. Due to the short lifespan of fake websites
and lack of links pointing to them leading to have a low
Google page rank, they are not displayed in top Google
results.
To meet with the privacy concerns of feeding Google
with the contents received from an SMS, we assure that the
system is only comparing the domain names in search
results but not opening the links received in search results.
This way, the system protects the user from any malicious
file downloading while doing search engine comparisons.
In the second-level domain checking, the system first
extracts the source code of the URL included in the mes-
sage. Then, all the URLs included in the source code for
Fig. 3 A text message showing Covid-19 scam redirections are extracted. The domain names of these
URLs are compared with the domain name of the URL
The system matches the domain name of each search contained in the message. If the domain name matches, the
results with the domain name of the URL obtained from the message is declared as legitimate else the message is
text message. We repeat this step for the top 5 search transferred to SMS Classification Phase. The second-level
results. If the string matching gives favorable results, the Domain Checking Algorithm is given in Algorithm 3.
system declares the message as legitimate SMS else the
message is forwarded for the second-level domain
checking.
Phelps and Wilensky [37] proposed the idea of provid-
ing a few selected terms, called lexical signature to search
engine for finding URLs. Their empirical studies suggested
that it is sufficient to use about five terms to create a lexical
signature for identifying a web page distinctively, out of
millions of pages on the web. Hence, we are comparing the
domain name with the top 5 search results. If zero search
results are encountered, then the system predicts the mes-
sage as smishing.
If nouns are not found in the message, we feed only
the domain name to the search engine. This domain
name might give zero search results in case if it is an IP
address. If the Google search engine gives zero results,
then we predict the message as smishing because an IP
address instead of a domain name indicates smishing. If
the message contains misspelled words or leet words in
brand names, feeding these words into a search engine
might also result in zero results. A genuine message
never contains misspelled words, especially in their
brand names. This leads us to assume that misspelled
nouns indicate a smishing message. Domain names fed
into the search engine should give genuine websites in
search results if it is a domain name of a genuine
website. If it is giving zero results, again it indicates a
smishing message. Second-level domain checking is a crucial part of the
Our technique assumes that the Google search engine system. This level of checking determines whether the
gives the majority of legitimate websites in its top search message needs to be transferred to the Backpropagation

123
4984 Neural Computing and Applications (2023) 35:4975–4992

Fig. 4 Architecture of the proposed System

Algorithm for classification or it can be declared legiti- 3.2 SMS classification phase
mate. Phishing is an activity in which attackers copy the
source code of the genuine site to give a genuine effect to Text messages containing the phone number and email id
their malicious websites. For stealing the sensitive cre- will be processed in this phase as shown in Fig. 4. The
dentials of the user, attackers insert redirections in the system inspects the text message contents in this module.
source code, so that the information entered by the user The values of five heuristics from the text message are
gets saved in the database created by the attacker in their extracted, namely the presence of misspelled words, leet
malicious websites. These redirections often have a dif- words, symbols, special characters, and smishing key-
ferent domain name provided the attacker wants to steal the words. Based on the values of the extracted heuristics, a
user’s sensitive data and needs access to it through a fake feature vector is built which will be provided to the clas-
website having a different domain name. To capture this sification algorithms. A python script is created to extract
type of maliciousness, we are matching the domain name and compare all the heuristics related to the text.
included in the source code in the second-level domain The presence of misspelled words, leet words, symbols,
checking. The point to be noted here is that the system is special characters, and smishing keywords in the SMS text
accessing the source code of the URL without invoking the is the five features extracted in this phase, which are
website to avoid any malicious file downloading during elaborated as follows:
this step. Misspelled words Misspelled words in text messages
indicate smishing because attackers often send misspelled
text messages to the users. Misspelled words are the result
of typing errors or intentional misspellings. Attackers often

123
Neural Computing and Applications (2023) 35:4975–4992 4985

use misspelled words to befool the users. Legitimate activation function is used to activate the neurons and
organizations never use misspelled words in their official calculate the output. The inputs are mapped to the outputs
text message sent to customers. System checks for mis- with the help of an activation function. This process is
spelled words if any exist. If misspelled words exist, the termed as Forward Propagation. Every neuron connection
system returns 0 else returns 1. We found 62.12% of in the network has a corresponding weight. Backpropaga-
smishing data with misspelled words in it. tion Algorithm (BPA) is used for training the algorithm to
Leet Words Leet words are words in which alphabets are reach the expected outputs. An error function is used to
replaced by numerals and symbols which look similar to calculate the difference between actual output and expected
the alphabets. Leet words are often used by attackers to output. Actual output is the output given by the algorithm,
give a genuine effect to malicious texts, URLs, and mes- and expected output is the output provided to the system.
sages. Hackers are using leet words in a text message to BPA adjusts the weights and biases of the network for
bypass word filtering and to prevent smishing messages bringing the error function to the minima. This process is
from being discovered via keyword search. If misspelled termed as Backward Propagation. In BPA, both the inputs
words exist, the system returns 0 else returns 1. We found and outputs are supplied to the algorithm and the mapping
74.22% of smishing data with leet words in it. function between the inputs and outputs is learned.
Symbols Attackers use mathematical symbols like ?, %, Hyperparameters like the number of neurons, connections,
-, /, ^, %, etc., to lure the victims. Using these symbols also number of hidden layers, etc., can be optimally set for
helps them to create leet words. System checks if any achieving good accuracy.
symbols are used in the text message. If symbols are used, Figure 5 shows the architecture of the backpropagation
the system returns 0 else returns 1. We were able to find network. Following parameters are used in implementing
51.59% of smishing data with symbols inserted in it. the neuron-based network:
Special Characters The special characters like !, $, &, #,
• Activation function is used to activate the layers and
and * are used by the attackers in smishing messages. The
predict the output
character ‘‘$’’ denotes currency in the smishing message
• Prediction error is the difference between target output
claiming to award the user, and character ‘‘!’’ is used with
and prediction output
smishing words like ‘‘CONGRATULATIONS!’’, ‘‘WIN-
• Error function is used to calculate the error
NER!’’, etc., to attract the victims. By reviewing our
• Optimization method is used to minimize error
dataset, we found 26.20% of smishing data with special
characters in them. Our dataset is trained and tested using the Backpropa-
Smishing keywords The system deals with the words gation Algorithm. The best accuracy given by the algo-
which are regularly used by the attackers to attract the rithm is given in Sect. 5 of this paper. Also, a comparison
victims. They try to attract victims by offering discount chart depicting the results of BPA with all three traditional
coupons, gifts, prizes, etc. We found 83.02% of smishing algorithms is shown in Fig. 7.
data with the following 20 smishing keywords in it. The Messages containing URL is processed in Domain
following commonly used smishing keywords are extracted Checking Phase. On the other hand, messages containing
by the system to deal with the smishing messages: phone number and email id are processed in SMS Classi-

The system extracts the values of features mentioned fication Phase. If none of them is detected, messages are
above and classifies the messages based on that. We have classified as legitimate. Messages which do not contain a
used the Backpropagation Algorithm and three traditional URL, email id, or phone number are considered legitimate
classification algorithms, namely Decision Tree, Random because such messages cannot cause any harm to the user
Forest, and Naive Bayes to predict the results and check the even if it contains malicious keywords. In the proposed
effectiveness of the heuristics selected. system, messages are segregated in an earlier stage,
An Artificial Neural Network (ANN) is a machine avoiding the processing of the same messages to different
learning technique inspired by the human brain. Hence, phases. Messages with email id and phone number are
ANN is modeled like the human nervous system. Neural processed by SMS Classification Phase, and messages
Network is comprised of neurons connected in layers. The containing URL are processed by Domain Checking Phase.
inputs and expected outputs are fed into the system. An Features extracted in SMS Classification Phase are only

123
4986 Neural Computing and Applications (2023) 35:4975–4992

Fig. 5 Architecture of the


backpropagation network

about messages having an email id or phone number in it, be detected in Google search results if the text message is
and thus, this system highly reduces complexity in the sent by a genuine organization. Short URL is converted to
analysis of the text messages. long URL before examining it for maliciousness. It is
difficult to analyze the URL within the small display of
mobile phones even if it is converted into a long URL.
4 Challenges and comparison Abbreviated form of text messages and short URL makes it
difficult to analyze the maliciousness of the text message.
This section elaborates on the various challenges we have Hence, we analyze the presence of leet words and smishing
faced during the implementation of this research work. keywords included in the text message to check its
This section also presents a comparison of the proposed maliciousness.
model with smishing detection models proposed by other Some of the other challenges faced during the evaluation
researchers. of this system are discussed below:
As smishing is a fraud activity conducted through short
• What if the domain name of the current URL does not
text messages, detection of smishing involves various
match with the domain name of the results given by the
challenges. In the case of smishing, we have short text
search engine? If the domain name is not matching with
messages which are often in short forms or symbolic forms.
the top 5 domain names, our system declares the
We assume that attackers often use nouns in the text
message as smishing. If a message contains misspelled
message to identify the legitimate brand which they are
brand names or leet words, it might result in zero search
masquerading. Hence, we are extracting the nouns in the
results. A genuine message never contains misspelled
message to identify the legitimate website. Also, the URL
words, especially in their brand names. This indicates
incorporated in the message is short. Our system first
that misspelled nouns mean a smishing message.
converts the short URL to a long URL; then, it is processed
Domain names fed into the search engine should give
to extract the actual domain name. Forming a signature
genuine results if it is an authenticated message.
from the short text message is a strenuous task because it
• What if the domain name is an IP address and not
includes abbreviations, text shortcuts, and acronyms used
identifiable by its name for comparison? If the domain
by the text sender. It also contains leet words used by the
name of the URL is an IP address, it indicates smishing
attacker which causes difficulty for the systems to recog-
as IP address is often used by the attackers due to their
nize the actual word intended by the attacker. Leet words
frequently changing domain names. Legitimate brands
often lead to zero search results in Google search engines.
prefer to use domain names in their website URLs for
Zero search results indicate smishing by our system. Due to
effortless recognition.
the minimum amount of information shared by the attacker
• How to make a signature from the message? What are
in the messages, we are left with a minimum set of features
the criteria to create it from a text message? Most of the
for classification. In this system, we are selecting the nouns
attackers are using leet words, symbols, numerals,
used in the text message to form a signature. The nouns can
smishing keywords, etc., in the text message which

123
Neural Computing and Applications (2023) 35:4975–4992 4987

cannot be used in a signature. Every attacker masquer- of the system. We have extracted the best five features
ades as a genuine brand; thus, they use the brand name from our real-time dataset and thereby improving the per-
in the text message. This brand name is often a noun or formance and reducing the complexity. Moreover, the leet
abbreviation. Hence, we have extracted the nouns from word feature used in our system is a novel heuristic used
the message to form a signature. for the detection of SMS Phishing.
• How to handle misspelled words in the text message? It Search engine domain matching is a technique in which
could drastically affect search results. Misspelled words we compare the domain names of top search results with
are not used in the signature to avoid false search the current URL available in SMS. This technique is used
results. To identify the smishing keywords used by the for the first time in the detection of smishing. This tech-
attacker, we have used some selected keywords often nique helps us in finding the specific website of the legit-
used by the attackers. imate brand which the attacker is masquerading. Hence,
• What if there are zero nouns in the message? In case of this improves our true negative rate. We have successfully
the absence of nouns in the text message, the system developed a system to create a signature from the text
will use the domain name of the URL to form the messages to perform the domain checking.
signature. Moreover, we have conducted a detailed study about the
leet words used by attackers in smishing and why they are
used. In this study, we have observed that leet words are
4.1 Comparison
different from misspelled words, and leet words are
intentionally created by the attacker to give a genuine
A comparative analysis of our proposed system with other
effect to the smishing messages. These leet words also help
smishing detection systems is shown in Table 1. This
the attacker to bypass word filtering and to prevent
comparison is made based on the security measures and
smishing messages from being discovered via keyword
techniques implemented in the system.
search. Leet word identification is a novel technique used
It is evident from the comparison table that we have
in our system. It provides a new direction to future research
used some novel techniques for the detection of smishing
in smishing detection systems.
in our system. Search engine domain matching and source
code domain matching are the two phases of domain
matching conducted in our system. Search engine domain
matching is used as a novel technique for the detection of
5 Evaluation and results
smishing in our system. Also, in the case of phishing
This section elaborates on the evaluation of the proposed
detection, signature for search engine matching can be
system using SMS datasets and the evaluation results
formed by extracting website contents. But in the case of
therein. The prototype of this model is built in python
smishing, SMS has abbreviations and idioms which makes
language. The proposed system is built into two phases,
it difficult to form the signature. The existence of URL,
namely Domain Checking Phase and SMS Classification
phone number, and email id in the message is used as a
Phase. The two phases are finally merged to obtain a final
heuristic in other systems too, but we have categorized the
working model of the approach. The final model is
messages based on these features at the beginning of the
experimented with using a collection of text messages.
implementation phase and thereby reducing the complexity

Table 1 Comparison of the proposed model with other proposed systems


Techniques and details Feature based Rule based SmiDCA Smishing Detector S-detector Proposed
[6] [13] [7] [14] [10] system

Search engine domain matching NO NO NO NO NO YES


Source code domain matching NO NO NO YES NO YES
Existence of URL YES YES YES YES YES YES
Existence of phone number and email id in the YES YES YES YES NO YES
message
Smishing keywords YES YES YES YES YES YES
Misspelled words NO NO YES NO NO YES
Leet words NO NO NO NO NO YES
Symbols YES YES NO NO NO YES
Special characters NO NO YES NO NO YES

123
4988 Neural Computing and Applications (2023) 35:4975–4992

Text messages dataset is collected from the paper Almeida Precision  Recall
F1Score ¼ 2 
[39], a contribution to the study of spam messages. It Precision þ Recall
includes 5574 text messages out of which 4827 are legiti-
• Area Under the Curve (AUC): The area under the curve
mate and 747 are spam messages. According to our study,
(AUC) of a Receiver Operating Characteristic (ROC)
the smishing dataset is not publicly available yet. But we
curve is used as a metric for the performance evaluation
have observed that smishing SMS is a part of spam SMS.
of a binary classifier system. For plotting the ROC
Hence, some smishing messages were manually extracted
curve, the values of True-Positive Rate (TPR) are
from the spam messages. Also, some smishing SMS was
shown on the vertical axis and False-Positive Rate
extracted from pinterest.com [40] which were added to the
(FPR) are shown on the horizontal axis of the curve.
final dataset. So, our final dataset contains 5858 text mes-
• Time Complexity is measured as the time consumed by
sages of which 538 are smishing SMS and 5320 are
the CPU to execute the code. It depends on the software
legitimate SMS.
as well as the hardware we are using to execute it. This
value helps us to determine that the computational
5.1 Evaluation metrics
complexity of the proposed system is not high, and it
converges to a solution within a reasonable time.
The metrics estimate the performance of the system in
terms of the percentage of correct instances detected and
the number of misclassifications it makes. For evaluating 5.2 Results
the performance of the smishing detection system, we
measured Accuracy (ACC), F1-score, precision, recall, The feature set used in this system is carefully selected to
Area Under the ROC Curve (AUC), and time complexity. make sure that the smishing messages are predicted accu-
Following evaluation metrics are used in our system: rately. The frequency of each feature in our dataset is
calculated to determine the importance of features used in
• True Positive (TP): True Positive denotes the number of
our system. This frequency is calculated based on the
smishing messages identified as smishing by the
system. existence of these features in 538 smishing messages
available in our dataset. The presence of these features is
• False Positive (FP): False Positive denotes the number
tested by building a python code for the extraction of
of smishing messages identified as legitimate by the
system. features. The frequency of each heuristic used in our sys-
tem is depicted in Fig. 6. The results have shown that the
• False Negative (FN): False Negative denotes the
‘smishing keywords’ feature is most useful in detecting
number of legitimate messages identified as smishing
by the system. smishing SMS and it exists in 83.02% of smishing SMS,
followed by ‘leet words’ which exist in 74.22% of smish-
• True Negative (TN): True Negative denotes the number
ing messages. The least important feature used in our
of legitimate messages identified as legitimate by the
system. system is special characters which are found in 26.20% of
smishing data.
• Accuracy: Accuracy is evaluated as the proportion of
True Positive and True Negative over the total number
of classifications as depicted by the formula below:
TP þ TN
Accuracy ¼
TP þ FP þ FN þ TN
• Precision: Precision is calculated by the formula:
TP
Precision ¼
TP þ FP
• Recall: Recall is calculated as:
TP
Recall ¼
TP þ FN
• F1-Score: It is calculated as the harmonic mean of
precision and recall.

Fig. 6 Frequency of each heuristic analyzed

123
Neural Computing and Applications (2023) 35:4975–4992 4989

The above five features selected by our system have Table 2 Hyperparameters used
Hyperparameters Values
shown good accuracy while classifying the messages using in Backpropagation Algorithm
Backpropagation Algorithm. The values returned by the No. of hidden nodes 10
above heuristics have been passed to machine learning No. of Epochs 10
algorithms. We have evaluated our approach using three Activation function ReLU
machine learning algorithms, namely Decision Tree, Ran- Solver Adam
dom Forest, and Naive Bayes. The prediction results of Learning rate 0.01
each algorithm based on the feature values have been
recorded. The prediction result of each algorithm is shown
in Fig. 7. It shows that the Random Forest Algorithm (RF) Error
0.4
gave a decent accuracy of 97.85%, while the Backpropa-
0.35
gation Algorithm (BPA) gave the best result with an

Error
accuracy of 97.93%. The naı̈ve Bayes algorithm (NB) also 0.3

presented a good performance with an accuracy of 97.76%. 0.25


The precision and recall of BPA and RF are almost the 0.2
same with a value of 84% and 94%, respectively. Thus, 0 0.2 0.4 0.6 0.8 1 1.2
BPA outperformed RF in the evaluation of our system and Learning Rate
Naı̈ve Bayes outperformed Decision Tree in the machine
Fig. 8 Behavior of error to change in learning rate
learning experiment.
We have tried different values for hyperparameters like
implementation of our system. Different values of learning
the number of hidden nodes, the number of epochs, and
rate were tested ranging from 0.1 to 1.0. The resultant
learning rate for achieving the best accuracy and achieved
graph and response to the error rate are depicted in Fig. 8.
the final accuracy of 97.93% using the Backpropagation
The figure shows that the error rate is rising in response to
Algorithm. Hyperparameters used in achieving this accu-
an increase in the learning rate.
racy are shown in Table 2. We have tried values ranging
The prediction error is calculated by finding the differ-
from 1 to 12 for the number of hidden nodes and achieved
ence between the actual output and the target output. A
the best accuracy for 10 hidden nodes. The maximum
Mean Squared Error (MSE) function is used to calculate
number of iterations was set to 100 for evaluating the
the error as shown in Eq. (1), where ak is the actual output,
model, but time complexity was high on the execution of
tk is the target output, and E is the prediction error.
the system. Hence, the best accuracy achieved in a
remarkable time complexity was noted for which the 1
E ¼ ð t k  ak Þ 2 ð1Þ
number of epochs was 10. The sigmoid activation function 2
was also tried for BPA, but we have achieved the best The behavior of Error corresponding to learning rate is
accuracy using ReLU Function for the BPA shown in Fig. 8. The error is gradually increasing when the

1.2

0.8
Values Ploed

0.6

0.4

0.2

0
Back Propagaon Random Forest Naïve bayes Decision Tree
Algorithm
ML Algorithms

Precision Recall F1-score Accuracy

Fig. 7 Performance of algorithms on our system

123
4990 Neural Computing and Applications (2023) 35:4975–4992

Fig. 9 The software and hardware configurations used for evaluation of the system

Table 3 Evaluation results of


Algorithm Accuracy AUC Execution time in seconds
the system
Backpropagation Algorithm 97.93 0.988 33.32
Random Forest 97.85 0.985 20.41
Naive Bayes 97.76 0.983 17.24
Decision Tree 96.48 0.974 16.32

Table 4 Performance of the proposed model on Backpropagation Table 5 Outcome of the pro-
Iterations Accuracy
Algorithm posed model after cross-
validation Iteration 1 97.65
Confusion matrix
Iteration 2 97.97
Smishing messages Legitimate messages
Iteration 3 98.29
Classified as smishing TP = 509 FP = 92 Iteration 4 97.52
Classified as legitimate FN = 29 TN = 5228 Iteration 5 98.25
Average 97.93

learning rate is incremented from 0.1 to 1. This means that Table 3. AUC value 0.988 is achieved for BPA and 0.985
the algorithm achieves convergence at a slower pace when for Random Forest. Hence, it is evident from the AUC
the learning rate is set to low values, but a higher value of values that our system is persistent in the performance for
learning rate causes fast convergence. But the fast con- smishing detection.
vergence has a risk of escaping the local minimum. Hence, The performance of the Backpropagation Algorithm is
the error rate rises with increase in the learning rate. depicted in Table 4. It shows that this approach can identify
Execution time of the system is estimated using Begin smishing messages with accuracy of 97.93%. The false-
Time() and End Time() functions in the python environ- positive rate is very less which shows that very few ham
ment. Hardware and software configurations used for the messages are classified as smishing. Twenty-nine smishing
execution of the system are shown in Fig. 9. A 64-bit messages out of 538 smishing messages in our dataset were
operating system with 8 GB RAM is used for the evalua- wrongly classified as legitimate by the system. This shows
tion of the system. The time taken by the Backpropagation that the false-negative rate is very low, and the system is
Algorithm is 33.32 s, whereas random Forest took 20.14 s efficient in detecting smishing messages. Only 92 legiti-
to evaluate the system. Hence, the time complexity is high mate messages out of 5320 legitimate messages were
in the case of the Backpropagation Algorithm, but it gave classified as smishing which concludes 1.7% FNR (False
us a higher accuracy in comparison with Random Forest. Negative Rate). Hence, our system is efficient in identify-
Therefore, BPA has shown the best performance with an ing legitimate messages.
accuracy of 97.93% and execution time of 33.32 s. AUC We have experimented our approach using fivefold
values are plotted using the ROC curve which is shown in cross-validation in which we break up the dataset into five

123
Neural Computing and Applications (2023) 35:4975–4992 4991

equal subsets, i.e., 80% data for training and 20% data for work. Creating a signature from the minimum amount of
testing. One subset is used as the testing set and the rest information shared by the attacker in a message is a diffi-
four subsets are used as the training set. Iterations are cult task. But, researchers are motivated to find out the best
repeated until all the subsets are used as the testing set. In technique for creating a signature using the minimum
fivefold cross-validation, a final accuracy of 97.93% is available information and thereby taking it as a challenge.
achieved. The result of the cross-validation experiment is Other deep learning algorithms can be used for the clas-
shown in Table 5. sification and comparison part of this work. Results of the
The above results show that this system is efficient in the Backpropagation Algorithms can be compared with other
detection of smishing messages. We have tested the per- deep learning models.
formance of our system using evaluation metrics like
Accuracy, Precision, recall, and F1-Score. A confusion
matrix depicting the performance of the Backpropagation
Declarations
Algorithm is also presented. System is also evaluated using
AUC (Area Under the Curve). Thus, the results are per- Conflict of interest The authors have no conflicts of interest to declare
sistent when evaluating with both the evaluation metrics. that are relevant to the content of this article. All authors certify that
they have no affiliations with or involvement in any organization or
entity with any financial interest or non-financial interest in the
subject matter or materials discussed in this manuscript.
6 Conclusion

This paper focused on developing a system for the detec- References


tion of smishing. We successfully designed and evaluated a
smishing detection system comprising of two phases, 1. S Mishra, D Soni, (2019) SMS phishing and mitigation approa-
namely Domain Checking Phase and SMS Classification ches. In: Twelfth International Conference on Contemporary
Phase. Each phase focused on different aspects of the SMS Computing (IC3), Noida, India pp. 1–5, doi: https://doi.org/10.
1109/IC3.2019.8844920
to determine its maliciousness. We mainly focused on 2. Arab M, Sohrabi MK (2017) Proposing a new clustering method
scrutinizing the authenticity of the URL in SMS while to detect phishing websites. Turk J Electr Eng Comput Sci
reducing the complexity of the system. We have also dis- 25(6):4757–4767
cussed the Paytm smishing scam and its counterparts in this 3. Statista , ‘‘Number of smartphone users worldwide from 2016 to
2021’’. (2020) URL https://www.statista.com/statistics/330695/
paper. number-of-smartphone-users-worldwide/, accessed on 2020
Finally, a model of the system has been developed and 4. CallHub , ‘‘6 reasons why sms is more effective than email
experimented on SMS data comprising of 5858 messages. marketing - callhub.’’ (2020) URL https://callhub.io/6-reasons-
We have selected the most efficient features for the iden- sms-effective-email-marketing/, accessed on 2020
5. Delany SJ, Buckley M, Greene D (2012) Sms spam filtering:
tification of smishing messages. We have implemented our methods and data. Expert Syst Appl 39(10):9899–9908
system using the Backpropagation Approach and obtained 6. Jain A, Gupta BB (2019) Feature based approach for detection of
the final accuracy of 97.93%. The existing smishing smishing messages in the mobile environment. J Inf Technol Res
detection approaches focused on SMS contents only, but 12:17–35. https://doi.org/10.4018/JITR.2019040102
7. Sonowal G, Kuppusamy KS (2018) SmiDCA: an anti-smishing
our method focused on monitoring the authenticity of the model with machine learning approach. Comput J
URL in the detection of smishing messages. 61(8):1143–1157
A comparison of our system with other smishing 8. C. Balim and E. S. Gunal, (2019) Automatic detection of
detection systems has been presented. It displayed that smishing attacks by machine learning methods. In: 1st Interna-
tional Informatics and Software Engineering Conference
some novel techniques have been used in this research (UBMYK), Ankara, Turkey, pp. 1–3, doi: https://doi.org/10.
work to detect smishing. Google search engine has been 1109/UBMYK48245.2019.8965429
used to aid in the detection of legitimate websites. 9. Sonowal G (2020) Detecting phishing SMS based on multiple
In this study, the phone number and email id included in correlation algorithms. SN Comput Sci 1(6):361. https://doi.org/
10.1007/s42979-020-00377-8
a message are not checked for their maliciousness. In a few 10. Joo JW, Moon SY, Singh S, Park JH (2017) S-detector: an
cases, messages might appear legitimate, but the phone enhanced security model for detecting smishing attack for mobile
number and email id included in the message might belong computing. Telecommun Syst 66:1–10
to an attacker. Blacklists of phone number and email id are 11. Mishra S, Soni D (2019) A content-based approach for detecting
smishing in mobile environment. Suscom. https://doi.org/10.
not publicly available yet. These blacklists can be devel- 2139/ssrn.3356256
oped and utilized through a publicly available database 12. D Goel, AK Jain, (2018) Smishing-classifier: a novel framework
which might be used for future directions of this work. for detection of smishing attack in mobile environment. In:
Also, different techniques can be developed for creating the NGCT, CCIS 828, pp. 502–512
signature for search engine domain check conducted in this

123
4992 Neural Computing and Applications (2023) 35:4975–4992

13. Jain Ak, Gupta BB (2018) Rule based framework for detection of Sec Tech 5(1):1–14. https://doi.org/10.1080/23742917.2020.
smishing messages in mobile environment. Procedia Comput Sci 1813396
125:617–623 28. Saravanan P, Subramanian S (2020) A framework for detecting
14. Mishra S, Soni D (2020) Smishing detector: a security model to phishing websites using GA based feature selection and ART-
detect smishing through sms content analysis and url behavior MAP based website classification. Procedia Comput Sci
analysis. Futur Gener Comput Syst. https://doi.org/10.1016/j. 171:1083–1092. https://doi.org/10.1016/j.procs.2020.04.116
future.2020.03.021 29. Y Zhang, J Hong, L Cranor, (2007) Cantina: a content-based
15. Vayansky I, Kumar S (2018) Phishing – challenges and solutions. approach to detecting phishing web sites. In: Proceedings of the
Comput Fraud Secur. https://doi.org/10.1016/S1361- 16th International Conference on World Wide Web pp. 639–648,
3723(18)30007-1 doi: https://doi.org/10.1145/1242572.1242659
16. Goel D, Jain AK (2017) Mobile phishing attacks and defence 30. L. Wu, X. Du, J. Wu, (2014) MobiFish: a lightweight
mechanisms: state of art and open research challenges. Comput antiphishing scheme for mobile phones. In: 23rd International
Secur. https://doi.org/10.1016/j.cose.2017.12.006 Conference on Computer Communication and Networks, ICCCN,
17. Kang A, Lee JD, Kang WM, Barolli L, Park JH (2014) Security pp. 1–8
considerations for smart phone smishing attacks. Springer, Berlin 31. Ankit J (2019) A novel approach to detect spam and smishing
18. Foozy CFM, Ahmad R, Abdollah MF (2013) Phishing detection SMS using machine learning techniques. Int J E-Services Mob
taxonomy for mobile device. Int J Comput Sci 10(1):338–344 Appl. https://doi.org/10.4018/IJESMA.2020010102
19. Shahriar H, Klintic T, Clincy V (2015) Mobile phishing attacks 32. Ghourabi A, Mahmood MA, Alzubi QM (2020) A hybrid CNN-
and mitigation techniques. J Inf Secur 06:206–212. https://doi. LSTM model for SMS spam detection in Arabic and english
org/10.4236/jis.2015.63021 messages. Future Internet 12:156
20. Basit A, Zafar M, Liu X et al (2021) A comprehensive survey of 33. Roy PK, Singh JP, Banerjee S (2020) Deep learning to filter SMS
AI-enabled phishing attacks detection techniques. Telecommun spam. Future Gener Comput Syst 102:524–533
Syst 76:139–154. https://doi.org/10.1007/s11235-020-00733-2 34. Sheikhi S, Kheirabadi MT, Bazzazi A (2020) An effective model
21. Sonowal G, Kuppusamy K (2017) Phidma—a phishing detection for SMS spam detection using content-based features and aver-
model with multi-filter approach. J King Saud Univ Comput Inf aged neural network. Int J Eng (IJE) IJE Trans B Appl
Sci 29:1–15 33(2):221–228
22. Mohammad RM, Thabtah F, McCluskey L (2014) Intelligent 35. Sesha RA, Avadhani PS, C Nandita., (2019) A content-based
rule-based phishing websites classification. IET Inf Secur spam e-mail filtering approach using multilayer percepton neural
8:153–160 networks. Int J Eng Trends Technol 41:44–45. https://doi.org/10.
23. J Zhang, Y Wang, (2012) A real-time automatic detection of 14445/22315381/IJETT-V41P210
phishing URLs. In: 2nd International Conference on Computer 36. MessageMedia, ‘‘6 COVID-19 (Coronavirus) SMS scams to look
Science and Network Technology, ICCSNT, IEEE, out for’’, (2020) URL https://messagemedia.com/au/blog/covid-
pp. 1212–1216 19-coronavirus-sms-scams-to-look-out-for/, accessed on 2020
24. Xiang G, Hong J, Rosé C, Cranor L (2011) CANTINA?: a 37. Phelps TA, Wilensky R (2000) Robust hyperlinks and locations.
feature-rich machine learning framework for detecting phishing D-Lib Mag 6:7–8
web sites. ACM Trans Inf Syst Secur. https://doi.org/10.1145/ 38. Wu L, Du X, Wu J (2016) Effective defense schemes for phishing
20195992019606 attacks on mobile computing platforms. IEEE Trans Veh Technol
25. Gupta BB, Ankit J (2020) Phishing attack detection using a 65(8):6678–6691. https://doi.org/10.1109/TVT.2015.2472993
search engine and heuristics-based technique. J Inf Technol Res 39. TA Almeida, JMG Hidalgo, A Yamakami, (2011) Contributions
13:94–109. https://doi.org/10.4018/JITR.2020040106 to the study of SMS spam filtering: new collection and results. In:
26. M. Korkmaz, O. K. Sahingoz and B. Diri, (2020) Detection of 11th ACM Symposium on Document Engineering, pp. 259–262
phishing websites by using machine learning-based URL analy- 40. Pinterest, ‘‘Smishing Dataset’’, November 20 2018, Retrieved
sis. In: 11th International Conference on Computing, Commu- from https://in.pinterest.com/seceduau/smishing-dataset/?lp=true.
nication and Networking Technologies (ICCCNT), Kharagpur,
India, pp. 1–7, doi: https://doi.org/10.1109/ICCCNT49239.2020. Publisher’s Note Springer Nature remains neutral with regard to
9225561. jurisdictional claims in published maps and institutional affiliations.
27. Harinahalli Lokesh G, BoreGowda G (2021) Phishing website
detection based on effective machine learning approach. J Cyber

123

You might also like